C# XMLDocument Encoding?

C# XMLDocument Encoding? - c#

I'm trying to code a function that validates an XML settings file, so if a node does not exist on the file, it should create it.
I have this function
private void addMissingSettings() {
XmlDocument xmldocSettings = new XmlDocument();
xmldocSettings.Load("settings.xml");
XmlNode xmlMainNode = xmldocSettings.SelectSingleNode("settings");
XmlNode xmlChildNode = xmldocSettings.CreateElement("ExampleNode");
xmlChildNode.InnerText = "Hello World!";
//add to parent node
xmlMainNode.AppendChild(xmlChildNode);
xmldocSettings.Save("settings.xml");
}
But on my XML file, if I have
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Write Suffix"></wPortSuffix>
When the I save the document, it saves those lines as
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Sufijo en puerto de escritura"></wPortSuffix>
<ExampleNode>Hello World!</ExampleNode>
Is there a way to prevent this behaviour? Like setting a working charset or something like that?

The two files are equivalent, and should be treated as being equivalent by all XML parsers, I believe.
Additionally, Unicode character U+0003 isn't a valid XML character, so you've fundamentally got other problems if you're trying to represent it in your file. Even though that particular .NET XML parser doesn't seem to object, other parsers may well do so.
If you need to represent absolutely arbitrary characters in your XML, I suggest you do so in some other form - e.g.
<rPortSuffix desc="Read Suffix">\u000c\u000a</rPortSuffix>
<wPortSuffix desc="Write Suffix">\u0003</wPortSuffix>
Obviously you'll then need to parse that text appropriately, but at least the XML parser won't get in the way, and you'll be able to represent any UTF-16 code unit.

Related

C# Xml Encoding

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas

The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

Avoid escaping ' entity with .NET XmlDocument class

How can I avoid XmlDocument class replace &apos; entity with the ' character?
For example if I have:
string xml = "<a> &apos; </a>";
After doing
var doc = new XmlDocument();
doc.LoadXml(xml);
string output = doc.OutterXml;
The value of output is
"<a>'</a>"
I need to avoid this because I must load an XML, make some changes and sign it digitally so the signed XML must be the same loaded.

For your specific requirements, don't use XmlDocument or any other XML parser to parse the original document.
Do use XmlDocument or any other XML-specific classes to create your new document, except put a placeholder where the original document needs to go, like ORIGINAL_DOCUMENT_HERE. Then after you've generated the resulting text XML for your new document, replace ORIGINAL_DOCUMENT_HERE with your original received text, and then sign the result.
Not a normal way to work with XML, but should work for your specific use case.

Handling \x01 received from Flash's ExternalInterface

I'm receiving data from a Flash component embedded in a Windows Form. Unfortunately, if the data returned from the socket contains any of the following characters, the call to loadXml below fails:
This is the callback method I have to receive data from the socket (via ExternalInterface in the Flash component).
private void player_FlashCall(object sender, _IShockwaveFlashEvents_FlashCallEvent e)
{
String output = e.request;
//output = CleanInvalidXmlChars(output);
XmlDocument document = new XmlDocument();
document.LoadXml(output);
XmlAttributeCollection attributes = document.FirstChild.Attributes;
String command = attributes.Item(0).InnerText;
XmlNodeList list = document.GetElementsByTagName("arguments");
process(list[0].InnerText);
I had a method to replace the characters with text (CleanInvalidXmlChars), but I don't think this is the right approach.
How can I load this data into an XML file, as this makes separating the method name, paramter names and parameter types which are returned very easy to work with.
Would appreciate any help at all.
Thanks.

If the “XML” contains any U+0001 (aka '\x01') or other similar characters, it is not a valid XML. There is no way you can include those characters in XML (well, in XML 1.0, anyway). See the XML specification. If you need to pass e.g. binary data in XML, you need to convert them to a proper form, e.g. using Base-64.
If the data does contain those invalid characters, it is not XML, and therefore cannot be read using standard XML tools (I don’t think any of the standard .NET classes allows you to override that behavior). You can either replace all those characters (these are basically all control characters (U+0000 through U+001F) except U+0009 (tab), U+000A and U+000D (CR+LF), plus U+FFFE and U+FFFF (noncharacters)) prior to use as you tried – you could devise a safe transformation which would not lose any data (e.g. first replace all # characters with #0040, then replace any invalid character with #xxxx where xxxx is its code, and when processing the parsed XML data, replace all #xxxx back).
Another option is to drop the XML idea and just process it as a string. Just for inspiration, see e.g. this piece of code.

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?

If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)

Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

How do I work with an XML tag within a string?

I'm working in Microsoft Visual C# 2008 Express.
Let's say I have a string and the contents of the string is: "This is my <myTag myTagAttrib="colorize">awesome</myTag> string."
I'm telling myself that I want to do something to the word "awesome" - possibly call a function that does something called "colorize".
What is the best way in C# to go about detecting that this tag exists and getting that attribute? I've worked a little with XElements and such in C#, but mostly to do with reading in and out XML files.
Thanks!
-Adeena

Another solution:
var myString = "This is my <myTag myTagAttrib='colorize'>awesome</myTag> string.";
try
{
var document = XDocument.Parse("<root>" + myString + "</root>");
var matches = ((System.Collections.IEnumerable)document.XPathEvaluate("myTag|myTag2")).Cast<XElement>();
foreach (var element in matches)
{
switch (element.Name.ToString())
{
case "myTag":
//do something with myTag like lookup attribute values and call other methods
break;
case "myTag2":
//do something else with myTag2
break;
}
}
}
catch (Exception e)
{
//string was not not well formed xml
}
I also took into account your comment to Dabblernl where you want parse multiple attributes on multiple elements.

You can extract the XML with a regular expression, load the extracted xml string in a XElement and go from there:
string text=#"This is my<myTag myTagAttrib='colorize'>awesome</myTag> text.";
Match match=Regex.Match(text,#"(<MyTag.*</MyTag>)");
string xml=match.Captures[0].Value;
XElement element=XElement.Parse(xml);
XAttribute attribute=element.Attribute("myTagAttrib");
if(attribute.Value=="colorize") DoSomethingWith(element.Value);// Value=awesome
This code will throw an exception if no MyTag element was found, but that can be remedied by inserting a line of:
if(match.Captures.Count!=0)
{...}
It gets even more interesting if the string could hold more than just the MyTag Tag...

I'm a little confused about your example, because you switch between the string (text content), tags, and attributes. But I think what you want is XPath.
So if your XML stream looks like this:
<adeena/><parent><child x="this is my awesome string">This is another awesome string<child/><adeena/>
You'd use an XPath expression that looks like this to find the attribute:
//child/#x
and one like this to find the text value under the child tag:
//child
I'm a Java developer, so I don't know what XML libraries you'd use to do this. But you'll need a DOM parser to create a W3C Document class instance for you by reading in the XML file and then using XPath to pluck out the values.
There's a good XPath tutorial from the W3C schools if you need it.
UPDATE:
If you're saying that you already have an XML stream as String, then the answer is to not read it from a file but from the String itself. Java has abstractions called InputStream and Reader that handle streams of bytes and chars, respectively. The source can be a file, a string, etc. Check your C# DOM API to see if it has something similar. You'll pass the string to a parser that will give back a DOM object that you can manipulate.

Since the input is not well-formed XML you won't be able to parse it with any of the built in XML libraries. You'd need a regular expression to extract the well-formed piece. You could probably use one of the more forgiving HTML parsers like HtmlAgilityPack on CodePlex.

This is my solution to match any type of xml using Regex:
C# Better way to detect XML?

The XmlTextReader can parse XML fragments with a special constructor which may help in this situation, but I'm not positive about that.
There's an in-depth article here:
http://geekswithblogs.net/kobush/archive/2006/04/20/75717.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# XMLDocument Encoding? - c#

Related

C# Xml Encoding

Avoid escaping ' entity with .NET XmlDocument class

Handling \x01 received from Flash's ExternalInterface

Reading XML file with Invalid character

How do I work with an XML tag within a string?

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# XMLDocument Encoding? - c#

Related

C# Xml Encoding

Avoid escaping &apos; entity with .NET XmlDocument class

Handling \x01 received from Flash's ExternalInterface

Reading XML file with Invalid character

How do I work with an XML tag within a string?

Categories

Resources

Avoid escaping ' entity with .NET XmlDocument class