C# Xml Encoding - c#

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas

The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

Related

Making XmlReaderSettings CheckCharacters work for xml string

I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:
XDocument xDocument = XDocument.Parse(xmlString);
But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:
Name cannot begin with the 'number' character
Other solutions I found were about using: XmlReaderSettings.CheckCharacters
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:
If the XmlReader is processing text data, it always checks that the
XML names and text content are valid, regardless of the property
setting. Setting CheckCharacters to false turns off character checking
for character entity references.
So I tried using:
using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
This also didn't work.
Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters?
How is the flag XmlReaderSettings.CheckCharacters supposed to be used?
You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.
Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.
LL parser can be written by hand. Both lexer and parser are simple.
LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.
You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.

How to read xml string ignoring header?

I want to read a xml string ignoring the header and the comments.
To ignore the comments it's simples and I found a solution here.
But I'm not finding any solution to ignore the header.
Let me give an example:
Consider this xml:
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Some comments -->
<Tag Attribute="3">
...
</Tag>
I want to read the xml to a string obtaining just the element "Tag" and others elements but withou the "xml version" and the comments.
The element "Tag" is only an example. Could exist many others.
So, I want only this:
<Tag Attribute="3">
...
</Tag>
The code that I've come so far:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
XmlReader reader = XmlReader.Create("...", settings);
xmlDoc.Load(reader);
And I'm not finding anything on XmlReaderSettings to do that.
Do I need to go node by node choosing only the ones I want? This setting does not exist?
EDIT 1:
Just to resume my problem. I need the contents of the xml to use in a CDATA of a WebService. When I'm sending comments or xml version, I'm getting an specific error of that part of xml. So I assume that when I read the xml without the version, header and comments I'll be good to go.
Here's a really simple solution.
using (var reader = XmlReader.Create(/*reader, stream, etc.*/)
{
reader.MoveToContent();
string content = reader.ReadOuterXml();
}
Well, it seems that there is no settings to ignore declaration, so I had to ignore it myself.
Here's the code I've written for those who might be interested:
private string _GetXmlWithoutHeadersAndComments(XmlDocument doc)
{
string xml = null;
// Loop through the child nodes and consider all but comments and declaration
if (doc.HasChildNodes)
{
StringBuilder builder = new StringBuilder();
foreach (XmlNode node in doc.ChildNodes)
if (node.NodeType != XmlNodeType.XmlDeclaration && node.NodeType != XmlNodeType.Comment)
builder.Append(node.OuterXml);
xml = builder.ToString();
}
return xml;
}
If you want to only get the Tag elements, you should just read the XML as normal, then find them using the XmlDocument's XPath capabilities.
For your xmlDoc object:
var nodes = xmlDoc.DocumentElement.SelectNodes("Tag");
You can then iterate through these like so:
foreach (XmlNode node in nodes) { }
Or, obviously, you could just put your SelectNodes query into the foreach loop, if you're never going to reuse the nodes object.
This will return all Tag elements within your XML document, and you can do whatever you see fit with them.
There's no need to ever encounter comments while using XmlDocument if you don't want to, and you're not going to end up getting results including either the header or the comments. Is there a particular reason you're trying to remove pieces of the XML before you begin parsing it?
Edit: Based on your edit, it seems like you're having a problem with the header giving an error when you try to pass it. You probably shouldn't straight-up remove the header, so your best option might be to change the header to one that you know works. You can change the header (declaration) like so:
XmlDeclaration xmlDeclaration;
xmlDeclaration = yourDocument.CreateXmlDeclaration(
yourVersion,
yourEncoding,
isStandalone);
yourDocument.ReplaceChild(xmlDeclaration, doc.FirstChild);

Avoid escaping &apos; entity with .NET XmlDocument class

How can I avoid XmlDocument class replace &apos; entity with the ' character?
For example if I have:
string xml = "<a> &apos; </a>";
After doing
var doc = new XmlDocument();
doc.LoadXml(xml);
string output = doc.OutterXml;
The value of output is
"<a>'</a>"
I need to avoid this because I must load an XML, make some changes and sign it digitally so the signed XML must be the same loaded.
For your specific requirements, don't use XmlDocument or any other XML parser to parse the original document.
Do use XmlDocument or any other XML-specific classes to create your new document, except put a placeholder where the original document needs to go, like ORIGINAL_DOCUMENT_HERE. Then after you've generated the resulting text XML for your new document, replace ORIGINAL_DOCUMENT_HERE with your original received text, and then sign the result.
Not a normal way to work with XML, but should work for your specific use case.

Get XML from XPathDocument

I am working on a stylesheet and have some initial XML. However the XML is being manipulated a bit before styling and i would like to get the final XML sent into .Transform(). For instance, ...
XslCompiledTransform.Transform( xpd, xslArg, output )
...i would like to get the Xml content of xpd (as a string), so i can work on the stylesheet in other tools.
Is there a quick-and-dirty way to get this? Either in the VS2010 immediate window or as a quick C# line or two before the call to .Transform()?
EDIT: The .Transform() i'm using is
public void Transform(IXPathNavigable input,
XsltArgumentList arguments, TextWriter results);
...and xpd is an XPathDocument.
Edit: I misunderstood the intent of your question. The simple answer is to get the XML for any IXPathNavigable (which includes XPathDocument), you can do this:
string xml = xpd.CreateNavigator().OuterXml;
Below is my original answer, which explains how you could modify the XML from an XPathDocument in code before feeding it into a transform:
If xpd is an XPathDocument, you might be able to just get an XPathNavigator from the XPathDocument:
XPathNavigator xpn = xpd.CreateNavigator();
and use that to modify the XML. When you're done modifying it, you can just pass either xpn or xpd into the Transform() method. On the other hand, MSDN says that XPathDocument's CreateNavigator() creates a readonly navigator, so that may be a bit of a hitch.
If it really is readonly, you should be able to do this:
XmlDocument doc = new XmlDocument();
doc.LoadXml(xpd.CreateNavigator().OuterXml);
then use doc to modify the XML and pass doc into the transform when you're done.

C# XMLDocument Encoding?

I'm trying to code a function that validates an XML settings file, so if a node does not exist on the file, it should create it.
I have this function
private void addMissingSettings() {
XmlDocument xmldocSettings = new XmlDocument();
xmldocSettings.Load("settings.xml");
XmlNode xmlMainNode = xmldocSettings.SelectSingleNode("settings");
XmlNode xmlChildNode = xmldocSettings.CreateElement("ExampleNode");
xmlChildNode.InnerText = "Hello World!";
//add to parent node
xmlMainNode.AppendChild(xmlChildNode);
xmldocSettings.Save("settings.xml");
}
But on my XML file, if I have
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Write Suffix"></wPortSuffix>
When the I save the document, it saves those lines as
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Sufijo en puerto de escritura"></wPortSuffix>
<ExampleNode>Hello World!</ExampleNode>
Is there a way to prevent this behaviour? Like setting a working charset or something like that?
The two files are equivalent, and should be treated as being equivalent by all XML parsers, I believe.
Additionally, Unicode character U+0003 isn't a valid XML character, so you've fundamentally got other problems if you're trying to represent it in your file. Even though that particular .NET XML parser doesn't seem to object, other parsers may well do so.
If you need to represent absolutely arbitrary characters in your XML, I suggest you do so in some other form - e.g.
<rPortSuffix desc="Read Suffix">\u000c\u000a</rPortSuffix>
<wPortSuffix desc="Write Suffix">\u0003</wPortSuffix>
Obviously you'll then need to parse that text appropriately, but at least the XML parser won't get in the way, and you'll be able to represent any UTF-16 code unit.

Categories

Resources