How do I preserve whitespace characters when parsing XML from C# LINQ - c#

What do I need to do in either my C# code or my XML document so that the XDocument parser reads literal whitespace for Values of XElements?
Background
I have an XML document, part of which looks like this:
<NewLineString>
</NewLineString>
<IndentString> </IndentString>
I'm adding the values of each XELement to a data dictionary using a LINQ query; the .ForEach part looks like this:
.ForEach(x => SchemaDictionary.Add(
LogicHelper.GetEnumValue(x.Name.ToString()), x.Value));
To test to see if the whitespace values were preserved, I'm printing out a line of the character numbers of each value item in the data dictionary. In the following code, x represents a KeyValuePair and the Aggregate is simply making a string of the character integer values:
x.Value.ToCharArray()
.Aggregate<char,string>("",(word,c) => word + ((int)c).ToString() + " " )
));
I expected to see 10 13 for the <NewLineString> value and 32 32 32 32 for the <IndentString> value. However, nothing was printed for each value (note: other escaped values in the XML such as < printed their character numbers correctly).
What do I need to do in either my C# code or my XML document so that my parser adds the complete whitespace string to the Data Dictionary?

Try loading your XDocument with the LoadOptions.PreserveWhitespace

Try loading your document this way.
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load("book.xml");

or just modify your input xml to:
<NewLineString>
</NewLineString>
<IndentString xml:space="preserve"> </IndentString>

Related

Trouble with hexadecimap chars in XML data

I am trying to create an XML document using LINQ.
XElement element = new XElement("ManufacturerName", supplierName);
XDocument doc = new XDocument(element);
doc.Save("Sample.xml");
The supplierName has some special char at the end whose hexadecimal value is 0x1f.
This will not allow to save the document.
For this instance its this value for others it may be different.
So is there a way to remove any / all special chars?
Thanks in advance.

Parsing XML which contains illegal characters

A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>
If you have only & as invalid character, then you can use regex to replace it with &. We use regex to prevent replacement of already existing &, ", o, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = #"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, #"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);
Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.
Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse, that's all.

How to have the special entities like in the xml document output while using LinqXml

I am trying to generate a xml document using LinqXml, which has the "\n" to be "& #10;" in the XElement value, no matter whatever settings I try with the XmlWriter, it still emits a "\n" in the final xml document.
I did try the following, Extended the XmlWriter.
Overrided the XmlWriterSettings changed the NewLine Handling.
Both of the options didnt work out for me.
Any help/pointers will be appriciated.
Regards
Stephen
LINQ to XML works on top of XmlReader/XmlWriter. The XmlReader is an implementation of the XML processor/parser as described in the XML spec. That spec basically says that the parser needs to hide the actual representation in the text from the application above. Meaning that both \n and
should be reported as the same thing. That's what it does.
XmlWriter is the same thing backwards. It's purpose is to save the input in such a way, that when parsed you will get exactly the same thing back.
So writing a text value "\n" will write it such that the parser will report back "\n" (in this case the output text is \n for text node, but
for attribute due to normalization which occurs in attribute values).
Following that idea trying to write a text value "
" will actually write out "
" because when the reader parses that it will get back the original "
".
LINQ to XML uses XmlWriter to save the tree to an XML file. So you will get the above behavior.
You could write the tree into the XmlWriter yourself (or part of it) in which case you get more control. In particular it will allow you to use the XmlWriter.WriteCharEntity method which forces the writer to output the specified character as a character entity, that is in the $#xXX; format. (Note that it will use the hex format, not the decimal).
What is the reason for having the escaped value for '\n' in the XML element? The newline character is valid inside an XML element and when you parse the XML again, it will be parsed as you expect.
What you're looking for would happen if the newline character is placed within the value of an XML attribute:
XElement xEl = new XElement("Root",
new XAttribute("Value",
"Hello," + Environment.NewLine + "World!"));
Console.WriteLine(xEl);
Output:
<Root Value="Hello,
World!" />

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Create XML using Linq to XML and arrays

I am using Linq To XML to create XML that is sent to a third party. I am having difficulty understanding how to create the XML using Linq when part of information I want to send in the XML will be dynamic.
The dynamic part of the XML is held as a string[,] array. This multi dimensional array holds 2 values.
I can 'build' the dynamic XML up using a stringbuilder and store the values that were in the array into a string variable but when I try to include this variable into Linq the variable is HTMLEncoded rather than included as proper XML.
How would I go about adding in my dynamically built string to the XML being built up by Linq?
For Example:
//string below contains values passed into my class
string[,] AccessoriesSelected;
//I loop through the above array and build up my 'Tag' and store in string called AccessoriesXML
//simple linq to xml example with my AccessoriesXML value passed into it
XDocument RequestDoc = new XDocument(
new XElement("MainTag",
new XAttribute("Innervalue", "2")
),
AccessoriesXML);
'Tag' is an optional extra, it might appear in my XML multiple times or it might not - it's dependant on a user checking some checkboxes.
Right now when I run my code I see this:
<MainTag> blah blah </MainTag>
&lt ;Tag&gt ;&lt ;InnerTag&gt ; option1="valuefromarray0" option2="valuefromarray1" /&gt ;&lt ;Tag/&gt ;
I want to return something this:
<MainTag> blah blah </MainTag>
<Tag><InnerTag option1="valuefromarray0" option2="valuefromarray1" /></Tag>
<Tag><InnerTag option1="valuefromarray0" option2="valuefromarray1" /></Tag>
Any thoughts or suggestions? I can get this working using XmlDocument but I would like to get this working with Linq if it is possible.
Thanks for your help,
Rich
Building XElements with the ("name", "value") constructor will use the value text as literal text - and escape it if necessary to achieve that.
If you want to create the XElement programatically from a snippet of XML text that you want to actually be interpreted as XML, you should use XElement.Load(). This will parse the string as actual XML, instead of trying to assign the text of the string as an escaped literal value.
Try this:
XDocument RequestDoc = new XDocument(
new XElement("MainTag",
new XAttribute("Innervalue", "2")
),
XElement.Load(new StringReader(AccessoriesXML)));

Categories

Resources