Keep special characters in XML - c#

I have a requirement where I need to read an XML file that may contain special characters. But I need to keep those special characters "as-is". However, after calling XDocument.Load(), ' is turned to ' and & to &.
Here is what the XML file may look like:
<root>
<child>This is a text with special character such as &apos; and &</child>
</root>
XDocument xDoc = null;
xDocument = XDocument.Load("myFile.xml", LoadOptions.SetBaseUri | LoadOptions.SetLineInfo | LoadOptions.PreserveWhitespace);
I've tried with encoding, but with no success. For example:
using (StreamReader oReader = new StreamReader("myFile.xml", Encoding.GetEncoding("utf-8")))
{
xDocument = XDocument.Load(oReader);
}
or
xDocument = XDocument.Parse(File.ReadAllText("myFile.xml", Encoding.UTF8));
Is there anything else that I can try?
Thanks.

Related

replace new lines with "" in C#

I want to convert this:
<translation>
1 Sənədlər
</translation>
to
<translation>1 Sənədlər</translation> in XML using C#.
Please help me. Only translation tags.
I tried this:
XDocument xdoc = XDocument.Load(path);
xdoc.Save("path, SaveOptions.DisableFormatting);
But it does not remove the new lines between <translation> tags.
what you have should work. you can validate by dumping the XDocument to a string variable to confirm if the SaveOptions is removing the formatting.
for eg: i tried the below and content does not have any formatting including newlines and whitespaces.
XDocument xmlDoc = new XDocument(new XElement("Team", new XElement("Developer", "Sam")));
var content = xmlDoc.ToString(SaveOptions.DisableFormatting);
A new line is determined in the code by "\n" and possibly also "\r". You can simply remove these:
string xmlString = "<translation>\r\n1 Sənədlər\r\n</translation>"; // With the 'new lines'
xmlString = xmlString.Replace("\r", "").Replace("\n", "");
This will result in:
<translation>
1 Sənədlər
</translation>
Becomming:
<translation>1 Sənədlər</translation>
I hope this helps.
You can strip out newlines manually in an environment-sensitive way by using
var content = xmlString.Replace(Environment.NewLine, string.Empty)
XML defines two types of whitespace: significant and insignificant:
Insignificant whitespace is the whitespace between elements where text content doesn't occur, whereas significant whitespace is the whitespace within elements that contain text content. You might find the graphic in this article useful to show the difference.
What you have in your translation element is significant whitespace; the element contains text so it is assumed to be part of the element contents. Without a schema or DTD that says it can be collapsed, no amount of changing the whitespace handling on read or write is going to remove this. These options only relate to the insignificant whitespace.
What you can do is apply your own processing: using LINQ to XML, you can trim the whitespace of all elements that contain only text using something like this:
var textElements = doc.Descendants()
.Where(element => element.Nodes().All(node => node is XText));
foreach (var element in textElements)
{
element.Value = element.Value.Trim();
}
See this fiddle for a demo.

XDocument Parse Ignore Chinese Characters

I have a XML string which contains some chinese characters like �菅࿼Ჽ탽᫴. So When Parsing it with XDocument.Parse it is throwing the below exception.
System.Xml.XmlException: '', hexadecimal value 0x01, is an invalid character
I tried converting the xml string to UTF-8. But still the same issue.
Any Ideas?
Update:
XML Contains lots of elements in it, but on using the below answer it is ignoring all the other elements but just converting the elements which has special characters, Is there anything can be done with XDocument instead of XElement?
Use XmlReader with XmlReaderSettings.CheckCharacters set to false will solve your issue.
UPDATE
Here is what I'd used to load my japanese xml file.
string xmlText = "your xml data";
XElement node;
XmlReaderSettings xrs = new XmlReaderSettings();
xrs.CheckCharacters = false;
using (XmlReader rd = XmlReader.Create(new StringReader(xmlText), xrs))
{
node = XElement.Load(rd);
}

convert KOI8-R xml node into unicode in c#

I have the following xml:
<root>
<text><![CDATA[ОПЕЛХМЮБЮ ОПЕГ БЗПРЪЫ ЯЕ АЮПЮАЮМ, Б ЙНИРН ЯЕ]]></text>
</root>
I know this text is generated using encoding KOI8-R (this text is displayed in my text editor only when I select this encoding when I open the xml file as text) and I would like to convert the value of this node into a string usable in c#. I can read the InnerText value of this node, but it's not what I'm expecting. Can someone show me the correct way to convert a string written with this encoding into a Unicode one?
Update
Following Jon Skeet suggestions, the solution would look like this:
Encoding encoding = Encoding.GetEncoding("KOI8-R");
XmlDocument doc2 = new XmlDocument();
using (TextReader tr = new StreamReader(outputPath, encoding))
{
doc2.Load(tr);
}
How do you have that XML? It should have an XML declaration stating which encoding it's using; otherwise it's not correct simply in XML terms. You shouldn't be worrying about encodings after you've parsed the XML. So potentially something like:
Encoding encoding = Encoding.GetEncoding("KOI8-R");
XDocument doc;
using (var reader = File.OpenText("file.xml", encoding))
{
doc = XDocument.Load(reader);
}
... but as I say, the file itself should declare the encoding.

Escaping ONLY contents of Node in XML

I have a part of code mentioned like below.
//Reading from a file and assign to the variable named "s"
string s = "<item><name> Foo </name></item>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
But, it stops working if the contents has characters something like "<", ">"..etc.
string s = "<item><name> Foo > Bar </name></item>";
I know, I have to escape those characters before loading but, if I do like
doc.LoadXml(System.Security.SecurityElement.Escape(s));
, the tags (< , >) are also escaped and as a result, the error occurs.
How can I solve this problem?
a tricky solution:
string s = "<item><name> Foo > Bar </name></item>";
s = Regex.Replace(s, #"<[^>]+?>", m => HttpUtility.HtmlEncode(m.Value)).Replace("<","ojlovecd").Replace(">","cdloveoj");
s = HttpUtility.HtmlDecode(s).Replace("ojlovecd", ">").Replace("cdloveoj", "<");
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
Assuming your content will never contain the characters "]]>", you can use CDATA.
string s = "<item><name><![CDATA[ Foo > Bar ]]></name></item>";
Otherwise, you'll need to html encode your special characters, and decode them before you use/display them (unless it's in a browser).
string s = "<item><name> Foo > Bar </name></item>";
Assign the content of string to the InnerXml property of node.
var node = doc.CreateElement("root");
node.InnerXml = s;
Take a look at - Different ways how to escape an XML string in C#
It looks like the strings that you have generated are strings, and not valid XML. You can either get the strings generated as valid XML OR if you know that the strings are always going to be the name, then don't include the XML <item> and <name> tags in the data.
Then when you create the XMLDocument. do a CreateElement and assign your string before resaving the results.
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("item");
doc.AppendChild(root);
XmlElement name = doc.CreateElement("name");
name.InnerText = "the contents from your file";
root.AppendChild(name);

XmlReader read document with unescaped &s

I am trying to parse an XMl document that i received into a string from a web service call.
String content = ...;//long xml document
using(TextReader reader = new StringReader(content))
using(XmlReader xml_reader = XmlReader.Create(reader, settings))
{
XML = new XPathDocument(xml_reader);
}
however i get an exception :
An error occurred while parsing EntityName. Line 1, position 1721.
i looked through the document around that character and it was in the middle of a random tag, however about 20-30 chars earlier i noticed that there were unescaped ampersands (& characters), so im thinking that that is the problem.
running:
content.Substring(1700, 100);//results in the following text
"alue>1 time per day& with^honey~&water\\\\</Value></Frequency></Direction> </Directions> "
^unescaped & char 1721 is the 'w'
how can i successful read this document as xml?
verify that your xml encoding matches theirs (the top of the document, something like <?xml version="1.0" encoding="ISO-8859-9"?>). Substitute the value from the webservice xml document for webserviceEncoding below
using(XmlReader r = XmlReader.Create(new StreamReader(fileName, Encoding.GetEncoding(webserviceEncoding)))) {
XML = new XPathDocument( r );
// ...
}
If that doesn't work
Replace it in the string prior to loading it into an xml parser
Notify the webservice vendor

Categories

Resources