I know there is already a question about this: Why XElement Value property changing \r\n to \n?, but the question here was that XElement converts \r\n into \n.
My situation is a bit different, I store \n in my XML, but if I save the XML to a file or a file stream, I get the \r\n back again (Only in the file). Reading it returns \n.
Attached is small snippet of code to see the result. I can understand that due to https://www.w3schools.com/Xml/xml_syntax.asp, \r\n is converted to \n when writing, but I see no reason why it is written as \r\n to the stream or file.
using System.Xml.Linq;
var text = "Test1" + Environment.NewLine + "Test2" + Environment.NewLine + "Test3";
using MemoryStream stream = new();
var xmlSource = new XElement("Data", text.Replace(Environment.NewLine, "\n"));
xmlSource.Save(stream, SaveOptions.None);
stream.Position = 0;
var xmlTarget = XElement.Load(stream);
Console.WriteLine(xmlTarget.Value);
Is there some explanation on the given behavior?
Well, actually... you can use \r\n or \n in an XML document but a conformant XML parser will normalize either of them to just \n when it reads the document. And usually those line-ending characters are just so that people can read the document more easily, the application which uses the documents normally ignore them. So you've only got a serious problem if the application which uses the document is rejecting or mishandling it.
However, yeah, it's nice for people to be able to read an XML document sometimes.
Related
I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.
Reading xml document with hex 0x19 character C#
Apparently the above character is invalid in C# and the XML document I am about to load contains it.
Therefore it throws XML exception everytime I try to read it.
<description><![CDATA[Whenever I run<br>Whenever I run to you lost one<br>Its never done<br>Just hanging on<br><br>Just past has let me be<br>Returning as if dream<br>Shattered as belief<br><br>If you have to go dont say goodbye<br>If you have to go dont you cry<br>If you have to go I will get by<br>Someday Ill follow you and see you on the other side<br><br>But for the grace of love<br>Id will the meaning of<br>Heaven from above<br><br>Your picture out of time<br>Left aching in my mind<br>Shadows kept alive<br><br>If you have to go dont say goodbye<br>If you have to go dont you cry<br>If you have to go I will get by<br>I will follow you and see you on the other side<br><br>But for the grace of love<br>Id will the meaning of<br>Heaven from above<br><br>Long horses we are born<br>Creatures more than torn<br>Mourning our way home]]></description>
Apparently the above line contains that character
Just had the same problem. 0x19 is a control-character (End of Medium) which is pretty much invisible with text editors. What I did to parse the XML anyway was to replace this char with a space before parsing, like this:
var xmlString = GetXmlString();
xmlString = xmlString.Replace((char)0x19, ' ');
var doc = new XmlDocument();
doc.LoadXml(xmlString);
I'm trying to create a simple App which reads a XML using SAX (XmlTextReader) from a stream which does not only contain the XML but also other data such as binary blobs and text. The structure of the stream is simply chunk based.
When entering my reading function, the stream is properly positioned at the beginning of the XML. I've reduced the issue to the following code example:
string xml = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><Models />" + (char)0x014;
XmlTextReader reader = new XmlTextReader(new StringReader(xml));
reader.MoveToContent();
reader.ReadStartElement("Models");
These few lines causes an exception when calling ReadStartElement due to the 0x014 at the end of the string.
The interesting thing about it is, that the code runs just fine when using the following input instead:
string xml = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><Models></Models>" + (char)0x014;
I don't want to read the whole document due to its size nor do I want to change the input as I need to stay backward compatible to older data inputs.
The only solution I can think of at first is a custom stream reader which doesn't continue to read after the last ending tag but that would involve some major parsing efforts.
Do you have any ideas on how to solve this issue? I've already tried to use LINQ's XDocument but that also failed.
Thank you very much in advance,
Cheers,
Romout
I don't know if this is quite what you are looking for, but if you instead call:
reader.IsStartElement("Models");,
than the <Models/> node will only be tested if it is a start tag or empty element tag and if the Name matches. The reader will not be moved beyond it (the Read() method will not be called).
I am trying to generate a xml document using LinqXml, which has the "\n" to be "& #10;" in the XElement value, no matter whatever settings I try with the XmlWriter, it still emits a "\n" in the final xml document.
I did try the following, Extended the XmlWriter.
Overrided the XmlWriterSettings changed the NewLine Handling.
Both of the options didnt work out for me.
Any help/pointers will be appriciated.
Regards
Stephen
LINQ to XML works on top of XmlReader/XmlWriter. The XmlReader is an implementation of the XML processor/parser as described in the XML spec. That spec basically says that the parser needs to hide the actual representation in the text from the application above. Meaning that both \n and
should be reported as the same thing. That's what it does.
XmlWriter is the same thing backwards. It's purpose is to save the input in such a way, that when parsed you will get exactly the same thing back.
So writing a text value "\n" will write it such that the parser will report back "\n" (in this case the output text is \n for text node, but
for attribute due to normalization which occurs in attribute values).
Following that idea trying to write a text value "
" will actually write out "
" because when the reader parses that it will get back the original "
".
LINQ to XML uses XmlWriter to save the tree to an XML file. So you will get the above behavior.
You could write the tree into the XmlWriter yourself (or part of it) in which case you get more control. In particular it will allow you to use the XmlWriter.WriteCharEntity method which forces the writer to output the specified character as a character entity, that is in the $#xXX; format. (Note that it will use the hex format, not the decimal).
What is the reason for having the escaped value for '\n' in the XML element? The newline character is valid inside an XML element and when you parse the XML again, it will be parsed as you expect.
What you're looking for would happen if the newline character is placed within the value of an XML attribute:
XElement xEl = new XElement("Root",
new XAttribute("Value",
"Hello," + Environment.NewLine + "World!"));
Console.WriteLine(xEl);
Output:
<Root Value="Hello,
World!" />
Is there any classes to convert ascii to xml characterset preferably opensource i will be using this class either in vc++ or C#
My ascii has some printable characters which is not there in xml character set
i just tried to sen a resume which is in ascii character set and i tried to store it in a online crm and i got this error message
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,22]
Message: Character reference "" is an invalid XML character.]
Thanks in advance
I had the same problem with Excel using the OpenXML document creation in C#.
My Excel Export feature would blow-up when building a doc with a bad ASCII character.
Somehow the string data, in my company's database, has funky characters in it.
Even though I used the Microsoft DocumentFormat.OpenXML assembly from their OpenXML SDK 2.0, it still didn't take care of this when assigning string values using their objects.
The Fix:
t.Text = Regex.Replace(sValue, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]", "?");
This cleans up the sValue string by removing the offending characters and replacing them with a question mark. You could replace with any string or just use an empty string.
The XML Spec Allows 0x09 (TAB), 0x0A (LF - Line Feed or NL - New Line), and 0x0D (CR - Carriage Return). The RegEx above takes care not remove those.
The XML 1.1 Spec allows you to escape some of these characters.
For example: Using for 0x03 appears as in HTML and as L in Office documents and notepad.
I use Asp.net and this is automatically taken care of in my GridView, so I do not need to replace these values - but I believe it may be the browser that takes care of it for all I know.
I thought of escaping these values in OpenXML, but when I looked at the output, it showed the excape markup. So MikeTeeVee still shows up as MikeTeeVee in Excel instead of something like MikeTeeVee, or MikeLTeeVee. This is why I preferred the Mike?TeeVee approach.
My hunch is this is a bug in the current OpenXML which encodes the allowed XML ASCII characters, but allows the unsupported ASCII characters to slip on through.
UPDATE:
I forgot I could look up how these characters are displayed using the "Open XML SDK 2.0 Productivity Tool" to see inside docs like Excel.
There I found it uses the format: _x0000_
Remember: XML 1.0 does not support escaping these values, but XML 1.1 does, so if you're using 1.1, then you can use this code to escape them.
Regular XML 1.1 Escaping:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("&#x" + string.Format("{0:00}", (byte)(m.Value[0])) + ";");
});
If you're escaping strings for OpenXML, then use this instead:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("_x" + string.Format("{0:0000}", (byte)(m.Value[0])) + "_");
});
Your text won't have any printable characters which aren't available in XML - but it may have some unprintable characters which aren't available in XML.
In particular, Unicode values U+0000 to U+001F are invalid except for tab. carriage return and line feed. If you really need those other control characters, you'll have to create your own form of escaping for them, and unescape them at the other end.
The character reference  is indeed not a valid XML character. You probably want either 
 or 
.
Out of curiousity, I took a few minutes to write a simple routinein C# to pump out a XML string of the 128 ASCII characters, to my surprise, .NET didn't output a really valid XML document. I guess the way I output the element text wasn't quite right. Anyway here is the code (comments are welcomed):
XmlDocument doc = new XmlDocument();
doc.AppendChild(doc.CreateXmlDeclaration("1.0", "us-ascii", ""));
XmlElement elem = doc.CreateElement("ASCII");
doc.AppendChild(elem);
byte[] b = new byte[1];
for (int i = 0; i < 128; i++)
{
b[0] = Convert.ToByte(i);
XmlElement e = doc.CreateElement("ASCII_" + i.ToString().PadLeft(3,'0'));
e.InnerText = System.Text.ASCIIEncoding.ASCII.GetString(b);
elem.AppendChild(e);
}
Console.WriteLine(doc.OuterXml);
Here is the formatted output:
<?xml version="1.0" encoding="us-ascii" ?>
<ASCII>
<ASCII_000></ASCII_000>
<ASCII_001></ASCII_001>
<ASCII_002></ASCII_002>
<ASCII_003></ASCII_003>
<ASCII_004></ASCII_004>
<ASCII_005></ASCII_005>
<ASCII_006></ASCII_006>
<ASCII_007></ASCII_007>
<ASCII_008></ASCII_008>
<ASCII_009> </ASCII_009>
<ASCII_010>
</ASCII_010>
<ASCII_011></ASCII_011>
<ASCII_012></ASCII_012>
<ASCII_013>
</ASCII_013>
<ASCII_014></ASCII_014>
<ASCII_015></ASCII_015>
<ASCII_016></ASCII_016>
<ASCII_017></ASCII_017>
<ASCII_018></ASCII_018>
<ASCII_019></ASCII_019>
<ASCII_020></ASCII_020>
<ASCII_021></ASCII_021>
<ASCII_022></ASCII_022>
<ASCII_023></ASCII_023>
<ASCII_024></ASCII_024>
<ASCII_025></ASCII_025>
<ASCII_026></ASCII_026>
<ASCII_027></ASCII_027>
<ASCII_028></ASCII_028>
<ASCII_029></ASCII_029>
<ASCII_030></ASCII_030>
<ASCII_031></ASCII_031>
<ASCII_032> </ASCII_032>
<ASCII_033>!</ASCII_033>
<ASCII_034>"</ASCII_034>
<ASCII_035>#</ASCII_035>
<ASCII_036>$</ASCII_036>
<ASCII_037>%</ASCII_037>
<ASCII_038>&</ASCII_038>
<ASCII_039>'</ASCII_039>
<ASCII_040>(</ASCII_040>
<ASCII_041>)</ASCII_041>
<ASCII_042>*</ASCII_042>
<ASCII_043>+</ASCII_043>
<ASCII_044>,</ASCII_044>
<ASCII_045>-</ASCII_045>
<ASCII_046>.</ASCII_046>
<ASCII_047>/</ASCII_047>
<ASCII_048>0</ASCII_048>
<ASCII_049>1</ASCII_049>
<ASCII_050>2</ASCII_050>
<ASCII_051>3</ASCII_051>
<ASCII_052>4</ASCII_052>
<ASCII_053>5</ASCII_053>
<ASCII_054>6</ASCII_054>
<ASCII_055>7</ASCII_055>
<ASCII_056>8</ASCII_056>
<ASCII_057>9</ASCII_057>
<ASCII_058>:</ASCII_058>
<ASCII_059>;</ASCII_059>
<ASCII_060><</ASCII_060>
<ASCII_061>=</ASCII_061>
<ASCII_062>></ASCII_062>
<ASCII_063>?</ASCII_063>
<ASCII_064>#</ASCII_064>
<ASCII_065>A</ASCII_065>
<ASCII_066>B</ASCII_066>
<ASCII_067>C</ASCII_067>
<ASCII_068>D</ASCII_068>
<ASCII_069>E</ASCII_069>
<ASCII_070>F</ASCII_070>
<ASCII_071>G</ASCII_071>
<ASCII_072>H</ASCII_072>
<ASCII_073>I</ASCII_073>
<ASCII_074>J</ASCII_074>
<ASCII_075>K</ASCII_075>
<ASCII_076>L</ASCII_076>
<ASCII_077>M</ASCII_077>
<ASCII_078>N</ASCII_078>
<ASCII_079>O</ASCII_079>
<ASCII_080>P</ASCII_080>
<ASCII_081>Q</ASCII_081>
<ASCII_082>R</ASCII_082>
<ASCII_083>S</ASCII_083>
<ASCII_084>T</ASCII_084>
<ASCII_085>U</ASCII_085>
<ASCII_086>V</ASCII_086>
<ASCII_087>W</ASCII_087>
<ASCII_088>X</ASCII_088>
<ASCII_089>Y</ASCII_089>
<ASCII_090>Z</ASCII_090>
<ASCII_091>[</ASCII_091>
<ASCII_092>\</ASCII_092>
<ASCII_093>]</ASCII_093>
<ASCII_094>^</ASCII_094>
<ASCII_095>_</ASCII_095>
<ASCII_096>`</ASCII_096>
<ASCII_097>a</ASCII_097>
<ASCII_098>b</ASCII_098>
<ASCII_099>c</ASCII_099>
<ASCII_100>d</ASCII_100>
<ASCII_101>e</ASCII_101>
<ASCII_102>f</ASCII_102>
<ASCII_103>g</ASCII_103>
<ASCII_104>h</ASCII_104>
<ASCII_105>i</ASCII_105>
<ASCII_106>j</ASCII_106>
<ASCII_107>k</ASCII_107>
<ASCII_108>l</ASCII_108>
<ASCII_109>m</ASCII_109>
<ASCII_110>n</ASCII_110>
<ASCII_111>o</ASCII_111>
<ASCII_112>p</ASCII_112>
<ASCII_113>q</ASCII_113>
<ASCII_114>r</ASCII_114>
<ASCII_115>s</ASCII_115>
<ASCII_116>t</ASCII_116>
<ASCII_117>u</ASCII_117>
<ASCII_118>v</ASCII_118>
<ASCII_119>w</ASCII_119>
<ASCII_120>x</ASCII_120>
<ASCII_121>y</ASCII_121>
<ASCII_122>z</ASCII_122>
<ASCII_123>{</ASCII_123>
<ASCII_124>|</ASCII_124>
<ASCII_125>}</ASCII_125>
<ASCII_126>~</ASCII_126>
<ASCII_127></ASCII_127>
</ASCII>
Update:
Added XML decalration with "us-ascii" encoding
Possibly you don't fully understand what a character set is. XML is not a character set, though XML based output does use character sets to encode data.
I'd recommend reading through Joel Spolsky's excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), then come back and have another go at your question.
You won't need an additional library to do that. From different encodings to embedded binary data, all of that is possible through the common .net library. Can you just give a simple example?