Error parsing XML string to XDocument - c#

I have this XML string bn:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
I am trying to convert it to an XDocument like this:
var doc = XDocument.Parse(bn);
However, I get this error:
Data at the root level is invalid. Line 1, position 1.
Am I missing something?
UPDATE:
This is the method I use to create the xml string:
public static string SerializeObjectToXml(Root rt)
{
var memoryStream = new MemoryStream();
var xmlSerializer = new XmlSerializer(typeof(Root));
var xmlTextWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);
xmlSerializer.Serialize(xmlTextWriter, rt);
memoryStream = (MemoryStream)xmlTextWriter.BaseStream;
string xmlString = ByteArrayToStringUtf8(memoryStream.ToArray());
xmlTextWriter.Close();
memoryStream.Close();
memoryStream.Dispose();
return xmlString;
}
It does add to the start that I have to remove. Could I change something to make it correct from the start?

There is two characters at the beginning of your string that, although you can't see them, are still there and make the string fail. Try this instead:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
The character in question is this. This is a byte-order mark, basically telling the program reading it if it's big or little endian. It seems like you copied and pasted this from a file that wasn't decoded properly.
To remove it, you could use this:
yourString.Replace(((char)0xFEFF).ToString(), "")

You have two unprintable characters (Zero-Width No-break Space) at the beginning of your string.
XML does not allow text outside the root element.

The accepted answer does unnecessary string processing, but, in its defense, it's because you're unnecessarily dealing in string when you don't have to. One of the great things about the .NET XML APIs is that they have robust internals. So instead of trying to feed a string to XDocument.Parse, feed a Stream or some type of TextReader to XDocument.Load. This way, you aren't fooling with manually managing the encoding and any problems it creates, because the internals will handle all of that stuff for you. Byte-order marks are a pain in the neck, but if you're dealing in XML, .NET makes it easier to handle them.

Related

C# Xml Encoding

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

C# XPathDocument parsing string to XML with BOM

For a code in C#, I am parsing a string to XML using XPathDocument.
The string is retrieved from SDL Trados Studio and it depends on the XML that is being worked on (how it was originally created and loaded for translations) the string sometimes has a BOM sometimes not.
Edit: The 'xml' is actually parsed from the segments of the source and target text and the structure element. The textual elements are escaped for xml and the markup and text is joined in one string. So if the markup has BOM in the xliff, then the string will have BOM.
I am trying to actually parse any of the xmls, independent of encoding. So at this point my solution is to remove the BOM with Substring.
Here is my code:
//Recreate XML files (extractor returns two string arrays)
string strSourceXML = String.Join("", extractor.TextSrc);
string strTargetXML = String.Join("", extractor.TextTgt);
//strip BOM
strSourceXML = strSourceXML.Substring(strSourceXML.IndexOf("<?"));
strTargetXML = strTargetXML.Substring(strSourceXML.IndexOf("<?"));
//Transform XML with the preview XSL
var xSourceDoc = new XPathDocument(strSourceXML);
var xTargetDoc = new XPathDocument(strTargetXML);
I have searched for a better solution, through several articles, such as these, but I found no better solution yet:
XML - Data At Root Level is Invalid
Parsing XML with C#
Parsing complex XML with C#
Parsing : String to XML
XmlReader breaks on UTF-8 BOM
Any advice to solve this more elegantly?
The constructor of XPathDocument taking a String argument https://msdn.microsoft.com/en-us/library/te0h7f95%28v=vs.110%29.aspx takes a URI with the XML file location. If you have a string with XML markup then use a StringReader over that string e.g.
XPathDocument xSourceDoc;
using (TextReader tr = new StringReader(strSourceXML))
{
xSourceDoc = new XPathDocument(tr);
}

How to save and load the contents of a multiline textbox to XML?

I have a multiline textbox, and I use databinding to bind its Text property to a string. This appears to work, but upon loading the string from XML, the returns (new lines) get lost. When I inspect the XML, the returns are there, but once the string is loaded from XML they are lost. Does anybody know why this is happening and how to do this right.
(I am not bound to use either a multiline textbox, or a string property for binding, I just a maintainable, (and preferably elegant) solution. )
Edit: Basically, I use the XmlSerializer class:
loading:
using (StreamReader streamReader = new StreamReader(fileName))
{
XmlSerializer xmlSerializer = new XmlSerializer(typeof(T));
return (T)xmlSerializer.Deserialize(streamReader);
}
saving:
using (StreamWriter streamWriter = new StreamWriter(fileName))
{
Type t = typeof(T);
XmlSerializer xmlSerializer = new XmlSerializer(t);
xmlSerializer.Serialize(streamWriter, data);
}
When looking inside the XML, it saves multiline textbox data like this:
<OverriddenComponent>
<overrideInformation>
<Comments>first rule
second rule
third rule</Comments>
</overrideInformation>
</OverriddenComponent>
But those breaks no longer get displayed after the data is loaded.
What are the actual codes for new lines ? 0x0A or 0x0D or both? I stumbled on a similar problem before. The characters from a string "got lost" because textbox "converted" them on its own (or didn't understand them). Basicly, your xml file may be encoded one way, and your textbox uses other encoding, or it is lost during reading from, or writing to, the file itself (your string may be "messed up" also during reading from/writing to file). So there are 3 places your string may be tampered with, without your knowledge:
During writing to the file (take notice what encoding you use)
During reading from the file
When displaying your string in textbox.
My advice is that you should assign the text that you read from the file to another string (not bound) before you assign it to the bound one and use a debugger to check how it changes. This http://home2.paulschou.net/tools/xlate/ is a useful tool to check what exactly is in your strings.
When I encountered this problem in my application, I ended up using binary/hex values of characters to write/read, and then converting them back when I needed to display. But I had to use a lot of strange ASCII codes. Maybe there's an easier solution for you out there.
EDIT: or it may be just some xml-related thing. Maybe you should use some other character to replace line break when writing it to xml?

C#: shield XmlTextReader from an occasional Unicode character

In C#, I have a XmlTextReader created directly from an HTTP response (I have no control over the XML content of the response).
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
XmlTextReader reader = new XmlTextReader(response.GetResponseStream());
It works, but sometimes one of the XML element nodes will contain a Unicode character (e.g. "é") which trips the reader. I've tried to use a StreamReader with declared encoding, but now the XmlTextReader quits out on the very first line: "Data invalid. Line 1, position 1":
StreamReader sReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Unicode);
XmlTextReader reader = new XmlTextReader(sReader);
Is there a way to fix this? Alternatively, is there a way to prevent the XmlTextReader from parsing an element (I know its name) with a potentially offending character? I don't care about that particular element, I just don't want it to trip the reader.
EDIT: Quick fix: read the response into a StringBuilder ("sb"):
sb.Replace("é", "e");
StringReader strReader = new StringReader(sb.ToString());
XmlTextReader reader = new XmlTextReader(strReader);
It is not a Unicode character, it is an invalid character (not correctly encoded).
There is no way to shield an XmlTextReader from invalid XML. You need to either
Fix the server side to properly encode characters
Pre-process the text to do it yourself
According to UTF8, all such characters ("é") are encoded with 2 or 3 bytes (or more). You can use a hex editor to verify it.
What do you mean by "trips the reader"? Your first snippet of code should be fine - if the XML is genuinely in the encoding it declares (please look at the XML declaration) then it should be absolutely fine.
If the XML is genuinely broken, I would suggest performing some sort of filtering before XML parsing (e.g. loading the XML into a string with the right encoding, then fixing the declared encoding to match)... but we'll need to work out what's wrong with it first.

How do I preserve special characters when writing XML with XDocument.Save()?

My source XML has the copyright character in it as ©. When writing the XML with this code:
var stringWriter = new StringWriter();
segmentDoc.Save(stringWriter);
Console.WriteLine(stringWriter.ToString());
it is rendering that copyright character as a little "c" with a circle around it. I'd like to preserve the original code so it gets spit back out as ©. How can I do this?
Update: I also noticed that the source declaration looks like <?xml version="1.0" encoding="utf-8"?> but my saved output looks like <?xml version="1.0" encoding="utf-16"?>. Can I indicate that I want the output to still be utf-8? Would that fix it?
Update2: Also,   is getting output as ÿ. I definitely don't want that happening!
Update3: § is becoming a little box and that is wrong, too. It should be §
I strongly suspect you won't be able to do this. Fundamentally, the copyright sign is © - they're different representations of the same thing, and I expect that the in-memory representation normalizes this.
What are you doing with the XML afterwards? Any sane application processing the resulting XML should be fine with it.
You may be able to persuade it to use the entity reference if you explicitly encode it with ASCII... but I'm not sure.
EDIT: You can definitely make it use a different encoding. You just need a StringWriter which reports that its "native" encoding is UTF-8. Here's a simple class you can use for that:
public class Utf8StringWriter : StringWriter
{
public override Encoding Encoding
{
get { return Encoding.UTF8; }
}
}
You could try changing it to use Encoding.ASCII as well and see what that does to the copyright sign...
i had the same problem when saving some lithuanian characters in this way. i found a way to cheat around this by replacing & with & (&#x00A9; to write © and so on) it looks strange but it worked for me :)
Maybe you can try to diffent document encoding, check out:
http://www.sagehill.net/docbookxsl/CharEncoding.html
It appears that UTF8 won't solve the problem. The following has the same symptoms as your code:
MemoryStream ms = new MemoryStream();
XmlTextWriter writer = new XmlTextWriter(ms, new UTF8Encoding());
segmentDoc.Save(writer);
ms.Seek(0L, SeekOrigin.Begin);
var reader = new StreamReader(ms);
var result = reader.ReadToEnd();
Console.WriteLine(result);
I tried the same approach with ASCII, but wound up with ? instead of ©.
I think using a string replace after converting the XML to a string is your best bet to get the effect you want. Of course, that could be cumbersome if you are interested in more than just the #copy; symbol.
result = result.Replace("©", "\u0026#x00A9;");

Categories

Resources