C# XPathDocument parsing string to XML with BOM - c#

For a code in C#, I am parsing a string to XML using XPathDocument.
The string is retrieved from SDL Trados Studio and it depends on the XML that is being worked on (how it was originally created and loaded for translations) the string sometimes has a BOM sometimes not.
Edit: The 'xml' is actually parsed from the segments of the source and target text and the structure element. The textual elements are escaped for xml and the markup and text is joined in one string. So if the markup has BOM in the xliff, then the string will have BOM.
I am trying to actually parse any of the xmls, independent of encoding. So at this point my solution is to remove the BOM with Substring.
Here is my code:
//Recreate XML files (extractor returns two string arrays)
string strSourceXML = String.Join("", extractor.TextSrc);
string strTargetXML = String.Join("", extractor.TextTgt);
//strip BOM
strSourceXML = strSourceXML.Substring(strSourceXML.IndexOf("<?"));
strTargetXML = strTargetXML.Substring(strSourceXML.IndexOf("<?"));
//Transform XML with the preview XSL
var xSourceDoc = new XPathDocument(strSourceXML);
var xTargetDoc = new XPathDocument(strTargetXML);
I have searched for a better solution, through several articles, such as these, but I found no better solution yet:
XML - Data At Root Level is Invalid
Parsing XML with C#
Parsing complex XML with C#
Parsing : String to XML
XmlReader breaks on UTF-8 BOM
Any advice to solve this more elegantly?

The constructor of XPathDocument taking a String argument https://msdn.microsoft.com/en-us/library/te0h7f95%28v=vs.110%29.aspx takes a URI with the XML file location. If you have a string with XML markup then use a StringReader over that string e.g.
XPathDocument xSourceDoc;
using (TextReader tr = new StringReader(strSourceXML))
{
xSourceDoc = new XPathDocument(tr);
}

Related

Making XmlReaderSettings CheckCharacters work for xml string

I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:
XDocument xDocument = XDocument.Parse(xmlString);
But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:
Name cannot begin with the 'number' character
Other solutions I found were about using: XmlReaderSettings.CheckCharacters
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:
If the XmlReader is processing text data, it always checks that the
XML names and text content are valid, regardless of the property
setting. Setting CheckCharacters to false turns off character checking
for character entity references.
So I tried using:
using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
This also didn't work.
Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters?
How is the flag XmlReaderSettings.CheckCharacters supposed to be used?
You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.
Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.
LL parser can be written by hand. Both lexer and parser are simple.
LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.
You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.

Saving XML file from one location to another location using XML DOCUMENT

While saving the existing XML to new location, entities escaped from the content and replaced with Question Mark
See the snaps below entity ‐ (- as Hex) present while reading but its replaced with question mark after saving to another location.
While Reading as Inner XML
While Reading as Inner Text
After Saving XML File
EDIT 1
Below is my code
string path = #"C:\work\myxml.XML";
string pathnew = #"C:\work\myxml_new.XML";
//GetFileEncoding(path);
XmlDocument document = new XmlDocument();
XmlDeclaration xmlDeclaration = document.CreateXmlDeclaration("1.0","US-ASCII",null);
//document.CreateXmlDeclaration("1.0", null, null);
document.Load(path);
string x = document.InnerText;
document.Save(pathnew);
EDIT 2
My source file looks like below. I need to retain the entities as it is
The issue here seems to be the handling of encoding of entity references by the specific XmlWriter implementation internal to XmlDocument.
The issue disappears if you create an XmlWriter yourself - the unsupported character will be correctly encoded as an entity reference. This XmlWriter is a different (and newer) implementation that sets an EncoderFallback that encodes characters as entity references for characters that can't be encoded. Per the remarks in the docs, the default fallback mechanism is to encode a question mark.
var settings = new XmlWriterSettings
{
Indent = true,
Encoding = Encoding.GetEncoding("US-ASCII")
};
using (var writer = XmlWriter.Create(pathnew, settings))
{
document.Save(writer);
}
As an aside, I'd recomment using the LINQ to XML XDocument API, it's much nicer to work with than the old creaky XmlDocument API. And its version of Save doesn't have this problem, either!

Error parsing XML string to XDocument

I have this XML string bn:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
I am trying to convert it to an XDocument like this:
var doc = XDocument.Parse(bn);
However, I get this error:
Data at the root level is invalid. Line 1, position 1.
Am I missing something?
UPDATE:
This is the method I use to create the xml string:
public static string SerializeObjectToXml(Root rt)
{
var memoryStream = new MemoryStream();
var xmlSerializer = new XmlSerializer(typeof(Root));
var xmlTextWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);
xmlSerializer.Serialize(xmlTextWriter, rt);
memoryStream = (MemoryStream)xmlTextWriter.BaseStream;
string xmlString = ByteArrayToStringUtf8(memoryStream.ToArray());
xmlTextWriter.Close();
memoryStream.Close();
memoryStream.Dispose();
return xmlString;
}
It does add to the start that I have to remove. Could I change something to make it correct from the start?
There is two characters at the beginning of your string that, although you can't see them, are still there and make the string fail. Try this instead:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
The character in question is this. This is a byte-order mark, basically telling the program reading it if it's big or little endian. It seems like you copied and pasted this from a file that wasn't decoded properly.
To remove it, you could use this:
yourString.Replace(((char)0xFEFF).ToString(), "")
You have two unprintable characters (Zero-Width No-break Space) at the beginning of your string.
XML does not allow text outside the root element.
The accepted answer does unnecessary string processing, but, in its defense, it's because you're unnecessarily dealing in string when you don't have to. One of the great things about the .NET XML APIs is that they have robust internals. So instead of trying to feed a string to XDocument.Parse, feed a Stream or some type of TextReader to XDocument.Load. This way, you aren't fooling with manually managing the encoding and any problems it creates, because the internals will handle all of that stuff for you. Byte-order marks are a pain in the neck, but if you're dealing in XML, .NET makes it easier to handle them.

Loading string into XML Data

I am developing an application where I am reading a file, converting the contents into string and then loading the string in XML. But the issue that I am facing is that while loading the string data into XML I am getting an exception of invalid characters. I am using the following piece of code. Could any one help me to resolve the issue. Thank you in advance.
ZipFileEntry objContactXML;
String xmlData = ASCIIEncoding.UTF8.GetString(objContactXML.FileData);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlData);
Regards,
Sanchaita
Firstly, this is a nasty bit of code:
ASCIIEncoding.UTF8
Please use just Encoding.UTF8 - it's UTF-8, not ASCII.
Now, you can create a StringReader around your XML text data - but you'd actually be better off not turning it into string data at all. It may be encoded in something other than UTF-8 - and the XML parser knows how to deal with that. It's entirely possible that this is why you're running into problems with your current approach. Leave the data in binary and parse that:
using (MemoryStream stream = new MemoryStream(objContactXML.FileData))
{
document.Load(stream);
}
As an aside, if you're using .NET 3.5 or higher, I would strongly advise you to use LINQ to XML (XDocument etc) instead of the old DOM API. LINQ to XML is a much nicer API.
In LINQ to XML, you'd use:
XDocument document;
using (MemoryStream stream = new MemoryStream(objContactXML.FileData))
{
document = XDocument.Load(stream);
}

C#: shield XmlTextReader from an occasional Unicode character

In C#, I have a XmlTextReader created directly from an HTTP response (I have no control over the XML content of the response).
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
XmlTextReader reader = new XmlTextReader(response.GetResponseStream());
It works, but sometimes one of the XML element nodes will contain a Unicode character (e.g. "é") which trips the reader. I've tried to use a StreamReader with declared encoding, but now the XmlTextReader quits out on the very first line: "Data invalid. Line 1, position 1":
StreamReader sReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Unicode);
XmlTextReader reader = new XmlTextReader(sReader);
Is there a way to fix this? Alternatively, is there a way to prevent the XmlTextReader from parsing an element (I know its name) with a potentially offending character? I don't care about that particular element, I just don't want it to trip the reader.
EDIT: Quick fix: read the response into a StringBuilder ("sb"):
sb.Replace("é", "e");
StringReader strReader = new StringReader(sb.ToString());
XmlTextReader reader = new XmlTextReader(strReader);
It is not a Unicode character, it is an invalid character (not correctly encoded).
There is no way to shield an XmlTextReader from invalid XML. You need to either
Fix the server side to properly encode characters
Pre-process the text to do it yourself
According to UTF8, all such characters ("é") are encoded with 2 or 3 bytes (or more). You can use a hex editor to verify it.
What do you mean by "trips the reader"? Your first snippet of code should be fine - if the XML is genuinely in the encoding it declares (please look at the XML declaration) then it should be absolutely fine.
If the XML is genuinely broken, I would suggest performing some sort of filtering before XML parsing (e.g. loading the XML into a string with the right encoding, then fixing the declared encoding to match)... but we'll need to work out what's wrong with it first.

Categories

Resources