XmlTextReader passes end of XML document without recognizing

XmlTextReader passes end of XML document without recognizing - c#

I'm trying to create a simple App which reads a XML using SAX (XmlTextReader) from a stream which does not only contain the XML but also other data such as binary blobs and text. The structure of the stream is simply chunk based.
When entering my reading function, the stream is properly positioned at the beginning of the XML. I've reduced the issue to the following code example:
string xml = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><Models />" + (char)0x014;
XmlTextReader reader = new XmlTextReader(new StringReader(xml));
reader.MoveToContent();
reader.ReadStartElement("Models");
These few lines causes an exception when calling ReadStartElement due to the 0x014 at the end of the string.
The interesting thing about it is, that the code runs just fine when using the following input instead:
string xml = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><Models></Models>" + (char)0x014;
I don't want to read the whole document due to its size nor do I want to change the input as I need to stay backward compatible to older data inputs.
The only solution I can think of at first is a custom stream reader which doesn't continue to read after the last ending tag but that would involve some major parsing efforts.
Do you have any ideas on how to solve this issue? I've already tried to use LINQ's XDocument but that also failed.
Thank you very much in advance,
Cheers,
Romout

I don't know if this is quite what you are looking for, but if you instead call:
reader.IsStartElement("Models");,
than the <Models/> node will only be tested if it is a start tag or empty element tag and if the Name matches. The reader will not be moved beyond it (the Read() method will not be called).

Related

C# Xml Encoding

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas

The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

What's the best way to update xml in a file?

I have been looking all over for the best way to update xml in a file. I have just switched over to using XmlReader (coming from the XDocument method) for speed (not having to read the entire file in memory).
My XmlReader method works perfect and when I need to read a value, it opens the xml, starts reading and ONLY reads up to the node needed, then closes everything. It's very fast and effective.
Now that I have that working I want to make a method that UPDATES xml that is already in place. I would like to keep to the same idea and ONLY read in memory what is needed. So the idea would be, read up until the node I'm changing then use the writer to UPDATE that value.
Everything I have seen has a XmlReader reading while using an XmlWriter writing everything. If I did that I would assume that I would have to let it run through the entire file just like the XDocument would do. As an example this answer.
Is it possible to maybe just use the reader and read up to the node I'm trying to edit then change the innerxml or something?
What's the fastest and most efficient method to update XML in a file?
I would like to only read into memory what I'm trying to edit, not
the whole file.
I would also like to account for nodes that do not
exist (that need to be added).

By design, XmlReader represents a "read-only forward-only" view of the document and cannot be used to update the content. Using the Load method of either XmlDocument, XDocument or XElement, will still cause the entire file to be read in to memory. (Under the hood, XDocument and XElement still use an XmlReader.) However, you can combine using a raw XmlReader and XElement together using the overloads of the Load method which take an XmlReader.
You don't describe your XML structure, but you would want to do something similar to this:
var reader = XmlReader.Create(#"file://c:\test.xml");
var document = XElement.Load(reader);
document.Add(new XElement("branch", "leaves"));
document.Save("Tree.xml");
To find a specific node (for example, with a specific attribute value), you'd want to do something similar to this:
var node = document.Descendants("branch")
.SingleOrDefault(e => (string)e.Attribute("name") == "foo");

Most Efficient Way to Parse Only From Specific Keys in a Large XML with XMLReader

Suppose I have a large XML (200 - 1000+ MB) and I'm just looking to get a very small subset of data in the most efficient way.
Given a great solution from one of my previous questions, I ended up coding a solution to use an XMLReader mixed with XMLDocument / XPath.
So, supposing I have the following XML:
<Doc>
<Big_Element1>
... LOTS of sub-elements ...
</Big_Element1>
.....
<Small_Element1>
<Sub_Element1_1 />
...
<Sub_Element1_N />
</Small_Element1>
.....
<Small_Element2>
<Sub_Element2_1 />
...
<Sub_Element2_N />
</Small_Element2>
.....
<Big_ElementN>
.......
</Big_ElementN>
</Doc>
And all I really need is the data from the Small_Elements and the Big_Elements are definitely very large (with many small sub-elements within them) and, so, I'd like to not even enter them if I don't have to.
I came up with this form of solution:
Dim doc As XmlDocument
Dim xNd As XmlNode
Using reader As XmlReader = XmlReader.Create(uri)
reader.MoveToContent()
While reader.Read
If reader.NodeType = XmlNodeType.Element Then
Select Case UCase(reader.Name)
Case "SMALL_ELEMENT1"
doc = New XmlDocument
xNd = doc.ReadNode(reader)
GetSmallElement1Data(xNd)
Case "SMALL_ELEMENT2"
doc = New XmlDocument
xNd = doc.ReadNode(reader)
GetSmallElement2Data(xNd)
End Select
End If
End While
End Using
And GetSmallElement1Data(xNd) & GetSmallElement2Data(xNd) are easy enough for me to deal with since they're small and so I use XPath within them to get the data I need.
But my question is that it seems this reader still goes through the entire XML rather than just skipping over the Big_Elements. Or is it not / this the correct way to have programmed this??
Also, I know this sample code was written in VB.net, but I'm equally comfortable with c# / VB.net solutions.
Any help / thoughts would be great!!!
Thanks!!!

Suppose I have a large XML (200 - 1000+ MB)
XmlReader is the only approach that does not parse the whole document to create an in memory object model.
But my question is that it seems this reader still goes through the entire XML rather than just skipping over the Big_Elements. Or is it not / this the correct way to have programmed this??
The parser still has to read that content: it has no knowledge of what elements you are interested in.
Your only option to skip content (thus not returning to your code from XmlReader.Read) is to call XmlReader.Skip: telling the parser there are no descendants of the current node you are interested in. The parser will still need to read and parse the text to find the matching end node, but without your code being running this will be quicker.

C#: shield XmlTextReader from an occasional Unicode character

In C#, I have a XmlTextReader created directly from an HTTP response (I have no control over the XML content of the response).
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
XmlTextReader reader = new XmlTextReader(response.GetResponseStream());
It works, but sometimes one of the XML element nodes will contain a Unicode character (e.g. "é") which trips the reader. I've tried to use a StreamReader with declared encoding, but now the XmlTextReader quits out on the very first line: "Data invalid. Line 1, position 1":
StreamReader sReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Unicode);
XmlTextReader reader = new XmlTextReader(sReader);
Is there a way to fix this? Alternatively, is there a way to prevent the XmlTextReader from parsing an element (I know its name) with a potentially offending character? I don't care about that particular element, I just don't want it to trip the reader.
EDIT: Quick fix: read the response into a StringBuilder ("sb"):
sb.Replace("é", "e");
StringReader strReader = new StringReader(sb.ToString());
XmlTextReader reader = new XmlTextReader(strReader);

It is not a Unicode character, it is an invalid character (not correctly encoded).
There is no way to shield an XmlTextReader from invalid XML. You need to either
Fix the server side to properly encode characters
Pre-process the text to do it yourself
According to UTF8, all such characters ("é") are encoded with 2 or 3 bytes (or more). You can use a hex editor to verify it.

What do you mean by "trips the reader"? Your first snippet of code should be fine - if the XML is genuinely in the encoding it declares (please look at the XML declaration) then it should be absolutely fine.
If the XML is genuinely broken, I would suggest performing some sort of filtering before XML parsing (e.g. loading the XML into a string with the right encoding, then fixing the declared encoding to match)... but we'll need to work out what's wrong with it first.

How to change character encoding of XmlReader

I have a simple XmlReader:
XmlReader r = XmlReader.Create(fileName);
while (r.Read())
{
Console.WriteLine(r.Value);
}
The problem is, the Xml file has ISO-8859-9 characters in it, which makes XmlReader throw "Invalid character in the given encoding." exception. I can solve this problem with adding <?xml version="1.0" encoding="ISO-8859-9" ?> line in the beginning but I'd like to solve this in another way in case I can't modify the source file. How can I change the encoding of XmlReader?

To force .NET to read the file in as ISO-8859-9, just use one of the many XmlReader.Create overloads, e.g.
using(XmlReader r = XmlReader.Create(new StreamReader(fileName, Encoding.GetEncoding("ISO-8859-9")))) {
while(r.Read()) {
Console.WriteLine(r.Value);
}
}
However, that may not work because, IIRC, the W3C XML standard says something about when the XML declaration line has been read, a compliant parser should immediately switch to the encoding specified in the XML declaration regardless of what encoding it was using before. In your case, if the XML file has no XML declaration, the encoding will be UTF-8 and it will still fail. I may be talking nonsense here so try it and see. :-)

The XmlTextReader class (which is what the static Create method is actually returning, since XmlReader is the abstract base class) is designed to automatically detect encoding from the XML file itself - there's no way to set it manually.
Simply insure that you include the following XML declaration in the file you are reading:
<?xml version="1.0" encoding="ISO-8859-9"?>

If you can't ensure that the input file has the right header, you could look at one of the other 11 overloads to the XmlReader.Create method.
Some of these take an XmlReaderSettings variable or XmlParserContext variable, or both. I haven't investigated these, but there is a possibility that setting the appropriate values might help here.
There is the XmlReaderSettings.CheckCharacters property - the help for this states:
Instructs the reader to check characters and throw an exception if any characters are outside the range of legal XML characters. Character checking includes checking for illegal characters in the document, as well as checking the validity of XML names (for example, an XML name may not start with a numeral).
So setting this to false might help. However, the help also states:
If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.
So further investigation is warranted.

Use a XmlTextReader instead of a XmlReader:
System.Text.Encoding.UTF8.GetString(YourXmlTextReader.Encoding.GetBytes(YourXmlTextReader.Value))

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XmlTextReader passes end of XML document without recognizing - c#

Related

C# Xml Encoding

What's the best way to update xml in a file?

Most Efficient Way to Parse Only From Specific Keys in a Large XML with XMLReader

C#: shield XmlTextReader from an occasional Unicode character

How to change character encoding of XmlReader

Categories

Resources