How to Verify using C# if a XML file is broken - c#

Is there anything built in to determine if an XML file is valid. One way would be to read the entire content and verify if the string represents valid XML content. Even then, how to determine if string contains valid XML data.

Create an XmlReader around a StringReader with the XML and read through the reader:
using (var reader = XmlReader.Create(something))
while(reader.Read())
;
If you don't get any exceptions, the XML is well-formed.
Unlike XDocument or XmlDocument, this will not hold an entire DOM tree in memory, so it will run quickly even on extremely large XML files.

You can try to load the XML into XML document and catch the exception.
Here is the sample code:
var doc = new XmlDocument();
try {
doc.LoadXml(content);
} catch (XmlException e) {
// put code here that should be executed when the XML is not valid.
}
Hope it helps.

Have a look at this question:
How to check for valid xml in string input before calling .LoadXml()

Related

XDocument.Parse: Avoid replacing XXE references

I'm trying to protect against malicious XXE injections in the XMLs processed by my app. Therefore I'm using XDocument instead of XmlDocument.
The XML represents the payload of a web request so I call XDocument.Parse on its string content. However, I'm seeing the XXE references contained in the XML (&XXE) being replaced in the result with the actual value of ENTITY xxe.
Is it possible to parse the XML with XDocument without replacing &xxe ?
Thanks
EDIT:
I managed to avoid the replacement of xxes in the XML using XmlResolver=null for XDocument.Load
Instead of Parse try to use Load with a pre-configured reader:
var xdoc = XDocument.Load(new XmlTextReader(
new StringReader(xmlContent)) { EntityHandling = EntityHandling.ExpandCharEntities });
From MSDN:
When EntityHandling is set to ExpandCharEntities, the reader expands character entities and returns general entities as EntityReference nodes.
Use the following example to stop resolving XXE (schemas and DTD).
Dim objXmlReader As System.Xml.XmlTextReader = Nothing
objXmlReader = New System.Xml.XmlTextReader(_patternFilePath)
objXmlReader.XmlResolver = Nothing
patternDocument = XDocument.Load(objXmlReader)

How to check if a long string is a valid XML?

I have an string, and I want to do some things with it if it is a valid XML; and If not, tell the user that the string is not a valid XML.
My code is this:
try
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(rawData);
//And here I want to do some things with doc if it is a valid XML.
}
catch
{
//Tell the user that the string is not a valid XML.
}
Now, If rawData contains a valid XML data, there is no problem. Also if rawData contains something else (like HELLOEVERYBODY!), It will throw an exception, So I can tell the user the string is not a valid XML.
But When rawData contains a HTML page, The process takes a long time (more than 20 seconds!)...
It may differ from page to page. for example, it can process stackoverflow.com quickly, but processing 1pezeshk.com takes a long long time...
Isn't there any faster way to validate XML before loading it into a XmlDocument?
I've seen this before and the problem is that XmlDocument tries to download the DTD for the document. In your sample this is http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd which lets you open a connection but never returns anything. So a simple solution (without any type of error checking mind you) is to remove anything before the -tag like this.
WebClient wc = new WebClient();
wc.Encoding = Encoding.UTF8;
string data = wc.DownloadString("http://1pezeshk.com/");
data = data.Remove(0, data.IndexOf("<html"));
XmlDocument xml = new XmlDocument();
xml.LoadXml(data);
Edit
Browsing to http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd actully returns the DTD, but it took well over a minute to respond. Since you still won't do DTD-validation you should really just strip this from your HTML and then try to validate it as HTML.

modifying pseudo-xml doc before it hits XmlDocument.Load

I have a series of... pseudo-xml files. What I mean by this, is they are almost XML files, but they are missing the xml declaration and a root node. e.g. conceptually it may look like this:
<a>info</a>
<b>info2</b>
What I want to do is load it into an XmlDocument object, e.g something similar to this:
XmlDocument xml = new XmlDocument();
using (StreamReader file = new StreamReader(File.Open(#"file.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)))
{
xml.Load(file);
}
This is throwing errors, most likely due to the ill formatted pseudo-xml file. I need to somehow handle adding in a root node before it hits the Load. I don't want to modify the actual file, or have to save anything to disk (e.g. a new temp file). I'm stuck on this, any suggestions?
XmlDocument has also a LoadXml() method that parses an Xml string. You can load your file content into a string, add the declaration and call LoadXml().
Of course, when you are using long files, this can be very memory consuming, pay attention to that.
you could try this
var xmlString = file.ReadToEnd();
xmlString = "<root>" + xmlString + "</root>";
xml.LoadXml(xmlString);

validate xml string content including encoding using C#

I need to validate a string that contains XML Data, there is no schema validation required. All I need to do is make sure that the XML is well formed and properly encoded. For example, I want my code to identify this snippet of XML as invalid:
<?xml version="1.0" encoding="utf-8"?>
<parentNode> Positions1 ’</parentNode>
Using the LoadXML method in XMLDocument does not work, there are no errors thrown when I load the snippet above.
I am aware of how to do this if the content were in an XML file, the following snippet of code shows that:
XmlReaderSettings readerSettings = new XmlReaderSettings();
readerSettings.ConformanceLevel = ConformanceLevel.Document;
readerSettings.CheckCharacters = true;
readerSettings.ValidationType = ValidationType.None;
xmlReader = XmlReader.Create(xmlFileName, readerSettings);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(xmlReader);
So short of creating a temporary file to write out my xml string content and then creating an XmlReader instance to read it, is there any alternative? Appreciate much if someone could guide me in the right direction with this problem.
You have not fully understand what encoding means. If you have a .Net string in memory, it's no more "raw data" and has no encoding for that reason. And so LoadXML ingores for a good reason. So what you want to do makes not much sense at all. But if you really want to do it:
You can convert your string into a in memory stream, so you don't have to write a temporary file. Then you can use that stream instead of the xmlFileName in your call to XmlReader.Create.
Achim,
Thanks for your detailed replies, I was able to finally come up with a solution that fits my needs. It involves grabbing the bytes out of the 'unicode' string and then transforming the bytes to utf8 encoding.
try
{
byte[] xmlContentInBytes = new System.Text.UnicodeEncoding().GetBytes(xmlContent);
System.Text.UTF8Encoding utf8 = new System.Text.UTF8Encoding(false, true);
utf8.GetChars(xmlContentInBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
return false;
}

How to resolve System.OutOfMemoryException when loading large XML file

I have this code on my program that actually loads 500 MB and up files.
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
reader.Close();
I get this kind of error and don't know how to resolve the problem. Please send me some advice.
I would use an XmlReader to parse the document, providing forward only access to the data and cleans itself up nicely in memory -- of course, it can be much more complex without the convenience of the XmlDocument class.
This simple sample will start by starting to read the file line by line, providing an XmlReader for each line.
using (var rdr = XmlReader.Create(new StreamReader("File.xml")))
{
while (rdr.Read())
{
//do what you will with the line
}
}
See the methods and properties available to you when using the XmlReader at XmlReader Properties (MSDN)
you need something like SAX but for .NET.
http://sourceforge.net/projects/saxdotnet/ or the XmlReader, basically a stream based parser.
HTH

Categories

Resources