How to check if a long string is a valid XML? - c#

I have an string, and I want to do some things with it if it is a valid XML; and If not, tell the user that the string is not a valid XML.
My code is this:
try
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(rawData);
//And here I want to do some things with doc if it is a valid XML.
}
catch
{
//Tell the user that the string is not a valid XML.
}
Now, If rawData contains a valid XML data, there is no problem. Also if rawData contains something else (like HELLOEVERYBODY!), It will throw an exception, So I can tell the user the string is not a valid XML.
But When rawData contains a HTML page, The process takes a long time (more than 20 seconds!)...
It may differ from page to page. for example, it can process stackoverflow.com quickly, but processing 1pezeshk.com takes a long long time...
Isn't there any faster way to validate XML before loading it into a XmlDocument?

I've seen this before and the problem is that XmlDocument tries to download the DTD for the document. In your sample this is http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd which lets you open a connection but never returns anything. So a simple solution (without any type of error checking mind you) is to remove anything before the -tag like this.
WebClient wc = new WebClient();
wc.Encoding = Encoding.UTF8;
string data = wc.DownloadString("http://1pezeshk.com/");
data = data.Remove(0, data.IndexOf("<html"));
XmlDocument xml = new XmlDocument();
xml.LoadXml(data);
Edit
Browsing to http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd actully returns the DTD, but it took well over a minute to respond. Since you still won't do DTD-validation you should really just strip this from your HTML and then try to validate it as HTML.

Related

How do I display XML that has been posted to an ASHX handler on a page?

I am a new programmer working with C# and building a simple page, to which a third-party service posts XML. I would like to simply display this XML as text on a webpage.
I've started a new ASHX file that contains the following method:
public void ProcessRequest (HttpContext context)
{
StreamReader reader = new StreamReader(context.Request.InputStream);
String xmlData = reader.ReadToEnd();
XmlDocument doc = new XmlDocument();
doc.LoadXml(xmlData);
context.Response.ContentType = "text/xml";
context.Response.Write(xmlData);
}
However, when the third-party application tries to post to the page, I get an error that the Root Element is missing.
If I understand this error correctly, this would seem to imply that the XML that was posted is not formatted correctly.
1.) If I am incorrect in this assumption, is there something wrong with the way that I am trying to receive and display the XML data using this handler?
2.) Otherwise, how might I easily troubleshoot by examining the contents of the XML that is sent to the service using my handler?
1.) It's a good assumption, or the request is blank.
The doc.LoadXml(xmlData); line will be where it's complaining about about a missing root element.
2.) You could start by simply commenting out the above line, because you're not doing anything with the doc object yet.
You might also need to set the content type to text/plain
Longer term, you may want some try/catch logic around the LoadXml() call. Something like this perhaps...
try
{
doc.LoadXml(xmlData);
context.Response.ContentType = "text/xml";
context.Response.Write(xmlData);
}
catch (Exception e)
{
context.Response.ContentType = "text/plain";
context.Response.Write(e.Message);
}

modifying pseudo-xml doc before it hits XmlDocument.Load

I have a series of... pseudo-xml files. What I mean by this, is they are almost XML files, but they are missing the xml declaration and a root node. e.g. conceptually it may look like this:
<a>info</a>
<b>info2</b>
What I want to do is load it into an XmlDocument object, e.g something similar to this:
XmlDocument xml = new XmlDocument();
using (StreamReader file = new StreamReader(File.Open(#"file.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)))
{
xml.Load(file);
}
This is throwing errors, most likely due to the ill formatted pseudo-xml file. I need to somehow handle adding in a root node before it hits the Load. I don't want to modify the actual file, or have to save anything to disk (e.g. a new temp file). I'm stuck on this, any suggestions?
XmlDocument has also a LoadXml() method that parses an Xml string. You can load your file content into a string, add the declaration and call LoadXml().
Of course, when you are using long files, this can be very memory consuming, pay attention to that.
you could try this
var xmlString = file.ReadToEnd();
xmlString = "<root>" + xmlString + "</root>";
xml.LoadXml(xmlString);

validate xml string content including encoding using C#

I need to validate a string that contains XML Data, there is no schema validation required. All I need to do is make sure that the XML is well formed and properly encoded. For example, I want my code to identify this snippet of XML as invalid:
<?xml version="1.0" encoding="utf-8"?>
<parentNode> Positions1 ’</parentNode>
Using the LoadXML method in XMLDocument does not work, there are no errors thrown when I load the snippet above.
I am aware of how to do this if the content were in an XML file, the following snippet of code shows that:
XmlReaderSettings readerSettings = new XmlReaderSettings();
readerSettings.ConformanceLevel = ConformanceLevel.Document;
readerSettings.CheckCharacters = true;
readerSettings.ValidationType = ValidationType.None;
xmlReader = XmlReader.Create(xmlFileName, readerSettings);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(xmlReader);
So short of creating a temporary file to write out my xml string content and then creating an XmlReader instance to read it, is there any alternative? Appreciate much if someone could guide me in the right direction with this problem.
You have not fully understand what encoding means. If you have a .Net string in memory, it's no more "raw data" and has no encoding for that reason. And so LoadXML ingores for a good reason. So what you want to do makes not much sense at all. But if you really want to do it:
You can convert your string into a in memory stream, so you don't have to write a temporary file. Then you can use that stream instead of the xmlFileName in your call to XmlReader.Create.
Achim,
Thanks for your detailed replies, I was able to finally come up with a solution that fits my needs. It involves grabbing the bytes out of the 'unicode' string and then transforming the bytes to utf8 encoding.
try
{
byte[] xmlContentInBytes = new System.Text.UnicodeEncoding().GetBytes(xmlContent);
System.Text.UTF8Encoding utf8 = new System.Text.UTF8Encoding(false, true);
utf8.GetChars(xmlContentInBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
return false;
}

How to Verify using C# if a XML file is broken

Is there anything built in to determine if an XML file is valid. One way would be to read the entire content and verify if the string represents valid XML content. Even then, how to determine if string contains valid XML data.
Create an XmlReader around a StringReader with the XML and read through the reader:
using (var reader = XmlReader.Create(something))
while(reader.Read())
;
If you don't get any exceptions, the XML is well-formed.
Unlike XDocument or XmlDocument, this will not hold an entire DOM tree in memory, so it will run quickly even on extremely large XML files.
You can try to load the XML into XML document and catch the exception.
Here is the sample code:
var doc = new XmlDocument();
try {
doc.LoadXml(content);
} catch (XmlException e) {
// put code here that should be executed when the XML is not valid.
}
Hope it helps.
Have a look at this question:
How to check for valid xml in string input before calling .LoadXml()

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!
Take a look at Html Agility Pack
You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.
You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).
You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

Categories

Resources