validate xml string content including encoding using C#

validate xml string content including encoding using C# - c#

I need to validate a string that contains XML Data, there is no schema validation required. All I need to do is make sure that the XML is well formed and properly encoded. For example, I want my code to identify this snippet of XML as invalid:
<?xml version="1.0" encoding="utf-8"?>
<parentNode> Positions1 ’</parentNode>
Using the LoadXML method in XMLDocument does not work, there are no errors thrown when I load the snippet above.
I am aware of how to do this if the content were in an XML file, the following snippet of code shows that:
XmlReaderSettings readerSettings = new XmlReaderSettings();
readerSettings.ConformanceLevel = ConformanceLevel.Document;
readerSettings.CheckCharacters = true;
readerSettings.ValidationType = ValidationType.None;
xmlReader = XmlReader.Create(xmlFileName, readerSettings);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(xmlReader);
So short of creating a temporary file to write out my xml string content and then creating an XmlReader instance to read it, is there any alternative? Appreciate much if someone could guide me in the right direction with this problem.

You have not fully understand what encoding means. If you have a .Net string in memory, it's no more "raw data" and has no encoding for that reason. And so LoadXML ingores for a good reason. So what you want to do makes not much sense at all. But if you really want to do it:
You can convert your string into a in memory stream, so you don't have to write a temporary file. Then you can use that stream instead of the xmlFileName in your call to XmlReader.Create.

Achim,
Thanks for your detailed replies, I was able to finally come up with a solution that fits my needs. It involves grabbing the bytes out of the 'unicode' string and then transforming the bytes to utf8 encoding.
try
{
byte[] xmlContentInBytes = new System.Text.UnicodeEncoding().GetBytes(xmlContent);
System.Text.UTF8Encoding utf8 = new System.Text.UTF8Encoding(false, true);
utf8.GetChars(xmlContentInBytes);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
return false;
}

Related

How to check if a long string is a valid XML?

I have an string, and I want to do some things with it if it is a valid XML; and If not, tell the user that the string is not a valid XML.
My code is this:
try
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(rawData);
//And here I want to do some things with doc if it is a valid XML.
}
catch
{
//Tell the user that the string is not a valid XML.
}
Now, If rawData contains a valid XML data, there is no problem. Also if rawData contains something else (like HELLOEVERYBODY!), It will throw an exception, So I can tell the user the string is not a valid XML.
But When rawData contains a HTML page, The process takes a long time (more than 20 seconds!)...
It may differ from page to page. for example, it can process stackoverflow.com quickly, but processing 1pezeshk.com takes a long long time...
Isn't there any faster way to validate XML before loading it into a XmlDocument?

I've seen this before and the problem is that XmlDocument tries to download the DTD for the document. In your sample this is http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd which lets you open a connection but never returns anything. So a simple solution (without any type of error checking mind you) is to remove anything before the -tag like this.
WebClient wc = new WebClient();
wc.Encoding = Encoding.UTF8;
string data = wc.DownloadString("http://1pezeshk.com/");
data = data.Remove(0, data.IndexOf("<html"));
XmlDocument xml = new XmlDocument();
xml.LoadXml(data);
Edit
Browsing to http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd actully returns the DTD, but it took well over a minute to respond. Since you still won't do DTD-validation you should really just strip this from your HTML and then try to validate it as HTML.

How to Verify using C# if a XML file is broken

Is there anything built in to determine if an XML file is valid. One way would be to read the entire content and verify if the string represents valid XML content. Even then, how to determine if string contains valid XML data.

Create an XmlReader around a StringReader with the XML and read through the reader:
using (var reader = XmlReader.Create(something))
while(reader.Read())
;
If you don't get any exceptions, the XML is well-formed.
Unlike XDocument or XmlDocument, this will not hold an entire DOM tree in memory, so it will run quickly even on extremely large XML files.

You can try to load the XML into XML document and catch the exception.
Here is the sample code:
var doc = new XmlDocument();
try {
doc.LoadXml(content);
} catch (XmlException e) {
// put code here that should be executed when the XML is not valid.
}
Hope it helps.

Have a look at this question:
How to check for valid xml in string input before calling .LoadXml()

Getting "ï»¿" at the beginning of my XML File after save() [duplicate]

This question already has answers here:
How can I remove the BOM from XmlTextWriter using C#?
(2 answers)
Closed 7 years ago.
I'm opening an existing XML file with C#, and I replace some nodes in there. All works fine. Just after I save it, I get the following characters at the beginning of the file:
ï»¿ (EF BB BF in HEX)
The whole first line:
ï»¿<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
The rest of the file looks like a normal XML file.
The simplified code is here:
XmlDocument doc = new XmlDocument();
doc.Load(xmlSourceFile);
XmlNode translation = doc.SelectSingleNode("//trans-unit[#id='127']");
translation.InnerText = "testing";
doc.Save(xmlTranslatedFile);
I'm using a C# Windows Forms application with .NET 4.0.
Any ideas? Why would it do that? Can we disable that somehow? It's for Adobe InCopy, and it does not open it like this.
UPDATE:
Alternative Solution:
Saving it with the XmlTextWriter works too:
XmlTextWriter writer = new XmlTextWriter(inCopyFilename, null);
doc.Save(writer);

It is the UTF-8 BOM, which is actually discouraged by the Unicode standard:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
Use of a BOM is neither required nor
recommended for UTF-8, but may be
encountered in contexts where UTF-8
data is converted from other encoding
forms that use a BOM or where the BOM
is used as a UTF-8 signature
You may disable it using:
var sw = new IO.StreamWriter(path, new System.Text.UTF8Encoding(false));
doc.Save(sw);
sw.Close();

It's a UTF-8 Byte Order Mark (BOM) and is to be expected.

You can try to change the encoding of the XmlDocument. Below is the example copied from MSDN
using System; using System.IO; using System.Xml;
public class Sample {
public static void Main() {
// Create and load the XML document.
XmlDocument doc = new XmlDocument();
string xmlString = "<book><title>Oberon's Legacy</title></book>";
doc.Load(new StringReader(xmlString));
// Create an XML declaration.
XmlDeclaration xmldecl;
xmldecl = doc.CreateXmlDeclaration("1.0",null,null);
xmldecl.Encoding="UTF-16";
xmldecl.Standalone="yes";
// Add the new node to the document.
XmlElement root = doc.DocumentElement;
doc.InsertBefore(xmldecl, root);
// Display the modified XML document
Console.WriteLine(doc.OuterXml);
}
}

As everybody else mentioned, it's Unicode issue.
I advise you to try LINQ To XML. Although not really related, I mention it as it's super easy compared to old ways and, more importantly, I assume it might have automatic resolutions to issues like these without extra coding from you.

Correcting Encoding in a large Xml File

I'm importing data from XML files containing this type of content:
<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>
The XML is loaded via:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.
I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?
Update: I do not have any encoding declaration or <?xml in this document.
I've seen some links say to add it dynamically? Is this UTF-16 encoding?

It appears that:
The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
But it was decoded using windows-1252 (the "ANSI" code page).

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?
Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?
The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...
Edit: dan04 hit the nail on the head. ™ in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.
The fix is simple: just specify this encoding when opening your XML file:
XmlDocument doc = new XmlDocument();
using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
doc.Load(reader);
}

From here:
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
You might want to take a look at this: How to best detect encoding in XML file?
For actual reading you can use StreamReader to take care of BOM(Byte order mark):
string xml;
using (var reader = new StreamReader("FilePath", true))
{ // ↑
xml= reader.ReadToEnd(); // detectEncodingFromByteOrderMarks
}
Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.
Edit 2: Detecting Text Encoding for StreamReader

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

Under High load XDocument.Parse Creating errors

I am trying to access this webservice, The problem is that sometimes XDocument.Parse is not able to process and generates an error System.Xml.XmlException: Root element is missing. on the line:
XDocument xmlDoc = XDocument.Parse(xmlData);
Even though the XML sent is correct according to my logs.
I was wondering, Is it possible that the StreamReader is not working properly
using (StreamReader reader = new StreamReader(context.Request.InputStream))
{
xmlData = reader.ReadToEnd();
}
XDocument xmlDoc = XDocument.Parse(xmlData);
By the way this is all under a Custom HttpHandler.
Can someone please me guide in the right direction for this.
Thanks

Does it work any more consistently if you use
XDocument.Load(new StreamReader(context.Request.InputStream))
instead of XDocument.Parse?

Your code sample doesn't include logging of the read inputstream. The problem is prior to this point.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

validate xml string content including encoding using C# - c#

Related

How to check if a long string is a valid XML?

How to Verify using C# if a XML file is broken

Getting "ï»¿" at the beginning of my XML File after save() [duplicate]

Correcting Encoding in a large Xml File

Under High load XDocument.Parse Creating errors

Categories

Resources