Testing whether or not something is parseable XML in C# [duplicate] - c#

This question already has answers here:
Check well-formed XML without a try/catch?
(11 answers)
Closed 9 years ago.
Does anyone know of a quick way to check if a string is parseable as XML in C#? Preferably something quick, low resource, which returns a boolean whether or not it will parse.
I'm working on a database app which deals with errors that are sometimes stored as XML, and sometimes not. Hence, I'd like to just be able to test the string I grab from the database (contained in a DataTable) very quickly...and not have to resort to any try / catch {} statements or other kludges...unless those are the only way to make it happen.

It sounds like that you sometimes get back XML and sometimes you get back "plain" (non-XML) text.
If that's the case you could just check that the text starts with <:
if (!string.IsNullOrEmpty(str) && str.TrimStart().StartsWith("<"))
var doc = XDocument.Parse(str);
Since "plain" messages seem unlikely to start with < this may be reasonable. The only thing you need to decide is what to do in the edge case that you have non-XML text that starts with a <?
If it were me I would default to trying to parse it and catching the exception:
if (!string.IsNullOrEmpty(str) && str.TrimStart().StartsWith("<"))
{
try
{
var doc = XDocument.Parse(str);
return //???
}
catch(Exception ex)
return str;
}
else
{
return str;
}
That way the only time you have the overhead of a thrown exception is when you have a message that starts with < but is not valid XML.

You could try to parse the string into an XDocument. If it fails to parse, then you know that it is not valid.
string xml = "";
XDocument document = XDocument.Parse(xml);
And if you don't want to have the ugly try/catch visible, you can throw it into an extension method on the string class...
public static bool IsValidXml(this string xml)
{
try
{
XDocument.Parse(xml);
return true;
}
catch
{
return false;
}
}
Then your code simply looks like if (mystring.IsValidXml()) {

The only way you can really find out if something will actually parse is to...try and parse it.
An XMl document should (but may not) have an XML declaration at the head of the file, following the BOM (if present). It should look something like this:
<?xml version="1.0" encoding="UTF-8" ?>
Though the encoding attribute is, I believe, optional (defaulting to UTF-8. It might also have a standalone attribute whose value is yes or no. If that is present, that's a pretty good indicator that the document is supposed to be valid XML.
Riffing on #GaryWalker's excellent answer, something like this is about as good as it gets, I think (though the settings might need some tweaking, a custom no-op resolver perhaps). Just for kicks, I generated a 300mb random XML file using XMark xmlgen (http://www.xml-benchmark.org/): validating it with the code below takes 1.7–1.8 seconds elapsed time on my desktop machine.
public static bool IsMinimallyValidXml( Stream stream )
{
XmlReaderSettings settings = new XmlReaderSettings
{
CheckCharacters = true ,
ConformanceLevel = ConformanceLevel.Document ,
DtdProcessing = DtdProcessing.Ignore ,
IgnoreComments = true ,
IgnoreProcessingInstructions = true ,
IgnoreWhitespace = true ,
ValidationFlags = XmlSchemaValidationFlags.None ,
ValidationType = ValidationType.None ,
} ;
bool isValid ;
using ( XmlReader xmlReader = XmlReader.Create( stream , settings ) )
{
try
{
while ( xmlReader.Read() )
{
; // This space intentionally left blank
}
isValid = true ;
}
catch (XmlException)
{
isValid = false ;
}
}
return isValid ;
}
static void Main( string[] args )
{
string text = "<foo>This &SomeEntity; is about as simple as it gets.</foo>" ;
Stream stream = new MemoryStream( Encoding.UTF8.GetBytes(text) ) ;
bool isValid = IsMinimallyValidXml( stream ) ;
return ;
}

The best answer I've seem for test well-formed XML I know of is What is the fastest way to programatically check the well-formedness of XML files in C#?
formedness-of-xml-file" It covers using an XMLReader to do this efficiently.

Related

XmlException - given illegal XML from 3rd party; must process

There are several SO questions and answers about this when creating an XML file; but can't find any pertaining to when you are given bad XML from a 3rd party that you must process; note, the 3rd party cannot be held accountable for the illegal XML.
Ultimately, the .InnerText needs to be escaped or encoded (e.g. changed to legal XML characters) - and later decoded after proper XML parsing.
QUESTION: Are there any libraries that will Load() Invalid/Illegal XML files to allow quick navigation for such escaping/encoding? Or am I stuck having to manually parse the invalid xml, fixing it along the way ... ?
<?xml version="1.0" encoding="utf-8"?>
<ChunkData>
<Fields>
<Field1>some words < other words</Field1>
<Field2>some words > other words</Field2>
</Fields>
</ChunkData>
Although HttpAgilityPack is awesome (and I'm using it in another project of my own), I was given no the time to follow Alexei's advice - which is exactly the direction that I was looking for -- can't parse it as XML? cool, parse it as HTML ... didn't even cross my mind ...
Ended up with this, which does the trick (but is exactly what Alexei advised against):
private static string EncodeValues(string xml)
{
var doc = new List<string>();
var lines = xml.Split('\n');
foreach (var line in lines)
{
var output = line;
if (line.Contains("<Field") && !line.Contains("Fields>"))
{
var value = line.Parse(">", "</");
var encoded = HttpUtility.UrlEncode(value);
output = line.Replace(value, encoded);
}
doc.Add(output);
}
return string.Join("", doc);
}
private static Hashtable DecodeValues(IDictionary data)
{
var output = new Hashtable();
foreach (var key in data.Keys)
{
var value = (string)data[key];
output.Add(key, HttpUtility.UrlDecode(value));
}
return output;
}
Used in conjunction with an Extension method I wrote quite awhile ago ...
public static string Parse(this string s, string first, string second)
{
try
{
if (string.IsNullOrEmpty(s)) return "";
var start = s.IndexOf(first, StringComparison.InvariantCulture) + first.Length;
var end = s.IndexOf(second, start, StringComparison.InvariantCulture);
var length = end - start;
return (end > 0 && length < s.Length) ? s.Substring(start, length) : s.Substring(start);
}
catch (Exception) { return ""; }
}
Used as such (kept separate from the Transform and Hashtable creation methods for clarity):
xmlDocs[0] = EncodeValues(xmlDocs[0]); // in order to handle illegal chars in XML, encode InnerText
var doc = TransformXmlDocument(orgName, xmlDocs[0], xmlDocs[1]);
var data = GetHashtableFromXml(doc);
data = DecodeValues(data); // decode the values extracted from the hashtable
Regardless, I'm always looking for insight ... feel free to comment on this solution - or provide another.

XML Deserialization with special characters in C# XMlSerializer

I have an xml sheet which contains some special character "& is the special character causing issues" and i use below code to deserialize XML
XMLDATAMODEL imported_data;
// Create an instance of the XmlSerializer specifying type and namespace.
XmlSerializer serializer = new XmlSerializer(typeof(XMLDATAMODEL));
// A FileStream is needed to read the XML document.
FileStream fs = new FileStream(path, FileMode.Open);
XmlReader reader = XmlReader.Create(fs);
// Use the Deserialize method to restore the object's state.
imported_data = (XMLDATAMODEL)serializer.Deserialize(reader);
fs.Close();
and structre of my XML MOdel is like this
[XmlRoot(ElementName = "XMLDATAMODEL")]
public class XMLDATAMODEL
{
[XmlElement(ElementName = "EventName")]
public string EventName { get; set; }
[XmlElement(ElementName = "Location")]
public string Location { get; set; }
}
I tried this code as well with Encoding mentioned but no success
// Declare an object variable of the type to be deserialized.
StreamReader streamReader = new StreamReader(path, System.Text.Encoding.UTF8, true);
XmlSerializer serializer = new XmlSerializer(typeof(XMLDATAMODEL));
imported_data = (XMLDATAMODEL)serializer.Deserialize(streamReader);
streamReader.Close();
Both approaches failed and if i put special character inside Cdata it looks working.
How can i make it work for xml data without CData as well?
Here is my XML file content
http://pastebin.com/Cy7icrgS
And error i am getting is There is an error in XML document (2, 17).
The best answer I could get after looking around is, unless you serialize the data yourself, it will be pretty trouble some to deserialize XML will special characters.
For your case, since the special character is & before you can deserialize it, you should convert it to & Unless the character & is converted to & we cannot really deserialize it with XmlSerializer. Yes, we still can read it by using
XmlReaderSettings settings = new XmlReaderSettings();
settings.CheckCharacters = false; //not to check false character, this setting can be set.
FileStream fs = new FileStream(xmlfolder + "\\xmltest.xml", FileMode.Open);
XmlReader reader = XmlReader.Create(fs, settings);
But we cannot deserialize it.
As how to convert & to &, there are various ways with plus and minus. But the bottom line in all conversion is, do not use stream directly. Just take the data from the file and convert it to string by using, for example, File.ReadAllText and start doing the string processing. After that, convert it to MemoryStream and start the deserialization;
And now for the string processing before deserialization, there are couple of ways to do it.
The easiest, and most of the time could be the most unsafe, would be by using string.Replace("&", "&").
The other way, harder but safer, is by using Regex. Since your case is something inside CData, this could be a good way too.
Another way harder yet safer, by creating your parsing for line by line.
I have yet to find what is the common, safe, way for this conversion.
But as for your example, the string.Replace would work. Also, you could potentially exploit the pattern (something inside CData) to use Regex. This could be a good way too.
Edit:
As for what are considered as special characters in XML and how to process them before hand, according to this, non-Roman characters are included.
Apart from the non-Roman characters, in here, there are 5 special characters listed:
< -> <
> -> >
" -> "
' -> &apos;
& -> &
And from here, we get one more:
% -> %
Hope they can help you!

Reading ReadSubtree()'s entire string before calling XElement.Load()

I'm using the following code to read xml subtrees from a stream (where reader is an XmlReader):
while (reader.Read())
{
using (XmlReader subTreeReader = reader.ReadSubtree())
{
try
{
XElement xmlData = XElement.Load(subTreeReader);
ProcessStreamingEvent(xmlData);
}
catch (XmlException ex)
{
// want to write xml subtree
}
}
}
It works fine most of the time, but once XElement.Load(subTreeReader) throws an exception like
Name cannot begin with the '[' character, hexadecimal value 0x5B. Line
1, position 775.
The message is mostly useless; I need to output the entire string that XElement.Load() tried to parse. How can I do this? In the exception handler there is very little information; subTreeReader.Nodetype is Text, and Value is "spread". Perhaps I need to read the content of subTreeReader before calling XElement.Load()? It seems like subTreeReader should contain the entire string since it knows where the tree ends...?
I tried to replace XElement xmlData = XElement.Load(subTreeReader) with:
string xmlString = subTreeReader.ReadOuterXml();
XElement xmlData = XElement.Load(xmlString);
but xmlString is empty (even when subTreeReader does contain valid xml).
Interesting (and a good thing) that it is not XmlReader.Read() or XmlReader.ReadSubTree() that throws an exception (when the stream may not contain xml), but rather XElement.Load(). Makes me wonder though how does ReadSubTree() knows when to stop reading when the stream the does not contain xml.

What would be the best way of checking whether a string contains XML tags?

I know that the following would find potential tags, but is there a better way to check if a string contains XML tags to prevent exceptions when reading/writing the string between XML files?
string testWord = "test<a>";
bool foundTag = Regex.IsMatch(testWord, #"^*<*>*$"));
I'd use another Regex for that
Regex.IsMatch(testWord, #"<.+?>");
However, even if it does match, there is no guarantee that your file actually is an xml file, as the regex could also match strings like "<<a>" which is invalid, or "a <= b >= c" which is obviously not xml.
You should consider using the XmlDocument class instead.
XmlDocument xmlDoc = new XmlDocument();
try
{
xmlDoc.Load(testWord);
}
catch
{
// not an xml
}
Why don't you HtmlEncode the string before sending it via XML? This way you can avoid difficulties with Regex parsing tags.

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.
Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function
I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}
string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);
I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).
For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}
using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);
You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
Simply use string.StripHTML();

Categories

Resources