parsing XML with ampersand - c#

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);

Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);

Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"

HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.

Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;

The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.

This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}

If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.

You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)

Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Related

Regex to get values between Double Quotes

I have a value i am pulling from a database
<iframe width="420" height="315" src="//www.youtube.com/embed/8GRDA1gG8R8" frameborder="0" allowfullscreen></iframe>
I am trying to get the src as a value using regex.
Regex.Match(details.Tables["MarketingDetails"].Rows[0]["MarketingVideo"].ToString(), "\\\"([^\\\"]*)\\\"").Groups[2].Value
that is how i am currently writing it
How would I write this to pull the correct value of src?
You could do it like this....
Match match = Regex.Match( #"<iframe width=""420"" height=""315"" src=""//www.youtube.com/embed/8GRDA1gG8R8"" frameborder=""0"" allowfullscreen></iframe>", #"src=(\""[^\""]*\"")");
Console.WriteLine (match.Groups[1].Value);
However, as others have already commented on your question... it's better practice to use an actual html parser.
Don't use regex to parse xml or html. It's not worth it. I'll let you read this post, and it sort of exagerates the point, but the main thing to keep in mind is you can get into a lot of trouble with regex and html.
So, instead you should use an actual html/xml parser! For starters, use XElement, a class built into the .net framework.
string input = "<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/8GRDA1gG8R8\" frameborder=\"0\" allowfullscreen=''></iframe>";
XElement html = XElement.Parse(input);
string src = html.Attribute("src").Value;
This will make src have the value //www.youtube.com/embed/8GRDA1gG8R8. You can then split that up to get whatever you need from it.
I should also note that your input is not valid xml. allowfullscreen does not have a value attached, which is why I added =''.
If you need to get more complex, such as your input, use an HTML parser (XElement is meant for xml). Use the Html Agility Pack like this (using the previous example):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
string src = doc.DocumentNode
.Element("iframe")
.Attributes["src"]
.Value;
This parser is more forgiving for invalid or incorrect (or just irregular) inputs. This will parse your original input just fine (so missing the ='').

Parsing XML which contains illegal characters

A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>
If you have only & as invalid character, then you can use regex to replace it with &. We use regex to prevent replacement of already existing &, ", o, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = #"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, #"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);
Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.
Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse, that's all.

How to escape xml content in a raw string?

I am getting a string of 'xml' that contains some content that is unescaped. Here is a trivial example:
<link text="This is some text with "potentially" some quoted text in it." linktype="external" anchor="" target="" />
The problem I have is when you try to convert the above as a string using XmlDocument.LoadXml(), LoadXml() throws an exception because of the lack of escaping on the inner quotes for the content held by attribute 'text'. Is there a relatively painless way to escape the content specifically? Or am I just going to have to parse it/escape it/rebuild it myself?
i'm not generating this text, i just get it from another process in a string like this:
"<link text="This is some text with "potentially" some quoted text in it." linktype="external" anchor="" target="" />"
You need to use the html character encoding where " is "
But since your input is a malformed xml text you have to find a way to parse that text and replace the quotes with their encoded translation. Maybe some regex parsing..
Please consider this just a creative way to make the job. I know it's dirty but it will work in most cases:
private static string XmlEncodeQuotes(string target) {
string result = string.Empty;
for (int i = 0; i < target.Length; i++)
{
if (target[i] == '"')
{
if (target[i - 1] != '=')
if (!Regex.IsMatch(target.Substring(i), #"^""\s[a-zA-Z]+="""))
{
result += """;
continue;
}
}
result += target[i];
}
return result;
}
have you tried wrapping the portion of the xml document within a CDATA tag?
Will System.Security.SecurityElement.Escape() work for you? If not, there is an XmlTextWriter as well.
If you're simply asking how to escape a quote, that's done with
"
I'm not sure what you're dealing with, but the root of your problem is the fact that the data you are receiving is malformed.
Option 1) Unless you clean up the data, you will have a hard time getting most parsers to load invalid XML data. Some are more forgiving than others. You might have some luck with the HTML Agility Pack
Option 2) Use Regular Expressions to fix your XML.
Option 3) If coding a parsing solution is not an option use XSLT. Simply create transform and then add a template to fix the issues.

replacing an undefined tags inside an xml string using a regex

i need to replace an undefined tags inside an xml string.
example: <abc> <>sdfsd <dfsdf></abc><def><movie></def> (only <abc> and <def> are defined)
should result with: <abc> <>sdfsd <dfsdf></abc><def><movie><def>
<> and <dfsdf> are not predefined as and and does not have a closing tag.
it must be done with a regex!.
no using xml load and such.
i'm working with C# .Net
Thanks!
How about this:
string s = "<abc> <>sdfsd <dfsdf></abc><def><movie></def>";
string regex = "<(?!/?(?:abc|def)>)|(?<!</?(?:abc|def))>";
string result = Regex.Replace(s, regex, match =>
{
if (match.Value == "<")
return "<";
else
return ">";
});
Console.WriteLine(result);
Result:
<abc> <>sdfsd <dfsdf></abc><def><movie></def>
Also, when tested on your other test case (which by the way I found in a comment on the other question):
<abc>>sdfsdf<<asdada>>asdasd<>asdasd<asdsad>asds<</abc>
I get this result:
<abc>>sdfsdf<<asdada>>asdasd<>asdasd<asdsad>asds<</abc>
Let me guess... this doesn't work for you because you just thought of a new requirement? ;)
it must be done with a regex! no using xml load and such.
I must hammer this nail in with my boot! No using a hammer and such. It's an old story :)
You'll need to supply more information. Are "valid" tags allowed to be nested? Are the "valid" tags likely to change at any point? How robust does this need to be?
Assuming that your list of valid tags isn't going to change at any point, you could do it with a regex substitution:
s/<(?!\/?(your|valid|tags))([^>]*)>/<$1>/g

How do I work with an XML tag within a string?

I'm working in Microsoft Visual C# 2008 Express.
Let's say I have a string and the contents of the string is: "This is my <myTag myTagAttrib="colorize">awesome</myTag> string."
I'm telling myself that I want to do something to the word "awesome" - possibly call a function that does something called "colorize".
What is the best way in C# to go about detecting that this tag exists and getting that attribute? I've worked a little with XElements and such in C#, but mostly to do with reading in and out XML files.
Thanks!
-Adeena
Another solution:
var myString = "This is my <myTag myTagAttrib='colorize'>awesome</myTag> string.";
try
{
var document = XDocument.Parse("<root>" + myString + "</root>");
var matches = ((System.Collections.IEnumerable)document.XPathEvaluate("myTag|myTag2")).Cast<XElement>();
foreach (var element in matches)
{
switch (element.Name.ToString())
{
case "myTag":
//do something with myTag like lookup attribute values and call other methods
break;
case "myTag2":
//do something else with myTag2
break;
}
}
}
catch (Exception e)
{
//string was not not well formed xml
}
I also took into account your comment to Dabblernl where you want parse multiple attributes on multiple elements.
You can extract the XML with a regular expression, load the extracted xml string in a XElement and go from there:
string text=#"This is my<myTag myTagAttrib='colorize'>awesome</myTag> text.";
Match match=Regex.Match(text,#"(<MyTag.*</MyTag>)");
string xml=match.Captures[0].Value;
XElement element=XElement.Parse(xml);
XAttribute attribute=element.Attribute("myTagAttrib");
if(attribute.Value=="colorize") DoSomethingWith(element.Value);// Value=awesome
This code will throw an exception if no MyTag element was found, but that can be remedied by inserting a line of:
if(match.Captures.Count!=0)
{...}
It gets even more interesting if the string could hold more than just the MyTag Tag...
I'm a little confused about your example, because you switch between the string (text content), tags, and attributes. But I think what you want is XPath.
So if your XML stream looks like this:
<adeena/><parent><child x="this is my awesome string">This is another awesome string<child/><adeena/>
You'd use an XPath expression that looks like this to find the attribute:
//child/#x
and one like this to find the text value under the child tag:
//child
I'm a Java developer, so I don't know what XML libraries you'd use to do this. But you'll need a DOM parser to create a W3C Document class instance for you by reading in the XML file and then using XPath to pluck out the values.
There's a good XPath tutorial from the W3C schools if you need it.
UPDATE:
If you're saying that you already have an XML stream as String, then the answer is to not read it from a file but from the String itself. Java has abstractions called InputStream and Reader that handle streams of bytes and chars, respectively. The source can be a file, a string, etc. Check your C# DOM API to see if it has something similar. You'll pass the string to a parser that will give back a DOM object that you can manipulate.
Since the input is not well-formed XML you won't be able to parse it with any of the built in XML libraries. You'd need a regular expression to extract the well-formed piece. You could probably use one of the more forgiving HTML parsers like HtmlAgilityPack on CodePlex.
This is my solution to match any type of xml using Regex:
C# Better way to detect XML?
The XmlTextReader can parse XML fragments with a special constructor which may help in this situation, but I'm not positive about that.
There's an in-depth article here:
http://geekswithblogs.net/kobush/archive/2006/04/20/75717.aspx

Categories

Resources