Loading XML to an XDocument with a URL containing an ampersand - c#

XDocument xd = XDocument.Load("http://www.google.com/ig/api?weather=vilnius&hl=lt");
The ampersand & isn't a supported character in a string containing a URL when calling the Load() method. This error occurs:
XmlException was unhandled: Invalid character in the given encoding
How can you load XML from a URL into an XDocument where the URL has an ampersand in the querystring?

You need to URL-encode it as &:
XDocument xd = XDocument.Load(
"http://www.google.com/ig/api?weather=vilnius&hl=lt");
You might be able to get away with using WebUtility.HtmlEncode to perform this conversion automatically; however, be careful that this is not the intended use of that method.
Edit: The real issue here has nothing to do with the ampersand, but with the way Google is encoding the XML document using a custom encoding and failing to declare it. (Ampersands only need to be encoded when they occur within special contexts, such as the <a href="…" /> element of (X)HTML. Read Ampersands (&'s) in URLs for a quick explanation.)
Since the XML declaration does not specify the encoding, XDocument.Load is internally falling back to default UTF-8 encoding as required by XML specification, which is incompatible with the actual data.
To circumvent this issue, you can fetch the raw data and decode it manually using the sample below. I don’t know whether the encoding really is Windows-1252, so you might need to experiment a bit with other encodings.
string url = "http://www.google.com/ig/api?weather=vilnius&hl=lt";
byte[] data;
using (WebClient webClient = new WebClient())
data = webClient.DownloadData(url);
string str = Encoding.GetEncoding("Windows-1252").GetString(data);
XDocument xd = XDocument.Parse(str);

There is nothing wrong with your code - it is perfectly OK to have & in the query string, and it is how separate parameters are defined.
When you look at the error you'll see that it fails to load XML, not to query it from the Url:
XmlException: Invalid character in the given encoding. Line 1, position 473
which clearly points outside of your query string.
The problem could be "Apsiniaukę" (notice last character) in the XML response...

instead of "&" use "&" or "&" . and it will work fine .

Related

How to create a Xml Document from a fully escaped Xml string?

Question Background:
I have an XML response from a web service (that I am unable to control the content of) that I would like to validate. For example, often the response will have a URL in it that has query string parameters using a "&".
Code:
The following code gives an example of escaping an XML string with illegal characters. This will indeed produce an escaped string:
string xml = "<node>it's my \"node\" & i like it<node>";
string encodedXml = System.Security.SecurityElement.Escape(xml);
// RESULT: <node>it&apos;s my "node" & i like it<node>
If I know attempt to load this escaped XML into a new Xml Document, I will receive an error that the first character of the XML is not valid:
var doc = new XmlDocument();
// Error will occur here.
doc.LoadXml(encodedXml);
Error output:
Data at the root level is invalid. Line 1, position 1.
How do I load this escaped XML into an XML Document object?
This is not a valid XML document:
<node>it&apos;s my "node" & i like it<node>
When you escape the angle brackets on the tags, they are no longer treated as tags by the XML parser. It's all just text in an element -- but there's no element containing it. In XML, there must be a root element. That's a requirement. It may be an arbitrary requirement, and that may be unjust, but you'll never win an argument with a parser.
What you're doing is like giving this to a C# compiler:
string s = \"foo\" bar\";
The outer quotes shouldn't be escaped.
This is what you want:
string xml = "<node>it&apos;s my "node" & i like it</node>";
Note also that your original XML was broken already:
string xml = "<node>it's my \"node\" & i like it<node>";
Your "closing" tag isn't a closing tag. It should be </node>, not <node>.
If you are receiving a response from another web application / API / service, it is likely that the contents are Html encoded.
Take a look at the WebUtility class, particularly, HtmlDecode and UrlDecode. This is likely to convert your "string" data to proper Xml.
If you're receiving valid XML back from the service you can convert the response using something like this:
//...
WebResponse response = request.GetResponse();
XDocument doc = XDocument.Parse
((
new System.IO.StreamReader
(
response.GetResponseStream()
)
).ReadToEnd());
If you're receiving invalid XML from a service which should return valid XML, contact whoever owns/provides that service / raise a support ticket with them in the appropriate way.
Any other action is a hack. Sometimes that may be required (e.g. when you're dealing with a legacy system that's no longer supported with bugs that have never been corrected), but pursue the non-hacky routes first.

Reading xml file with linq and getting error : Invalid character in the given encoding [duplicate]

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance
The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

KeyNotFoundException with using HtmlEntity.DeEntitize() method

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.
I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:
WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?
I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.
After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here
This may be of some help to you in case you want to modify the htmlagilitypack source yourself.
Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:
// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
var sb = new StringBuilder(str);
//TODO: add other replacements, as needed
return sb.Replace("&period;", ".")
.Replace("&abreve;", "ă")
.Replace("â", "â")
.ToString();
}
In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.
This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.
My HTML had a block of text like so:
... found in sections: 233.9 & 517.3; ...
Despite the spacing and decimal point, it was interpreting & 517.3; as a unicode character.
Simply HTML Encoding the raw text fixed the problem for me.
string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);
In my case I have fixed this by updating HtmlAgilityPack to version 1.5.0

XmlDocument.Load() method fails to decode € (euro)

I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9)
<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
<f>€.txt</f>
</root>
From my favorite text editor, I can tell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).
My software is written in C# and wants to extract the element f.
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml");
In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.
Now, my problem when I extract the value:
//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm);
if (n != null)
String filename = n.InnerText;
The Visual Studio debugger displays filename = □.txt
It could only be a Visual Studio bug. Unfortunately File.Exists(filename) returns false, whereas the file actually exist.
What's wrong?
If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding.
You would have to create a StreamReader with the correct encoding and use that as the parameter.
xmlDoc.Load(new StreamReader(
File.Open("file.xml"),
Encoding.GetEncoding("iso-8859-15")));
EDIT:
I just stumbled across KB308061 from Microsoft. There's an interesting passage:
Specify the encoding declaration in
the XML declaration section of the XML
document. For example, the following
declaration indicates that the
document is in UTF-16 Unicode encoding
format:
<?xml version="1.0" encoding="UTF-16"?>
Note that this declaration only
specifies the encoding format of an
XML document and does not modify or
control the actual encoding format of
the data.
Don't just use the debugger or the console to display the string as a string.
Instead, dump the contents of the string, one character at a time. For example:
foreach (char c in filename)
{
Console.WriteLine("{0}: {1:x4}", c, (int) c);
}
That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.
Use the Unicode code charts to look up the characters specified.
Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15
Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>
Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.
I dont know the exact escape code for € .. but it would be something of this sort
<f><![CDATA[%3E.txt]]></f>
The above should make € be communicated correctly through the xml.

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Categories

Resources