This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 4 years ago.
Here is my Code?
XmlDocument rssXmlDoc = new XmlDocument();
// Load the RSS file from the RSS URL
rssXmlDoc.Load("https://polsky.uchicago.edu/events/feed/");
var nsmgr = new XmlNamespaceManager(rssXmlDoc.NameTable);
nsmgr.AddNamespace("event", "http://www.w3.org/1999/XSL;Transform");
// Parse the Items in the RSS file
XmlNodeList rssNodes = rssXmlDoc.SelectNodes("rss/channel/item", nsmgr);
I know that the XML has some elements that contain "&", and I also know that it is really not up to me to fix this bad RSS feed; however, I am not certain if they will comply. Is there anything I can do?
The following exception is thrown:
An error occurred while parsing EntityName. Line 138, position 26.
You can't fix that with an XML parser because it's invalid XML. & isn't allowed without being escaped.
You can however read in the bad XML as a string, do a string replace of & for &, then process the string with your normal XML parser.
You can also bracket it in CDATA and get on with your life 8-)
PS. If you go with the first method, be sure to check for and handle the other "bad" characters like <>"' (less than, greater than, double quote, single quote)
I use System.Security.SecurityElement.Escape() to take care of "XML encoding" requirements. It works essentially the same as the System.Web.HttpUtility.HtmlEncode Encode/decode
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape
Related
This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 3 years ago.
I have an XML file that users can change and add some different text to certain attributes and then upload to my tool. The problem is that they sometimes include < and > in the values of the attributes. I want to change that to < and >.
For instance:
<title value="Tuition and fees paid with (Percent<5000) by Gender" />
Loading this causes an error using the following code:
XmlDocument smldoc = new XmlDocument();
xmldoc.LoadXml(xmlString);
The issue I have is that I need all the attributes which can be user generated to be in an html entity for < and >. The problem is that I cannnot do just a .Replace("<", "<") because the actual XML file needs those.
How is this done easily? The code is C#.Net.
Why are you allowing your users to send you invalid XML in the first place? You should deny such input. Isn't there a more suitable format for your users to send this data? Like a list of "key: value" strings?
Anyway you can fix this by your replace method, just make sure you start after the first and stop before the last < and >.
Something like this:
var trimmedXml = xmlString.Trim(); // to remove whitespace at either end
var innerText = trimmedXml.Substring(1, trimmedXml.Length -1);
innerText = innerText.Replace("<", "<").Replace(">", ">");
xmlString = trimmedXml[0] + innerText + trimmedXml[trimmedXml.Length -1];
Of course you'll need to validate that the "XML" string at least contains </>.
This question already has answers here:
Error tolerant XML reader
(5 answers)
Closed 3 years ago.
There are already some posts on parsing XML to JSON, but I have not come across skipping validating XML and properly translating to JSON in C# yet.
I would like to translate (invalid) XML code to JSON using Json.NET. The XML contains special characters such as:
Space in <send to>, slash in <body/content>, ! in <!priority>.
In C# the XDocument.Parse(xmlString) always validates the XML, therefore converting will throw an exception. Decoding/encoding using the HtmlUtility affects the XML tags < and > and I haven't been able to use it. How can I make this work?
Some sample code can be found below.
Input (string):
<root>
<message>
<send to>some#email.com</send to>
<body/content>This is a message!</body/content>
<!priority>high</!priority>
</message>
</root>
Expected output (string):
{
"root": {
"message": {
"send to": "some#email.com",
"body/content": "This is a message!",
"!priority": "high"
}
}
}
Don't treat this as "invalid XML", treat it as some proprietary syntax completely unrelated to XML. No XML tools are going to help you with this. You first need to define a grammar for the non-XML files, then you need to write a parser for that grammar. Having written that parser, you can either generate JSON direcly, or you can generate XML and use an off-the-shelf XML-to-JSON converter.
Alternatively, if you possibly can, stop using proprietary syntax and use standards such as XML and JSON instead. Most people did that 20 years ago, and saved themselves a lot of money in the process.
Question Background:
I have an XML response from a web service (that I am unable to control the content of) that I would like to validate. For example, often the response will have a URL in it that has query string parameters using a "&".
Code:
The following code gives an example of escaping an XML string with illegal characters. This will indeed produce an escaped string:
string xml = "<node>it's my \"node\" & i like it<node>";
string encodedXml = System.Security.SecurityElement.Escape(xml);
// RESULT: <node>it's my "node" & i like it<node>
If I know attempt to load this escaped XML into a new Xml Document, I will receive an error that the first character of the XML is not valid:
var doc = new XmlDocument();
// Error will occur here.
doc.LoadXml(encodedXml);
Error output:
Data at the root level is invalid. Line 1, position 1.
How do I load this escaped XML into an XML Document object?
This is not a valid XML document:
<node>it's my "node" & i like it<node>
When you escape the angle brackets on the tags, they are no longer treated as tags by the XML parser. It's all just text in an element -- but there's no element containing it. In XML, there must be a root element. That's a requirement. It may be an arbitrary requirement, and that may be unjust, but you'll never win an argument with a parser.
What you're doing is like giving this to a C# compiler:
string s = \"foo\" bar\";
The outer quotes shouldn't be escaped.
This is what you want:
string xml = "<node>it's my "node" & i like it</node>";
Note also that your original XML was broken already:
string xml = "<node>it's my \"node\" & i like it<node>";
Your "closing" tag isn't a closing tag. It should be </node>, not <node>.
If you are receiving a response from another web application / API / service, it is likely that the contents are Html encoded.
Take a look at the WebUtility class, particularly, HtmlDecode and UrlDecode. This is likely to convert your "string" data to proper Xml.
If you're receiving valid XML back from the service you can convert the response using something like this:
//...
WebResponse response = request.GetResponse();
XDocument doc = XDocument.Parse
((
new System.IO.StreamReader
(
response.GetResponseStream()
)
).ReadToEnd());
If you're receiving invalid XML from a service which should return valid XML, contact whoever owns/provides that service / raise a support ticket with them in the appropriate way.
Any other action is a hack. Sometimes that may be required (e.g. when you're dealing with a legacy system that's no longer supported with bugs that have never been corrected), but pursue the non-hacky routes first.
This question already has answers here:
Encoding space character in XML name
(2 answers)
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
So I have to parse a simple XML file (there is only one level, no attributes, just elements and values) but the problem is that there are (or could be) spaces in the XML. I know that's bad (possibly terrible) practice, but I'm not the one that's building the XML, that's coming from an external library.
example:
<live key>test</live key>
<not live>test</not live>
<Test>hello</Test>
Right now my strategy is to read the XML (I have it as a string) one character at a time and just save each element name and value as I get to it, but that seems a bit too complicated.
Is there any easier way to do it? XMLReader would throw an error because it thinks the XML is well-formed, thus it thinks "live" is the element name and "key" is an attribute, so it is trying to look for a "=" and gets a ">".
Unfortunately, the text returned by your library is not a well-formed XML, so you cannot use an XML parser to parse it. The spaces in the tags are only part of the problem; there are other issues, for example, the absence of the "root" tag.
Fortunately, a single-level language is trivial enough to be matched with regular expressions. Regex-based "parsers" would be an awful choice for real XML, but this language is not real, so you could use regex at least as a workaround:
Regex rx = new Regex("<([^>\n]*)>(.*?)</(\\1)>");
var m = rx.Match(text);
while (m.Success) {
Console.WriteLine("{0}='{1}'", m.Groups[1], m.Groups[2]);
m = m.NextMatch();
}
The idea behind this approach is to find strings with "opening tags" that match "closing tags" with a slash.
Here is a demo, it produces the following output for your input:
live key='test'
not live='test'
Test='hello'
As it is a flat structure maybe that could help:
MatchCollection ms = Regex.Matches(xml, #"\<([\w ]+?)\>(.*?)\<\/\1\>");
foreach (Match m in ms)
{
Trace.WriteLine(string.Format("{0} - {1}", m.Groups[1].Value, m.Groups[2].Value));
}
So you get a list of 'key-value' pairs. Traces are only for checking results
I am trying to parse the below with in C# with xmldocument. but I can't load it. It says invalid characters. Even in the browser it doesn't display correctly complaining about invalid characters. I need to loop through all elements in this string.
Can someone please advise what's wrong here?
<div><b>Q1.
What is your name?:</b> BTB (Build the bank)</div>
<div><b>Q2.
How old are you?:</b> 29</div>
code is this:
XmlDocument xml = new XmlDocument();
xml.Load(item.Summary);
error is: "Illegal characters in path."
XmlDocument.Load expects a file name to load the xml from. Try LoadXml.
"BTB (Build the bank)" needs to be wrapped in its own tag if this shall be a valid xml. It is valid html though.
Also, xml must have a single top node.