Handle "XML" with incorrectly encoded HTML entities [duplicate] - c#

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 3 years ago.
I have an XML file that users can change and add some different text to certain attributes and then upload to my tool. The problem is that they sometimes include < and > in the values of the attributes. I want to change that to < and >.
For instance:
<title value="Tuition and fees paid with (Percent<5000) by Gender" />
Loading this causes an error using the following code:
XmlDocument smldoc = new XmlDocument();
xmldoc.LoadXml(xmlString);
The issue I have is that I need all the attributes which can be user generated to be in an html entity for < and >. The problem is that I cannnot do just a .Replace("<", "<") because the actual XML file needs those.
How is this done easily? The code is C#.Net.

Why are you allowing your users to send you invalid XML in the first place? You should deny such input. Isn't there a more suitable format for your users to send this data? Like a list of "key: value" strings?
Anyway you can fix this by your replace method, just make sure you start after the first and stop before the last < and >.
Something like this:
var trimmedXml = xmlString.Trim(); // to remove whitespace at either end
var innerText = trimmedXml.Substring(1, trimmedXml.Length -1);
innerText = innerText.Replace("<", "&lt").Replace(">", ">");
xmlString = trimmedXml[0] + innerText + trimmedXml[trimmedXml.Length -1];
Of course you'll need to validate that the "XML" string at least contains </>.

Related

How to convert invalid XML with special characters to JSON with Json.NET [duplicate]

This question already has answers here:
Error tolerant XML reader
(5 answers)
Closed 3 years ago.
There are already some posts on parsing XML to JSON, but I have not come across skipping validating XML and properly translating to JSON in C# yet.
I would like to translate (invalid) XML code to JSON using Json.NET. The XML contains special characters such as:
Space in <send to>, slash in <body/content>, ! in <!priority>.
In C# the XDocument.Parse(xmlString) always validates the XML, therefore converting will throw an exception. Decoding/encoding using the HtmlUtility affects the XML tags < and > and I haven't been able to use it. How can I make this work?
Some sample code can be found below.
Input (string):
<root>
<message>
<send to>some#email.com</send to>
<body/content>This is a message!</body/content>
<!priority>high</!priority>
</message>
</root>
Expected output (string):
{
"root": {
"message": {
"send to": "some#email.com",
"body/content": "This is a message!",
"!priority": "high"
}
}
}
Don't treat this as "invalid XML", treat it as some proprietary syntax completely unrelated to XML. No XML tools are going to help you with this. You first need to define a grammar for the non-XML files, then you need to write a parser for that grammar. Having written that parser, you can either generate JSON direcly, or you can generate XML and use an off-the-shelf XML-to-JSON converter.
Alternatively, if you possibly can, stop using proprietary syntax and use standards such as XML and JSON instead. Most people did that 20 years ago, and saved themselves a lot of money in the process.

Is There anyway to remove Ampersand prior to XmlDocument Load? [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 4 years ago.
Here is my Code?
XmlDocument rssXmlDoc = new XmlDocument();
// Load the RSS file from the RSS URL
rssXmlDoc.Load("https://polsky.uchicago.edu/events/feed/");
var nsmgr = new XmlNamespaceManager(rssXmlDoc.NameTable);
nsmgr.AddNamespace("event", "http://www.w3.org/1999/XSL;Transform");
// Parse the Items in the RSS file
XmlNodeList rssNodes = rssXmlDoc.SelectNodes("rss/channel/item", nsmgr);
I know that the XML has some elements that contain "&", and I also know that it is really not up to me to fix this bad RSS feed; however, I am not certain if they will comply. Is there anything I can do?
The following exception is thrown:
An error occurred while parsing EntityName. Line 138, position 26.
You can't fix that with an XML parser because it's invalid XML. & isn't allowed without being escaped.
You can however read in the bad XML as a string, do a string replace of & for &, then process the string with your normal XML parser.
You can also bracket it in CDATA and get on with your life 8-)
PS. If you go with the first method, be sure to check for and handle the other "bad" characters like <>"' (less than, greater than, double quote, single quote)
I use System.Security.SecurityElement.Escape() to take care of "XML encoding" requirements. It works essentially the same as the System.Web.HttpUtility.HtmlEncode Encode/decode
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape

how to used less than sign in xml document?

I am using C#.net where I required to used xml string,which needs to populate into xmldocument. It is loading fine,but when that string has special following values in one of the node then it is not working
sometime I have html tags with style and class. so how to load that string in xml document. so How to deal with in such cases?
here my string which produces an error
<restdata>
<listingAddress>
fsdfsdf dfdf <Not Specified=""> Argentina dsfsf</listingAddress>
<listingAddress>
xxk dfsdf 899993
</listingAddress>
</restdata>
in my case error may be because of <not Specified="".
also sometime there may be html tags.
so how this would be used generalized way so any data my it should work fine?
Generally if you need to use characters that are commonly reserved in XML, you can use their encoded HTML entities if you need to enter HTML data :
Use < for <
Use > for >
Use & for &
Use " for "
You can find a complete list of them here. If you need to programatically encode HTML cotent in C#, you can use the HttpUtility.HtmlEncode() method :
// Your original text
var input = "<a href='http://example-site.com'>This is a link</a>";
// This yields <a href='http://example-site.com'>This is a link</a>
var encoded = HttpUtility.HtmlEncode(input);

Parsing XML with spaces in element names [duplicate]

This question already has answers here:
Encoding space character in XML name
(2 answers)
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
So I have to parse a simple XML file (there is only one level, no attributes, just elements and values) but the problem is that there are (or could be) spaces in the XML. I know that's bad (possibly terrible) practice, but I'm not the one that's building the XML, that's coming from an external library.
example:
<live key>test</live key>
<not live>test</not live>
<Test>hello</Test>
Right now my strategy is to read the XML (I have it as a string) one character at a time and just save each element name and value as I get to it, but that seems a bit too complicated.
Is there any easier way to do it? XMLReader would throw an error because it thinks the XML is well-formed, thus it thinks "live" is the element name and "key" is an attribute, so it is trying to look for a "=" and gets a ">".
Unfortunately, the text returned by your library is not a well-formed XML, so you cannot use an XML parser to parse it. The spaces in the tags are only part of the problem; there are other issues, for example, the absence of the "root" tag.
Fortunately, a single-level language is trivial enough to be matched with regular expressions. Regex-based "parsers" would be an awful choice for real XML, but this language is not real, so you could use regex at least as a workaround:
Regex rx = new Regex("<([^>\n]*)>(.*?)</(\\1)>");
var m = rx.Match(text);
while (m.Success) {
Console.WriteLine("{0}='{1}'", m.Groups[1], m.Groups[2]);
m = m.NextMatch();
}
The idea behind this approach is to find strings with "opening tags" that match "closing tags" with a slash.
Here is a demo, it produces the following output for your input:
live key='test'
not live='test'
Test='hello'
As it is a flat structure maybe that could help:
MatchCollection ms = Regex.Matches(xml, #"\<([\w ]+?)\>(.*?)\<\/\1\>");
foreach (Match m in ms)
{
Trace.WriteLine(string.Format("{0} - {1}", m.Groups[1].Value, m.Groups[2].Value));
}
So you get a list of 'key-value' pairs. Traces are only for checking results

Innertext from XElement? [duplicate]

This question already has answers here:
Best way to get InnerXml of an XElement?
(15 answers)
Closed 9 years ago.
I'm having a hard time getting the correct value from the innertext of an XElement.
First, here's the XML that I'm using. This is a copy of our production data that results from a process in our workflow. In other words, I can't change the XML, I can only parse it. The element whose innertext I'd like to get has data inside that looks like XML, but it isn't. It is straight text from the tool that produced the XML. The element is called <creatorshapeutildata:
Here is the line of code I've tried:
CreatorShapeUtilData = element.Descendants("creatorshapeutildata").Single().Value;
I've also tried this:
CreatorShapeUtilData = element.Descendants("creatorshapeutildata").First().Value;
I've also tried this:
CreatorShapeUtilData = element.Element("creatorshapeutildata").Value;
Unfortunately, the value that gets returned in every case looks like this:
33012-1true#FFFF003#FFFFFF2743337743358
I need the value returned to look like this:
"<creatorData type="object"><type type="int">33012</type>..."
This piece I'm working on is part of a larger program that uses XDocument, XElement, etc. I know an XmlElement has an InnerText property, but I think XElement does not, since I can't seem to find it in Intellisense.
So, is there any possible way to grab the exact text between the creatorshapeutil tags?
You're trying to get the exact opposite of the InnerText / Value properties: the raw XML content.
You can get the content including the outer node by calling element.ToString().
If you want to exclude the outer tag, you can call String.Concat(element.Nodes()).

Categories

Resources