Parsing XML with spaces in element names [duplicate] - c#

This question already has answers here:
Encoding space character in XML name
(2 answers)
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
So I have to parse a simple XML file (there is only one level, no attributes, just elements and values) but the problem is that there are (or could be) spaces in the XML. I know that's bad (possibly terrible) practice, but I'm not the one that's building the XML, that's coming from an external library.
example:
<live key>test</live key>
<not live>test</not live>
<Test>hello</Test>
Right now my strategy is to read the XML (I have it as a string) one character at a time and just save each element name and value as I get to it, but that seems a bit too complicated.
Is there any easier way to do it? XMLReader would throw an error because it thinks the XML is well-formed, thus it thinks "live" is the element name and "key" is an attribute, so it is trying to look for a "=" and gets a ">".

Unfortunately, the text returned by your library is not a well-formed XML, so you cannot use an XML parser to parse it. The spaces in the tags are only part of the problem; there are other issues, for example, the absence of the "root" tag.
Fortunately, a single-level language is trivial enough to be matched with regular expressions. Regex-based "parsers" would be an awful choice for real XML, but this language is not real, so you could use regex at least as a workaround:
Regex rx = new Regex("<([^>\n]*)>(.*?)</(\\1)>");
var m = rx.Match(text);
while (m.Success) {
Console.WriteLine("{0}='{1}'", m.Groups[1], m.Groups[2]);
m = m.NextMatch();
}
The idea behind this approach is to find strings with "opening tags" that match "closing tags" with a slash.
Here is a demo, it produces the following output for your input:
live key='test'
not live='test'
Test='hello'

As it is a flat structure maybe that could help:
MatchCollection ms = Regex.Matches(xml, #"\<([\w ]+?)\>(.*?)\<\/\1\>");
foreach (Match m in ms)
{
Trace.WriteLine(string.Format("{0} - {1}", m.Groups[1].Value, m.Groups[2].Value));
}
So you get a list of 'key-value' pairs. Traces are only for checking results

Related

Is There anyway to remove Ampersand prior to XmlDocument Load? [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 4 years ago.
Here is my Code?
XmlDocument rssXmlDoc = new XmlDocument();
// Load the RSS file from the RSS URL
rssXmlDoc.Load("https://polsky.uchicago.edu/events/feed/");
var nsmgr = new XmlNamespaceManager(rssXmlDoc.NameTable);
nsmgr.AddNamespace("event", "http://www.w3.org/1999/XSL;Transform");
// Parse the Items in the RSS file
XmlNodeList rssNodes = rssXmlDoc.SelectNodes("rss/channel/item", nsmgr);
I know that the XML has some elements that contain "&", and I also know that it is really not up to me to fix this bad RSS feed; however, I am not certain if they will comply. Is there anything I can do?
The following exception is thrown:
An error occurred while parsing EntityName. Line 138, position 26.
You can't fix that with an XML parser because it's invalid XML. & isn't allowed without being escaped.
You can however read in the bad XML as a string, do a string replace of & for &, then process the string with your normal XML parser.
You can also bracket it in CDATA and get on with your life 8-)
PS. If you go with the first method, be sure to check for and handle the other "bad" characters like <>"' (less than, greater than, double quote, single quote)
I use System.Security.SecurityElement.Escape() to take care of "XML encoding" requirements. It works essentially the same as the System.Web.HttpUtility.HtmlEncode Encode/decode
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape

Removing empty elements from xml with regex that matches a sequence twice

I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>
I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
#"<.*></*>
I need some sort of regex that makes sure the pattern of the two * are the same.
So:
<Item><One>1</One><Two></Two><Three>3</Three></Item>
Would change into:
<Item><One>1</One><Three>3</Three></Item>
So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.
I don't have access to the original data that would allow recreating valid xml.
You want to capture one or more word characters inside <...>and match the closing tag by using \1 backreference to what was captured by first group.
<(\w+)></\1>
See demo at regex101
AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.
Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).
What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):
<\w+><\/\w+>
You should/may have a better check for tag name, for example \s*[\w\d]+\s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.
Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).
Use XML Linq
string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
XElement item = XElement.Parse(xml);
item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));

Parsing XML which contains illegal characters

A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>
If you have only & as invalid character, then you can use regex to replace it with &. We use regex to prevent replacement of already existing &, ", o, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = #"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, #"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);
Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.
Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse, that's all.

Regular expression - how to match xml value [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I want to use regular expression to get the airline code between <AirlineCode> and </AirlineCode> tags.
I only want the values of the <AirlineCode> tags that are w/in the <Flight> tags. There are more <AirlineCode>tags outside and I don't want the airline values from them.
I tried w/ the regex below but it's giving me all airline codes regardless of the position consideration mentioned. Please help.
var regex = new Regex(#"<AirlineCode>(.*?)</AirlineCode>", RegexOptions.IgnoreCase);
Match m = regex.Match("<PNRViewRS><AirGroup><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>DL</AirlineCode></Carrier></Flight><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>AA</AirlineCode></Carrier></Flight></AirGroup></PNRViewRS>");
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
//do stuff...
}
m = m.NextMatch();
}
In general, it's a bad idea to try parsing XML with regular expressions. The reason is that regex is insufficiently expressive, even with back references and such. The questions linked in the comments are worth reading to understand why this is generally a bad idea.
That said, you can be successful if you know for certain the format of your file, and if you're willing to do a little non-regex parsing as well.
In your situation, you have essentially:
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
And you want all of the <AirlineCode> tags that occur within <Flight> tags.
The way to approach this problem is to extract the <Flight> tags and their contents with one regex, and then use another regex to extract the <AirlineCode> tags from those extracted <Flight> tags. Don't try to do it in a single regular expression. You will not succeed.
If your data really is that simple, then this will work. I won't say that I recommend this approach. There are too many things that can go wrong. Data formats have a distressing tendency to change, and that fragile regex solution is likely to break if the format changes even a little bit. An XML parser solution will be much more robust.

regular expression to eliminate text inside < and > [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using C# regular expressions to remove HTML tags
I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.
Here's the code I've tried
Regex reg = new Regex(#"<.*>");
file = reg.Replace(file, "");
Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?
Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.
Well, the unexpected behavior you're getting is because your regular expression is greedy
If you change your regex to
Regex reg = new Regex(#"<.*?>");
file = reg.Replace(file, "");
you'll get what you expect.
Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.

Categories

Resources