I have inherited a poorly written web application that seems to have errors when it tries to read in an xml document stored in the database that has an "&" in it. For example there will be a tag with the contents: "Prepaid & Charge". Is there some secret simple thing to do to have it not get an error parsing that character, or am I missing something obvious?
EDIT:
Are there any other characters that will cause this same type of parser error for not being well formed?
The problem is the xml is not well-formed. Properly generated xml would list the data like this:
Prepaid & Charge
I've fixed the same problem before, and I did it with this regex:
Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");
Combine that with a string constant defined like this:
const string goodAmpersand = "&";
Now you can say badAmpersand.Replace(<your input>, goodAmpersand);
Note a simple String.Replace("&", "&") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.
The catches here are you have to do this to your xml document before loading it into your parser, which likely means an extra pass through the document. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.
Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, large (non-contiguous) swaths of UNICODE are defined as legal, and anything outside that is illegal.
When it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.
Oh, and a note about the CDATA suggestion: I use that to make sure xml I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.
The web application isn't at fault, the XML document is. Ampersands in XML should be encoded as &. Failure to do so is a syntax error.
Edit: in answer to the followup question, yes there are all kinds of similar errors. For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. In order to get any decent XML parser to consume a document, that document must be well-formed. The XML specification requires that a parser encountering a malformed document throw a fatal error.
The other answers are all correct, and I concur with their advice, but let me just add one thing:
PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :).
Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs.
You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...".
I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message.
Your database doesn't contain XML documents. It contains some well-formed XML documents and some strings that look like XML to a human.
If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall.
You can replace & with &
Or you might also be able to use CDATA sections.
There are several characters which will cause XML data to be reported as badly-formed.
From w3schools:
Characters like "<" and "&" are illegal in XML elements.
The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, e.g.
<![CDATA[This is my wonderful & great user text]]>
Everything within the <![CDATA[ and ]]> tags is ignored by the parser.
Related
i checked many threads that have the same title but all of them are demo different problem than mine. so if i was wrong, can someone to point me to right thread.
i'm building my own SAX XML parser beside the test tool that will used to send XML file to the parser then the test tool will receive the fired event from SAX parser to reconstruct the XML and to be compared with transmitted one to verify the proper operation of parser.
anyway, i'm using oasis test cases, especially (o-p04pass1)
<doc>
<abcdefghijklmnopqrstuvwxyz/>
<ABCDEFGHIJKLMNOPQRSTUVWXYZ/>
<A01234567890/>
<A.-:̀·/>
</doc>
so to reconstruct <A.-:̀·/> , i'm using the following C# code and i'm getting exception System.Xml.XmlException: The ':' character, hexadecimal value 0x3A, cannot be included in a name. with the following line
currentXMLElement = new XElement(startString[0], "");
startString[0] will contain A.-:̀· which will cause the exception.
my questions are:
- the : is one of accepted character in the name according to the latest standard https://www.w3.org/TR/xml/#NT-NameStartChar, so why i'm getting the exception?
- Am using the correct function to reconstruct the XML, and if not, what is best way to do it?
Thanks,
Mohammed Fawzy
For most people today, "XML" means "XML plus Namespaces", that is, the combination of XML 1.0 plus XML Namespaces 1.0 (or sometimes XML 1.1 plus Namespaces 1.1, though that combination is less popular). XML 1.0 can in theory be used without namespaces, but hardly anyone does so.
So you may well have found a test suite that uses colons in a way that is permitted by XML 1.0 but disallowed by XML Namespaces 1.0, and only you can decide what you want to do with such test cases. Whether your XML parser will accept them depends on what standards it chooses to conform to, and possibly on how it is configured. In the real world (as distinct from the world of conformance testing), documents that conform to XML 1.0 but don't conform to XML Namespaces 1.0 are of little practical interest.
I'm trying to use C#'s XmlReader on a large series of XML files, they are all properly formatted except for a few select ones (unfortunately I'm not in a position to have them changed, because it would break a lot of other code).
The errors only come from one specific part of the these affronting XML files and it's ok to just skip them but I don't want to stop reading the rest of the XML file.
The bad parts look like this:
<InterestingStuff>
...
<ErrorsHere OptionA|Something = "false" OptionB|SomethingElse = "false"/>
<OtherInterestingStuff>
...
</OtherInterestingStuff>
</InterestingStuff>
So really if I could just ignore invalid tags, or ignore the pipe symbol then I would be ok.
Trying to use XmlReader.Skip() when I see the name "ErrorsHere" doesn't work, apparently it already reads a bit ahead and throws the exception.
TLDR: How do I skip so I can read in the XML file above, using the XmlReader?
Edit:
Some people suggested just replacing the '|'-symbol, but the idea of XmlReader is to not load the entire file but only traverse parts you want, since I'm reading directly from files I can not afford the read in entire files, replace all instances of '|' and then read parts again :).
I've experimented a bit with this in the past.
In general the input simply has to be well-formed. An XmlReader will go into an unrecoverable error-state when the basic XML rules are broken. It is easy to avoid schema-validation but that's not relevant here.
Your only option is to clean the input, that can be done in a streaming manner (custom Stream or TextReader) but that will require a light form of parsing. If you don't have pipe-symbols in valid positions it's easy.
XmlReader is strict. Any non-conformance, it will error.
So no, you can't do that unless you write your own xml implementation. Fixup on the malformed data is probably easier.
Once I had a similar situation (with HTML files, not XML files). But I ended up using regular expression for each HTML file before entering it into my operation pipeline, to delete malformed parts. It came handy and was easier than struggling with the API. :)
Does anyone have/make/sell an error tolerant XML reader for .NET?
Yeah, I know, XML isn't designed to have errors in it and should be rejected if it's not valid .. blah blah. But sadly the real-world is imperfect and developers do make mistakes and I still want to be able to read their feeds even if I'm missing the odd element here or there because it wasn't encoded properly or had some other error in it. So please, no answers "fix the source" or "reject it".
So, does anyone have a component that can recover and handle common mistakes in XML files?
It's precisely because the real world is imperfect that XML is so widely used. What would be the functional specification for an error-tolerant XML parser? It's an open-ended problem. It's hard enough to parse all variations of well-formed XML without trying to second-guess all possible errors.
[... Waits for downvote.]
Look around HTML Parser, 'cause html is almost xml
Run the XML through Beautiful Soup first. That will clean your XML of errors so it parses correctly
For the specific case of an RSS feed and the specific case of individual corrupt item entries, you can use XmlTextReader to manually read in each item separately, handling the XmlException for invalid items. When an Exception occurs, you'll need to use a new Reader instance, as the original Reader is hosed. You'll still have to have valid <item> and </item> tags to identify each item, but you'll be able to recover from corrupt data within each item.
yes, I know it's old question, but recently I was looking for tolerant xml parser and found the following: XmlParser.
A Roslyn-inspired full-fidelity XML parser with no dependencies and a
simple Visual Studio XML language service.
The parser produces a full-fidelity syntax tree, meaning every
character of the source text is represented in the tree. The tree
covers the entire source text. The parser has no dependencies and can
easily be made portable.
You can add Nugets in your project. I tried this parser and it can read any XML files.
In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.
I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.
Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.
It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.
I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started
You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?
I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser