Removing and  to get valid XML? - c#

I have a WCF service (.NET C#) that sometimes returns for example 
 and  which is not correct XML.
I guess I could build a translator that are applied on each string field before sending response but it feels a bit sketchy, I do not know what to look for(more then the above) or what to translate it into. Maybe there is a existing solution for this?

These characters are allowed in XML 1.1 but not in XML 1.0. XML 1.1 has not been a great success and Microsoft has never supported it.
Does the XML declaration at the start of the file say version="1.1"?
A clean way to handle this would be to process the file using a parser that does support XML 1.1, converting it to XML 1.0 in the process. For example, you could do this with a simple Java SAX application, or XSLT if you prefer.
Quite what you want to translate these characters into is largely up to you. It depends whether they have any significance. If you want to translate them losslessly into XML 1.0, you could convert them to processing instructions such as <?char x1E?>.

&#xD stands for a new line (the way i know).
The behavoir is the same as Environment.NewLine. So you can replace it easily:
string text = yourString.ToString().Replace("&#xD", Environment.NewLine).
Dont know if this is what you're searchin for, but thats the only thing thats in my mind right now.
Hope it helps. :)

Related

How to use decimal entity codes in XML instead of hexadecimal entity codes

In my C# application, I am reading data from a source which has \r \n characters, and converting them to an XmlDocument. When using the CreateElement method of XmlDocument, it escapes them using hexadecimal entity codes, like 
 and
.
I have to send this XML to a 3rd party application, which accepts only decimal entity codes. So I have to send as 
 and
How can I configure XmlDocument to use decimal entity codes?
As soon as receiver app is crooked, the easies way is to introduce post-processing step that would bake your XML string to "acceptable" format. So, string.Replace() should help you here for sure. Not efficient, very effective. Sad but true.
When you have a receiving application that isn't accepting correct XML, the best thing to do is to change the receiving application. (Similarly, if you have a sending application that isn't sending correct XML, it's best to change the sending application). This is what standards are all about: if you want to get the benefits of XML, everyone has to play by the rules.
Producing constrained XML is generally going to be difficult and costly, because it constrains your choice of tools, or you have to write stuff by hand rather than using off-the-shelf software.
I don't think there are many tools that give you control over how numeric character references are written, but one that does is Saxon (PE or higher). You can use the extension property saxon:character-representation as an additional serialization property: see http://www.saxonica.com/documentation/index.html#!extensions/output-extras/serialization-parameters.

What's so bad about building XML with string concatenation?

In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.

Save attribute value of xml element with single quotes using linq to xml

How do I make the XDocument object save an attribute value of a element with single quotes?
I'm not sure that any of the formatting options for LINQ to XML allow you to specify that. Why do you need to? It's a pretty poor kind of XML handler which is going to care about it...
As long as you use single- and double-quotes in matched pairs and with correct nesting, standards-compliant XML processors won't care which style you use. Your question suggests that you are intending to process your XML output with tools that are not standards-compliant (or perhaps even not XML-aware). This is a dicey proposition at best, though I recognize that work situations and customer demands may not always give you the options of working with the right tools. I have co-workers who use sed and grep to sift through and modify XML files, and they often can get away with that. But if you have any choice at all, I recommend that you handle XML files with XML-aware tools all along the pipeline up to the point where the data is no longer marked up in XML. Doing otherwise will result in systems that are much more fragile than if you used XML-aware tools for all XML processing.
If you can't do that, then JacobE's suggestion is probably your best bet.
If it is absolutely necessary to have single quotes you could write your XML document to a string and then use a string replace to change from single to double quotes.

C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?
I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.
Are regular expressions the best way to achieve what I'm trying to accomplish?
I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).
HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;
Regular expressions are one way to do it, but it can be problematic.
Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.
You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.
UPDATE
At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.
For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack # http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.
Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php
There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.
I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.
It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:
Here are three good tools:
TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.
Reading its code would be a great learning exercise for everyone of us.
From the description:
"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)
The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))
Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.
It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().
Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.
Attribute names are lowercased if html-mode=true()"
Read a more detailed description here.
Hope this helped.
Cheers,
Dimitre Novatchev.
I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.
href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']
From here on RegExLib should get you started
You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.
On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:
Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?
I've linked some code here that will let you use "LINQ to HTML"...
Looking for C# HTML parser

Mapping internal data elements to external vendors' XML schema

I'm considering Altova MapForce (or something similar) to produce either XSLT and/or a Java or C# class to do the translation. Today, we pull data right out of the database and manually build an XML string that we post to a webservice.
Should it be db -> (internal)XML -> XSLT -> (External)XML? What do you folks do out there in the wide world?
I would use one of the out-of-the-box XML serialization classes to do your internal XML generation, and then use XSLT to transform to the external XML. You might generate a schema as well to enforce that the translation code (whatever will drive your XSLT translation) continues to get the XML it is expecting for translation in case of changes to the object breaks things.
There are a number of XSLT editors on the market that will help you do the mappings, but I prefer to just use a regular XML editor.
ya, I think you're heading down the right path with MapForce. If you don't want to write code to preform the actual transformation, MapForce can do that for you also. THis may be better long term b/c it's less code to maintain.
Steer clear of more expensive options (e.g. BizTalk) unless you really need to B2B integration and orchestration.
What database are you using? Oracle has some nice XML mapping tools. There are some Java binding tools (one is http://java.sun.com/developer/technicalArticles/WebServices/jaxb). However, if you have the luxory consider using Ruby which has nice built-in "to_xml" methods.
Tip #1: Avoid all use of XSLT.
The tool support sucks. The resulting solution will be unmaintainable.
Tip #2: Eliminate all unnecessary steps.
Just translate your resultset (assuming you're using JDBC or equiv) to the outbound XML.
Tip #3: Assume all use of a schema-based tool to be incorrect and plan accordingly.
In other words, just fake it. If you have to squirt out some mutant SOAP (redundant, I know) payload just mock up a working SOAP message and then turn it into a template. Velocity doesn't suck.
That said, the best/correct answer, is to use an "XML Writer" style solution. There's a few.
The best is the one I wrote, LOX (Lightweight Objects for XML).
The public API uses a Builder design pattern. Due to some magic under the hood, it's impossible to create malformed XML.
Please note: If XML is the answer, you've asked the wrong question. Sometimes, we're forced against our will to use it in some way. When that happens, it's crucial to use tools which minimize developer effort and improve code maintainability.

Categories

Resources