How to beautify incomplete XML documents

How to beautify incomplete XML documents - c#

I look for a way to beautify incomplete XML documents. In best case it should handle even large sizes (e.g. 10 MB or maybe 100 MB).
Incomplete means that the documents are truncated at a random position. Until this position the XML has a valid syntax. Beautify means to add line breaks and leading spaces between the tags.
In my case it's needed to analyse aborted streams. Without line breaks and indentions it's really hard to read for a human.
I know there are some editors which can beautify incomplete documents, but I want to integrate the beautifier into my own analysis tool.
Unfortunately I did't find a discussion or solution for that case.
The nuget package GuiLabs.Language.Xml of Kirill Osenkov (repository XmlParser) seems to be a useful candidate for an own beautifier implementation, because it's designed to be error tolerant. Unfortunately there is too less documentation to understand how to use this parser.
Example xml:
<?xml encoding="UTF-8"?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p="pp"/><nn:A>cc</nn:A><D><E>eee</
Expected result as string:
<?xml encoding="UTF-8"?>
<X>
<B>
<C>aa</C>
<B/>
<A.B>
<X>bb</X>
</A.B>
<A p="pp"/>
<nn:A>cc</nn:A>
<D>
<E>eee</

The error ignoring "XML" parser of AngleSharp.Xml can be used to parse your sample, though missing tags will be added, you can then get an XML string representation of the built document and with the help of legacy XmlTextReader and XmlTextWriter which allow you to ignore namespaces you can at least indent the markup:
var xml = #"<?xml encoding=""UTF-8""?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p=""pp""/><nn:A>cc</nn:A><D><E>eee</";
var xmlParser = new XmlParser(new XmlParserOptions() { IsSuppressingErrors = true });
var doc = xmlParser.ParseDocument(xml);
Console.WriteLine(doc.ToMarkup());
using (StringReader sr = new StringReader(doc.ToXml()))
{
using (XmlTextReader xr = new XmlTextReader(sr))
{
xr.Namespaces = false;
using (XmlTextWriter xw = new XmlTextWriter(Console.Out))
{
xw.Namespaces = false;
xw.Formatting = Formatting.Indented;
xw.WriteNode(xr, false);
}
}
}
}
e.g. get
<X>
<B>
<C>aa</C>
<B />
<A.B>
<X>bb</X>
</A.B>
<A p="pp" />
<nn:A>cc</nn:A>
<D>
<E>eee</E>
</D>
</B>
</X>
As your text says "Until this position the XML has a valid syntax" and your comment suggests the errors in your sample are just due to sloppiness I think it might also be possible to use WriteNode of an XmlWriter with XmlWriterSettings.Indent set to true on a standard XmlReader, as long as you catch the exception the XmlReader throws:
var xml = #"<?xml version=""1.0""?><root><section><p>Paragraph 1.</p><p>Paragraph 2.";
try
{
using (StringReader sr = new StringReader(xml))
{
using (XmlReader xr = XmlReader.Create(sr))
{
using (XmlWriter xw = XmlWriter.Create(Console.Out, new XmlWriterSettings() { Indent = true }))
{
xw.WriteNode(xr, false);
}
}
}
}
catch (XmlException e)
{
Console.WriteLine();
Console.WriteLine("Malformed input XML: {0}", e.Message);
}
gives
<?xml version="1.0"?>
<root>
<section>
<p>Paragraph 1.</p>
<p>Paragraph 2.</p>
</section>
</root>
Malformed input XML: Unexpected end of file has occurred. The following elements are not closed: p, section, root. Line 1, position 71.
So no need with WriteNode to handle every possible Readxxx and node type and call the corresponding Writexxx on the XmlWriter by you own code.

Does it have to be C#?
In Java, you should be able to pipe the output of a SAX parser into an indenting serializer by connecting a SAXSource to a StreamResult using an identity transformer, and then just make sure that when the SAX parser aborts, you trap the exception and close the output stream tidily.
I think you can probably do the same thing in C# but not quite as conveniently: coupling the events read from an XmlReader and sending the corresponding events to an XmlWriter is a lot more tedious because you have to write code for each separate kind of event.
If you want a C# solution and you're prepared to install Saxon enterprise edition, you can write a simple streaming transformation:
<transform version="3.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="xml" indent="yes"/>
<mode streamable="yes" on-no-match="shallow-copy"/>
</transform>
invoke it from the Saxon API using XsltTransformer with a Serializer as the destination, and again, catch the exception and flush/close the output stream to which the Serializer is writing.
Using Saxon on Java would be overkill because the identity transformer does this "out of the box".

Related

Namespaces, Schemas, Elements and Attributes in an XmlDocument in .NET

I'm putting this here because I saw a lot of Q&A for XML on StackOverflow while trying to solve my own problems, and figured that once I'd found it, I'd post what I found so when someone else needs some XML help, this might help them.
My goal: To create an XML document that contains the following XML Declaration, Schema & Namespace Information:
<?xml version="1.0" encoding="UTF-8"?>
<abc:abcXML xsi:schemaLocation="urn:abcXML:v12 http://www.test.com/XML/schemas/v12/abcXML_v12.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ase="urn:abcXML:v12">
I'd already done it in Python for a quick prototype using minidom, and it was very simple. I needed to do it in a .NET language though (C#), because that's what the business calls for. I'm quite familiar with C#, but I've always stayed away from processing XML with it because I honestly don't have an in-depth grasp of XML and it's guidelines. Today, I had to face my demons.

Here's how I did it:
The first part is simple enough - create a document, and create a DocumentElement for the root (there's a catch here which I get to later):
XmlDeclaration xmlDeclaration = doc.CreateXmlDeclaration("1.0", "UTF-8", null);
XmlElement root = doc.DocumentElement;
doc.InsertBefore(xmlDeclaration, root);
The next part seems simple enough - create an element, give it a prefix, name and URI, then append it to the document. I thought this would work, but it doesn't (this is where the minimal understanding of XML comes into play):
XmlElement abcXML = xmlDoc.CreateElement("ase", "abcXML", "urn:abcXML:r38 http://www.w3.org/2001/XMLSchema-instance");
XmlAttribute xmlAttr = xmlDoc.CreateAttribute("xsi:schemaLocation", "urn:abcXML:v12 http://www.test.com/XML/schemas/v12/abcXML_v12.xsd");
abcXML.AppendChild(xmlAttr);
xmlDoc.AppendChild(abcXML);
I tried to use doc.LoadXml() and doc.CreateDocumentFragment() and write my own declarations. No - I would get "Unexpected end of file". For those interested in XmlDocumentFragment: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmldocumentfragment.innerxml?view=netcore-3.1
This Microsoft article about XML Schemas and Namespaces didn't directly help me: https://learn.microsoft.com/en-us/dotnet/standard/data/xml/including-or-importing-xml-schemas
After doing more reading on XML, and going through the documentation for XmlDocument, XmlElement and XmlAttribute, this is the solution:
XmlElement abcXML = xmlDoc.CreateElement("ase", "abcXML", "urn:abcXML:r38");
XmlAttribute xmlAttr = xmlDoc.CreateAttribute("xsi:schemaLocation", "http://www.w3.org/2001/XMLSchema-instance");
xmlAttr.InnerXml = "urn:abcXML:v12 http://www.test.com/XML/schemas/v12/abcXML_v12.xsd";
abcXML.Attributes.Append(xmlAttr);
xmlDoc.AppendChild(abcXML);
Now you can add the elements to your document like so:
XmlElement header = doc.CreateElement(string.Empty, "Header", string.Empty);
abcXML.AppendChild(header);
To save the document, I used:
xmlDoc.Save(fileLocation);
I compared my output to the sample I had, and after comparing the file contents, I had succeeded in matching it. I provided the output to the client, they uploaded it into application they were using, and it failed: Row 1, Column 1 - Unexpected Character.
I had a suspicion it was encoding, and I was right. Using xmlDoc.Save(fileLocation) is correct, but it generates a UTF-8 file with the Byte Order Mark (BOM) at Row 1, Column 1. The XML parsing function in the application doesn't expect that, so the process failed. To fix that, I used the following method:
Encoding enc = new UTF8Encoding(false); /* This creates a UTF-8 encoding without the BOM */
using (System.IO.TextWriter tw = new System.IO.StreamWriter(filePath, false, enc))
{
xmlDoc.Save(tw);
}
return true;
I generated the file again, sent it to the client, and it worked first go.
I hope someone finds this to be useful.

For complicated namespaces it is simpler to just parse the xml string. I like using xml linq. You sample xml is wrong. The namespace is "ase" (not abc).
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<ase:abcXML xsi:schemaLocation=\"urn:abcXML:v12 http://www.test.com/XML/schemas/v12/abcXML_v12.xsd\"" +
" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"" +
" xmlns:ase=\"urn:abcXML:v12\">" +
"</ase:abcXML>";
XDocument doc = XDocument.Parse(xml);
XElement root = doc.Root;
XNamespace nsAse = root.GetNamespaceOfPrefix("ase");
}
}
}

C# Parsing XML in ISO-8859-1

I'm working on a tool for validating XML files grabbed from a mainframe. For reasons beyond my control every XML file is encoded in ISO 8859-1.
<?xml version="1.0" encoding="ISO 8859-1"?>
My C# application utilizes the System.XML library to parse the XML and eventually a string of a message contained within one of the child nodes.
If I manually remove the XML encoding line it works just fine. But i'd like to find a solution that doesn't require manual intervention. Are there any elegant approaches to solving this? Thanks in advance.
The exception that is thrown reads as:
System.Xml.XmlException' occurred in System.Xml.dll. System does not support 'ISO 8859-1' encoding. Line 1, position 31
My code is
XMLDocument xmlDoc = new XMLDocument();
xmlDoc.Load(//fileLocation);

As Jeroen pointed out in a comment, the encoding should be:
<?xml version="1.0" encoding="ISO-8859-1"?>
not:
<?xml version="1.0" encoding="ISO 8859-1"?>
(missing dash -).
You can use a StreamReader with an explicit encoding to read the file anyway:
using (var reader = new StreamReader("//fileLocation", Encoding.GetEncoding("ISO-8859-1")))
{
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
// ...
}
(from answer by competent_tech in other thread I linked in an earlier comment).
If you do not want the using statement, I guess you can do:
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(File.ReadAllText("//fileLocation", Encoding.GetEncoding("ISO-8859-1")));
Instead of XmlDocument, you can use the XDocument class in the namespace System.Xml.Linq if you refer the assembly System.Xml.Linq.dll (since .NET 3.5). It has static methods like Load(Stream) and Parse(string) which you can use as above.

XML Deserialize with UTF-8 encoding

I already searched a lot today about this and I can't find how to Deserialize with UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
<AvailabilityRequestV2 xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema- instance"
siteid="0000"
apikey="0000"
async="false" waittime="0">
<Type>4</Type>
<Id>159266</Id>
<Radius>0</Radius>
<Latitude>0</Latitude>
<Longitude>0</Longitude>
</AvailabilityRequestV2>
If I try this
string xmlString = File above;
XmlSerializer serializer = new XmlSerializer(typeof(AvailabilityRequestV2));
AvailabilityRequestV2 request = (AvailabilityRequestV2)serializer.Deserialize(
new MemoryStream(Encoding.UTF8.GetBytes(xmlString)));
If I put in debugging mode the mouse over request I get this:
{<?xml version="1.0" encoding="utf-16"?><AvailabilityRequestV2
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
..................
How can I force to be UTF-8 ?
I only saw to Serialize, but Deserialize I didn't.

You can use a StreamReader and specify UTF-8, you can also tell it to use the BOM if present:
using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {
XmlSerializer serializer = new XmlSerializer(typeof(SomeType));
object result = serializer.Deserialize(reader);
}
I'm unsure what happens when the XML reader however encounters the encoding="utf-16" directive within the XML, it may switch over.

Once you have slurped the contents of a file into a .Net/CLR string, it is UTF-16 encoded: it has been transformed from its original source encoding. The CLR uses UTF-16 internally—hence the reason for a char being 16 bits.
As a result, the encoding specified in the document's [original] XML Declaration is now at odds with the actual encoding of the document.
Best to pass a StreamReader as recommended by #Lloyd above.

I think the example from #Lloyd needs the new keyword:
using (StreamReader reader = new StreamReader("my.xml",Encoding.UTF8,true)) {

Remove whitespace in self closing tags when writing xml document

When writing out an xml document I need to write all self closing tags without any whitespace, for example:
<foo/>
instead of:
<foo />
The reason for this is that a vendor system that I'm interfacing with throws a fit otherwise. In an ideal world the vendor would fix their system, but I don't bet on that happening any time soon. What's the best way to get an XmlWriter to output the self closing tags without the space?
My current scheme is to do something like:
return xml.Replace(" />", "/>");
Obviously this is far from ideal. Is it possible to subclass the XmlWriter for that one operation? Is there a setting as part of the XmlWriterSettings that I've overlooked?

I think that there is no such option to avoid that one space in self-closing tag. According to MSDN, XmlTextWriter:
When writing an empty element, an
additional space is added between tag
name and the closing tag, for example
. This provides compatibility
with older browsers.
Hopefully you could write <elementName></elementName> syntax instead of unwanted <elementName />, to do that use XmlWriter.WriteFullEndElement method, e.g.:
using System.Xml;
..
static void Main(string[] args)
{
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Indent = true;
xmlWriterSettings.IndentChars = ("\t");
xmlWriterSettings.OmitXmlDeclaration = true;
XmlWriter writer = XmlWriter.Create("example.xml", xmlWriterSettings);
writer.WriteStartElement("root");
writer.WriteStartElement("element1");
writer.WriteEndElement();
writer.WriteStartElement("element2");
writer.WriteFullEndElement();
writer.WriteEndElement();
writer.WriteEndDocument();
writer.Close();
}
produces following XML document:
<root>
<element1 />
<element2></element2>
</root>

Use a different serializer, for example the Saxon serializer, which also runs on .NET. It so happens that the Saxon serializer does what you want.
It's horrible, of course, to choose products based on accidental behaviour that no self-respecting system would require, but you have to accept reality - if you want to trade with idiots, you have to behave like an idiot.

Try this:
x.WriteStartElement("my-tag");
//Value of your tag is null
If (<"my-tag"> == "")
{
x.WriteWhitespace("");
}else
x.WriteString(my-tag);
x.WriteEndElement();

Proper name space management in .NET XmlWriter

I use .NET XML technologies quite extensively on my work. One of the things the I like very much is the XSLT engine, more precisely the extensibility of it. However there one little piece which keeps being a source of annoyance. Nothing major or something we can't live with but it is preventing us from producing the beautiful XML we would like to produce.
One of the things we do is transform nodes inline and importing nodes from one XML document to another.
Sadly , when you save nodes to an XmlTextWriter (actually whatever XmlWriter.Create(Stream) returns), the namespace definitions get all thrown in there, regardless of it is necessary (previously defined) or not. You get kind of the following xml:
<root xmlns:abx="http://bladibla">
<abx:child id="A">
<grandchild id="B">
<abx:grandgrandchild xmlns:abx="http://bladibla" />
</grandchild>
</abx:child>
</root>
Does anyone have a suggestion as to how to convince .NET to be efficient about its namespace definitions?
PS. As an added bonus I would like to override the default namespace, changing it as I write a node.

Use this code:
using (var writer = XmlWriter.Create("file.xml"))
{
const string Ns = "http://bladibla";
const string Prefix = "abx";
writer.WriteStartDocument();
writer.WriteStartElement("root");
// set root namespace
writer.WriteAttributeString("xmlns", Prefix, null, Ns);
writer.WriteStartElement(Prefix, "child", Ns);
writer.WriteAttributeString("id", "A");
writer.WriteStartElement("grandchild");
writer.WriteAttributeString("id", "B");
writer.WriteElementString(Prefix, "grandgrandchild", Ns, null);
// grandchild
writer.WriteEndElement();
// child
writer.WriteEndElement();
// root
writer.WriteEndElement();
writer.WriteEndDocument();
}
This code produced desired output:
<?xml version="1.0" encoding="utf-8"?>
<root xmlns:abx="http://bladibla">
<abx:child id="A">
<grandchild id="B">
<abx:grandgrandchild />
</grandchild>
</abx:child>
</root>

Did you try this?
Dim settings = New XmlWriterSettings With {.Indent = True,
.NamespaceHandling = NamespaceHandling.OmitDuplicates,
.OmitXmlDeclaration = True}
Dim s As New MemoryStream
Using writer = XmlWriter.Create(s, settings)
...
End Using
Interesting is the 'NamespaceHandling.OmitDuplicates'

I'm not sure this is what you're looking for, but you can use this kind of code when you start writing to the Xml stream:
myWriter.WriteAttributeString("xmlns", "abx", null, "http://bladibla");
The XmlWriter should remember it and not rewrite it anymore. It may not be 100% bulletproof, but it works most of the time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to beautify incomplete XML documents - c#

Related

Namespaces, Schemas, Elements and Attributes in an XmlDocument in .NET

C# Parsing XML in ISO-8859-1

XML Deserialize with UTF-8 encoding

Remove whitespace in self closing tags when writing xml document

Proper name space management in .NET XmlWriter

Categories

Resources