Preserve decimal character entities in XML C# - c#

I use XmlDocument class for loading XML and XmlWriter class for generating the file. I wanted to preserve the decimal character entities (characters in bold) that's present in the xml, like the one below,
<car id="wait for the signal
then proceed">
Tried options like XmlTextReader, but had no luck. After processing the above line in the file looks something like below,
<car id="wait for the signal
then proceed">
or
<car id="wait for the signal
then proceed">
XmlWriter code block i used,
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
Encoding = encoding,
NewLineHandling = NewLineHandling.None
};
XmlDataDocument xmlDataDocument = new XmlDataDocument
{
PreserveWhitespace = true
};
xmlDataDocument.LoadXml(xmlString);
using (XmlWriter writer = XmlWriter.Create(filepath, xmlWriterSettings))
{
if (writer != null)
{
xmlDataDocument.Save(writer);
}
}
any help one this is much appreciated.

I am unsure what you are trying to achieve here.
Both
and
are equivalent. They both reference a newline. One in is decimal form, the other is in hexadecimal form. Either will work with the XML parser.
If you read the official XML Documentation, it specifies that either method of referencing is allowed.
Why do you need to preserve the decimal form?

As far as an XML parser is concerned, it will always treat a numeric character reference in exactly the same way as it treats the character itself.
Your only possible way forward is to preprocess the file before the XML parser gets to see it, replacing the & with some other character such as §. And then of course, reverse the process afterwards.

Related

Preserve unicode whitespace characters during .NET XmlSerializer deserialize

I have a problem. I'm deserializing XML-document and performing a schema validation for it. This works fine but if a field value contains a Unicode whitespace character (Tab/U+0009), it will be replaced with a space.
I've narrowed this functionality down to schema validation. Without schema validation, the Unicode characters will be preserved. I believe the XmlReader performs normalization when performing the schema validation.
Here's the simplified code:
var settings = new XmlReaderSettings
{
Schemas = ...,
ValidationType = ValidationType.Schema,
ValidationFlags = XmlSchemaValidationFlags.ReportValidationWarnings
}
// ...
// doc is XDocument
using (var reader = XmlReader.Create(doc.CreateReader(), settings)
{
var deserializedObject = serializer.Deserialize(reader);
}
I found a suggestion to use XmlTextReader manually, but I can't set XmlReaderSettings for the reader that way. XmlTextReader would have a property Normalize that would (probably) fix my issue.
Another suggestion I found was to set XmlReader Normalization private property via reflection, but it seems that it has been removed in later versions as it seems it doesn't exist in .NET 6.
I can't edit the XML document or the schema file.
Any suggestions how to workaround this issue?

Making XmlReaderSettings CheckCharacters work for xml string

I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:
XDocument xDocument = XDocument.Parse(xmlString);
But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:
Name cannot begin with the 'number' character
Other solutions I found were about using: XmlReaderSettings.CheckCharacters
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:
If the XmlReader is processing text data, it always checks that the
XML names and text content are valid, regardless of the property
setting. Setting CheckCharacters to false turns off character checking
for character entity references.
So I tried using:
using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
This also didn't work.
Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters?
How is the flag XmlReaderSettings.CheckCharacters supposed to be used?
You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.
Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.
LL parser can be written by hand. Both lexer and parser are simple.
LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.
You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.

C# Xml Encoding

I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.

Saving XML file from one location to another location using XML DOCUMENT

While saving the existing XML to new location, entities escaped from the content and replaced with Question Mark
See the snaps below entity ‐ (- as Hex) present while reading but its replaced with question mark after saving to another location.
While Reading as Inner XML
While Reading as Inner Text
After Saving XML File
EDIT 1
Below is my code
string path = #"C:\work\myxml.XML";
string pathnew = #"C:\work\myxml_new.XML";
//GetFileEncoding(path);
XmlDocument document = new XmlDocument();
XmlDeclaration xmlDeclaration = document.CreateXmlDeclaration("1.0","US-ASCII",null);
//document.CreateXmlDeclaration("1.0", null, null);
document.Load(path);
string x = document.InnerText;
document.Save(pathnew);
EDIT 2
My source file looks like below. I need to retain the entities as it is
The issue here seems to be the handling of encoding of entity references by the specific XmlWriter implementation internal to XmlDocument.
The issue disappears if you create an XmlWriter yourself - the unsupported character will be correctly encoded as an entity reference. This XmlWriter is a different (and newer) implementation that sets an EncoderFallback that encodes characters as entity references for characters that can't be encoded. Per the remarks in the docs, the default fallback mechanism is to encode a question mark.
var settings = new XmlWriterSettings
{
Indent = true,
Encoding = Encoding.GetEncoding("US-ASCII")
};
using (var writer = XmlWriter.Create(pathnew, settings))
{
document.Save(writer);
}
As an aside, I'd recomment using the LINQ to XML XDocument API, it's much nicer to work with than the old creaky XmlDocument API. And its version of Save doesn't have this problem, either!

Error parsing XML string to XDocument

I have this XML string bn:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
I am trying to convert it to an XDocument like this:
var doc = XDocument.Parse(bn);
However, I get this error:
Data at the root level is invalid. Line 1, position 1.
Am I missing something?
UPDATE:
This is the method I use to create the xml string:
public static string SerializeObjectToXml(Root rt)
{
var memoryStream = new MemoryStream();
var xmlSerializer = new XmlSerializer(typeof(Root));
var xmlTextWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);
xmlSerializer.Serialize(xmlTextWriter, rt);
memoryStream = (MemoryStream)xmlTextWriter.BaseStream;
string xmlString = ByteArrayToStringUtf8(memoryStream.ToArray());
xmlTextWriter.Close();
memoryStream.Close();
memoryStream.Dispose();
return xmlString;
}
It does add to the start that I have to remove. Could I change something to make it correct from the start?
There is two characters at the beginning of your string that, although you can't see them, are still there and make the string fail. Try this instead:
<Root><Row><ITEMNO>1</ITEMNO><USED>y</USED><PARTSOURCE>Buy</PARTSOURCE><QTY>2</QTY></Row><Row><ITEMNO>5</ITEMNO><PARTSOURCE>Buy</PARTSOURCE><QTY>5</QTY></Row></Root>
The character in question is this. This is a byte-order mark, basically telling the program reading it if it's big or little endian. It seems like you copied and pasted this from a file that wasn't decoded properly.
To remove it, you could use this:
yourString.Replace(((char)0xFEFF).ToString(), "")
You have two unprintable characters (Zero-Width No-break Space) at the beginning of your string.
XML does not allow text outside the root element.
The accepted answer does unnecessary string processing, but, in its defense, it's because you're unnecessarily dealing in string when you don't have to. One of the great things about the .NET XML APIs is that they have robust internals. So instead of trying to feed a string to XDocument.Parse, feed a Stream or some type of TextReader to XDocument.Load. This way, you aren't fooling with manually managing the encoding and any problems it creates, because the internals will handle all of that stuff for you. Byte-order marks are a pain in the neck, but if you're dealing in XML, .NET makes it easier to handle them.

Categories

Resources