I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:
XDocument xDocument = XDocument.Parse(xmlString);
But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:
Name cannot begin with the 'number' character
Other solutions I found were about using: XmlReaderSettings.CheckCharacters
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:
If the XmlReader is processing text data, it always checks that the
XML names and text content are valid, regardless of the property
setting. Setting CheckCharacters to false turns off character checking
for character entity references.
So I tried using:
using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
This also didn't work.
Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters?
How is the flag XmlReaderSettings.CheckCharacters supposed to be used?
You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.
Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.
LL parser can be written by hand. Both lexer and parser are simple.
LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.
You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.
Related
I have a problem. I'm deserializing XML-document and performing a schema validation for it. This works fine but if a field value contains a Unicode whitespace character (Tab/U+0009), it will be replaced with a space.
I've narrowed this functionality down to schema validation. Without schema validation, the Unicode characters will be preserved. I believe the XmlReader performs normalization when performing the schema validation.
Here's the simplified code:
var settings = new XmlReaderSettings
{
Schemas = ...,
ValidationType = ValidationType.Schema,
ValidationFlags = XmlSchemaValidationFlags.ReportValidationWarnings
}
// ...
// doc is XDocument
using (var reader = XmlReader.Create(doc.CreateReader(), settings)
{
var deserializedObject = serializer.Deserialize(reader);
}
I found a suggestion to use XmlTextReader manually, but I can't set XmlReaderSettings for the reader that way. XmlTextReader would have a property Normalize that would (probably) fix my issue.
Another suggestion I found was to set XmlReader Normalization private property via reflection, but it seems that it has been removed in later versions as it seems it doesn't exist in .NET 6.
I can't edit the XML document or the schema file.
Any suggestions how to workaround this issue?
I use XmlDocument class for loading XML and XmlWriter class for generating the file. I wanted to preserve the decimal character entities (characters in bold) that's present in the xml, like the one below,
<car id="wait for the signal
then proceed">
Tried options like XmlTextReader, but had no luck. After processing the above line in the file looks something like below,
<car id="wait for the signal
then proceed">
or
<car id="wait for the signal
then proceed">
XmlWriter code block i used,
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
Encoding = encoding,
NewLineHandling = NewLineHandling.None
};
XmlDataDocument xmlDataDocument = new XmlDataDocument
{
PreserveWhitespace = true
};
xmlDataDocument.LoadXml(xmlString);
using (XmlWriter writer = XmlWriter.Create(filepath, xmlWriterSettings))
{
if (writer != null)
{
xmlDataDocument.Save(writer);
}
}
any help one this is much appreciated.
I am unsure what you are trying to achieve here.
Both
and
are equivalent. They both reference a newline. One in is decimal form, the other is in hexadecimal form. Either will work with the XML parser.
If you read the official XML Documentation, it specifies that either method of referencing is allowed.
Why do you need to preserve the decimal form?
As far as an XML parser is concerned, it will always treat a numeric character reference in exactly the same way as it treats the character itself.
Your only possible way forward is to preprocess the file before the XML parser gets to see it, replacing the & with some other character such as §. And then of course, reverse the process afterwards.
I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.
While saving the existing XML to new location, entities escaped from the content and replaced with Question Mark
See the snaps below entity ‐ (- as Hex) present while reading but its replaced with question mark after saving to another location.
While Reading as Inner XML
While Reading as Inner Text
After Saving XML File
EDIT 1
Below is my code
string path = #"C:\work\myxml.XML";
string pathnew = #"C:\work\myxml_new.XML";
//GetFileEncoding(path);
XmlDocument document = new XmlDocument();
XmlDeclaration xmlDeclaration = document.CreateXmlDeclaration("1.0","US-ASCII",null);
//document.CreateXmlDeclaration("1.0", null, null);
document.Load(path);
string x = document.InnerText;
document.Save(pathnew);
EDIT 2
My source file looks like below. I need to retain the entities as it is
The issue here seems to be the handling of encoding of entity references by the specific XmlWriter implementation internal to XmlDocument.
The issue disappears if you create an XmlWriter yourself - the unsupported character will be correctly encoded as an entity reference. This XmlWriter is a different (and newer) implementation that sets an EncoderFallback that encodes characters as entity references for characters that can't be encoded. Per the remarks in the docs, the default fallback mechanism is to encode a question mark.
var settings = new XmlWriterSettings
{
Indent = true,
Encoding = Encoding.GetEncoding("US-ASCII")
};
using (var writer = XmlWriter.Create(pathnew, settings))
{
document.Save(writer);
}
As an aside, I'd recomment using the LINQ to XML XDocument API, it's much nicer to work with than the old creaky XmlDocument API. And its version of Save doesn't have this problem, either!
I have a simple XmlReader:
XmlReader r = XmlReader.Create(fileName);
while (r.Read())
{
Console.WriteLine(r.Value);
}
The problem is, the Xml file has ISO-8859-9 characters in it, which makes XmlReader throw "Invalid character in the given encoding." exception. I can solve this problem with adding <?xml version="1.0" encoding="ISO-8859-9" ?> line in the beginning but I'd like to solve this in another way in case I can't modify the source file. How can I change the encoding of XmlReader?
To force .NET to read the file in as ISO-8859-9, just use one of the many XmlReader.Create overloads, e.g.
using(XmlReader r = XmlReader.Create(new StreamReader(fileName, Encoding.GetEncoding("ISO-8859-9")))) {
while(r.Read()) {
Console.WriteLine(r.Value);
}
}
However, that may not work because, IIRC, the W3C XML standard says something about when the XML declaration line has been read, a compliant parser should immediately switch to the encoding specified in the XML declaration regardless of what encoding it was using before. In your case, if the XML file has no XML declaration, the encoding will be UTF-8 and it will still fail. I may be talking nonsense here so try it and see. :-)
The XmlTextReader class (which is what the static Create method is actually returning, since XmlReader is the abstract base class) is designed to automatically detect encoding from the XML file itself - there's no way to set it manually.
Simply insure that you include the following XML declaration in the file you are reading:
<?xml version="1.0" encoding="ISO-8859-9"?>
If you can't ensure that the input file has the right header, you could look at one of the other 11 overloads to the XmlReader.Create method.
Some of these take an XmlReaderSettings variable or XmlParserContext variable, or both. I haven't investigated these, but there is a possibility that setting the appropriate values might help here.
There is the XmlReaderSettings.CheckCharacters property - the help for this states:
Instructs the reader to check characters and throw an exception if any characters are outside the range of legal XML characters. Character checking includes checking for illegal characters in the document, as well as checking the validity of XML names (for example, an XML name may not start with a numeral).
So setting this to false might help. However, the help also states:
If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.
So further investigation is warranted.
Use a XmlTextReader instead of a XmlReader:
System.Text.Encoding.UTF8.GetString(YourXmlTextReader.Encoding.GetBytes(YourXmlTextReader.Value))