While saving the existing XML to new location, entities escaped from the content and replaced with Question Mark
See the snaps below entity ‐ (- as Hex) present while reading but its replaced with question mark after saving to another location.
While Reading as Inner XML
While Reading as Inner Text
After Saving XML File
EDIT 1
Below is my code
string path = #"C:\work\myxml.XML";
string pathnew = #"C:\work\myxml_new.XML";
//GetFileEncoding(path);
XmlDocument document = new XmlDocument();
XmlDeclaration xmlDeclaration = document.CreateXmlDeclaration("1.0","US-ASCII",null);
//document.CreateXmlDeclaration("1.0", null, null);
document.Load(path);
string x = document.InnerText;
document.Save(pathnew);
EDIT 2
My source file looks like below. I need to retain the entities as it is
The issue here seems to be the handling of encoding of entity references by the specific XmlWriter implementation internal to XmlDocument.
The issue disappears if you create an XmlWriter yourself - the unsupported character will be correctly encoded as an entity reference. This XmlWriter is a different (and newer) implementation that sets an EncoderFallback that encodes characters as entity references for characters that can't be encoded. Per the remarks in the docs, the default fallback mechanism is to encode a question mark.
var settings = new XmlWriterSettings
{
Indent = true,
Encoding = Encoding.GetEncoding("US-ASCII")
};
using (var writer = XmlWriter.Create(pathnew, settings))
{
document.Save(writer);
}
As an aside, I'd recomment using the LINQ to XML XDocument API, it's much nicer to work with than the old creaky XmlDocument API. And its version of Save doesn't have this problem, either!
Related
I have a problem. I'm deserializing XML-document and performing a schema validation for it. This works fine but if a field value contains a Unicode whitespace character (Tab/U+0009), it will be replaced with a space.
I've narrowed this functionality down to schema validation. Without schema validation, the Unicode characters will be preserved. I believe the XmlReader performs normalization when performing the schema validation.
Here's the simplified code:
var settings = new XmlReaderSettings
{
Schemas = ...,
ValidationType = ValidationType.Schema,
ValidationFlags = XmlSchemaValidationFlags.ReportValidationWarnings
}
// ...
// doc is XDocument
using (var reader = XmlReader.Create(doc.CreateReader(), settings)
{
var deserializedObject = serializer.Deserialize(reader);
}
I found a suggestion to use XmlTextReader manually, but I can't set XmlReaderSettings for the reader that way. XmlTextReader would have a property Normalize that would (probably) fix my issue.
Another suggestion I found was to set XmlReader Normalization private property via reflection, but it seems that it has been removed in later versions as it seems it doesn't exist in .NET 6.
I can't edit the XML document or the schema file.
Any suggestions how to workaround this issue?
I use XmlDocument class for loading XML and XmlWriter class for generating the file. I wanted to preserve the decimal character entities (characters in bold) that's present in the xml, like the one below,
<car id="wait for the signal
then proceed">
Tried options like XmlTextReader, but had no luck. After processing the above line in the file looks something like below,
<car id="wait for the signal
then proceed">
or
<car id="wait for the signal
then proceed">
XmlWriter code block i used,
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
Encoding = encoding,
NewLineHandling = NewLineHandling.None
};
XmlDataDocument xmlDataDocument = new XmlDataDocument
{
PreserveWhitespace = true
};
xmlDataDocument.LoadXml(xmlString);
using (XmlWriter writer = XmlWriter.Create(filepath, xmlWriterSettings))
{
if (writer != null)
{
xmlDataDocument.Save(writer);
}
}
any help one this is much appreciated.
I am unsure what you are trying to achieve here.
Both
and
are equivalent. They both reference a newline. One in is decimal form, the other is in hexadecimal form. Either will work with the XML parser.
If you read the official XML Documentation, it specifies that either method of referencing is allowed.
Why do you need to preserve the decimal form?
As far as an XML parser is concerned, it will always treat a numeric character reference in exactly the same way as it treats the character itself.
Your only possible way forward is to preprocess the file before the XML parser gets to see it, replacing the & with some other character such as §. And then of course, reverse the process afterwards.
I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:
XDocument xDocument = XDocument.Parse(xmlString);
But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:
Name cannot begin with the 'number' character
Other solutions I found were about using: XmlReaderSettings.CheckCharacters
using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:
If the XmlReader is processing text data, it always checks that the
XML names and text content are valid, regardless of the property
setting. Setting CheckCharacters to false turns off character checking
for character entity references.
So I tried using:
using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
XDocument xDocument = XDocument.Load(xmlReader);
}
This also didn't work.
Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters?
How is the flag XmlReaderSettings.CheckCharacters supposed to be used?
You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.
Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.
LL parser can be written by hand. Both lexer and parser are simple.
LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.
You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.
I'm freaking out with C# and XmlDocuments right now.
I need to parse XML data into another XML but I can't get special characters to work.
I'm working with XmlDocument and XmlNode.
What I tried so far:
- XmlDocument.CreateXmlDeclaration("1.0", "UTF-8", "yes");
- XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
What I know for sure:
- The input XML is also UTF-8
- The "InnerText" value is encoded without replacing the characters
Here is some code (not all... way to much code):
XmlDocument newXml = new XmlDocument();
newXml = (XmlDocument)systemsTemplate.Clone();
newXml.CreateXmlDeclaration("1.0", "UTF-8", "yes");
newXml.SelectSingleNode("systems").RemoveAll();
foreach(XmlNode categories in exSystems.SelectNodes("root/Content/Systems/SystemLine"))
{
XmlNode categorieSystemNode = systemsTemplate.SelectSingleNode("systems/system").Clone();
categorieSystemNode.RemoveAll();
XmlNode importIdNode = systemsTemplate.SelectSingleNode("systems/system/import_id").Clone();
string import_id = categories.Attributes["nodeName"].Value;
importIdNode.InnerText = import_id;
categorieSystemNode.AppendChild(importIdNode);
[way more Nodes which I proceed like this]
}
newXml.SelectSingleNode("systems").AppendChild(newXml.ImportNode(categorieSystemNode, true));
XmlTextWriter writer = new XmlTextWriter(outputDir + "systems.xml", Encoding.UTF8);
writer.Formatting = Formatting.Indented;
newXml.Save(writer);
writer.Flush();
writer.Close();
But what I get is this as an example:
<intro><p>Whether your project [...]</intro>
Instead of this:
<intro><p>Whether your project [...] </p></intro>
I do have other non-html tags in the XML so please don't provide HTML-parsing solutions :/
I know I could replace the characters with String.Replace() but that's dirty and unsafe (and slow with around 20K lines).
I hope there is a simpler way of doing this.
Kind regards,
Eriwas
The main propose of XmlDocument is to provide an easy way to work with XML documents while making sure the outcome is a well formed document.
So, using InnerText as in your example, you let the framework encode the string and properly insert it into that document. Whenever you read that same value, it will be decoded and returned to you exactly as your original string.
But, if you want to add an XML fragment anyways, you should stick with InnerXml or ImportNode. You must be aware that could lead to a more complex document structure, and you probably would like to avoid that.
As a third possibility, you can use the CreateCDataSection to add a CDATA and add your text there.
You definitely should be away from treating that XML document as a string by trying Replace things; stick with the framework and you'll be ok.
For a code in C#, I am parsing a string to XML using XPathDocument.
The string is retrieved from SDL Trados Studio and it depends on the XML that is being worked on (how it was originally created and loaded for translations) the string sometimes has a BOM sometimes not.
Edit: The 'xml' is actually parsed from the segments of the source and target text and the structure element. The textual elements are escaped for xml and the markup and text is joined in one string. So if the markup has BOM in the xliff, then the string will have BOM.
I am trying to actually parse any of the xmls, independent of encoding. So at this point my solution is to remove the BOM with Substring.
Here is my code:
//Recreate XML files (extractor returns two string arrays)
string strSourceXML = String.Join("", extractor.TextSrc);
string strTargetXML = String.Join("", extractor.TextTgt);
//strip BOM
strSourceXML = strSourceXML.Substring(strSourceXML.IndexOf("<?"));
strTargetXML = strTargetXML.Substring(strSourceXML.IndexOf("<?"));
//Transform XML with the preview XSL
var xSourceDoc = new XPathDocument(strSourceXML);
var xTargetDoc = new XPathDocument(strTargetXML);
I have searched for a better solution, through several articles, such as these, but I found no better solution yet:
XML - Data At Root Level is Invalid
Parsing XML with C#
Parsing complex XML with C#
Parsing : String to XML
XmlReader breaks on UTF-8 BOM
Any advice to solve this more elegantly?
The constructor of XPathDocument taking a String argument https://msdn.microsoft.com/en-us/library/te0h7f95%28v=vs.110%29.aspx takes a URI with the XML file location. If you have a string with XML markup then use a StringReader over that string e.g.
XPathDocument xSourceDoc;
using (TextReader tr = new StringReader(strSourceXML))
{
xSourceDoc = new XPathDocument(tr);
}