.NET XmlDocument LoadXML and Entities

.NET XmlDocument LoadXML and Entities - c#

When loading XML into an XmlDocument, i.e.
XmlDocument document = new XmlDocument();
document.LoadXml(xmlData);
is there any way to stop the process from replacing entities? I've got a strange problem where I've got a TM symbol (stored as the entity #8482) in the xml being converted into the TM character. As far as I'm concerned this shouldn't happen as the XML document has the encoding ISO-8859-1 (which doesn't have the TM symbol)
Thanks

This is a standard misunderstanding of the XML toolset. The whole business with "&#x", is a syntactic feature designed to cope with character encodings. Your XmlDocument isn't a stream of characters - it has been freed of character encoding issues - instead it contains an abstract model of XML type data. Words for this include DOM and InfoSet, I'm not sure exactly which is accurate.
The "&#x" gubbins won't exist in this model because the whole issue is irrelevant, it will return - if appropriate - when you transform the Info Set back into a character stream in some specific encoding.
This misunderstanding is sufficiently common to have made it into academic literature as part of a collection of similar quirks. Take a look at "Xml Fever" at this location: http://doi.acm.org/10.1145/1364782.1364795

What are you writing it to? A TextWriter? a Stream? what?
The following keeps the entity (well, it replaces it with the hex equivalent) - but if you do the same with a StringWriter it detects the unicode and uses that instead:
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<xml>™</xml>");
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriter xw = XmlWriter.Create(ms, settings);
doc.Save(xw);
xw.Close();
Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray()));
}
Outputs:
<?xml version="1.0" encoding="iso-8859-1"?><xml>™</xml>

I confess things get a little confusing with XML documents and encodings, but I'd hope that it would get set appropriate when you save it again, if you're still using ISO-8859-1 - but that if you save with UTF-8, it wouldn't need to. In some ways, logically the document really contains the symbol rather the entity reference - the latter is just an encoding matter. (I'm thinking aloud here - please don't take this as authoritative information.)
What are you doing with the document after loading it?

I beleive if you enclose the entity contents in the CDATA section it should leave it all alone e.g.
<root>
<testnode>
<![CDATA[some text ™]]>
</testnode>
</root>

Entity references are not encoding specific. According to the W3C XML 1.0 Recommendation:
If the character reference begins with
"&#x", the digits and letters up to
the terminating ; provide a
hexadecimal representation of the
character's code point in ISO/IEC
10646.

The &#xxxx; entities are considered to be the character they represent. All XML is converted to unicode on reading and any such entities are removed in favor of the unicode character they represent. This includes any occurance for them in unicode source such as the string passed to LoadXML.
Similarly on writing any character that cannot be represented by the stream being written to is converted to a &#xxxx; entity. There is little point trying to preserve them.
A common mistake is expect to get a String from a DOM by some means that uses an encoding other then unicode. That just doesn't happen regardless of what the

Thanks for all of the help.
I've fixed my problem by writing a HtmlEncode function which actually replaces all of the characters before it spits them out to the webpage (instead of relying on the somewhat broken HtmlEncode() .NET function which only seems to encode a small subset of the characters necessary)

Related

Reading xml file with linq and getting error : Invalid character in the given encoding [duplicate]

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

xml and & issue

I am new to XML and I am now trying to read an xml file.
I googled and try this way to read xml but I get this error.
Reference to undeclared entity 'Ccaron'. Line 2902, position 9.
When I go to line 2902 I got this,
<H0742>&Ccaron;opova 14, POB 1725,
SI-1000 Ljubljana</H0742>
This is the way I try
XmlDocument xDoc = new XmlDocument();
xDoc.Load(file);
XmlNodeList nodes = xDoc.SelectNodes("nodeName");
foreach (XmlNode n in nodes)
{
if (n.SelectSingleNode("H0742") != null)
{
row.IrNbr = n.SelectSingleNode("H0742").InnerText;
}
.
.
.
}
When I look at w3school, & is illegal in xml.
EDIT :
This is the encoding. I wonder it's related with xml somehow.
encoding='iso-8859-1'
Thanks in advance.
EDIT :
They gave me an .ENT file and I can reference online ftp.MyPartnerCompany.com/name.ent.
In this .ENT file
I see entities like that
<!ENTITY Cacute "Ć"> <!-- latin capital letter C with acute,
U+0106 Latin Extended-A -->
How can I reference it in my xml Parsing ?
I prefer to reference online since they may add new anytime.
Thanks in advance !!!

The first thing to be aware of is that the problem isn't in your software.
As you are new to XML, I'm going to guess that definining entities isn't something you've come across before. Character entities are shortcuts for arbitrary pieces of text (one or more characters). The most common place you are going to see them is in the situation you are in now. At some point, your XML has been created by someone who wanted to type the character 'Č' or 'č' (that's upper and lower case C with Caron if your font can't display it).
However, in XML we only have a few predeclared entities (ampersand, less than, greater than, double quote and apostraphe). Any other character entities need to be declared. In order to parse your file correctly you will need to do one of two things - either replace the character entity with something that doesn't cause the parser issues or declare the entity.
To declare the entity, you can use something called an "internal subset" - a specialised form of the DTD statement you might see at the top of your XML file. Something like this:
<!DOCTYPE root-element
[ <!ENTITY Ccaron "Č">
<!ENTITY ccaron "č">]
>
Placing that statement at the beginning of the XML file (change the 'root-element' to match yours) will allow the parser to resolve the entity.
Alternatively, simply change the &Ccaron; to Č and your problem will also be resolved.
The &# notation is a numeric entity, giving appropriate unicode value for the character (the 'x' indicates that it's in hex).
You could always just type the character too but that requires knowledge of the ins and outs of your keyboard and region.

&Ccaron; isn't XML it's not even defined in the HTML 4 entity reference. Which btw isn't XML. XML doesn't support all those entities, in fact, it supports very few of them but if you look up the entity and find it, you'll be able to use it's Unicode equivalent, which you can use. e.g. Š is invalid XML but Š isn't. (Scaron was the closest I could find to Ccaron).

Your XML file isn't well-formed and, so, can't be used as XmlDocument. Period.
You have two options:
Open that file as a regular text file and fixed that symptom.
Fix your XML generator, and that's your real problem. That generator isn't generating that file using System.Xml, but probably concatening several strings, as "XML is just a text file". You should repair it, or opening a generated XML file will be always a surprise.
EDIT: As you can't fix your XML generator, I recommend to open it with File.ReadAllText and execute an regular expression to re-encode that & or to strip off entire entity (as we can't translate it)
Console.WriteLine(
Regex.Replace("<H0742>&Ccaron;opova 14, { POB & SI-1000 &</H0742>",
#"&((?!#)\S*?;)?", match =>
{
switch (match.Value)
{
case "<":
case ">":
case "&":
case """:
case "&apos;":
return match.Value; // correctly encoded
case "&":
return "&";
default: // here you can choose:
// to remove entire entity:
return "";
// or just encode that & character
return "&" + match.Value.Substring(1);
}
}));

&Ccaron; is an entity reference. It is likely that the entity reference is intended to be for the character Č, in order to produce: Čopova.
However, that entity must be declared, or the XML parser will not know what should be substituted for the entity reference as it parses the XML.

solution :-
byte[] encodedString = Encoding.UTF8.GetBytes(xml);
// Put the byte array into a stream and rewind it to the beginning
MemoryStream ms = new MemoryStream(encodedString);
ms.Flush();
ms.Position = 0;
// Build the XmlDocument from the MemorySteam of UTF-8 encoded bytes
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(ms);

Unicode to Windows-1251 Conversion with XML(HTML)-escaping

I have XML-file and need to produce HTML-file with Windows-1251 encoding by applying XSL Transformation. A problem is that Unicode characters of XSL -file are not converted to HTML Unicode Escape Sequence like "ғ" during XSL Transformation, only "?" sign is written instead of them. How can I ask XslCompiledTransform.Transform method to do this conversion? Or is there any method to write HTML-string into Windows-1251 HTML file with applying HTML Unicode Escape Sequences, so that I can perform XSL Transformation to string and then by this method to write to a file with Windows-1251 encoding and with HTML-escaping of all unicode characters (something like Convert("ғ") will return "ғ")?
XmlReader xmlReader = XmlReader.Create(new StringReader("<Data><Name>The Wizard of Wishaw</Name></data>"));
XslCompiledTransform xslTrans = new XslCompiledTransform();
xslTrans.Load("sheet.xsl");
using (XmlTextWriter xmlWriter = new XmlTextWriter("result.html", Encoding.GetEncoding("Windows-1251")))
{
xslTrans.Transform(xmlReader, xmlWriter); // it writes Windows-1251 HTML-file but does not escape unicode characters, just writes "?" signs
}
Thanks all for help!
UPDATE
My output configuration tag in XSL-file:
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />
I do not even hope now that XSL will satisfy my needs. But I wonder that I do not have any method to check if character is acceptable by specified encoding. Something like
Char.IsEncodable('ғ', Encoding.GetEncoding('Windows-1251'))
My current solution is to convert all characters greater than 127 (c > 127) to &#dddd; escape strings, but my chief is not satisfied by the solution, because the source of generated HTML-file is not readable.

Do note that XML is both a data model and a serialization format. The data can use different character set than the serialization of this data.
It looks like the key reason to your problem is that your serialization process is trying to limit the character set of the data model, whereas you would like to set the character set of the serialization format. Let's have an example: <band>Motörhead</band> and <band>Motörhead</band> are equal XML documents. They have the same structure and exactly the same data. Because of the heavy metal umlaut, the character set of the data is unicode (or something bigger than ASCII) but, because the usage of a character reference ö, the character set of the latter serialization form of the document is ASCII. In order to process this data, your XML tools still need to be unicode aware in both cases, but when using the latter serialization, the I/O and file transfer tools don't need to be unicode aware.
My guess is that by telling the XMLTextWriter to use Windows-1251 encoding, it probably in practice tries to limit the character set of the data to the characters contained in Windows-1251 by discarding all the characters outside this character set and writing a ? character instead.
However, since you produce your XML document by an XSL transformation, you can control the character set of the serialization directly in your XSLT document. This is done by adding a encoding attribute to the xsl:output element. Modify it to look like this
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" encoding="windows-1251"/>
Now the XSLT processor takes care of the serialization to reduced character set and outputs a character reference for all characters in the data that are included in windows-1251.
If changing the character set of the data is really what you need, then you need to process your data with a suitable character conversion library that can guess the most suitable replacement character (like ö -> o).

try to complement your xsl-file with replacement rules a la
<xsl:value-of select="replace(.,'ғ','&#1171;')"/>
you may wish to do this using regex patterns instead:
<xsl:value-of select="replace(.,'&#(\d+);','&#$1;')"/>
your problem origins with the xml parser that substitutes the numeric entity reference with the corresponding unicode chars before the transformation takes place. thus the unknown chars (resp. '?')
end up in your converted document.
hope this helps,
best regards,
carsten

The correct solution would be to write the file in a Unicode encoding (such as UTF-8) and forget about CP-1251 and all other legacy encodings.
But I will assume that this is not an option for some reason.
The best alternative that I can devise is to do the character replacements in the string before handing it to the XmlReader. You should use the Encoding class to convert the string to an array of bytes in CP-1251, and create your own decoder fallback mechanism. The fallback mechanism can then insert the XML escape sequences. This way you are guarunteed to handle all (and exactly those) characters that are not in CP-1251.
Then you can convert the array of bytes (in CP-1251) into a normal .NET String (in UTF-16) and hand it to your XmlReader. The values that need to be escaped will already be escaped, so the final file should be written correctly.
UPDATE
I just realized the flaw of this method. The XmlWriter will further escape the & characters as &, so the escapes themselves will appear in the final document rather than the characters they represent.
This may require some very complicated solution!
ANOTHER UPDATE
Ignore that last update. Since you are reading the string in as XML, the escapes should be interpreted correctly. This is what I get for trying post quickly rather than thinking through the problem!
My proposed solution should work fine.

Have you tried specifying the encoding in the xsl:output?
(http://www.w3schools.com/xsl/el_output.asp)

The safest and most interoperable way to do this is to specify encoding="us-ascii" in your xsl:output element. Most XSLT processors support writing this encoding.
US-ASCII is a completely safe encoding as it is a compatible subset of UTF-8 (you may elect to label the emitted XML as having a "utf-8" encoding, as this will also be true: this can be done by specifying omit-xml-declaration="yes" for your xsl:output and manually prepending an "<?xml version='1.0' encoding='utf-8'?>" declaration to your output).
This approach works because when using US-ASCII encoding, a serializer is forced to use XML's escaping mechanism for characters beyond U+007F, and so will emit them as numeric character references (the "&#.....;" form).
When dealing with environments in which non-standard encodings are required, it is generally a good defensive technique to produce this kind of XML as it is completely conformant and works in practice with even some buggy consuming software.

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):
123[]45[]6789
I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?
Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?

The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".
What you have here is either:
Perfectly normal characters for which your font doesn't have a glyph.
Perfectly normal characters that aren't printable (e.g. control characters).
An artefact of how the debugger works.
The first thing is to find out what that character is. Find the integer value of the character, and then look it up.
An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).
Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.
Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.
Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.

Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.
Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.

Take a deeper look at the characters themselves, what are the acutal char values?
When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.
edit, nope
In your example I'd venture a guess that your seeing imbedded newline characters.

Define the allowed characters and block everything else, i.e.:
// only lowercase letters and digits
if(Regex.IsMatch(yourString, #"^[a-z0-9]*$"))
{
// allowed
}
But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.
PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.
Edit: possible solution
Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.
When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:
Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header
But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.
Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.
Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.

public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}

How can I generate XML with CR, instead of CRLF in XmlTextWriter

I'm generating XML via XmlTextWriter.
The file looks good to my eyes, validates (at wc3), and was accepted by the client.
But a client vendor is complaining that the line-endings are CRLF, instead of just CR.
Well, I'm on a Win32 machine using C#, and CRLF is the Win32 standard line-ending.
Is there any way to change the line-endings in the XmlTextWriter?
Also -- shouldn't the line-endings not matter to a proper XML parser?
see also: What are carriage return, linefeed, and form feed?
NOTE: looks like the only answer is a sideways solution -- you have to use the XmlWriter instead of the XmlTextWriter

of course, moments after asking, I find a clue on MSDN (that I couldn't find via google) that refers to XmlWriterSettings.NewLineChars
which then led me to the unaccepted answer on SO: Writing XMLDocument to file with specific newline character (c#)
It's all in the terminology.....

Use the XmlWriterSettings to set what you want as your end of line char.
XmlWriterSettings mySettings = new XmlWriterSettings();
mySettings.NewLineChars = "\r";
XmlWriter writer = XmlWriter.Create(
new StreamWriter(#"c:\temp\hello.xml", mySettings);
I don't know where end of line characters would matter. I haven't run into it before.

What line ending is used should not matter to a properly implemented parser (see the spec), I quote (emphasis mine):
To simplify the tasks of applications,
the XML processor must behave as if it
normalized all line breaks in external
parsed entities (including the
document entity) on input, before
parsing, by translating both the
two-character sequence #xD #xA and any #xD
that is not followed by #xA to a single #xA character.
Therefore, you should be fine with the way you have it right now. You might want to ask what the client vendor is actually doing there, chances are that they are Doing it Wrong.

Use the XmlWriterSettings.NewLineChars property.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.