Unicode to Windows-1251 Conversion with XML(HTML)-escaping

Unicode to Windows-1251 Conversion with XML(HTML)-escaping - c#

I have XML-file and need to produce HTML-file with Windows-1251 encoding by applying XSL Transformation. A problem is that Unicode characters of XSL -file are not converted to HTML Unicode Escape Sequence like "ғ" during XSL Transformation, only "?" sign is written instead of them. How can I ask XslCompiledTransform.Transform method to do this conversion? Or is there any method to write HTML-string into Windows-1251 HTML file with applying HTML Unicode Escape Sequences, so that I can perform XSL Transformation to string and then by this method to write to a file with Windows-1251 encoding and with HTML-escaping of all unicode characters (something like Convert("ғ") will return "ғ")?
XmlReader xmlReader = XmlReader.Create(new StringReader("<Data><Name>The Wizard of Wishaw</Name></data>"));
XslCompiledTransform xslTrans = new XslCompiledTransform();
xslTrans.Load("sheet.xsl");
using (XmlTextWriter xmlWriter = new XmlTextWriter("result.html", Encoding.GetEncoding("Windows-1251")))
{
xslTrans.Transform(xmlReader, xmlWriter); // it writes Windows-1251 HTML-file but does not escape unicode characters, just writes "?" signs
}
Thanks all for help!
UPDATE
My output configuration tag in XSL-file:
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />
I do not even hope now that XSL will satisfy my needs. But I wonder that I do not have any method to check if character is acceptable by specified encoding. Something like
Char.IsEncodable('ғ', Encoding.GetEncoding('Windows-1251'))
My current solution is to convert all characters greater than 127 (c > 127) to &#dddd; escape strings, but my chief is not satisfied by the solution, because the source of generated HTML-file is not readable.

Do note that XML is both a data model and a serialization format. The data can use different character set than the serialization of this data.
It looks like the key reason to your problem is that your serialization process is trying to limit the character set of the data model, whereas you would like to set the character set of the serialization format. Let's have an example: <band>Motörhead</band> and <band>Motörhead</band> are equal XML documents. They have the same structure and exactly the same data. Because of the heavy metal umlaut, the character set of the data is unicode (or something bigger than ASCII) but, because the usage of a character reference ö, the character set of the latter serialization form of the document is ASCII. In order to process this data, your XML tools still need to be unicode aware in both cases, but when using the latter serialization, the I/O and file transfer tools don't need to be unicode aware.
My guess is that by telling the XMLTextWriter to use Windows-1251 encoding, it probably in practice tries to limit the character set of the data to the characters contained in Windows-1251 by discarding all the characters outside this character set and writing a ? character instead.
However, since you produce your XML document by an XSL transformation, you can control the character set of the serialization directly in your XSLT document. This is done by adding a encoding attribute to the xsl:output element. Modify it to look like this
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" encoding="windows-1251"/>
Now the XSLT processor takes care of the serialization to reduced character set and outputs a character reference for all characters in the data that are included in windows-1251.
If changing the character set of the data is really what you need, then you need to process your data with a suitable character conversion library that can guess the most suitable replacement character (like ö -> o).

try to complement your xsl-file with replacement rules a la
<xsl:value-of select="replace(.,'ғ','&#1171;')"/>
you may wish to do this using regex patterns instead:
<xsl:value-of select="replace(.,'&#(\d+);','&#$1;')"/>
your problem origins with the xml parser that substitutes the numeric entity reference with the corresponding unicode chars before the transformation takes place. thus the unknown chars (resp. '?')
end up in your converted document.
hope this helps,
best regards,
carsten

The correct solution would be to write the file in a Unicode encoding (such as UTF-8) and forget about CP-1251 and all other legacy encodings.
But I will assume that this is not an option for some reason.
The best alternative that I can devise is to do the character replacements in the string before handing it to the XmlReader. You should use the Encoding class to convert the string to an array of bytes in CP-1251, and create your own decoder fallback mechanism. The fallback mechanism can then insert the XML escape sequences. This way you are guarunteed to handle all (and exactly those) characters that are not in CP-1251.
Then you can convert the array of bytes (in CP-1251) into a normal .NET String (in UTF-16) and hand it to your XmlReader. The values that need to be escaped will already be escaped, so the final file should be written correctly.
UPDATE
I just realized the flaw of this method. The XmlWriter will further escape the & characters as &, so the escapes themselves will appear in the final document rather than the characters they represent.
This may require some very complicated solution!
ANOTHER UPDATE
Ignore that last update. Since you are reading the string in as XML, the escapes should be interpreted correctly. This is what I get for trying post quickly rather than thinking through the problem!
My proposed solution should work fine.

Have you tried specifying the encoding in the xsl:output?
(http://www.w3schools.com/xsl/el_output.asp)

The safest and most interoperable way to do this is to specify encoding="us-ascii" in your xsl:output element. Most XSLT processors support writing this encoding.
US-ASCII is a completely safe encoding as it is a compatible subset of UTF-8 (you may elect to label the emitted XML as having a "utf-8" encoding, as this will also be true: this can be done by specifying omit-xml-declaration="yes" for your xsl:output and manually prepending an "<?xml version='1.0' encoding='utf-8'?>" declaration to your output).
This approach works because when using US-ASCII encoding, a serializer is forced to use XML's escaping mechanism for characters beyond U+007F, and so will emit them as numeric character references (the "&#.....;" form).
When dealing with environments in which non-standard encodings are required, it is generally a good defensive technique to produce this kind of XML as it is completely conformant and works in practice with even some buggy consuming software.

Related

Read multiple files with different encoding, preserving all characters

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt

Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.

In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)

In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.

I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?

If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)

Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

How do I stop XElement.Save from escaping characters?

I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?

Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.

The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.

.NET XmlDocument LoadXML and Entities

When loading XML into an XmlDocument, i.e.
XmlDocument document = new XmlDocument();
document.LoadXml(xmlData);
is there any way to stop the process from replacing entities? I've got a strange problem where I've got a TM symbol (stored as the entity #8482) in the xml being converted into the TM character. As far as I'm concerned this shouldn't happen as the XML document has the encoding ISO-8859-1 (which doesn't have the TM symbol)
Thanks

This is a standard misunderstanding of the XML toolset. The whole business with "&#x", is a syntactic feature designed to cope with character encodings. Your XmlDocument isn't a stream of characters - it has been freed of character encoding issues - instead it contains an abstract model of XML type data. Words for this include DOM and InfoSet, I'm not sure exactly which is accurate.
The "&#x" gubbins won't exist in this model because the whole issue is irrelevant, it will return - if appropriate - when you transform the Info Set back into a character stream in some specific encoding.
This misunderstanding is sufficiently common to have made it into academic literature as part of a collection of similar quirks. Take a look at "Xml Fever" at this location: http://doi.acm.org/10.1145/1364782.1364795

What are you writing it to? A TextWriter? a Stream? what?
The following keeps the entity (well, it replaces it with the hex equivalent) - but if you do the same with a StringWriter it detects the unicode and uses that instead:
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<xml>™</xml>");
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriter xw = XmlWriter.Create(ms, settings);
doc.Save(xw);
xw.Close();
Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray()));
}
Outputs:
<?xml version="1.0" encoding="iso-8859-1"?><xml>™</xml>

I confess things get a little confusing with XML documents and encodings, but I'd hope that it would get set appropriate when you save it again, if you're still using ISO-8859-1 - but that if you save with UTF-8, it wouldn't need to. In some ways, logically the document really contains the symbol rather the entity reference - the latter is just an encoding matter. (I'm thinking aloud here - please don't take this as authoritative information.)
What are you doing with the document after loading it?

I beleive if you enclose the entity contents in the CDATA section it should leave it all alone e.g.
<root>
<testnode>
<![CDATA[some text ™]]>
</testnode>
</root>

Entity references are not encoding specific. According to the W3C XML 1.0 Recommendation:
If the character reference begins with
"&#x", the digits and letters up to
the terminating ; provide a
hexadecimal representation of the
character's code point in ISO/IEC
10646.

The &#xxxx; entities are considered to be the character they represent. All XML is converted to unicode on reading and any such entities are removed in favor of the unicode character they represent. This includes any occurance for them in unicode source such as the string passed to LoadXML.
Similarly on writing any character that cannot be represented by the stream being written to is converted to a &#xxxx; entity. There is little point trying to preserve them.
A common mistake is expect to get a String from a DOM by some means that uses an encoding other then unicode. That just doesn't happen regardless of what the

Thanks for all of the help.
I've fixed my problem by writing a HtmlEncode function which actually replaces all of the characters before it spits them out to the webpage (instead of relying on the somewhat broken HtmlEncode() .NET function which only seems to encode a small subset of the characters necessary)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.