Ascii to XML Character set conversion - c#

Is there any classes to convert ascii to xml characterset preferably opensource i will be using this class either in vc++ or C#
My ascii has some printable characters which is not there in xml character set
i just tried to sen a resume which is in ascii character set and i tried to store it in a online crm and i got this error message
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,22]
Message: Character reference "&#x13" is an invalid XML character.]
Thanks in advance

I had the same problem with Excel using the OpenXML document creation in C#.
My Excel Export feature would blow-up when building a doc with a bad ASCII character.
Somehow the string data, in my company's database, has funky characters in it.
Even though I used the Microsoft DocumentFormat.OpenXML assembly from their OpenXML SDK 2.0, it still didn't take care of this when assigning string values using their objects.
The Fix:
t.Text = Regex.Replace(sValue, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]", "?");
This cleans up the sValue string by removing the offending characters and replacing them with a question mark. You could replace with any string or just use an empty string.
The XML Spec Allows 0x09 (TAB), 0x0A (LF - Line Feed or NL - New Line), and 0x0D (CR - Carriage Return). The RegEx above takes care not remove those.
The XML 1.1 Spec allows you to escape some of these characters.
For example: Using  for 0x03 appears as  in HTML and as L in Office documents and notepad.
I use Asp.net and this is automatically taken care of in my GridView, so I do not need to replace these values - but I believe it may be the browser that takes care of it for all I know.
I thought of escaping these values in OpenXML, but when I looked at the output, it showed the excape markup. So MikeTeeVee still shows up as MikeTeeVee in Excel instead of something like MikeTeeVee, or MikeLTeeVee. This is why I preferred the Mike?TeeVee approach.
My hunch is this is a bug in the current OpenXML which encodes the allowed XML ASCII characters, but allows the unsupported ASCII characters to slip on through.
UPDATE:
I forgot I could look up how these characters are displayed using the "Open XML SDK 2.0 Productivity Tool" to see inside docs like Excel.
There I found it uses the format: _x0000_
Remember: XML 1.0 does not support escaping these values, but XML 1.1 does, so if you're using 1.1, then you can use this code to escape them.
Regular XML 1.1 Escaping:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("&#x" + string.Format("{0:00}", (byte)(m.Value[0])) + ";");
});
If you're escaping strings for OpenXML, then use this instead:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("_x" + string.Format("{0:0000}", (byte)(m.Value[0])) + "_");
});

Your text won't have any printable characters which aren't available in XML - but it may have some unprintable characters which aren't available in XML.
In particular, Unicode values U+0000 to U+001F are invalid except for tab. carriage return and line feed. If you really need those other control characters, you'll have to create your own form of escaping for them, and unescape them at the other end.

The character reference &#x13 is indeed not a valid XML character. You probably want either &#xD or &#13.

Out of curiousity, I took a few minutes to write a simple routinein C# to pump out a XML string of the 128 ASCII characters, to my surprise, .NET didn't output a really valid XML document. I guess the way I output the element text wasn't quite right. Anyway here is the code (comments are welcomed):
XmlDocument doc = new XmlDocument();
doc.AppendChild(doc.CreateXmlDeclaration("1.0", "us-ascii", ""));
XmlElement elem = doc.CreateElement("ASCII");
doc.AppendChild(elem);
byte[] b = new byte[1];
for (int i = 0; i < 128; i++)
{
b[0] = Convert.ToByte(i);
XmlElement e = doc.CreateElement("ASCII_" + i.ToString().PadLeft(3,'0'));
e.InnerText = System.Text.ASCIIEncoding.ASCII.GetString(b);
elem.AppendChild(e);
}
Console.WriteLine(doc.OuterXml);
Here is the formatted output:
<?xml version="1.0" encoding="us-ascii" ?>
<ASCII>
<ASCII_000></ASCII_000>
<ASCII_001></ASCII_001>
<ASCII_002></ASCII_002>
<ASCII_003></ASCII_003>
<ASCII_004></ASCII_004>
<ASCII_005></ASCII_005>
<ASCII_006></ASCII_006>
<ASCII_007></ASCII_007>
<ASCII_008></ASCII_008>
<ASCII_009> </ASCII_009>
<ASCII_010>
</ASCII_010>
<ASCII_011></ASCII_011>
<ASCII_012></ASCII_012>
<ASCII_013>
</ASCII_013>
<ASCII_014></ASCII_014>
<ASCII_015></ASCII_015>
<ASCII_016></ASCII_016>
<ASCII_017></ASCII_017>
<ASCII_018></ASCII_018>
<ASCII_019></ASCII_019>
<ASCII_020></ASCII_020>
<ASCII_021></ASCII_021>
<ASCII_022></ASCII_022>
<ASCII_023></ASCII_023>
<ASCII_024></ASCII_024>
<ASCII_025></ASCII_025>
<ASCII_026></ASCII_026>
<ASCII_027></ASCII_027>
<ASCII_028></ASCII_028>
<ASCII_029></ASCII_029>
<ASCII_030></ASCII_030>
<ASCII_031></ASCII_031>
<ASCII_032> </ASCII_032>
<ASCII_033>!</ASCII_033>
<ASCII_034>"</ASCII_034>
<ASCII_035>#</ASCII_035>
<ASCII_036>$</ASCII_036>
<ASCII_037>%</ASCII_037>
<ASCII_038>&</ASCII_038>
<ASCII_039>'</ASCII_039>
<ASCII_040>(</ASCII_040>
<ASCII_041>)</ASCII_041>
<ASCII_042>*</ASCII_042>
<ASCII_043>+</ASCII_043>
<ASCII_044>,</ASCII_044>
<ASCII_045>-</ASCII_045>
<ASCII_046>.</ASCII_046>
<ASCII_047>/</ASCII_047>
<ASCII_048>0</ASCII_048>
<ASCII_049>1</ASCII_049>
<ASCII_050>2</ASCII_050>
<ASCII_051>3</ASCII_051>
<ASCII_052>4</ASCII_052>
<ASCII_053>5</ASCII_053>
<ASCII_054>6</ASCII_054>
<ASCII_055>7</ASCII_055>
<ASCII_056>8</ASCII_056>
<ASCII_057>9</ASCII_057>
<ASCII_058>:</ASCII_058>
<ASCII_059>;</ASCII_059>
<ASCII_060><</ASCII_060>
<ASCII_061>=</ASCII_061>
<ASCII_062>></ASCII_062>
<ASCII_063>?</ASCII_063>
<ASCII_064>#</ASCII_064>
<ASCII_065>A</ASCII_065>
<ASCII_066>B</ASCII_066>
<ASCII_067>C</ASCII_067>
<ASCII_068>D</ASCII_068>
<ASCII_069>E</ASCII_069>
<ASCII_070>F</ASCII_070>
<ASCII_071>G</ASCII_071>
<ASCII_072>H</ASCII_072>
<ASCII_073>I</ASCII_073>
<ASCII_074>J</ASCII_074>
<ASCII_075>K</ASCII_075>
<ASCII_076>L</ASCII_076>
<ASCII_077>M</ASCII_077>
<ASCII_078>N</ASCII_078>
<ASCII_079>O</ASCII_079>
<ASCII_080>P</ASCII_080>
<ASCII_081>Q</ASCII_081>
<ASCII_082>R</ASCII_082>
<ASCII_083>S</ASCII_083>
<ASCII_084>T</ASCII_084>
<ASCII_085>U</ASCII_085>
<ASCII_086>V</ASCII_086>
<ASCII_087>W</ASCII_087>
<ASCII_088>X</ASCII_088>
<ASCII_089>Y</ASCII_089>
<ASCII_090>Z</ASCII_090>
<ASCII_091>[</ASCII_091>
<ASCII_092>\</ASCII_092>
<ASCII_093>]</ASCII_093>
<ASCII_094>^</ASCII_094>
<ASCII_095>_</ASCII_095>
<ASCII_096>`</ASCII_096>
<ASCII_097>a</ASCII_097>
<ASCII_098>b</ASCII_098>
<ASCII_099>c</ASCII_099>
<ASCII_100>d</ASCII_100>
<ASCII_101>e</ASCII_101>
<ASCII_102>f</ASCII_102>
<ASCII_103>g</ASCII_103>
<ASCII_104>h</ASCII_104>
<ASCII_105>i</ASCII_105>
<ASCII_106>j</ASCII_106>
<ASCII_107>k</ASCII_107>
<ASCII_108>l</ASCII_108>
<ASCII_109>m</ASCII_109>
<ASCII_110>n</ASCII_110>
<ASCII_111>o</ASCII_111>
<ASCII_112>p</ASCII_112>
<ASCII_113>q</ASCII_113>
<ASCII_114>r</ASCII_114>
<ASCII_115>s</ASCII_115>
<ASCII_116>t</ASCII_116>
<ASCII_117>u</ASCII_117>
<ASCII_118>v</ASCII_118>
<ASCII_119>w</ASCII_119>
<ASCII_120>x</ASCII_120>
<ASCII_121>y</ASCII_121>
<ASCII_122>z</ASCII_122>
<ASCII_123>{</ASCII_123>
<ASCII_124>|</ASCII_124>
<ASCII_125>}</ASCII_125>
<ASCII_126>~</ASCII_126>
<ASCII_127></ASCII_127>
</ASCII>
Update:
Added XML decalration with "us-ascii" encoding

Possibly you don't fully understand what a character set is. XML is not a character set, though XML based output does use character sets to encode data.
I'd recommend reading through Joel Spolsky's excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), then come back and have another go at your question.

You won't need an additional library to do that. From different encodings to embedded binary data, all of that is possible through the common .net library. Can you just give a simple example?

Related

Parsing XML which contains illegal characters

A message I receive from a server contains tags and in the tags is the data I need.
I try to parse the payload as XML but illegal character exceptions are generated.
I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.
My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_
Thanks.
Example:
<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>
If you have only & as invalid character, then you can use regex to replace it with &. We use regex to prevent replacement of already existing &, ", o, etc. symbols.
Regex can be as follows:
&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)
Sample code:
string content = #"<item><code>1234 & test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, #"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);
Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.
When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.
If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.
Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:
var invalidChars = new [] { '&', other chars comes here.. };
Then read all the xml as a whole text:
var xmlContent = File.ReadAllText("path");
Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:
var validContent = string.Concat(xmlContent
.Select(x =>
{
if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
return x.ToString();
}));
Then parse it using XDocument.Parse, that's all.

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.
In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)
In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.
I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

KeyNotFoundException with using HtmlEntity.DeEntitize() method

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.
I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:
WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?
I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.
After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here
This may be of some help to you in case you want to modify the htmlagilitypack source yourself.
Four years later and I have the same problem with some encoded characters (version 1.4.9.5). In my case, there is a limited set of characters that might generate the problem, so I have just created a function to perform the replacements:
// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
var sb = new StringBuilder(str);
//TODO: add other replacements, as needed
return sb.Replace("&period;", ".")
.Replace("&abreve;", "ă")
.Replace("â", "â")
.ToString();
}
In my case, the string contains both html-encoded characters and UTF-8 characters, but the problem is related to some encoded characters only.
This is not an elegant solution, but a quick fix for all those text with a limited (and known) amount of problematic encoded characters.
My HTML had a block of text like so:
... found in sections: 233.9 & 517.3; ...
Despite the spacing and decimal point, it was interpreting & 517.3; as a unicode character.
Simply HTML Encoding the raw text fixed the problem for me.
string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);
In my case I have fixed this by updating HtmlAgilityPack to version 1.5.0

xml and & issue

I am new to XML and I am now trying to read an xml file.
I googled and try this way to read xml but I get this error.
Reference to undeclared entity 'Ccaron'. Line 2902, position 9.
When I go to line 2902 I got this,
<H0742>&Ccaron;opova 14, POB 1725,
SI-1000 Ljubljana</H0742>
This is the way I try
XmlDocument xDoc = new XmlDocument();
xDoc.Load(file);
XmlNodeList nodes = xDoc.SelectNodes("nodeName");
foreach (XmlNode n in nodes)
{
if (n.SelectSingleNode("H0742") != null)
{
row.IrNbr = n.SelectSingleNode("H0742").InnerText;
}
.
.
.
}
When I look at w3school, & is illegal in xml.
EDIT :
This is the encoding. I wonder it's related with xml somehow.
encoding='iso-8859-1'
Thanks in advance.
EDIT :
They gave me an .ENT file and I can reference online ftp.MyPartnerCompany.com/name.ent.
In this .ENT file
I see entities like that
<!ENTITY Cacute "Ć"> <!-- latin capital letter C with acute,
U+0106 Latin Extended-A -->
How can I reference it in my xml Parsing ?
I prefer to reference online since they may add new anytime.
Thanks in advance !!!
The first thing to be aware of is that the problem isn't in your software.
As you are new to XML, I'm going to guess that definining entities isn't something you've come across before. Character entities are shortcuts for arbitrary pieces of text (one or more characters). The most common place you are going to see them is in the situation you are in now. At some point, your XML has been created by someone who wanted to type the character 'Č' or 'č' (that's upper and lower case C with Caron if your font can't display it).
However, in XML we only have a few predeclared entities (ampersand, less than, greater than, double quote and apostraphe). Any other character entities need to be declared. In order to parse your file correctly you will need to do one of two things - either replace the character entity with something that doesn't cause the parser issues or declare the entity.
To declare the entity, you can use something called an "internal subset" - a specialised form of the DTD statement you might see at the top of your XML file. Something like this:
<!DOCTYPE root-element
[ <!ENTITY Ccaron "Č">
<!ENTITY ccaron "č">]
>
Placing that statement at the beginning of the XML file (change the 'root-element' to match yours) will allow the parser to resolve the entity.
Alternatively, simply change the &Ccaron; to Č and your problem will also be resolved.
The &# notation is a numeric entity, giving appropriate unicode value for the character (the 'x' indicates that it's in hex).
You could always just type the character too but that requires knowledge of the ins and outs of your keyboard and region.
&Ccaron; isn't XML it's not even defined in the HTML 4 entity reference. Which btw isn't XML. XML doesn't support all those entities, in fact, it supports very few of them but if you look up the entity and find it, you'll be able to use it's Unicode equivalent, which you can use. e.g. Š is invalid XML but Š isn't. (Scaron was the closest I could find to Ccaron).
Your XML file isn't well-formed and, so, can't be used as XmlDocument. Period.
You have two options:
Open that file as a regular text file and fixed that symptom.
Fix your XML generator, and that's your real problem. That generator isn't generating that file using System.Xml, but probably concatening several strings, as "XML is just a text file". You should repair it, or opening a generated XML file will be always a surprise.
EDIT: As you can't fix your XML generator, I recommend to open it with File.ReadAllText and execute an regular expression to re-encode that & or to strip off entire entity (as we can't translate it)
Console.WriteLine(
Regex.Replace("<H0742>&Ccaron;opova 14, { POB & SI-1000 &</H0742>",
#"&((?!#)\S*?;)?", match =>
{
switch (match.Value)
{
case "<":
case ">":
case "&":
case """:
case "&apos;":
return match.Value; // correctly encoded
case "&":
return "&";
default: // here you can choose:
// to remove entire entity:
return "";
// or just encode that & character
return "&" + match.Value.Substring(1);
}
}));
&Ccaron; is an entity reference. It is likely that the entity reference is intended to be for the character Č, in order to produce: Čopova.
However, that entity must be declared, or the XML parser will not know what should be substituted for the entity reference as it parses the XML.
solution :-
byte[] encodedString = Encoding.UTF8.GetBytes(xml);
// Put the byte array into a stream and rewind it to the beginning
MemoryStream ms = new MemoryStream(encodedString);
ms.Flush();
ms.Position = 0;
// Build the XmlDocument from the MemorySteam of UTF-8 encoded bytes
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(ms);

Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe).
FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);
//strip out bad characters
content = content.Replace("’", "'");
This doesn't work and it changes the slanted apostrophes into ? marks.
I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533, i.e. the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:
content[0]; // 65533 '�'
The reason why the replace isn't working is simple - content doesn't contain the string you gave it:
content.IndexOf("’"); // -1
As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII, and so to read the file I just needed to specify the correct encoding:
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));
(See this question).
You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:
content = content.Replace("\u0092", "'");
My bet is the file is encoded in Windows-1252. This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. i.e. 0x92)
//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");
// This should replace smart single quotes with a straight single quote
Regex.Replace(content, #"(\u2018|\u2019)", "'");
//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));
If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

Categories

Resources