xml and & issue - c#

I am new to XML and I am now trying to read an xml file.
I googled and try this way to read xml but I get this error.
Reference to undeclared entity 'Ccaron'. Line 2902, position 9.
When I go to line 2902 I got this,
<H0742>&Ccaron;opova 14, POB 1725,
SI-1000 Ljubljana</H0742>
This is the way I try
XmlDocument xDoc = new XmlDocument();
xDoc.Load(file);
XmlNodeList nodes = xDoc.SelectNodes("nodeName");
foreach (XmlNode n in nodes)
{
if (n.SelectSingleNode("H0742") != null)
{
row.IrNbr = n.SelectSingleNode("H0742").InnerText;
}
.
.
.
}
When I look at w3school, & is illegal in xml.
EDIT :
This is the encoding. I wonder it's related with xml somehow.
encoding='iso-8859-1'
Thanks in advance.
EDIT :
They gave me an .ENT file and I can reference online ftp.MyPartnerCompany.com/name.ent.
In this .ENT file
I see entities like that
<!ENTITY Cacute "Ć"> <!-- latin capital letter C with acute,
U+0106 Latin Extended-A -->
How can I reference it in my xml Parsing ?
I prefer to reference online since they may add new anytime.
Thanks in advance !!!

The first thing to be aware of is that the problem isn't in your software.
As you are new to XML, I'm going to guess that definining entities isn't something you've come across before. Character entities are shortcuts for arbitrary pieces of text (one or more characters). The most common place you are going to see them is in the situation you are in now. At some point, your XML has been created by someone who wanted to type the character 'Č' or 'č' (that's upper and lower case C with Caron if your font can't display it).
However, in XML we only have a few predeclared entities (ampersand, less than, greater than, double quote and apostraphe). Any other character entities need to be declared. In order to parse your file correctly you will need to do one of two things - either replace the character entity with something that doesn't cause the parser issues or declare the entity.
To declare the entity, you can use something called an "internal subset" - a specialised form of the DTD statement you might see at the top of your XML file. Something like this:
<!DOCTYPE root-element
[ <!ENTITY Ccaron "Č">
<!ENTITY ccaron "č">]
>
Placing that statement at the beginning of the XML file (change the 'root-element' to match yours) will allow the parser to resolve the entity.
Alternatively, simply change the &Ccaron; to Č and your problem will also be resolved.
The &# notation is a numeric entity, giving appropriate unicode value for the character (the 'x' indicates that it's in hex).
You could always just type the character too but that requires knowledge of the ins and outs of your keyboard and region.

&Ccaron; isn't XML it's not even defined in the HTML 4 entity reference. Which btw isn't XML. XML doesn't support all those entities, in fact, it supports very few of them but if you look up the entity and find it, you'll be able to use it's Unicode equivalent, which you can use. e.g. Š is invalid XML but Š isn't. (Scaron was the closest I could find to Ccaron).

Your XML file isn't well-formed and, so, can't be used as XmlDocument. Period.
You have two options:
Open that file as a regular text file and fixed that symptom.
Fix your XML generator, and that's your real problem. That generator isn't generating that file using System.Xml, but probably concatening several strings, as "XML is just a text file". You should repair it, or opening a generated XML file will be always a surprise.
EDIT: As you can't fix your XML generator, I recommend to open it with File.ReadAllText and execute an regular expression to re-encode that & or to strip off entire entity (as we can't translate it)
Console.WriteLine(
Regex.Replace("<H0742>&Ccaron;opova 14, { POB & SI-1000 &</H0742>",
#"&((?!#)\S*?;)?", match =>
{
switch (match.Value)
{
case "<":
case ">":
case "&":
case """:
case "&apos;":
return match.Value; // correctly encoded
case "&":
return "&";
default: // here you can choose:
// to remove entire entity:
return "";
// or just encode that & character
return "&" + match.Value.Substring(1);
}
}));

&Ccaron; is an entity reference. It is likely that the entity reference is intended to be for the character Č, in order to produce: Čopova.
However, that entity must be declared, or the XML parser will not know what should be substituted for the entity reference as it parses the XML.

solution :-
byte[] encodedString = Encoding.UTF8.GetBytes(xml);
// Put the byte array into a stream and rewind it to the beginning
MemoryStream ms = new MemoryStream(encodedString);
ms.Flush();
ms.Position = 0;
// Build the XmlDocument from the MemorySteam of UTF-8 encoded bytes
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(ms);

Related

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.
In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)
In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.
I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

How to remove xml "&" symbol

i am parasing xml file to dataset.I getting error if xml data contain "&" or some special char how to remove that?
How to remove "&" from below tag?
xml
< department departmentid=1 name="pen & Note" >
$
string departmentpath = HostingEnvironment.MapPath("~/App_Data/Department.xml");
DataSet departmentDS = new DataSet();
System.IO.FileStream dpReadXml = new System.IO.FileStream(departmentpath, System.IO.FileMode.Open);
try
{
departmentDS.ReadXml(dpReadXml);
}
catch (Exception ex)
{
//logg
}
You can replace it with &
The culture of XML is that the person who creates the XML is responsible for delivering well-formed XML that conforms to the spec; the recipient is expected to reject it if they get it wrong. So by trying to repair bad XML and turn it into good XML you are going against the grain. It's like getting served bad food in a restaurant: you should complain, rather than asking the people at the next table how to make it digestible.
The input you've provided has a lot more wrong with it than the ampersands. It's hardly recognizable as XML at all. You're never going to turn this mess into a robust data flow.
The code seems to be C#. But do add the correct language tag!
There are five special characters that often require escaping within XML documents. You can read this SO question.
There are two possibilities:
Let your DataSet::ReadXML method handle these special characters [Recommended]
Change all special characters from your input files [Not recommended]
The second method is not recommended since you cannot possibly always control the incoming data (and you probably would be wasting time pre-processing them if you do want to). In order for ReadXML to properly parse the special characters you will need to define a proper encoding too in your input XML.

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?
If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)
Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

Ascii to XML Character set conversion

Is there any classes to convert ascii to xml characterset preferably opensource i will be using this class either in vc++ or C#
My ascii has some printable characters which is not there in xml character set
i just tried to sen a resume which is in ascii character set and i tried to store it in a online crm and i got this error message
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,22]
Message: Character reference "&#x13" is an invalid XML character.]
Thanks in advance
I had the same problem with Excel using the OpenXML document creation in C#.
My Excel Export feature would blow-up when building a doc with a bad ASCII character.
Somehow the string data, in my company's database, has funky characters in it.
Even though I used the Microsoft DocumentFormat.OpenXML assembly from their OpenXML SDK 2.0, it still didn't take care of this when assigning string values using their objects.
The Fix:
t.Text = Regex.Replace(sValue, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]", "?");
This cleans up the sValue string by removing the offending characters and replacing them with a question mark. You could replace with any string or just use an empty string.
The XML Spec Allows 0x09 (TAB), 0x0A (LF - Line Feed or NL - New Line), and 0x0D (CR - Carriage Return). The RegEx above takes care not remove those.
The XML 1.1 Spec allows you to escape some of these characters.
For example: Using  for 0x03 appears as  in HTML and as L in Office documents and notepad.
I use Asp.net and this is automatically taken care of in my GridView, so I do not need to replace these values - but I believe it may be the browser that takes care of it for all I know.
I thought of escaping these values in OpenXML, but when I looked at the output, it showed the excape markup. So MikeTeeVee still shows up as MikeTeeVee in Excel instead of something like MikeTeeVee, or MikeLTeeVee. This is why I preferred the Mike?TeeVee approach.
My hunch is this is a bug in the current OpenXML which encodes the allowed XML ASCII characters, but allows the unsupported ASCII characters to slip on through.
UPDATE:
I forgot I could look up how these characters are displayed using the "Open XML SDK 2.0 Productivity Tool" to see inside docs like Excel.
There I found it uses the format: _x0000_
Remember: XML 1.0 does not support escaping these values, but XML 1.1 does, so if you're using 1.1, then you can use this code to escape them.
Regular XML 1.1 Escaping:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("&#x" + string.Format("{0:00}", (byte)(m.Value[0])) + ";");
});
If you're escaping strings for OpenXML, then use this instead:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("_x" + string.Format("{0:0000}", (byte)(m.Value[0])) + "_");
});
Your text won't have any printable characters which aren't available in XML - but it may have some unprintable characters which aren't available in XML.
In particular, Unicode values U+0000 to U+001F are invalid except for tab. carriage return and line feed. If you really need those other control characters, you'll have to create your own form of escaping for them, and unescape them at the other end.
The character reference &#x13 is indeed not a valid XML character. You probably want either &#xD or &#13.
Out of curiousity, I took a few minutes to write a simple routinein C# to pump out a XML string of the 128 ASCII characters, to my surprise, .NET didn't output a really valid XML document. I guess the way I output the element text wasn't quite right. Anyway here is the code (comments are welcomed):
XmlDocument doc = new XmlDocument();
doc.AppendChild(doc.CreateXmlDeclaration("1.0", "us-ascii", ""));
XmlElement elem = doc.CreateElement("ASCII");
doc.AppendChild(elem);
byte[] b = new byte[1];
for (int i = 0; i < 128; i++)
{
b[0] = Convert.ToByte(i);
XmlElement e = doc.CreateElement("ASCII_" + i.ToString().PadLeft(3,'0'));
e.InnerText = System.Text.ASCIIEncoding.ASCII.GetString(b);
elem.AppendChild(e);
}
Console.WriteLine(doc.OuterXml);
Here is the formatted output:
<?xml version="1.0" encoding="us-ascii" ?>
<ASCII>
<ASCII_000></ASCII_000>
<ASCII_001></ASCII_001>
<ASCII_002></ASCII_002>
<ASCII_003></ASCII_003>
<ASCII_004></ASCII_004>
<ASCII_005></ASCII_005>
<ASCII_006></ASCII_006>
<ASCII_007></ASCII_007>
<ASCII_008></ASCII_008>
<ASCII_009> </ASCII_009>
<ASCII_010>
</ASCII_010>
<ASCII_011></ASCII_011>
<ASCII_012></ASCII_012>
<ASCII_013>
</ASCII_013>
<ASCII_014></ASCII_014>
<ASCII_015></ASCII_015>
<ASCII_016></ASCII_016>
<ASCII_017></ASCII_017>
<ASCII_018></ASCII_018>
<ASCII_019></ASCII_019>
<ASCII_020></ASCII_020>
<ASCII_021></ASCII_021>
<ASCII_022></ASCII_022>
<ASCII_023></ASCII_023>
<ASCII_024></ASCII_024>
<ASCII_025></ASCII_025>
<ASCII_026></ASCII_026>
<ASCII_027></ASCII_027>
<ASCII_028></ASCII_028>
<ASCII_029></ASCII_029>
<ASCII_030></ASCII_030>
<ASCII_031></ASCII_031>
<ASCII_032> </ASCII_032>
<ASCII_033>!</ASCII_033>
<ASCII_034>"</ASCII_034>
<ASCII_035>#</ASCII_035>
<ASCII_036>$</ASCII_036>
<ASCII_037>%</ASCII_037>
<ASCII_038>&</ASCII_038>
<ASCII_039>'</ASCII_039>
<ASCII_040>(</ASCII_040>
<ASCII_041>)</ASCII_041>
<ASCII_042>*</ASCII_042>
<ASCII_043>+</ASCII_043>
<ASCII_044>,</ASCII_044>
<ASCII_045>-</ASCII_045>
<ASCII_046>.</ASCII_046>
<ASCII_047>/</ASCII_047>
<ASCII_048>0</ASCII_048>
<ASCII_049>1</ASCII_049>
<ASCII_050>2</ASCII_050>
<ASCII_051>3</ASCII_051>
<ASCII_052>4</ASCII_052>
<ASCII_053>5</ASCII_053>
<ASCII_054>6</ASCII_054>
<ASCII_055>7</ASCII_055>
<ASCII_056>8</ASCII_056>
<ASCII_057>9</ASCII_057>
<ASCII_058>:</ASCII_058>
<ASCII_059>;</ASCII_059>
<ASCII_060><</ASCII_060>
<ASCII_061>=</ASCII_061>
<ASCII_062>></ASCII_062>
<ASCII_063>?</ASCII_063>
<ASCII_064>#</ASCII_064>
<ASCII_065>A</ASCII_065>
<ASCII_066>B</ASCII_066>
<ASCII_067>C</ASCII_067>
<ASCII_068>D</ASCII_068>
<ASCII_069>E</ASCII_069>
<ASCII_070>F</ASCII_070>
<ASCII_071>G</ASCII_071>
<ASCII_072>H</ASCII_072>
<ASCII_073>I</ASCII_073>
<ASCII_074>J</ASCII_074>
<ASCII_075>K</ASCII_075>
<ASCII_076>L</ASCII_076>
<ASCII_077>M</ASCII_077>
<ASCII_078>N</ASCII_078>
<ASCII_079>O</ASCII_079>
<ASCII_080>P</ASCII_080>
<ASCII_081>Q</ASCII_081>
<ASCII_082>R</ASCII_082>
<ASCII_083>S</ASCII_083>
<ASCII_084>T</ASCII_084>
<ASCII_085>U</ASCII_085>
<ASCII_086>V</ASCII_086>
<ASCII_087>W</ASCII_087>
<ASCII_088>X</ASCII_088>
<ASCII_089>Y</ASCII_089>
<ASCII_090>Z</ASCII_090>
<ASCII_091>[</ASCII_091>
<ASCII_092>\</ASCII_092>
<ASCII_093>]</ASCII_093>
<ASCII_094>^</ASCII_094>
<ASCII_095>_</ASCII_095>
<ASCII_096>`</ASCII_096>
<ASCII_097>a</ASCII_097>
<ASCII_098>b</ASCII_098>
<ASCII_099>c</ASCII_099>
<ASCII_100>d</ASCII_100>
<ASCII_101>e</ASCII_101>
<ASCII_102>f</ASCII_102>
<ASCII_103>g</ASCII_103>
<ASCII_104>h</ASCII_104>
<ASCII_105>i</ASCII_105>
<ASCII_106>j</ASCII_106>
<ASCII_107>k</ASCII_107>
<ASCII_108>l</ASCII_108>
<ASCII_109>m</ASCII_109>
<ASCII_110>n</ASCII_110>
<ASCII_111>o</ASCII_111>
<ASCII_112>p</ASCII_112>
<ASCII_113>q</ASCII_113>
<ASCII_114>r</ASCII_114>
<ASCII_115>s</ASCII_115>
<ASCII_116>t</ASCII_116>
<ASCII_117>u</ASCII_117>
<ASCII_118>v</ASCII_118>
<ASCII_119>w</ASCII_119>
<ASCII_120>x</ASCII_120>
<ASCII_121>y</ASCII_121>
<ASCII_122>z</ASCII_122>
<ASCII_123>{</ASCII_123>
<ASCII_124>|</ASCII_124>
<ASCII_125>}</ASCII_125>
<ASCII_126>~</ASCII_126>
<ASCII_127></ASCII_127>
</ASCII>
Update:
Added XML decalration with "us-ascii" encoding
Possibly you don't fully understand what a character set is. XML is not a character set, though XML based output does use character sets to encode data.
I'd recommend reading through Joel Spolsky's excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), then come back and have another go at your question.
You won't need an additional library to do that. From different encodings to embedded binary data, all of that is possible through the common .net library. Can you just give a simple example?

.NET XmlDocument LoadXML and Entities

When loading XML into an XmlDocument, i.e.
XmlDocument document = new XmlDocument();
document.LoadXml(xmlData);
is there any way to stop the process from replacing entities? I've got a strange problem where I've got a TM symbol (stored as the entity #8482) in the xml being converted into the TM character. As far as I'm concerned this shouldn't happen as the XML document has the encoding ISO-8859-1 (which doesn't have the TM symbol)
Thanks
This is a standard misunderstanding of the XML toolset. The whole business with "&#x", is a syntactic feature designed to cope with character encodings. Your XmlDocument isn't a stream of characters - it has been freed of character encoding issues - instead it contains an abstract model of XML type data. Words for this include DOM and InfoSet, I'm not sure exactly which is accurate.
The "&#x" gubbins won't exist in this model because the whole issue is irrelevant, it will return - if appropriate - when you transform the Info Set back into a character stream in some specific encoding.
This misunderstanding is sufficiently common to have made it into academic literature as part of a collection of similar quirks. Take a look at "Xml Fever" at this location: http://doi.acm.org/10.1145/1364782.1364795
What are you writing it to? A TextWriter? a Stream? what?
The following keeps the entity (well, it replaces it with the hex equivalent) - but if you do the same with a StringWriter it detects the unicode and uses that instead:
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<xml>™</xml>");
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriter xw = XmlWriter.Create(ms, settings);
doc.Save(xw);
xw.Close();
Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray()));
}
Outputs:
<?xml version="1.0" encoding="iso-8859-1"?><xml>™</xml>
I confess things get a little confusing with XML documents and encodings, but I'd hope that it would get set appropriate when you save it again, if you're still using ISO-8859-1 - but that if you save with UTF-8, it wouldn't need to. In some ways, logically the document really contains the symbol rather the entity reference - the latter is just an encoding matter. (I'm thinking aloud here - please don't take this as authoritative information.)
What are you doing with the document after loading it?
I beleive if you enclose the entity contents in the CDATA section it should leave it all alone e.g.
<root>
<testnode>
<![CDATA[some text ™]]>
</testnode>
</root>
Entity references are not encoding specific. According to the W3C XML 1.0 Recommendation:
If the character reference begins with
"&#x", the digits and letters up to
the terminating ; provide a
hexadecimal representation of the
character's code point in ISO/IEC
10646.
The &#xxxx; entities are considered to be the character they represent. All XML is converted to unicode on reading and any such entities are removed in favor of the unicode character they represent. This includes any occurance for them in unicode source such as the string passed to LoadXML.
Similarly on writing any character that cannot be represented by the stream being written to is converted to a &#xxxx; entity. There is little point trying to preserve them.
A common mistake is expect to get a String from a DOM by some means that uses an encoding other then unicode. That just doesn't happen regardless of what the
Thanks for all of the help.
I've fixed my problem by writing a HtmlEncode function which actually replaces all of the characters before it spits them out to the webpage (instead of relying on the somewhat broken HtmlEncode() .NET function which only seems to encode a small subset of the characters necessary)

Categories

Resources