Reading xml document with hex 0x19 character C#
Apparently the above character is invalid in C# and the XML document I am about to load contains it.
Therefore it throws XML exception everytime I try to read it.
<description><![CDATA[Whenever I run<br>Whenever I run to you lost one<br>Its never done<br>Just hanging on<br><br>Just past has let me be<br>Returning as if dream<br>Shattered as belief<br><br>If you have to go dont say goodbye<br>If you have to go dont you cry<br>If you have to go I will get by<br>Someday Ill follow you and see you on the other side<br><br>But for the grace of love<br>Id will the meaning of<br>Heaven from above<br><br>Your picture out of time<br>Left aching in my mind<br>Shadows kept alive<br><br>If you have to go dont say goodbye<br>If you have to go dont you cry<br>If you have to go I will get by<br>I will follow you and see you on the other side<br><br>But for the grace of love<br>Id will the meaning of<br>Heaven from above<br><br>Long horses we are born<br>Creatures more than torn<br>Mourning our way home]]></description>
Apparently the above line contains that character
Just had the same problem. 0x19 is a control-character (End of Medium) which is pretty much invisible with text editors. What I did to parse the XML anyway was to replace this char with a space before parsing, like this:
var xmlString = GetXmlString();
xmlString = xmlString.Replace((char)0x19, ' ');
var doc = new XmlDocument();
doc.LoadXml(xmlString);
Related
In my website, admin uploads a .docx file. I convert the file into xml using OpenXmlPowerTools Api.
The issue is the document has some bullets in it.
• This is my bullet 1 in the document.
• This is my bullet 2 in the document.
XElement html = OpenXmlPowerTools.HtmlConverter.ConvertToHtml(wDoc, settings);
var htmlString = html.ToString();
File.WriteAllText(destFileName.FullName, htmlString, Encoding.UTF8);
Now when I open the xml file, it renders the bullets as below:-
I need to read each node of XML & save in the database & reconsturct html from nodes.
Please don't ask me why so, as I am not the boss of the system.
How do I get the bullets render correctly in xml so that I can save the right
html in the database?
I have fixed same issue for my requirement and this working without issue so far.
In case like this you'll always have to try workaround i.e. copy this character and compare it within your input/read strings etc. if found then replace it with equivalent html encoded character. In your case it will be bullet list character "ampersandbull;" or "ampersand#8226;" .
Code should look like
listItem == "Compare with your copied character like one in your pic" ? "•" : listItem
you can find more equivalent characters at this link:
http://www.zytrax.com/tech/web/entities.html
Hey I don't think XML can read bullets. I'll advise you programmatically handle it. Try and debug and see what the square is being represented as and then do an if statement to find it and replace it with a code you can define so that when you return it to use it you can convert that code if found to a bullet.
i am parasing xml file to dataset.I getting error if xml data contain "&" or some special char how to remove that?
How to remove "&" from below tag?
xml
< department departmentid=1 name="pen & Note" >
$
string departmentpath = HostingEnvironment.MapPath("~/App_Data/Department.xml");
DataSet departmentDS = new DataSet();
System.IO.FileStream dpReadXml = new System.IO.FileStream(departmentpath, System.IO.FileMode.Open);
try
{
departmentDS.ReadXml(dpReadXml);
}
catch (Exception ex)
{
//logg
}
You can replace it with &
The culture of XML is that the person who creates the XML is responsible for delivering well-formed XML that conforms to the spec; the recipient is expected to reject it if they get it wrong. So by trying to repair bad XML and turn it into good XML you are going against the grain. It's like getting served bad food in a restaurant: you should complain, rather than asking the people at the next table how to make it digestible.
The input you've provided has a lot more wrong with it than the ampersands. It's hardly recognizable as XML at all. You're never going to turn this mess into a robust data flow.
The code seems to be C#. But do add the correct language tag!
There are five special characters that often require escaping within XML documents. You can read this SO question.
There are two possibilities:
Let your DataSet::ReadXML method handle these special characters [Recommended]
Change all special characters from your input files [Not recommended]
The second method is not recommended since you cannot possibly always control the incoming data (and you probably would be wasting time pre-processing them if you do want to). In order for ReadXML to properly parse the special characters you will need to define a proper encoding too in your input XML.
I have a xml file with invalid characters. I searched through internet and haven't found any other way than reading the file as a text file and replace invalid characters one by one.
Can somebody please tell me an easiest way to remove invalid characters from a xml file..
ex xml stream:
<Year>where 12 > 13 occures </Year>
I would try HtmlAgilityPack. At least better than trying to parse manually.
HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml("<Year>where 12 > 13 occures </Year>");
using(StringWriter wr = new StringWriter())
{
using (XmlWriter xmlWriter = XmlWriter.Create(wr,
new XmlWriterSettings() { OmitXmlDeclaration = true }))
{
hdoc.Save(xmlWriter);
Console.WriteLine(wr.ToString());
}
}
this outputs:
<year>where 12 > 13 occures </year>
Start by thinking of the question differently. Your problem is that the input isn't valid XML. So you actually want to remove invalid characters from a non-XML file. That might sound pedantic, but it immediately indicates that tools designed for processing XML will be no use to you, because your input is not XML.
Fixing the problem at source is always better than trying to repair the damage later. But it you are going to embark on a repair strategy, the first thing is to define precisely what faults in the data you want to repair and how you intend to repair them. It's also a good idea to say clearly what constraints you apply to the solution: for example, does it matter if your repair accidentally changes the contents of any comments or CDATA sections?
Once you have defined your repair strategy: e.g. "replace any & by & if it is not immediately followed by either #nn; or #xnn; or a name followed by ';', coding it up becomes quite straightforward.
I am new to XML and I am now trying to read an xml file.
I googled and try this way to read xml but I get this error.
Reference to undeclared entity 'Ccaron'. Line 2902, position 9.
When I go to line 2902 I got this,
<H0742>Čopova 14, POB 1725,
SI-1000 Ljubljana</H0742>
This is the way I try
XmlDocument xDoc = new XmlDocument();
xDoc.Load(file);
XmlNodeList nodes = xDoc.SelectNodes("nodeName");
foreach (XmlNode n in nodes)
{
if (n.SelectSingleNode("H0742") != null)
{
row.IrNbr = n.SelectSingleNode("H0742").InnerText;
}
.
.
.
}
When I look at w3school, & is illegal in xml.
EDIT :
This is the encoding. I wonder it's related with xml somehow.
encoding='iso-8859-1'
Thanks in advance.
EDIT :
They gave me an .ENT file and I can reference online ftp.MyPartnerCompany.com/name.ent.
In this .ENT file
I see entities like that
<!ENTITY Cacute "Ć"> <!-- latin capital letter C with acute,
U+0106 Latin Extended-A -->
How can I reference it in my xml Parsing ?
I prefer to reference online since they may add new anytime.
Thanks in advance !!!
The first thing to be aware of is that the problem isn't in your software.
As you are new to XML, I'm going to guess that definining entities isn't something you've come across before. Character entities are shortcuts for arbitrary pieces of text (one or more characters). The most common place you are going to see them is in the situation you are in now. At some point, your XML has been created by someone who wanted to type the character 'Č' or 'č' (that's upper and lower case C with Caron if your font can't display it).
However, in XML we only have a few predeclared entities (ampersand, less than, greater than, double quote and apostraphe). Any other character entities need to be declared. In order to parse your file correctly you will need to do one of two things - either replace the character entity with something that doesn't cause the parser issues or declare the entity.
To declare the entity, you can use something called an "internal subset" - a specialised form of the DTD statement you might see at the top of your XML file. Something like this:
<!DOCTYPE root-element
[ <!ENTITY Ccaron "Č">
<!ENTITY ccaron "č">]
>
Placing that statement at the beginning of the XML file (change the 'root-element' to match yours) will allow the parser to resolve the entity.
Alternatively, simply change the Č to Č and your problem will also be resolved.
The &# notation is a numeric entity, giving appropriate unicode value for the character (the 'x' indicates that it's in hex).
You could always just type the character too but that requires knowledge of the ins and outs of your keyboard and region.
Č isn't XML it's not even defined in the HTML 4 entity reference. Which btw isn't XML. XML doesn't support all those entities, in fact, it supports very few of them but if you look up the entity and find it, you'll be able to use it's Unicode equivalent, which you can use. e.g. Š is invalid XML but Š isn't. (Scaron was the closest I could find to Ccaron).
Your XML file isn't well-formed and, so, can't be used as XmlDocument. Period.
You have two options:
Open that file as a regular text file and fixed that symptom.
Fix your XML generator, and that's your real problem. That generator isn't generating that file using System.Xml, but probably concatening several strings, as "XML is just a text file". You should repair it, or opening a generated XML file will be always a surprise.
EDIT: As you can't fix your XML generator, I recommend to open it with File.ReadAllText and execute an regular expression to re-encode that & or to strip off entire entity (as we can't translate it)
Console.WriteLine(
Regex.Replace("<H0742>Čopova 14, { POB & SI-1000 &</H0742>",
#"&((?!#)\S*?;)?", match =>
{
switch (match.Value)
{
case "<":
case ">":
case "&":
case """:
case "'":
return match.Value; // correctly encoded
case "&":
return "&";
default: // here you can choose:
// to remove entire entity:
return "";
// or just encode that & character
return "&" + match.Value.Substring(1);
}
}));
Č is an entity reference. It is likely that the entity reference is intended to be for the character Č, in order to produce: Čopova.
However, that entity must be declared, or the XML parser will not know what should be substituted for the entity reference as it parses the XML.
solution :-
byte[] encodedString = Encoding.UTF8.GetBytes(xml);
// Put the byte array into a stream and rewind it to the beginning
MemoryStream ms = new MemoryStream(encodedString);
ms.Flush();
ms.Position = 0;
// Build the XmlDocument from the MemorySteam of UTF-8 encoded bytes
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(ms);
I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?
If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)
Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);