Reading XML file with Invalid character

Reading XML file with Invalid character - c#

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?

If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)

Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

Related

Is there a way to have an XmlReader preserve a character reference as text rather than converting it?

I'm using an xml reader to parse some xml and I'm wondering if I can have it read in a character entity reference as straight text rather than converting it to the actual character. So if I called ReadInnerXml() on the node:
<param name="id">don&apos;t convert this</param>
I would get "don&apos;t convert this" as opposed to what I'm currently getting, which is "don't convert this". This is necessary as any characters or character entity references should be handed back the way the came due to them being legacy content.
Any help appreciated!

No, I don't know of any XML parser that has this feature. The job of an XML parser is to parse the input, and that's what it will do.
If you can't fix the consumer of this process to handle XML properly, your best bet might be to preprocess the text by replacing & by (say) § so it doesn't mean anything special to the XML parser.

Convert string to valid XML [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.

A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.

The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.

IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.

In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)

In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.

I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

Handling \x01 received from Flash's ExternalInterface

I'm receiving data from a Flash component embedded in a Windows Form. Unfortunately, if the data returned from the socket contains any of the following characters, the call to loadXml below fails:
This is the callback method I have to receive data from the socket (via ExternalInterface in the Flash component).
private void player_FlashCall(object sender, _IShockwaveFlashEvents_FlashCallEvent e)
{
String output = e.request;
//output = CleanInvalidXmlChars(output);
XmlDocument document = new XmlDocument();
document.LoadXml(output);
XmlAttributeCollection attributes = document.FirstChild.Attributes;
String command = attributes.Item(0).InnerText;
XmlNodeList list = document.GetElementsByTagName("arguments");
process(list[0].InnerText);
I had a method to replace the characters with text (CleanInvalidXmlChars), but I don't think this is the right approach.
How can I load this data into an XML file, as this makes separating the method name, paramter names and parameter types which are returned very easy to work with.
Would appreciate any help at all.
Thanks.

If the “XML” contains any U+0001 (aka '\x01') or other similar characters, it is not a valid XML. There is no way you can include those characters in XML (well, in XML 1.0, anyway). See the XML specification. If you need to pass e.g. binary data in XML, you need to convert them to a proper form, e.g. using Base-64.
If the data does contain those invalid characters, it is not XML, and therefore cannot be read using standard XML tools (I don’t think any of the standard .NET classes allows you to override that behavior). You can either replace all those characters (these are basically all control characters (U+0000 through U+001F) except U+0009 (tab), U+000A and U+000D (CR+LF), plus U+FFFE and U+FFFF (noncharacters)) prior to use as you tried – you could devise a safe transformation which would not lose any data (e.g. first replace all # characters with #0040, then replace any invalid character with #xxxx where xxxx is its code, and when processing the parsed XML data, replace all #xxxx back).
Another option is to drop the XML idea and just process it as a string. Just for inspiration, see e.g. this piece of code.

How to remove xml "&" symbol

i am parasing xml file to dataset.I getting error if xml data contain "&" or some special char how to remove that?
How to remove "&" from below tag?
xml
< department departmentid=1 name="pen & Note" >
$
string departmentpath = HostingEnvironment.MapPath("~/App_Data/Department.xml");
DataSet departmentDS = new DataSet();
System.IO.FileStream dpReadXml = new System.IO.FileStream(departmentpath, System.IO.FileMode.Open);
try
{
departmentDS.ReadXml(dpReadXml);
}
catch (Exception ex)
{
//logg
}

You can replace it with &

The culture of XML is that the person who creates the XML is responsible for delivering well-formed XML that conforms to the spec; the recipient is expected to reject it if they get it wrong. So by trying to repair bad XML and turn it into good XML you are going against the grain. It's like getting served bad food in a restaurant: you should complain, rather than asking the people at the next table how to make it digestible.
The input you've provided has a lot more wrong with it than the ampersands. It's hardly recognizable as XML at all. You're never going to turn this mess into a robust data flow.

The code seems to be C#. But do add the correct language tag!
There are five special characters that often require escaping within XML documents. You can read this SO question.
There are two possibilities:
Let your DataSet::ReadXML method handle these special characters [Recommended]
Change all special characters from your input files [Not recommended]
The second method is not recommended since you cannot possibly always control the incoming data (and you probably would be wasting time pre-processing them if you do want to). In order for ReadXML to properly parse the special characters you will need to define a proper encoding too in your input XML.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading XML file with Invalid character - c#

Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset: StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding myDataset.ReadXml(reader);

Related

Is there a way to have an XmlReader preserve a character reference as text rather than converting it?

Convert string to valid XML [duplicate]

How to write '&' in xml?

Handling \x01 received from Flash's ExternalInterface

How to remove xml "&" symbol

Categories

Resources