How to keep < in Value when using XmlReader - c#

XmlReader.Read converts
<
to <.
When reading this sample xml fragment <add >,
XML.NodeType isXmlNodeType.Text but XmlReader.Value contains <add >.
How can I retain the original format of <" add > ?

You can't.
If the actual content of that element is the escaped text, you need to further escape it in the XML, like this:
&lt;add&gt;
This will be properly read as
<add>

I can only guess you want to subsequently use the text 'plain' in another XML or HTML context.
The right answer is:
use an XmlWriter/XElement.ToString down the line, or
proper HtmlEncode it
Sidebar XML != Text, don't treat it as such. Don't cut/paste fragments. You'll run into brick wall with unparsed character data, different character sets, different encodings, repeated escaping or unbalanced escaping etc.
The XmlReader is supposed to read the Xml and give you the content. No other way about it.

Related

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.
In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)
In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.
I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

Cleanup xml file - Invalid character in the given encoding

I am integrating against Magento ecommerce using their "SOAP" api, and the API returns "XML" results. Problem is, this is not always well formed:
<product>
<entity_id>18</entity_id>
<price regular="2925  <span>Nok</span>"/>
...
In this specific case, the price regular attribute has both an invisible character 0xa0 (before the span tag), and < > within the attribute text.
I have no way to get proper well-formed XML from Magento it seems, so the alternative is to clean it up before I feed it to my XmlSerializer deserialization:
XmlSerializer serializer = new XmlSerializer(typeof(Responses.Product.product));
product = serializer.Deserialize(textReader) as Responses.Product.product;
The invisible character I can get rid of using a simple text replace, but I'm more unsure about the <> within the attribute text.
My question is, how to clean it up for be valid XML?
The character 0x3c is the < character. For an invisible character you would rather be looking for something like the 0x09 TAB character.
To fix the broken markup you could look for that specific HTML tag in the content, using a regular expression to allow any currency within the tag:
xml = Regex.Replace(xml, "<span>([A-Za-z]{3})</span>", "<span>$1</span>");
This works as long as there isn't any span elements in the XML code itself, with a three character content. You could do similar replacements for other HTML tags, but try to keep the pattern as specific as possible, to avoid false positives.

Having trouble taking out all the newline, tab, and carriage return between two tags

I have been working on this for almost a day now. But I'm not able to take out all the newline, tab, and carriage return from ">" and "<"
This is a sample XML file I'm reading:
<Consequence_Note>
<Text>In some cases, integer coercion errors can lead to exploitable buffer
overflow conditions, resulting in the execution of arbitrary
code.</Text>
</Consequence_Note>
and this
<Consequence_Scope>Availability</Consequence_Scope>
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
My goal is to take out all the newline, tab, and carriage return from these two tag (> and <). The only thing I'm able to achieve is to take out all the /n/t/r from ">" and "<" when there's nothing in between the two tags. But I'm not able to take out all the \n\t\r when there's other character in between the two tags.
I need help in how to have a regular expression that will take out all the newline, tag, and carriage return from ">" and "<"
For example:
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
What I would like to have is:
<Consequence_Technical_Impact>DoS: resource consumption (CPU)</Consequence_Technical_Impact>
This is my code (I'm reading from a xml file):
String file = #"C:\Documents and Settings\YYC\Desktop\cwec_v2.1\cwec_v2.1.xml";
var lines = File.ReadAllText(file);
var replace = Regex.Replace(lines, #">([\r\n\t])*?<", "><");
File.WriteAllText(file, replace);
Don't parse html/xml with regexp ( RegEx match open tags except XHTML self-contained tags )!
Use XML reader for xml or HtmlAgilityPack (or some other html tool) for html.
The xml/html documents are so complex, the regexp is not always (in some cases yes, but not generaly) do the work absolutelly right.
If you first read the document using an XmlReader it will remove the newlines from the input by default. then you can simply write it back out with the writer correct settings.
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.ignorewhitespace.aspx
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling.aspx
A regex alternative can probably be built, but it will still have lots and lots of issues with XML containing CData, comments and other constructs which make XML hard to parse to begin with. If you XML is very structured, machine generated and unchanging, you could create a regex to fix it, but on the other hand, you might also be able to fix the generator. Simplest regex that might work:
\s{2,}
replace with
[ ]
That strips out any whitespace which is longer than one character and replaces it with one space. No need to treat any other whitespace inside tags differently, that's what the XMLReader should do by default anyways.

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?
If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)
Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

Problem getting XML properly formatted

I use classes (autogenerated from a schema) to generate xml documents. It has worked fine, until now, when I need to use inline HTML elements. I've tried several different methods, but as soon as I use the inline HTML, the "<" and ">" gets replaced with %lt; etc.
Example:
<meta>
<name>test</name>
<value>test <br />new row</value>
</meta>
becomes "destroyed" later on when trying to get it as a string for database storage, the value is changed to:
<value>test <br />new row</value>
How is it possible to keep the angle brackets intact?
You need to use CDATA sections for XML (or XML like) content.
The XML write is escaping the reserved characters such as <, > etc. If you're reading the text back using a Xml reader then your < will be correctly read as <.

Categories

Resources