How to write '&' in xml? - c#

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.

In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)

In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.

I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

Related

Is there a method in C# that converts just the characters with a special meaning in HTML and XML to entities?

My programme generates an XML file with data from a database that contains special HTML/XML characters, so those need to be converted to entities.
I tried to use System.Web.HttpUtility.HtmlEncode() for that, which works but it does much more than I want.
Certain (but not all) non-ASCII characters such as the middle dot and superscript numbers are also converted even though they have no special meaning in HTML and XML.
I get for instance the string a&#183;ba&#183;ĵur&#183;o instead of simply a·ba·ĵur·o.
As you see the circumflexed "j" is left as it is, but the dots are converted. This makes some data in the file difficult to read for humans;
I only want <, >, &, ' and " converted, nothing more.
Is there a method in C# that does that, or do I need to write my own?
Use SecurityElement.Escape("your HTML")

How to remove xml "&" symbol

i am parasing xml file to dataset.I getting error if xml data contain "&" or some special char how to remove that?
How to remove "&" from below tag?
xml
< department departmentid=1 name="pen & Note" >
$
string departmentpath = HostingEnvironment.MapPath("~/App_Data/Department.xml");
DataSet departmentDS = new DataSet();
System.IO.FileStream dpReadXml = new System.IO.FileStream(departmentpath, System.IO.FileMode.Open);
try
{
departmentDS.ReadXml(dpReadXml);
}
catch (Exception ex)
{
//logg
}
You can replace it with &
The culture of XML is that the person who creates the XML is responsible for delivering well-formed XML that conforms to the spec; the recipient is expected to reject it if they get it wrong. So by trying to repair bad XML and turn it into good XML you are going against the grain. It's like getting served bad food in a restaurant: you should complain, rather than asking the people at the next table how to make it digestible.
The input you've provided has a lot more wrong with it than the ampersands. It's hardly recognizable as XML at all. You're never going to turn this mess into a robust data flow.
The code seems to be C#. But do add the correct language tag!
There are five special characters that often require escaping within XML documents. You can read this SO question.
There are two possibilities:
Let your DataSet::ReadXML method handle these special characters [Recommended]
Change all special characters from your input files [Not recommended]
The second method is not recommended since you cannot possibly always control the incoming data (and you probably would be wasting time pre-processing them if you do want to). In order for ReadXML to properly parse the special characters you will need to define a proper encoding too in your input XML.

Having trouble taking out all the newline, tab, and carriage return between two tags

I have been working on this for almost a day now. But I'm not able to take out all the newline, tab, and carriage return from ">" and "<"
This is a sample XML file I'm reading:
<Consequence_Note>
<Text>In some cases, integer coercion errors can lead to exploitable buffer
overflow conditions, resulting in the execution of arbitrary
code.</Text>
</Consequence_Note>
and this
<Consequence_Scope>Availability</Consequence_Scope>
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
My goal is to take out all the newline, tab, and carriage return from these two tag (> and <). The only thing I'm able to achieve is to take out all the /n/t/r from ">" and "<" when there's nothing in between the two tags. But I'm not able to take out all the \n\t\r when there's other character in between the two tags.
I need help in how to have a regular expression that will take out all the newline, tag, and carriage return from ">" and "<"
For example:
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
What I would like to have is:
<Consequence_Technical_Impact>DoS: resource consumption (CPU)</Consequence_Technical_Impact>
This is my code (I'm reading from a xml file):
String file = #"C:\Documents and Settings\YYC\Desktop\cwec_v2.1\cwec_v2.1.xml";
var lines = File.ReadAllText(file);
var replace = Regex.Replace(lines, #">([\r\n\t])*?<", "><");
File.WriteAllText(file, replace);
Don't parse html/xml with regexp ( RegEx match open tags except XHTML self-contained tags )!
Use XML reader for xml or HtmlAgilityPack (or some other html tool) for html.
The xml/html documents are so complex, the regexp is not always (in some cases yes, but not generaly) do the work absolutelly right.
If you first read the document using an XmlReader it will remove the newlines from the input by default. then you can simply write it back out with the writer correct settings.
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.ignorewhitespace.aspx
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling.aspx
A regex alternative can probably be built, but it will still have lots and lots of issues with XML containing CData, comments and other constructs which make XML hard to parse to begin with. If you XML is very structured, machine generated and unchanging, you could create a regex to fix it, but on the other hand, you might also be able to fix the generator. Simplest regex that might work:
\s{2,}
replace with
[ ]
That strips out any whitespace which is longer than one character and replaces it with one space. No need to treat any other whitespace inside tags differently, that's what the XMLReader should do by default anyways.

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?
If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)
Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

How do I stop XElement.Save from escaping characters?

I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?
Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.
The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.

Categories

Resources