Xml content like following:
<xml>
<item content="abcd
abcd
abcd" />
</xml>
When using XmlDocument to read the content of content attribute,
and
are automatically escaped.
Code:
XmlDocument doc = new XmlDocument();
var content = doc.SelectSingleNode("/xml/item").Attributes["content"].Value;
How can get the raw text without char escaping?
If these characters were written to the lexical XML stream without escaping, then they would be swallowed by the XML parser when the stream is read by the recipient, as a result of the XML line-ending normalisation rules. So you've got it the wrong way around: the reason they are escaped is in order to preserve them; if they weren't escaped, they would be lost.
I got a workaround, it works for me:
private static string GetAttributeValue(XmlNode node, string attributeName)
{
if (node == null || string.IsNullOrWhiteSpace(attributeName))
{
throw new ArgumentException();
}
const string CharLF = "
";
const string CharCR = "
";
string xmlContent = node.OuterXml;
if (!xmlContent.Contains(CharLF) && !xmlContent.Contains(CharCR))
{
// no special char, return its original value directly
return node.Attributes[attributeName].Value;
}
string value = string.Empty;
if (xmlContent.Contains(attributeName))
{
value = xmlContent.Substring(xmlContent.IndexOf(attributeName)).Trim();
value = value.Substring(value.IndexOf("\"") + 1);
value = value.Substring(0, value.IndexOf("\""));
}
return value;
}
I am trying to convert a file to XML format that contains some special characters but it's not getting converted because of that special characters in the data.
I have already this regex code still it's not working for me please help.
The code what I have tried:
string filedata = #"D:\readwrite\test11.txt";
string input = ReadForFile(filedata);
string re1 = #"[^\u0000-\u007F]+";
string re5 = #"\p{Cs}";
data = Regex.Replace(input, re1, "");
data = Regex.Replace(input, re5, "");
XmlDocument xmlDocument = new XmlDocument();
try
{
xmlDocument = (XmlDocument)JsonConvert.DeserializeXmlNode(data);
var Xdoc = XDocument.Parse(xmlDocument.OuterXml);
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
0x04 is a transmission control character and cannot appear in a text string. XmlDocument is right to reject it if it really does appear in your data. This does suggest that the regex you have doesn't do what you think it does, if I'm right that regex will find the first instance of one or more of those invalid characters at the beginning of a line and replace it, but not all of them. The real question for me is why this non-text 'character' appears in data intended as XML in the first place.
I have other questions. I've never seen JsonConvert.DeserializeXmlNode before - I had to look up what it does. Why are you using a JSON function against the root of a document which presumably therefore contains no JSON? Why are you then taking that document, converting it back to a string, and then creating an XDocument from it? Why not just create an XDocument to start with?
I'm using the following code to read xml subtrees from a stream (where reader is an XmlReader):
while (reader.Read())
{
using (XmlReader subTreeReader = reader.ReadSubtree())
{
try
{
XElement xmlData = XElement.Load(subTreeReader);
ProcessStreamingEvent(xmlData);
}
catch (XmlException ex)
{
// want to write xml subtree
}
}
}
It works fine most of the time, but once XElement.Load(subTreeReader) throws an exception like
Name cannot begin with the '[' character, hexadecimal value 0x5B. Line
1, position 775.
The message is mostly useless; I need to output the entire string that XElement.Load() tried to parse. How can I do this? In the exception handler there is very little information; subTreeReader.Nodetype is Text, and Value is "spread". Perhaps I need to read the content of subTreeReader before calling XElement.Load()? It seems like subTreeReader should contain the entire string since it knows where the tree ends...?
I tried to replace XElement xmlData = XElement.Load(subTreeReader) with:
string xmlString = subTreeReader.ReadOuterXml();
XElement xmlData = XElement.Load(xmlString);
but xmlString is empty (even when subTreeReader does contain valid xml).
Interesting (and a good thing) that it is not XmlReader.Read() or XmlReader.ReadSubTree() that throws an exception (when the stream may not contain xml), but rather XElement.Load(). Makes me wonder though how does ReadSubTree() knows when to stop reading when the stream the does not contain xml.
the application idea is simple , the application is given a path , and application writes each file`s path into XML , the problem i am facing is the file name can have invalid character and that makes the application stops working , here is the code i use to parse file information into XML :
// the collecting details method
private void Get_Properties(string path)
{
// Load the XML File
XmlDocument xml = new XmlDocument();
xml.Load("Details.xml");
foreach (string eachfile in Files)
{
try
{
FileInfo Info = new FileInfo(eachfile);
toolStripStatusLabel1.Text = "Adding : " + Info.Name;
// Create the Root element
XmlElement ROOT = xml.CreateElement("File");
if (checkBox1.Checked)
{
XmlElement FileName = xml.CreateElement("FileName");
FileName.InnerText = Info.Name;
ROOT.AppendChild(FileName);
}
if (checkBox2.Checked)
{
XmlElement FilePath = xml.CreateElement("FilePath");
FilePath.InnerText = Info.FullName;
ROOT.AppendChild(FilePath);
}
if (checkBox3.Checked)
{
XmlElement ModificationDate = xml.CreateElement("ModificationDate");
string lastModification = Info.LastAccessTime.ToString();
ModificationDate.InnerText = lastModification;
ROOT.AppendChild(ModificationDate);
}
if (checkBox4.Checked)
{
XmlElement CreationDate = xml.CreateElement("CreationDate");
string Creation = Info.CreationTime.ToString();
CreationDate.InnerText = Creation;
ROOT.AppendChild(CreationDate);
}
if (checkBox5.Checked)
{
XmlElement Size = xml.CreateElement("Size");
Size.InnerText = Info.Length.ToString() + " Bytes";
ROOT.AppendChild(Size);
}
xml.DocumentElement.InsertAfter(ROOT, xml.DocumentElement.LastChild);
// +1 step in progressbar
toolStripProgressBar1.PerformStep();
success_counter++;
Thread.Sleep(10);
}
catch (Exception ee)
{
toolStripProgressBar1.PerformStep();
error_counter++;
}
}
toolStripStatusLabel1.Text = "Now Writing the Details File";
xml.Save("Details.xml");
toolStripStatusLabel1.Text = success_counter + " Items has been added and "+ error_counter +" Items has Failed , Total Files Processed ("+Files.Count+")";
Files.Clear();
}
Here is how the XML looks like after Generation of details :
<?xml version="1.0" encoding="utf-8"?>
<Files>
<File>
<FileName>binkw32.dll</FileName>
<FilePath>D:\ALL DLLS\binkw32.dll</FilePath>
<ModificationDate>3/31/2012 5:13:56 AM</ModificationDate>
<CreationDate>3/31/2012 5:13:56 AM</CreationDate>
<Size>286208 Bytes</Size>
</File>
<File>
Example of characters i would like to parse to XML without issue :
BX]GC^O^_nI_C{jv_rbp&1b_H âo&psolher d) doိiniᖭ
icon_Áq偩侉₳㪏ံ�ぞ鵃_䑋屡1]
MAnaFor줡�
EDIT [PROBLEM SOLVED]
All i had to do is :
1- convert the file name to UTF8-Bytes
2- Convert the UTF8-Bytes back to string
Here is the method :
byte[] FilestoBytes = System.Text.Encoding.UTF8.GetBytes(Info.Name);
string utf8 = System.Text.Encoding.UTF8.GetString(FilestoBytes);
It's not clear which of your characters you're having problems with. So long as you use the XML API (instead of trying to write the XML out directly yourself) you should be fine with any valid text (broken surrogate pairs would probably cause an issue) but what won't be valid is Unicode code points less than space (U+0020), aside from tab, carriage return and line feed. They're simply not catered for in XML.
Probably the xml is malformed. Xml files can not have some characters without being escaped.
For example, this is not valid:
<dummy>You & Me</dummy>
Instead you should use:
<dummy>You & Me</dummy>
Illegal characters in XML are &, < and > (as well as " or ' in attributes)
Illegal characters in XML are &, < and > (as well as " or ' in attributes)
In file system on windows you can have only & and ' in the file name (<,>," are not allowed in file name)
While saving XML you can escape these characters. For example for & you will require &
Is there any C# function which could be used to escape and un-escape a string, which could be used to fill in the content of an XML element?
I am using VSTS 2008 + C# + .Net 3.0.
EDIT 1: I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand, for example, I need to put a<b into <foo></foo>, so I need escape string a<b and put it into element foo.
SecurityElement.Escape(string s)
public static string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
public static string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}
EDIT: You say "I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand".
I would strongly advise you not to do it by hand. Use the XML APIs to do it all for you - read in the original files, merge the two into a single document however you need to (you probably want to use XmlDocument.ImportNode), and then write it out again. You don't want to write your own XML parsers/formatters. Serialization is somewhat irrelevant here.
If you can give us a short but complete example of exactly what you're trying to do, we can probably help you to avoid having to worry about escaping in the first place.
Original answer
It's not entirely clear what you mean, but normally XML APIs do this for you. You set the text in a node, and it will automatically escape anything it needs to. For example:
LINQ to XML example:
using System;
using System.Xml.Linq;
class Test
{
static void Main()
{
XElement element = new XElement("tag",
"Brackets & stuff <>");
Console.WriteLine(element);
}
}
DOM example:
using System;
using System.Xml;
class Test
{
static void Main()
{
XmlDocument doc = new XmlDocument();
XmlElement element = doc.CreateElement("tag");
element.InnerText = "Brackets & stuff <>";
Console.WriteLine(element.OuterXml);
}
}
Output from both examples:
<tag>Brackets & stuff <></tag>
That's assuming you want XML escaping, of course. If you're not, please post more details.
Thanks to #sehe for the one-line escape:
var escaped = new System.Xml.Linq.XText(unescaped).ToString();
I add to it the one-line un-escape:
var unescapedAgain = System.Xml.XmlReader.Create(new StringReader("<r>" + escaped + "</r>")).ReadElementString();
George, it's simple. Always use the XML APIs to handle XML. They do all the escaping and unescaping for you.
Never create XML by appending strings.
And if you want, like me when I found this question, to escape XML node names, like for example when reading from an XML serialization, use the easiest way:
XmlConvert.EncodeName(string nameToEscape)
It will also escape spaces and any non-valid characters for XML elements.
http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape%28VS.80%29.aspx
Another take based on John Skeet's answer that doesn't return the tags:
void Main()
{
XmlString("Brackets & stuff <> and \"quotes\"").Dump();
}
public string XmlString(string text)
{
return new XElement("t", text).LastNode.ToString();
}
This returns just the value passed in, in XML encoded format:
Brackets & stuff <> and "quotes"
WARNING: Necromancing
Still Darin Dimitrov's answer + System.Security.SecurityElement.Escape(string s) isn't complete.
In XML 1.1, the simplest and safest way is to just encode EVERYTHING.
Like for \t.
It isn't supported at all in XML 1.0.
For XML 1.0, one possible workaround is to base-64 encode the text containing the character(s).
//string EncodedXml = SpecialXmlEscape("привет мир");
//Console.WriteLine(EncodedXml);
//string DecodedXml = XmlUnescape(EncodedXml);
//Console.WriteLine(DecodedXml);
public static string SpecialXmlEscape(string input)
{
//string content = System.Xml.XmlConvert.EncodeName("\t");
//string content = System.Security.SecurityElement.Escape("\t");
//string strDelimiter = System.Web.HttpUtility.HtmlEncode("\t"); // XmlEscape("\t"); //XmlDecode(" ");
//strDelimiter = XmlUnescape(";");
//Console.WriteLine(strDelimiter);
//Console.WriteLine(string.Format("&#{0};", (int)';'));
//Console.WriteLine(System.Text.Encoding.ASCII.HeaderName);
//Console.WriteLine(System.Text.Encoding.UTF8.HeaderName);
string strXmlText = "";
if (string.IsNullOrEmpty(input))
return input;
System.Text.StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; ++i)
{
sb.AppendFormat("&#{0};", (int)input[i]);
}
strXmlText = sb.ToString();
sb.Clear();
sb = null;
return strXmlText;
} // End Function SpecialXmlEscape
XML 1.0:
public static string Base64Encode(string plainText)
{
var plainTextBytes = System.Text.Encoding.UTF8.GetBytes(plainText);
return System.Convert.ToBase64String(plainTextBytes);
}
public static string Base64Decode(string base64EncodedData)
{
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.UTF8.GetString(base64EncodedBytes);
}
Following functions will do the work. Didn't test against XmlDocument, but I guess this is much faster.
public static string XmlEncode(string value)
{
System.Xml.XmlWriterSettings settings = new System.Xml.XmlWriterSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
StringBuilder builder = new StringBuilder();
using (var writer = System.Xml.XmlWriter.Create(builder, settings))
{
writer.WriteString(value);
}
return builder.ToString();
}
public static string XmlDecode(string xmlEncodedValue)
{
System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
using (var stringReader = new System.IO.StringReader(xmlEncodedValue))
{
using (var xmlReader = System.Xml.XmlReader.Create(stringReader, settings))
{
xmlReader.Read();
return xmlReader.Value;
}
}
}
Using a third-party library (Newtonsoft.Json) as alternative:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped); ;
}
public static string XmlUnescape(string escaped)
{
if (escaped == null) return null;
return JsonConvert.DeserializeObject(escaped, typeof(string)).ToString();
}
Examples of escaped string:
a<b ==> "a<b"
<foo></foo> ==> "foo></foo>"
NOTE:
In newer versions, the code written above may not work with escaping, so you need to specify how the strings will be escaped:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped, new JsonSerializerSettings()
{
StringEscapeHandling = StringEscapeHandling.EscapeHtml
});
}
Examples of escaped string:
a<b ==> "a\u003cb"
<foo></foo> ==> "\u003cfoo\u003e\u003c/foo\u003e"
SecurityElementEscape does this job for you
Use this method to replace invalid characters in a string before using the string in a SecurityElement. If invalid characters are used in a SecurityElement without being escaped, an ArgumentException is thrown.
The following table shows the invalid XML characters and their escaped equivalents.
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape?view=net-5.0