Parsing invalid characters to XML - c#

the application idea is simple , the application is given a path , and application writes each file`s path into XML , the problem i am facing is the file name can have invalid character and that makes the application stops working , here is the code i use to parse file information into XML :
// the collecting details method
private void Get_Properties(string path)
{
// Load the XML File
XmlDocument xml = new XmlDocument();
xml.Load("Details.xml");
foreach (string eachfile in Files)
{
try
{
FileInfo Info = new FileInfo(eachfile);
toolStripStatusLabel1.Text = "Adding : " + Info.Name;
// Create the Root element
XmlElement ROOT = xml.CreateElement("File");
if (checkBox1.Checked)
{
XmlElement FileName = xml.CreateElement("FileName");
FileName.InnerText = Info.Name;
ROOT.AppendChild(FileName);
}
if (checkBox2.Checked)
{
XmlElement FilePath = xml.CreateElement("FilePath");
FilePath.InnerText = Info.FullName;
ROOT.AppendChild(FilePath);
}
if (checkBox3.Checked)
{
XmlElement ModificationDate = xml.CreateElement("ModificationDate");
string lastModification = Info.LastAccessTime.ToString();
ModificationDate.InnerText = lastModification;
ROOT.AppendChild(ModificationDate);
}
if (checkBox4.Checked)
{
XmlElement CreationDate = xml.CreateElement("CreationDate");
string Creation = Info.CreationTime.ToString();
CreationDate.InnerText = Creation;
ROOT.AppendChild(CreationDate);
}
if (checkBox5.Checked)
{
XmlElement Size = xml.CreateElement("Size");
Size.InnerText = Info.Length.ToString() + " Bytes";
ROOT.AppendChild(Size);
}
xml.DocumentElement.InsertAfter(ROOT, xml.DocumentElement.LastChild);
// +1 step in progressbar
toolStripProgressBar1.PerformStep();
success_counter++;
Thread.Sleep(10);
}
catch (Exception ee)
{
toolStripProgressBar1.PerformStep();
error_counter++;
}
}
toolStripStatusLabel1.Text = "Now Writing the Details File";
xml.Save("Details.xml");
toolStripStatusLabel1.Text = success_counter + " Items has been added and "+ error_counter +" Items has Failed , Total Files Processed ("+Files.Count+")";
Files.Clear();
}
Here is how the XML looks like after Generation of details :
<?xml version="1.0" encoding="utf-8"?>
<Files>
<File>
<FileName>binkw32.dll</FileName>
<FilePath>D:\ALL DLLS\binkw32.dll</FilePath>
<ModificationDate>3/31/2012 5:13:56 AM</ModificationDate>
<CreationDate>3/31/2012 5:13:56 AM</CreationDate>
<Size>286208 Bytes</Size>
</File>
<File>
Example of characters i would like to parse to XML without issue :
BX]GC^O^_nI_C{jv_rbp&1b_H âo&psolher d) doိiniᖭ
icon_Áq偩侉₳㪏ံ�ぞ鵃_䑋屡1]
MAnaFor줡�
EDIT [PROBLEM SOLVED]
All i had to do is :
1- convert the file name to UTF8-Bytes
2- Convert the UTF8-Bytes back to string
Here is the method :
byte[] FilestoBytes = System.Text.Encoding.UTF8.GetBytes(Info.Name);
string utf8 = System.Text.Encoding.UTF8.GetString(FilestoBytes);

It's not clear which of your characters you're having problems with. So long as you use the XML API (instead of trying to write the XML out directly yourself) you should be fine with any valid text (broken surrogate pairs would probably cause an issue) but what won't be valid is Unicode code points less than space (U+0020), aside from tab, carriage return and line feed. They're simply not catered for in XML.

Probably the xml is malformed. Xml files can not have some characters without being escaped.
For example, this is not valid:
<dummy>You & Me</dummy>
Instead you should use:
<dummy>You & Me</dummy>
Illegal characters in XML are &, < and > (as well as " or ' in attributes)

Illegal characters in XML are &, < and > (as well as " or ' in attributes)
In file system on windows you can have only & and ' in the file name (<,>," are not allowed in file name)
While saving XML you can escape these characters. For example for & you will require &

Related

How to remove unused > character in xml string in c#

I have to remove some special characters and ">" in XML string.
Load XML throwing Data root level error.
public T ConvertXmlFromByte<T>(byte[] data)
{
T model = null;
try
{
if (data != null)
{
XmlDocument xmlDoc = null;
XmlSerializer serializer = null;
string xml = "";
xmlDoc = new XmlDocument();
xml = Encoding.UTF8.GetString(data);
//xml = Regex.Replace(xml, #"[^&;:()a-zA-Z0-9\=./><_-~-]", string.Empty);
xmlDoc.LoadXml(xml);
}
}
catch (Exception ex)
{
_customLogger.Error(ex.Message, ex);
}
return model;
}
Below is my XML string:
<?xml version="1.0" encoding="utf-8" standalone="no"?><Test xmlns="https://cdn.Test.go.cr/xml-schemas/v4.3/test" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
</Test>>
You are trying to load the file as XML and remove what appears to be an extra character.
The problem is, that this extra character means that the text is not valid XML! Therefore, you can't load it using an XML parser because it is not valid so this is why you get this exception.
So you must therefore treat it as a string, find the offending characters and modify the string and save it again.
So you can use a simple Regex to do this. You don't really state the conditions, so I assume that anytime a double '>>' appears it is incorrect. But you need to amend as appropriate.
string contents = File.ReadAllText(#"c:\path\file.xml");
string output = Regex.Replace(contents, ">>", ">");
File.WriteAllText(#"c:\path\output.xml", output);

preserve &#xA, when reading XML

Xml content like following:
<xml>
<item content="abcd 
 abcd
abcd" />
</xml>
When using XmlDocument to read the content of content attribute, 
 and
are automatically escaped.
Code:
XmlDocument doc = new XmlDocument();
var content = doc.SelectSingleNode("/xml/item").Attributes["content"].Value;
How can get the raw text without char escaping?
If these characters were written to the lexical XML stream without escaping, then they would be swallowed by the XML parser when the stream is read by the recipient, as a result of the XML line-ending normalisation rules. So you've got it the wrong way around: the reason they are escaped is in order to preserve them; if they weren't escaped, they would be lost.
I got a workaround, it works for me:
private static string GetAttributeValue(XmlNode node, string attributeName)
{
if (node == null || string.IsNullOrWhiteSpace(attributeName))
{
throw new ArgumentException();
}
const string CharLF = "
";
const string CharCR = "
";
string xmlContent = node.OuterXml;
if (!xmlContent.Contains(CharLF) && !xmlContent.Contains(CharCR))
{
// no special char, return its original value directly
return node.Attributes[attributeName].Value;
}
string value = string.Empty;
if (xmlContent.Contains(attributeName))
{
value = xmlContent.Substring(xmlContent.IndexOf(attributeName)).Trim();
value = value.Substring(value.IndexOf("\"") + 1);
value = value.Substring(0, value.IndexOf("\""));
}
return value;
}

XmlException - given illegal XML from 3rd party; must process

There are several SO questions and answers about this when creating an XML file; but can't find any pertaining to when you are given bad XML from a 3rd party that you must process; note, the 3rd party cannot be held accountable for the illegal XML.
Ultimately, the .InnerText needs to be escaped or encoded (e.g. changed to legal XML characters) - and later decoded after proper XML parsing.
QUESTION: Are there any libraries that will Load() Invalid/Illegal XML files to allow quick navigation for such escaping/encoding? Or am I stuck having to manually parse the invalid xml, fixing it along the way ... ?
<?xml version="1.0" encoding="utf-8"?>
<ChunkData>
<Fields>
<Field1>some words < other words</Field1>
<Field2>some words > other words</Field2>
</Fields>
</ChunkData>
Although HttpAgilityPack is awesome (and I'm using it in another project of my own), I was given no the time to follow Alexei's advice - which is exactly the direction that I was looking for -- can't parse it as XML? cool, parse it as HTML ... didn't even cross my mind ...
Ended up with this, which does the trick (but is exactly what Alexei advised against):
private static string EncodeValues(string xml)
{
var doc = new List<string>();
var lines = xml.Split('\n');
foreach (var line in lines)
{
var output = line;
if (line.Contains("<Field") && !line.Contains("Fields>"))
{
var value = line.Parse(">", "</");
var encoded = HttpUtility.UrlEncode(value);
output = line.Replace(value, encoded);
}
doc.Add(output);
}
return string.Join("", doc);
}
private static Hashtable DecodeValues(IDictionary data)
{
var output = new Hashtable();
foreach (var key in data.Keys)
{
var value = (string)data[key];
output.Add(key, HttpUtility.UrlDecode(value));
}
return output;
}
Used in conjunction with an Extension method I wrote quite awhile ago ...
public static string Parse(this string s, string first, string second)
{
try
{
if (string.IsNullOrEmpty(s)) return "";
var start = s.IndexOf(first, StringComparison.InvariantCulture) + first.Length;
var end = s.IndexOf(second, start, StringComparison.InvariantCulture);
var length = end - start;
return (end > 0 && length < s.Length) ? s.Substring(start, length) : s.Substring(start);
}
catch (Exception) { return ""; }
}
Used as such (kept separate from the Transform and Hashtable creation methods for clarity):
xmlDocs[0] = EncodeValues(xmlDocs[0]); // in order to handle illegal chars in XML, encode InnerText
var doc = TransformXmlDocument(orgName, xmlDocs[0], xmlDocs[1]);
var data = GetHashtableFromXml(doc);
data = DecodeValues(data); // decode the values extracted from the hashtable
Regardless, I'm always looking for insight ... feel free to comment on this solution - or provide another.

Loading an XML file when the tags are written in Greek doesn't work, why?

When I load XML files with English tags everything works fine but when I try to load an XML file with tags written in the Greek Language nothing works, why is this happening?
Do I have to change the encoding somewhere in the code?
This is the code I use:
XmlDocument xdoc = new XmlDocument();
xdoc.Load(filename);
XmlNode root = xdoc.DocumentElement;
if (root.HasChildNodes)
{
for (int i = 0; i < root.ChildNodes.Count; i++)
{
richTextBox1.AppendText(root.ChildNodes[i].InnerXml + "\n");
}
}
I downloaded your file and deserialized/displayed succesfully.
public class ΦΑΡΜΑΚΑ
{
public string A;
public string ΦΑΡΜ_ΑΓΩΓΗ;
public string ΧΟΡΗΓΗΣΗ;
public string ΛΗΞΗΣ;
public string ΑMKA;
}
XmlSerializer xml = new XmlSerializer(typeof(ΦΑΡΜΑΚΑ[]),new XmlRootAttribute("dataroot"));
ΦΑΡΜΑΚΑ[] array = (ΦΑΡΜΑΚΑ[])xml.Deserialize(File.Open(#"D:\Downloads\bio3.xml", FileMode.Open));
richTextBox1.Text = String.Join(Environment.NewLine, array.Select(x => x.ΦΑΡΜ_ΑΓΩΓΗ));
Make sure your rich text box has its multiline property set to true. Default is true, but you can may have changed it. Also, instead of \n use Environment.NewLine.
Also .InnerText will get you the value without the tags. InnerXml gives you the markup as well.

Text boxes and Xml in C#

I just started using VS2010 and C#.
I'm trying to create an app which takes values from two textboxes and adds it to an existing xml file.
eg.
Text Box 1 Text Box 2
---------- ----------
A B
C D
E F
I want the resultant xml file to be like this.
<root>
<element>
<Entry1>A</Entry1>
<Entry2>B</Entry2>
</element>
</root>
and so on...
Can this be done using C# ??
I'm unable to add the entries alternatively i.e. Entry1 should contain Text Box 1 line #1 and Entry2 Text Box 2 line #1.
Any help would be appreciated.
Thanks
You need to split the string retrieved from the text box based on the new line like this:
string[] lines = theText.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
Once you have values split for each text box, you can use System.xml.linq.xdocument class and loop through the values that you retrieve above.
Something like this:
XDocument srcTree = new XDocument(new XElement("Root",
new XElement("entry1", "textbox value1")))
You can retrieve a xml document using a linq query or save it in an xml file using the Save method of XDocument
The below code will give you a string of XML data from the textboxes:
private string createXmlTags(TextBox textBox1, TextBox textBox2)
{
string strXml = string.Empty;
string[] text1Val = textBox1.Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
string[] text2Val = textBox2.Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
int count = 1;
IList<XElement> testt = new List<XElement>();
for (int i = 0; i < text1Val.Count(); i++)
{
testt.Add(new XElement("Entry" + count, text1Val[i]));
while (!String.IsNullOrEmpty(text2Val[i]))
{
count = count + 1;
testt.Add(new XElement("Entry"+count,text2Val[i]));
break;
}
count = count + 1;
}
foreach (var xElement in testt)
{
strXml += xElement.ToString();
}
return strXml;
}
You can then insert the code to an existing xml document. Follow: How can I build XML in C#? and How to change XML Attribute
Read here: XDocument or XmlDocument
I will have the decency of not copying the code from there. Every basics you need to know on creating a XML doc is well explained there.
There are two options, I would personally go with XDocument.
I know there's no code in this answer but since you haven't tried anything, not even apparently searching Google (believe me, you'd find it), I'd rather point you in the right direction than "giving you the fish".

Categories

Resources