How to get the xml contents without using the while loop - c#

I have an xml file which contains two start tags and end tags. And I need the contents within these two tags separately. Please check the below content.
<testing>
<test>
<text>test1</text>
</test>
<test>
<text>test2</text>
</test>
</testing>
As of now I am using a while loop and finding the start index and end index of the tags and then getting the contents using the substring method. Please check the below code.
string xml = File.ReadAllText(#"C:\testing_doc.txt");
int startindex = xml.IndexOf("<test>");
while (startindex > 0)
{
int endIndex = xml.IndexOf("</test>", startindex);
int length = endIndex - startindex;
string textValue = xml.Substring(startindex, length);
startindex = xml.IndexOf("<test>", endIndex); // getting the start index for the second test tag
}
Is there any other way to get the contents without using the while loop? Because using while seems to be kind of expensive and if text file is corrupted then it will cause other problems.
Thanks in advance,
Anish

You can use XPATH which is designed to solve querying XML as the following:
var xml = #"<testing>
<test>
<text>test1</text>
</test>
<test>
<text>test2</text>
</test>
</testing>
";
var testing = XElement.Parse(xml);
var tests = testing.XPathEvaluate("test/text/text()") as IEnumerable;
foreach (var test in tests)
{
Console.WriteLine(test); // test1, test2
}

You could use XmlDocument class which is based on W3C DOM(Document object Model)
and XPath class
XmlDocument doc = new XmlDocument();
doc.load(#"C:\testing_doc.txt");
XmlNodeList values = doc.SelectNodes("testing/test/text"); //Using XPath
string str = string.Empty;
foreach (XmlNode x in values)
{
str += x.InnerText + ",";
}
str.TrimEnd(',');
Console.WriteLine(str); //test1,test2

If you want to do manually, regex can help you
string xml = File.ReadAllText(#"C:\testing_doc.txt");
string pattern = "<test>(.*?)</test>";
Match match = Regex.Match(xml , pattern);
if (match.Success){
System.Console.WriteLine(match.Groups[1].Value);
}
But think about the library helping to parse XML available XMLDocument or LinQ to XML

Related

preserve &#xA, when reading XML

Xml content like following:
<xml>
<item content="abcd 
 abcd
abcd" />
</xml>
When using XmlDocument to read the content of content attribute, 
 and
are automatically escaped.
Code:
XmlDocument doc = new XmlDocument();
var content = doc.SelectSingleNode("/xml/item").Attributes["content"].Value;
How can get the raw text without char escaping?
If these characters were written to the lexical XML stream without escaping, then they would be swallowed by the XML parser when the stream is read by the recipient, as a result of the XML line-ending normalisation rules. So you've got it the wrong way around: the reason they are escaped is in order to preserve them; if they weren't escaped, they would be lost.
I got a workaround, it works for me:
private static string GetAttributeValue(XmlNode node, string attributeName)
{
if (node == null || string.IsNullOrWhiteSpace(attributeName))
{
throw new ArgumentException();
}
const string CharLF = "
";
const string CharCR = "
";
string xmlContent = node.OuterXml;
if (!xmlContent.Contains(CharLF) && !xmlContent.Contains(CharCR))
{
// no special char, return its original value directly
return node.Attributes[attributeName].Value;
}
string value = string.Empty;
if (xmlContent.Contains(attributeName))
{
value = xmlContent.Substring(xmlContent.IndexOf(attributeName)).Trim();
value = value.Substring(value.IndexOf("\"") + 1);
value = value.Substring(0, value.IndexOf("\""));
}
return value;
}

XmlException - given illegal XML from 3rd party; must process

There are several SO questions and answers about this when creating an XML file; but can't find any pertaining to when you are given bad XML from a 3rd party that you must process; note, the 3rd party cannot be held accountable for the illegal XML.
Ultimately, the .InnerText needs to be escaped or encoded (e.g. changed to legal XML characters) - and later decoded after proper XML parsing.
QUESTION: Are there any libraries that will Load() Invalid/Illegal XML files to allow quick navigation for such escaping/encoding? Or am I stuck having to manually parse the invalid xml, fixing it along the way ... ?
<?xml version="1.0" encoding="utf-8"?>
<ChunkData>
<Fields>
<Field1>some words < other words</Field1>
<Field2>some words > other words</Field2>
</Fields>
</ChunkData>
Although HttpAgilityPack is awesome (and I'm using it in another project of my own), I was given no the time to follow Alexei's advice - which is exactly the direction that I was looking for -- can't parse it as XML? cool, parse it as HTML ... didn't even cross my mind ...
Ended up with this, which does the trick (but is exactly what Alexei advised against):
private static string EncodeValues(string xml)
{
var doc = new List<string>();
var lines = xml.Split('\n');
foreach (var line in lines)
{
var output = line;
if (line.Contains("<Field") && !line.Contains("Fields>"))
{
var value = line.Parse(">", "</");
var encoded = HttpUtility.UrlEncode(value);
output = line.Replace(value, encoded);
}
doc.Add(output);
}
return string.Join("", doc);
}
private static Hashtable DecodeValues(IDictionary data)
{
var output = new Hashtable();
foreach (var key in data.Keys)
{
var value = (string)data[key];
output.Add(key, HttpUtility.UrlDecode(value));
}
return output;
}
Used in conjunction with an Extension method I wrote quite awhile ago ...
public static string Parse(this string s, string first, string second)
{
try
{
if (string.IsNullOrEmpty(s)) return "";
var start = s.IndexOf(first, StringComparison.InvariantCulture) + first.Length;
var end = s.IndexOf(second, start, StringComparison.InvariantCulture);
var length = end - start;
return (end > 0 && length < s.Length) ? s.Substring(start, length) : s.Substring(start);
}
catch (Exception) { return ""; }
}
Used as such (kept separate from the Transform and Hashtable creation methods for clarity):
xmlDocs[0] = EncodeValues(xmlDocs[0]); // in order to handle illegal chars in XML, encode InnerText
var doc = TransformXmlDocument(orgName, xmlDocs[0], xmlDocs[1]);
var data = GetHashtableFromXml(doc);
data = DecodeValues(data); // decode the values extracted from the hashtable
Regardless, I'm always looking for insight ... feel free to comment on this solution - or provide another.

Searching a String using C#

I have the following String "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>"
I require to get the attribute value from the div tag. How can i retrieve this using C#.
Avoid parsing html with regex
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilityPack
You can do it like this with htmlagilityPack.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> itemList = doc.DocumentNode.SelectNodes("//div[#id]")//selects all div having id attribute
.Select(x=>x.Attributes["id"].Value)//select the id attribute value
.ToList<string>();
//itemList will now contain all div's id attribute value
If you're a masochist you can do this old school VB3 style:
string input = #"</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string startString = "div id='";
int startIndex = input.IndexOf(startString);
if (startIndex != -1)
{
startIndex += startString.Length;
int endIndex = input.IndexOf("'", startIndex);
string subString = input.Substring(startIndex, endIndex - startIndex);
}
Strictly solving the question asked, one of a myriad ways of solving it would be to isolate the div element, parse it as an XElement and then pull the attribute's value that way.
string bobo = "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string justDiv = bobo.Substring(bobo.IndexOf("<div"));
XElement xelem = XElement.Parse(justDiv);
var id = xelem.Attribute("id");
var value = id.Value;
There are certainly lots of ways to solve this but this one answers the mail.
A .NET Regex that looks something like this will do the trick
^</script><div id='(?<attrValue>[^']+)'.*$
you can then get hold of the value as
MatchCollection matches = Regex.Matches(input, #"^</script><div id='(?<attrValue>[^']+)'.*$");
if (matches.Count > 0)
{
var attrValue = matches[0].Groups["attrValue"];
}

Escaping ONLY contents of Node in XML

I have a part of code mentioned like below.
//Reading from a file and assign to the variable named "s"
string s = "<item><name> Foo </name></item>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
But, it stops working if the contents has characters something like "<", ">"..etc.
string s = "<item><name> Foo > Bar </name></item>";
I know, I have to escape those characters before loading but, if I do like
doc.LoadXml(System.Security.SecurityElement.Escape(s));
, the tags (< , >) are also escaped and as a result, the error occurs.
How can I solve this problem?
a tricky solution:
string s = "<item><name> Foo > Bar </name></item>";
s = Regex.Replace(s, #"<[^>]+?>", m => HttpUtility.HtmlEncode(m.Value)).Replace("<","ojlovecd").Replace(">","cdloveoj");
s = HttpUtility.HtmlDecode(s).Replace("ojlovecd", ">").Replace("cdloveoj", "<");
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
Assuming your content will never contain the characters "]]>", you can use CDATA.
string s = "<item><name><![CDATA[ Foo > Bar ]]></name></item>";
Otherwise, you'll need to html encode your special characters, and decode them before you use/display them (unless it's in a browser).
string s = "<item><name> Foo > Bar </name></item>";
Assign the content of string to the InnerXml property of node.
var node = doc.CreateElement("root");
node.InnerXml = s;
Take a look at - Different ways how to escape an XML string in C#
It looks like the strings that you have generated are strings, and not valid XML. You can either get the strings generated as valid XML OR if you know that the strings are always going to be the name, then don't include the XML <item> and <name> tags in the data.
Then when you create the XMLDocument. do a CreateElement and assign your string before resaving the results.
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("item");
doc.AppendChild(root);
XmlElement name = doc.CreateElement("name");
name.InnerText = "the contents from your file";
root.AppendChild(name);

How to encapsulate text into tags?

Let say we have such string variable:
string input = "First regular, <b>bold</b>,<i>italic</i>,<u>underline</u>,<b><i><u>bold+italic+underline</u></i></b>"
string which can contain some html tags in it.
The question is how can i encapsule each "non-taged" text part into some tag, to get smth like this:
string output = "<plain>First regular, </plain><b>bold</b><plain>,</plain><i>italic</i><plain>,</plain><u>underline</u><plain>,</plain><b><i><u>bold+italic+underline</u></i></b>"
How to do this in C# ? Regex? How should look such regex expression?
Maybe encapsulation isn't good start, what i need is to create xml structure from:
string input = "First regular, <b>bold</b>,<i>italic</i>,<u>underline</u>,<b><i><u>bold+italic+underline</u></i></b>"
I need to create
XDocument xml = XDocument.Parse("<plain>First regular, </plain><b>bold</b><plain>,</plain><i>italic</i><plain>,</plain><u>underline</u><plain>,</plain><b><i><u>bold+italic+underline</u></i></b>")
This code is kind of fail, but it should get you on the right path:
string input = "First regular, <b>bold</b>,<i>italic</i>,<u>underline</u>,<b><i><u>bold+italic+underline</u></i></b>";
input = "<data>" + input + "</data>";
XmlDocument xml = new XmlDocument();
xml.InnerXml = input;
XmlNodeList nodes = xml.SelectNodes("//text()");
foreach (XmlNode node in nodes) {
if (node.ParentNode.Name != "b" && node.ParentNode.Name != "i" && node.ParentNode.Name != "u") {
node.InnerText = "^^^^^" + node.InnerText + "$$$$$";
}
}
input = xml.DocumentElement.InnerXml.Replace("^^^^^", "<plain>").Replace("$$$$$", "</plain>");

Categories

Resources