html to XSLT conversion using C# - c#

I am trying to change a html page to a xslt page using C#,
for example if i have something like
#companyname#
i have to convert it into
<xsl:value-of select="test/companyname" />
I have a xsl file which has all these values. I dont want to replace the values here as they are to be further processed before replacing the original values.
The problem i am facing here is i have a trouble identifying(to replace the xml construct) if the value is in the attribute level of the tag or in the value level of the tag.
I am trying to use the regular expressions on it . Can someone help??

Html Agility Pack is the way to go. Don't forget to add the reference to it. This code illustrates one way of using HTML Agility Pack to create an XSLT which is what I think you want to do.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html>" +
"<a href='#compantnameURL1#'>#companyname1#</a>" +
"<a href='#compantnameURL2#'>#companyname2#</a>" +
"</html>");
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = (" ");
settings.Encoding = Encoding.UTF8;
using (XmlWriter writer = XmlWriter.Create(Console.Out, settings))
{
writer.WriteStartDocument();
writer.WriteStartElement("xsl", "stylesheet", "http://www.w3.org/1999/XSL/Transform");
writer.WriteStartElement("template", "http://www.w3.org/1999/XSL/Transform");
writer.WriteAttributeString("match", "/");
writer.WriteElementString("apply-templates", "http://www.w3.org/1999/XSL/Transform", "");
writer.WriteEndElement();
writer.WriteStartElement("template", "http://www.w3.org/1999/XSL/Transform");
writer.WriteAttributeString("match", "test/");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a"))
{
HtmlAttribute att = link.Attributes["href"];
writer.WriteStartElement("a");
writer.WriteStartElement("attribute", "http://www.w3.org/1999/XSL/Transform");
writer.WriteStartElement("value-of", "http://www.w3.org/1999/XSL/Transform");
writer.WriteAttributeString("select", att.Value);
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteStartElement("value-of", "http://www.w3.org/1999/XSL/Transform");
writer.WriteAttributeString("select", link.InnerText);
writer.WriteEndElement();
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
}

I'm not aware of a component that will get you all to XSLT, but the HTML Agility Pack is wonderful for any sort of HTML manipulation. The parser will provide a complete object tree with attributes, tags, styles, etc clearly defined, and it's easily queryable with XSLT.
Also, for a good discussion of parsing HTML with regex, see the first answer on this post.

Related

How to add data to an existing xml file in c#

I'm using this c# code to write data to xml file:
Employee[] employees = new Employee[2];
employees[0] = new Employee(1, "David", "Smith", 10000);
employees[1] = new Employee(12, "Cecil", "Walker", 120000);
using (XmlWriter writer = XmlWriter.Create("employees.xml"))
{
writer.WriteStartDocument();
writer.WriteStartElement("Employees");
foreach (Employee employee in employees)
{
writer.WriteStartElement("Employee");
writer.WriteElementString("ID", employee.Id.ToString());
writer.WriteElementString("FirstName", employee.FirstName);
writer.WriteElementString("LastName", employee.LastName);
writer.WriteElementString("Salary", employee.Salary.ToString());
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
}
Now suppose I restart my application and I want to add new data to the xml file without losing the existed data, using the same way will overwrite the data on my xml file, I tried to figure out how to do that and I searched for a similar example but I couldn't come to anything , any ideas ??
Perhaps you should look at some examples using datasets and xml:
http://www.codeproject.com/Articles/13854/Using-XML-as-Database-with-Dataset
or use System.Xml.Serialization.XmlSerializer, when you dont't have amount of records.
Example using XmlDocument
XmlDocument xd = new XmlDocument();
xd.Load("employees.xml");
XmlNode nl = xd.SelectSingleNode("//Employees");
XmlDocument xd2 = new XmlDocument();
xd2.LoadXml("<Employee><ID>20</ID><FirstName>Clair</FirstName><LastName>Doner</LastName><Salary>13000</Salary></Employee>");
XmlNode n = xd.ImportNode(xd2.FirstChild,true);
nl.AppendChild(n);
xd.Save(Console.Out);
Using an xml writer for small amounts of data is awkward. You would be better of using an XDocument that you either initialize from scratch for the first run, or read from an existing file in subsequent runs.
Using XDocument you can manipulate the XML with XElement and XAttribute instances and then write the entire thing out to a file when you want to persist it.

Xml within an Xml

I basically want to know how to insert a XmlDocument inside another XmlDocument.
The first XmlDocument will have the basic header and footer tags.
The second XmlDocument will be the body/data tag which must be inserted into the first XmlDocument.
string tableData = null;
using(StringWriter sw = new StringWriter())
{
rightsTable.WriteXml(sw);
tableData = sw.ToString();
}
XmlDocument xmlTable = new XmlDocument();
xmlTable.LoadXml(tableData);
StringBuilder build = new StringBuilder();
using (XmlWriter writer = XmlWriter.Create(build, new XmlWriterSettings { OmitXmlDeclaration = true }))
{
writer.WriteStartElement("dataheader");
//need to insert the xmlTable here somehow
writer.WriteEndElement();
}
Is there an easier solution to this?
Use importNode feature in your document parser.
You can use this code based on CreateCDataSection method
// Create an XmlCDataSection from your document
var cdata = xmlTable.CreateCDataSection("<test></test>");
XmlElement root = xmlTable.DocumentElement;
// Append the cdata section to your node
root.AppendChild(cdata);
Link : http://msdn.microsoft.com/fr-fr/library/system.xml.xmldocument.createcdatasection.aspx
I am not sure what you are really looking for but this can show how to merge two xml documents (using Linq2xml)
string xml1 =
#"<xml1>
<header>header1</header>
<footer>footer</footer>
</xml1>";
string xml2 =
#"<xml2>
<body>body</body>
<data>footer</data>
</xml2>";
var xdoc1 = XElement.Parse(xml1);
var xdoc2 = XElement.Parse(xml2);
xdoc1.Descendants().First(d => d.Name == "header").AddAfterSelf(xdoc2.Elements());
var newxml = xdoc1.ToString();
OUTPUT
<xml1>
<header>header1</header>
<body>body</body>
<data>footer</data>
<footer>footer</footer>
</xml1>
You will need to write the inner XML files in CDATA sections.
Use writer.WriteCData for such nodes, passing in the inner XML as text.
writer.WriteCData(xmlTable.OuterXml);
Another option (thanks DJQuimby) is to encode the XML to some XML compatible format (say base64) - note that the encoding used must be XML compatible and that some encoding schemes will increase the size of the encoded document (base64 adds ~30%).

Read value from HTML node

I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.
I have this HTML file which looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">
<span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
</p>
Now I need 00:00:00, 00:00:29 and I'm great! from it. I could read it like this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
if (reader.LocalName == "span")
{
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
}
I get values in variables a, b and c. But there was a slight change in HTML format. Now the HTML looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">I'm great! </p>
In this scenario how do I parse out 00:00:00, 00:00:29 and I'm great! ? I tried this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
But I get this error: This document already has a 'DocumentElement' node. at line doc.Load(reader). How to read correctly and what's causing the trouble? I am using .NET 2.0
It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node. exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.
Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.
Edit:
From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:
Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...
To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.
Since you are wanting to parse HTML, you could use WebClient (or WebBrowser) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library (COM) for the following code example:
string html;
WebClient webClient = new WebClient();
using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
doc.write(html);
foreach (IHTMLElement el in doc.all)
Console.WriteLine(el.tagName);
I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.

Parsing html -> xml and querying with Xpath

I want to parse a html page to get some data.
First, I convert it to XML document using SgmlReader.
Then, I load the result to XMLDocument and then navigate through XPath:
//contains html document
var loadedFile = LoadWebPage();
...
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = new StringReader(loadedFile);
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:
var node = doc.SelectSingleNode(".//*[#id='results-table']");
This gives me a node with several child nodes:
[0] {Element, Name="thead"}
[1] {Element, Name="tbody"}
[2] {Element, Name="tbody"}
FirstChild {Element, Name="thead"}
Ok, let's try to get some child nodes using XPath. But this doesn't work:
var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0
This also:
var childNode = node.SelectSingleNode("thead");
// childNode = null
And even this:
var childNode = doc.SelectSingleNode(".//*[#id='results-table']/thead")
What can be wrong in Xpath queries?
I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.
I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:
//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));
XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);
Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.
My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(
I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):
<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">
This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().
This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\temp\X.loadtest.xml");
Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
if (n.NodeType != XmlNodeType.Element) continue;
if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
{
namespaces.Add(n.Name, n.NamespaceURI);
}
}
// Inspect the namespaces dictionary to write the code below
XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI);
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder");
XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
// Do stuff
}
To be honest when I am trying to get information from a website I use regex.
Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.
http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
But it is in php, so sorry for this =) Hope it helps anyway.

How to save XmlDocument with multiple indentation settings?

I need to save one XmlDocument to file with proper indentation (Formatting.Indented) but some nodes with their children have to be in one line (Formatting.None).
How to achieve that since XmlTextWriter accept setting for a whole document?
Edit after #Ahmad Mageed's resposne:
I didn't know that XmlTextWriter settings can be modified during writing. That's good news.
Right now I'm saving xmlDocument (which is already filled with nodes, to be specific it is .xaml page) this way:
XmlTextWriter writer = new XmlTextWriter(filePath, Encoding.UTF8);
writer.Formatting = Formatting.Indented;
xmlDocument.WriteTo(writer);
writer.Flush();
writer.Close();
It enables indentation in all nodes, of course. I need to disable indentation when dealing with all <Run> nodes.
In your example you write to XmlTextWriter "manually".
Is there an easy way to crawl through all xmlDocument nodes and write them to XmlTextWriter so I can detect <Run> nodes? Or do I have to write some kind of recursive method that will go to every child of current node?
What do you mean by "since XmlTextWriter accept setting for a whole document?" The XmlTextWriter's settings can be modified, unlike XmlWriter's once set. Similarly, how are you using XmlDocument? Please post some code to show what you've tried so that others have a better understanding of the issue.
If I understood correctly, you could modify the XmlTextWriter's formatting to affect the nodes you want to appear on one line. Once you're done you would reset the formatting back to be indented.
For example, something like this:
XmlTextWriter writer = new XmlTextWriter(...);
writer.Formatting = Formatting.Indented;
writer.Indentation = 1;
writer.IndentChar = '\t';
writer.WriteStartElement("root");
// people is some collection for the sake of an example
for (int index = 0; index < people.Count; index++)
{
writer.WriteStartElement("Person");
// some node condition to turn off formatting
if (index == 1 || index == 3)
{
writer.Formatting = Formatting.None;
}
// write out the node and its elements etc.
writer.WriteAttributeString("...", people[index].SomeProperty);
writer.WriteElementString("FirstName", people[index].FirstName);
writer.WriteEndElement();
// reset formatting to indented
writer.Formatting = Formatting.Indented;
}
writer.WriteEndElement();
writer.Flush();

Categories

Resources