Extracting data from an XML document without using an XML parser

Extracting data from an XML document without using an XML parser - c#

Here's some lines of the document:
<div class="rowleft">
<h3>Technical Fouls</h3>
<table class="num-left">
<tr class="datahl2b">
<td> </td>
<td>Players</td>
</tr>
<tr>
<td>DAL</td>
<td>
None</td>
</tr>
<tr>
<td>MIA</td>
<td>
Mike Miller</td>
<td>
Mike Miller, Jr.</td>
</tr>
</table>
</div>
I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.
One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.

Relevant
HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack

Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.
If your document is not well-formed XML, I would recommend using the HTML Agility Pack

Related

split a dynamic generated HTML files

My application converts several types of documents into HTML files. Then, it exposes generated files to users or search engine robots.
My problem is that some documents contain more than 100 pages and the generated HTML file is huge.
I am looking for a way to split HTML files into several pages.
One possible solution is split them by size and number of characters which is a tough solution because we should consider the style of HTML files.
For example, consider following HTML file:
<p>
-- So long paragraph with more than 100 lines
</p>
<table>
<tr>
<td> </td>
</tr>
...... more than 10 rows
</table>
the split mechanism should create several files for the paragraph and also, it should create one file for the table. like following:
PAGE1.HTML
<p>
-- contains 20 lines of original text
</p>
PAGE2.HTML
<p>
-- contains 20 lines of original text
</p>
PAGE3.HTML
<p>
-- contains 20 lines of original text
</p>
...
PAGE6.HTML
<p>
<table>
<tr>
<td> </td>
</tr>
...... more than 10 rows
</table>
</p>
please advice me, if you know a better solution or tools for achieving the solution?

You have to disentangle content from the HTML. If you opt for an intermediate format, that you control, you can generate HTML files with appropriate amount of content.
Trying to cut it after the HTML is generated is worse option, and inefficient one. You can try and navigate the HTML document using (e.g.) HtmlAgilityPack, but without intimate knowledge of what elements in what structure you actually generate it's hard to pinpoint the way of actually performing the split - and again, it will be much harder than splitting the content before it becomes HTML.

Given a web response, how do I extract a specific portion for further processing?

I have some code that gets a web response. How do I take that response and search for a table using its CSS class (class="data")? Once I have the table, I need to extract certain field values. For example, in the sample markup below, I need the values of Field #3 and Field #5, so "85" and "1", respectively.
<table width="570" border="0" cellpadding="1" cellspacing="2" class="data">
<tr>
<td width="158"><strong>Field #1:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #2:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #3:</strong></td>
<td width="99">85</td>
<td width="119"><strong>Field #4:</strong></td>
<td width="176">-259.34</td>
</tr>
<tr>
<td width="158"><strong>Field #5:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #6:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #7:</strong></td>
<td width="99">12</td>
<td width="119"><strong>Field #8:</strong></td>
<td width="176">123.23</td>
</tr>
</table>

Use the HTML Agility Pack and parse the HTML. If you want to do it the simplest way then go grab its beta (it supports LINQ).

As Randolf suggests, using HTML Agility Pack is a good option.
But, if you have control of the format of the HTML, it is also possible to do string parsing to extract the values you are after.
It is nearly trivial to download the entire HTML as a string and search for the string "<table" followed by the string "class=\"data\"". Then you can easily extract the values you are after by doing similar string manipulations.
I'm not saying you should do this, for the resulting code will be harder to read and maintain that the code using HTML Agility Pack, but it will save you an external dependency and your code will probably perform much better.
In a WP7 app I made, I started using HTML Agility Pack to parse some HTML and extract some values. This worked well, but it was quite slow. Switching to the string parsing regime made my code many times faster while returning the exact same result.

HTML not displaying correctly in outlook 2007 emails?

EDITED:
I have written some correct HTML and passed this as a string into an email,
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>/n<html>
<head>
</head>
<body>
<table>
<tr>
<th>X</th>
<th>Y</th>
</tr>
<tr>
<td>Overall</td>
<td>207,890</td>
</tr>
<tr>
<td>a</td>
<td>100,568</td>
</tr>
<tr>
<td>b</td>
<td>107,322</td>
</tr>
</table>
</body>
</html>
I re-wrote the HTML to be extremely simple, only using a table but its still not showing??

Generally email clients don't seem to like decently formatted HTML. Just from conversation I've had with HTML developers
Use inline styles even if that means repeating yourself. No style sheets even in head
No fancy floating of the divs
Put everything in tables for formatting
Generally pretend like it's 1999

Your problem is probably not only Outlook 2007 but most other email clients as well.
Make sure that your html is very simple and does not use many external resources, inline CSS is probably necessary. This article is a nice summary: http://css-tricks.com/using-css-in-html-emails-the-real-story/

How can I get all content within <td> tag using a HTML Agility Pack?

So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this:
<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td>
The data I want is in here <br />
and it's seperated by these annoying <br /> 's.
No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags.
</td>
</tr>
</table>
So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?
Update: Here is how I'm loading my doc
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);

Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
.SelectNodes("//table[#cellspacing='3']/tr[2]/td")
.Single();
string text = node.InnerText;

"Something else" is the best answer -- HTML is best parsed by an HTML parser rather than via regular expressions. I'm no C# expert, but I hear the HTML Agility Pack is well-liked for this purpose.

I'd say som̡et̨hińg Else

You'd probably get better mileage with an xml parser.

If you're using the Agility pack already, then it's just a matter of using some thing doc.DocumentNode.SelectNodes("//table[#cellspacing='3']") to get the table in the document. Try looking through the documentation and coding examples. Since you already have structured data, it's rediculous to go back to the text data and reparse.

Regex matching tags

I have the following piece of text from which I'd like to extract all the <td ????>???</td> tags
<tr id=row509>
<td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
<td align=center class='style4'>23</td>
<td align=center class='style10'>22</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td id=rowtot509 align=center class='style6'>0</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td align=center class='style6'>0</td>
</tr>
The expected result would be:
1. <td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
2. <td align=center class='style4'>23</td>
3. <td align=center class='style10'>22</td>
[..]
Any help? Thanks

What's the problem with using an HTML or XML library?
Using XML and XPath, for instance, this would just be a case of doing xml / td, in whatever way the library API supports that.
Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.
So, while it would be easy to create as regular expression for the simple case (<td.*?</td>), it would easily break if the XML changed just a bit.
Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+) in that with $1='$2' (or \1='\2', if that's the syntax of c# replace patterns), you'll get a valid XML.

I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.

Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td> tag you find after a <td> start tag may not actually be closing that element but a descendant element.)
A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extracting data from an XML document without using an XML parser - c#

Relevant HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack

Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone. If your document is not well-formed XML, I would recommend using the HTML Agility Pack

Related

split a dynamic generated HTML files

Given a web response, how do I extract a specific portion for further processing?

HTML not displaying correctly in outlook 2007 emails?

How can I get all content within <td> tag using a HTML Agility Pack?

Regex matching tags

Categories

Resources