Match Table w/ Regex - c#

I'm trying to match a table w/ regex but I'm having some issues. I can't figure out exactly why it will not match properly. Here is the HTML:
<table class="integrationteamstats">
<tbody>
<tr>
<td class="right">
<span class="mediumtextBlack">Queue:</span>
</td>
<td class="left">
<span class="mediumtextBlack">0</span>
</td>
<td class="right">
<span class="mediumtextBlack">Aban:</span>
</td>
<td class="left">
<span class="mediumtextBlack">0%</span>
</td>
<td class="right">
<span class="mediumtextBlack">Staffed:</span>
</td>
<td class="left">
<span class="mediumtextBlack">0</span>
</td>
</tr>
<tr>
<td class="right">
<span class="mediumtextBlack">Wait:</span>
</td>
<td class="left">
<span class="mediumtextBlack">0:00</span>
</td>
<td class="right">
<span class="mediumtextBlack">Total:</span>
</td>
<td class="left">
<span class="mediumtextBlack">0</span>
</td>
<td class="right">
<span class="mediumtextBlack">On ACD:</span>
</td>
<td class="left">
<span class="mediumtextBlack">0</span>
</td>
</tr>
</tbody>
</table>
I need to get 2 pieces of information:
the data inside of the td below Queue and the data inside the td below Wait (so the Queue count and wait time). Obivously the numbers are going to update frequently.
This is the regex I have for pulling the initial table, but it isnt working:
Match statstable = Regex.Match(this.html, "<table class=\"integrationteamstats\">(.*?)</table>");
And I'm not sure what regex I should use to get the data from the td's.
Before anyone asks, no there is no way I can update the HTML to have an ID or anything of that nature. Its pretty much as is. The only thing that is consistent is the location of the td's.

Instead of regex, I suggest using the HTML Agility Pack to parse the HTML and query its structure.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
In general, regex is a poor choice for parsing HTML.

Related

Datalist not displaying items ASP.NET

I have a serious problem with data list I HATE IT so much. I have a list of data that displays cart items in a table.. basic right?
Sure, but not for data list! its like data list says I MUST DESTROY YOU. The data list does not show any items that I should display!
Here are some screenshots that will let you understand my issue here.
So basically the items are displayed in VS, but in the web application it doesn't.
PLEASE HELP :(
Here is the code:
<div class="container-sm cart-page">
<table>
<tr>
<th>المنتج</th>
<th>الكمية</th>
<th>السعر الفرعي</th>
</tr>
<asp:DataList ID="DataList1" runat="server" RepeatColumns="1" RepeatLayout="Flow">
<ItemTemplate>
<tr>
<td>
<div class="cart-info">
<img src="http://bestjquery.com/tutorial/product-grid/demo8/images/img-1.jpg" alt="camera">
<div>
<p>Camera 211</p>
<small>السعر: 50 ر.س.</small>
<br>
حذف
</div>
</div>
</td>
<td>
<input type="number" value="1" min="1" max="10"></td>
<td>50 ر.س.</td>
</tr>
</ItemTemplate>
</asp:DataList>
</table>
<div class="total-price">
<table>
<tr>
<td class="titles">السعر الفرعي</td>
<td>150 ر.س.</td>
</tr>
<tr>
<td class="titles">VAT</td>
<td>22.50 ر.س.</td>
</tr>
<tr>
<td class="titles">المجموع</td>
<td>172.50 ر.س.</td>
</tr>
<tr>
<td>
<a href="store.aspx">
<button type="submit" class="btn btn-primary btn-continue-shopping">إكمال التسوق</button>
</a>
</td>
<td>
<a href="checkout.aspx">
<button type="submit" class="btn btn-primary btn-checkout">إكمال الدفع</button>
</a>
</td>
</tr>
</table>
</div>
</div>
You need to set the DataSource property with some list or table of data.
Here is a link to the MSDN article: https://learn.microsoft.com/en-us/dotnet/api/system.web.ui.webcontrols.basedatalist.datasource?view=netframework-4.8#System_Web_UI_WebControls_BaseDataList_DataSource
What is your data source? Are you using a database?
If you do not have a database you can mockup a dataset in XML, JSON, or CSV and then deserialize into your code.
I believe that you can bind anything that implements IList.

Parse HTML in c#

So i am trying to get SCN08_SS_GetCustomer_CAM from this html code.
<tr>
<td class="line2left bordered">
<div class="tablelabel typped virtualuser" style="margin-left:00px">SCN08_SS_GetCustomer_CAM</div>
</td>
<td class="line2right bordered">875.2</td>
<td class="line2right bordered">875.2</td>
<td class="line2right bordered">875.2</td>
<td class="line2right bordered">1</td>
<td class="line2right bordered">0</td>
<td class="line2right bordered">0</td>
<td class="line2right bordered"></td>
</tr>
I am basically building a desktop application using WPF. Coding in .net c#.
In htmlagilitypack there is a way to getelementyid but no getelementbyclass. And in this html code there is no id. Hence i will have to get it by class.
So any ideas on how to code this guys?
Here is a nice and simple application
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(#"pathtoyourpage.html");
var result = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='tablelabel typped virtualuser']").InnerText;
Console.WriteLine(result.ToString());
Haven't tested it though

XPath drops contents of td column on an HTML page for screen scraping

Below you find an excerpt of code used to screen scrape an economic calendar.
The HTML page that it parses using XPath includes this row as the first rown
in a table. (Only pasted this row instead of the entire HTML page.)
<tr class="calendar_row newday singleevent" data-eventid="42064"> <td class="date"><div class="date">Sun<div>Dec 23</div></div></td> <td class="time">All Day</td> <td class="currency">JPY</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"><div>Bank Holiday</div></td> <td class="detail"><a class="calendar_detail level1" data-level="1"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td> </tr>
This code that selects the first tr row using XPath:
var doc = new HtmlDocument();
doc.Load(new StringReader(html));
var rows = doc.DocumentNode.SelectNodes("//tr[#class=\"calendar_row\"]");
var rowHtml = rows[0].InnerHtml;
The problem is that rowHtml returns this:
<td class="date"></td> <td class="time">All Day</td> <td class="currency">EUR</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"> <div>French Bank Holiday</div> </td> <td class="detail"><a class="calendar_detail level2" data-level="2"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td>
Now you can see that the contents of the td column for the date vanished! Why?
I've experimented many things and stumped as to why it drops the contents of that column.
The other columns have content that it keeps. So what's wrong with the date column?
Is there some kind of setting or property somewhere to cause or prevent dropping contents?
Even if you haven't got a clue what's wrong but have some suggestions of a way to investigate it more.
Like #AlexeiLevenkov mentioned, you must be selecting a different row than what you want. You've pruned too much of essential problem away in an effort to simplify, but it's still clear what's wrong...
Consider that your input document might basically look like this:
<?xml version="1.0" encoding="UTF-8"?>
<table>
<tr class="calendar_row" data-eventid="12345">
<td>This IS NOT the tr you're looking for</td>
</tr>
<tr class="calendar_row newday singleevent" data-eventid="42064">
<td>This IS the tr you're looking for</td>
</tr>
</table>
The test #class="calendar_row" won't match against the tr you show, but it will match against the first row.
You could change your test to be contains(#class,'calendar_row') instead, but that would match both rows. You're going to have to identify some content or attribute that's unique to the row you desire. Perhaps the #data-eventid attribute would work -- can't tell without seeing your whole input file.

Multiline regex- works in RegexBuddy and online tester, but doesn't work in c#

I have a c# code, a regex expression, and HTML source file. It works fine in RegexBuddy and http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx, but not in Visual Studio. Please help and explain me what is wrong.
What I expect is:
Found 3 matches:
/setcard/?set=4387740&t=1&secure=xJHC9dYymGSnImebS4qLPw%3D%3D" onclick="return openUrl(this.href);" >Username</a></td> <td bgcolor="#F6F6F6">Message1 example text</td> <td bgcolor="#F6F6F6">16.10.11 23:20</td> has 5 groups:
1. 4387740
2. Username
3. 49244417
4. Message1 example text
5. 16.10.11 23:20
/setcard/?set=4387740&t=1&secure=xJHC9dYymGSnImebS4qLPw%3D%3D" onclick="return openUrl(this.href);" >Username2</a></td> <td>Message2 example text</td> <td>16.10.11 14:42</td> has 5 groups:
1. 4387740
2. Username2
3.49223017
4. Message2 example text
5. 16.10.11 14:42
/setcard/?set=4387740&t=1&secure=xJHC9dYymGSnImebS4qLPw%3D%3D" onclick="return openUrl(this.href);" >Username3</a></td> <td bgcolor="#F6F6F6">Message3 example text</td> <td bgcolor="#F6F6F6">16.10.11 14:34</td> has 5 groups:
1. 4387740
2. Username3
3. 49222720
4. Message3 example text
5. 16.10.11 14:34
Regex
#"/setcard/\?set=([0-9]*).*;"" >(.*)</a></td>$\s.*/msg/\?id=([0-9]*).*ref\);"">(.*)</a></td>$\s\s?.*>(.*)</td>$"
C# Code
using (StreamReader rdr = File.OpenText("file.html"))
{ s = rdr.ReadToEnd(); }
Regex listMsgs = new Regex(#"/setcard/\?set=([0-9]*).*;"".>(.*)</a></td>$
.*/msg/\?id=([0-9]*).*ref\);"">(.*)</a></td>$
?.*>(.*)</td>$", RegexOptions.Multiline);
Match m = listMsgs.Match(s);
while (m.Success)
{}
HTML Source
<td bgcolor="#F6F6F6" class="c1"><IMG BORDER="0" SRC="transparent.gif" width="15px" height="15px" /></td>
<td bgcolor="#F6F6F6" style="width:108px"><a href="../../auswertung/setcard/?set=4387740&t=1&secure=xJHC9dYymGSnImebS4qLPw%3D%3D" onclick="return openUrl(this.href);" >Username</a></td>
<td bgcolor="#F6F6F6">Message1 example text</td>
<td bgcolor="#F6F6F6">16.10.11 23:20</td>
<td bgcolor="#F6F6F6">
</td>
<td bgcolor="#F6F6F6" align="center">
<img src="message_art1.gif" width="14" height="10" border="0" /> </td>
<td bgcolor="#F6F6F6"><input type="checkbox" name="messages[]" id="id_msg_1" value="49244417"></td>
</tr>
<tr height="20">
<td class="c1"><IMG BORDER="0" SRC="transparent.gif" width="15px" height="15px" /></td>
<td style="width:108px"><a href="../../auswertung/setcard/?set=4387740&t=1&secure=xJHC9dYymGSnImebS4qLPw%3D%3D" onclick="return openUrl(this.href);" >Username2</a></td>
<td>Message2 example text</td>
<td>16.10.11 14:42</td>
<td>
</td>
<td align="center">
2 </td>
<td><input type="checkbox" name="messages[]" id="id_msg_2" value="49223017"></td>
</tr>
<tr height="20">
<td bgcolor="#F6F6F6" class="c1"><IMG BORDER="0" SRC="transparent.gif" width="15px" height="15px" /></td>
<td bgcolor="#F6F6F6" style="width:108px"><a href="../../auswertung/setcard/?set=4387740&t=1&secure=xJHC9dYymGSnImebS4qLPw%3D%3D" onclick="return openUrl(this.href);" >Username3</a></td>
<td bgcolor="#F6F6F6">Message3 example text</td>
<td bgcolor="#F6F6F6">16.10.11 14:34</td>
<td bgcolor="#F6F6F6">
</td>
<td bgcolor="#F6F6F6" align="center">
2 </td>
<td bgcolor="#F6F6F6"><input type="checkbox" name="messages[]" id="id_msg_3" value="49222720"></td>
</tr>
<tr height="20">
This regex yields the expected result for me:
#"/setcard/\?set=([0-9]*).*?;""\s*>(.*?)</a></td>\s*.*?/msg/\?id=([0-9]*).*?ref\);"">(.*?)</a></td>\s*.*>(.*?)</td>"
It looks like you're using the $ metacharacter to match newlines, which is incorrect. That's a zero-width assertion: it matches the position just before a newline, without consuming the newline character(s). That means the .* following the $ has to consume it, but of course the dot doesn't match newlines.
There's really no point in using the anchor ($) in this case; you have to consume the newlines anyway, so just match them the way you match any other characters. If the newlines were required I would suggest using [\r\n]+, which will match one or more of any kind of newlines, whether they're \r\n (DOS/Windows style), \r (pre-OSX Mac), or \n (everything else). But in this case I don't think you need to be that specific; \s* (zero or more of any whitespace characters) seems to work fine. You also don't need the Multiline option any more.
Anywhere you're expecting any sort of newline character, you need to use a '\s' in your regex. Check out a similar question and answer here for more details.

Convert HTML Tags Using C#

I have a situation where I need to convert a nested HTML list like this:
<ol>
<li>Item 1
<ol>
<li> Item 1.1
<ol>
<li>Item 1.1.1</li>
<li>Item 1.1.2</li>
</ol>
</li>
</ol>
</li>
<li>Item 2
<ol>
<li> Item 2.1
<ol>
<li>Item 2.1.1</li>
<li>Item 2.1.2</li>
</ol>
</li>
</ol>
</li>
Into separate indented tables (where each ol is a table, indented properly to look like the nested table). What would be the best way to do this? I've looked that the HtmlAgility pack, but I couldn't figure out how to replace tags once I found them (I was able to find all the appropriate tags, but couldn't do anything with them)...
Basically, I need the output table(s) to look something like this:
<table>
<tr>
<td>
•
</td>
<td>
Item 1
</td>
</tr>
</table>
<table style="margin-left: 5px;">
<tr>
<td>
•
</td>
<td>
Item 1.1
</td>
</tr>
</table>
<table style="margin-left: 10px;">
<tr>
<td>
•
</td>
<td>
Item 1.1.1
</td>
</tr>
<tr>
<td>
•
</td>
<td>
Item 1.1.2
</td>
</tr>
</table>
<table>
<tr>
<td>
•
</td>
<td>
Item 2
</td>
</tr>
</table>
<table style="margin-left: 5px;">
<tr>
<td>
•
</td>
<td>
Item 2.1
</td>
</tr>
</table>
<table style="margin-left: 10px;">
<tr>
<td>
•
</td>
<td>
Item 2.1.1
</td>
</tr>
<tr>
<td>
•
</td>
<td>
Item 2.1.2
</td>
</tr>
</table>
Just break down what you're trying to do into simpler steps.
1) Replace all <ol> with <table><tr>
2) Replace all <li> with <td>
3) Replace all </li> with </td>
4) Replace all </ol> with </tr></table>
...Or similar. Its basically a straight up translation if I'm understanding your issue correctly.
Would a regular expression replacing <ol> with <table> and <li> with <tr><td> not work?

Categories

Resources