Using Windows Forms and C#.
For example...
<table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>
I load the page using the WebBrowser Control. The page loads perfectly.
The next thing I want to do is search through all the rows in the table and check if they contain a specific value ; for example in this instance YES.
If they contain it I want the row to be passed on to me so I can store it as string.
But I want the row to be in HTML form. (containing the tags).
How can I accomplish this ?
Please help me.
You can use the HtmlAgilityPack to easily parse the html. For example, to get all of the TD elements, you can do this:
string value = #" <table id=tbl>
<tbody>
<tr>
<td>HELLO</td>
<td>YES</td>
<td>TEST</td>
</tr>
<tr>
<td>BLAH BLAH</td>
<td>YES</td>
<td>TEST</td>
</tr>
</tbody>
</table>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(value);
var nodes = doc.GetElementbyId("tbl").SelectNodes("tbody/tr/td");
foreach (var node in nodes)
{
Debug.WriteLine(node.InnerText);
}
You can use this: http://simplehtmldom.sourceforge.net/ , its really simple way how to search in HTML files
Just include simple_html_dom.php to your file and then just follow this manual
http://simplehtmldom.sourceforge.net/manual.htm
and your php code will looks like
$html = file_get_html('File.html');
foreach($html->find('td') as $element)
echo $element->text. '<br>';
Related
I am attempting to use HtmlAgilityPack package to find each of the href links within td tags throughout an entire html page. The trick is that these tables start deep down into the html structure. I noticed with HtmlAgilityPack you can't just say get all tds that are within trs on a page. There is a parent div wrapped around each table with a class on it "table-group" that I am not showing in my sample below. Maybe I can use that as a starting point? The biggest trouble that I am dealing with is that there are several parent elements above everything in my sample below, but I want to skip all of that and start here.
Here is a sample of the structure I am trying to navigate:
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 1</td>
<td>1</td>
</tr>
<tr>
<td>Link 2</td>
<td>2</td>
</tr>
<tr>
<td>Link 3</td>
<td>3</td>
</tr>
</tbody>
</table>
<table>
<thead>
</thead>
<tbody>
<tr>
<td>Link 4</td>
<td>4</td>
</tr>
<tr>
<td>Link 5</td>
<td>5</td>
</tr>
<tr>
<td>Link 6</td>
<td>6</td>
</tr>
</tbody>
</table>
I would like my end result to be:
https://path-to-pdf1
https://path-to-pdf2
https://path-to-pdf3
https://path-to-pdf4
https://path-to-pdf5
https://path-to-pdf6
Here is what I have tried:
var html = #"https://myurl.com";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
foreach (var item in nodes)
{
Console.WriteLine(item.Attributes["href"].Value);
}
Console.ReadKey();
Modify
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td/a[0]");
to
var nodes = htmlDoc.DocumentNode.SelectNodes("//table/tbody/tr/td[1]/a");
then you wil get the result you want ,you could read the documents related with XPath for more details
I tried in a MVC project with the same html file:
Update:
I copied the html codes to the html page in my local and get the nodes successfully
I have a bit of HTML that looks like this:
<table class="resultsTable">
<tbody>
<tr class="even">
<td width="35%"><strong>Name</strong></td>
<td>ACME ANVILS, INC</td>
</tr>
</tbody>
</table>
and some C# code that looks like this:
var name = document.DocumentNode
.SelectSingleNode("//*[text()='Name']/following::td").InnerText
which happily returns
ACME ANVILS, INC.
However, there's a new wrinkle. The page in question now returns multiple results:
<table class="resultsTable">
<tbody>
<tr class="even">
<td width="35%"><strong>Name</strong></td>
<td>ACME ANVILS, INC.</td>
</tr>
</tbody>
</table>
<table class="resultsTable">
<tbody>
<tr class="even">
<td width="35%"><strong>Name</strong></td>
<td>ROAD RUNNER RACES, LLC</td>
</tr>
</tbody>
</table>
So now I'm working with
var tables = document.DocumentNode.SelectNodes("//table/tbody");
foreach (var table in tables)
{
var name = table.SelectSingleNode("//*[text()='Name']/following::td").InnerText;
...
}
Which falls over, because SelectSingleNode returns null.
How do I get my XPath to actually return a result, searching only within the specific table I have selected?
With the addition of a second table, two adjustments are required:
Change your absolute XPath,
//*[text()='Name']/following::td
to one relative to the current table or tbody element:
.//*[text()='Name']/following::td
Account for there now being more than one td element on the
following:: axis.
Either just grab the first,
(.//*[text()='Name']/following::td)[1]
or, better, use the following-sibling:: axis instead in combination
with a test on the string value of td rather than a test on a text node, which might be buried beneath intervening formatting elements:
.//td[.='Name']/following-sibling::td
See also Difference between Testing text() nodes vs string values in XPath.
How can I get information out of all td tags in Classen = "string_14" so that I can store it away clean without html code in.
I have thought about this:
enter code here
<table class="string_14">
<tbody><tr>
<td>Postadr.:</td>
<td class="tab_space">Stenslivegen 67, 2817 Gjøvik</td>
</tr>
<tr>
<td>Telefon:</td>
<td class="tab_space">611 80 710</td>
</tr>
<tr>
<td>Mobil:</td>
<td class="tab_space">957 92 455</td>
</tr>
</tbody>
</table>
And my code to get it looking like this today, I want help with is to write xpath to name =? how should I write to get a single td.
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(result));
HtmlNode root = doc.DocumentNode;
List<string> list = new List<string>();
foreach (HtmlNode div in root.SelectNodes("//div[#class='biz_list']"))
{
string name = doc.DocumentNode.SelectNodes("//d[#class='string_14']/#tr");
list.Add(name);
string att = div.OuterHtml;
list.Add(att);
}
What I want out of this is I'm going to scrape a page and then the LATE stage I'll save this down to the xml file.
I think what you want is this:
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//table[#class='string_14']//td[#class='tab_space']");
You can consult XPath Tutorial for more on this.
Lets say I have this html:
<table class="c1">
<tr>
<td>Dog</td>
<td>Dog<td>
</tr>
<tr>
<td>Cat</td>
<td>Cat<td>
</tr>
</table>
What I tried:
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='c1']");
HtmlNodeCollection urls = node.SelectNodes("a");
the node have the table but urls is null. Why?
Use Descendants("a") instead of SelectNodes("a");
This should work....
var node = doc.DocumentNode.SelectSingleNode("//table[#class='c1']");
var urls = node.Descendants("a").ToList();
I'm using c# with htmlagilitypack. Everything works fine except when the table I'm looking for contains no rows. I'm trying to read only the data from the 1st table on the page. The problem is if the first table contains no rows, the htmlagilitypack seems to jump down to the 2nd table for some reason.
The html I'm trying to read looks something like this:
<table class='stats'>
<tr>
<td colspan='2'>This is the 1st table</td>
<tr>
<td>Column A</td>
<td>Column B</td>
</tr>
<tr>
<td>Value A</td>
<td>Value B</td>
</tr>
</table>
<table class='stats'>
<tr>
<td colspan='2'>This is the 2nd table</td>
<tr>
<td>Column 1</td>
<td>Column 2</td>
</tr>
<tr>
<td>Value 111</td>
<td>Value 222</td>
</tr>
</table>
I then retrieve the 1st table's values using the following line:
foreach (HtmlNode node in root.SelectNodes("//table[#class='stats']/tr[position() > 2]/td"))
How do I ensure the data I'm grabbing is only from the 1st table?
Thanks.
You could ensure that you only select the first matching table by using a position index [1] after the table selector.
Try the following:
"//table[#class='stats'][1]/tr[position()>2]/td"
If the first table has no rows, then you will get null back so you should check for that before iterating in the foreach.
For example you might want to do the following:
var elements = root.SelectNodes("//table[#class='stats'][1]/tr[position()>2]/td");
if (elements != null)
{
foreach (HtmlNode node in elements)
{
// process the td node
}
}
You need to have an id on the table or row which uniquely identifies the table or or and then use the id in the xpath.