How to sanitize html with HtmlAgilityPack? - c#

I'm facing a problem in my webscraper, essentially I need to get the decimal number inside the cell team_a_col home:
<th>Med. goal subiti p/p</th>
<td class='team_a_col total'>0.76</td>
<td class='team_a_col home'>0.89
<td class='team_a_col away'>0.62</td></td>
so the result should be: 0.89
but as you can see the html have a bad structure, so instead of get 0.89 I get also the content of team_a_col away with this code:
node.SelectSingleNode(".//td[#class='team_a_col home']").InnerText.Trim();
How can I get only 0.89? The </td> should be before of <team_a_col away..

You should set HtmlDocument.FixNestedTags to true:
string html = "<th>Med. goal subiti p/p</th><td class='team_a_col total'>0.76</td><td class='team_a_col home'>0.89<td class='team_a_col away'>0.62</td></td>";
var doc = new HtmlAgilityPack.HtmlDocument
{
OptionFixNestedTags = true,
OptionCheckSyntax = true,
OptionAutoCloseOnEnd = true
};
doc.LoadHtml(html);
string tdText = doc.DocumentNode.SelectSingleNode(".//td[#class='team_a_col home']")?.InnerText.Trim();
With FixNestedTags the result is: 0.89

Could you just take whole line and then substring and fetch the data?
var node = doc.DocumentNode.SelectNodes("//htmlelment/htmlelment");
string[] nodeArray = node[0].OuterHtml.Split(' ');

Related

Correctly use regular expressions to extract word

I've got an ASP.NET Core project that requires me to read the response from a website and extract a certain word.
What I've tried was to replace the tags with white space, and remove the tags. Unfortunately, I'm not getting any where with this. What is a better approach?
I want to extract Toyota from these html tags
<tr>
<td class="text-muted">Car Model</td>
<td><strong>Toyota 2015</strong></td>
</tr>
I've tried:
var documentSource = streamReader.ReadToEnd();
//removes html content
Regex remove = new Regex(#"<[^>].+?>");
var strippedSource = remove.Replace(documentSource.Replace("\n", ""), "");
//convert to array
string[] siteContextArray = strippedSource.Split(',');
//matching string
var match = new Regex("Car Model ([^2015]*)");
List<Model> modelList = new List<Model>();
Model model = new Model();
foreach (var item in siteContextArray)
{
var wordMatch = match.Match(item);
if (wordMatch.Success)
{
model.Add(
new Model
{
CarModel = wordMatch.Groups[1].Value
}
);
}
}
return modelList;
Use NuGet to retrieve HTML Agility Pack on your solution.
Usage
var html = #"
<tr>
<td class=""text-muted"">Car Model</td>
<td><strong> Toyota 2015 </strong></td>
</tr>
<tr>
<td class=""text-muted"">Car Model</td>
<td><strong> Toyota 2016 </strong></td>
</tr>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var models = htmlDoc.DocumentNode
.SelectNodes("//tr/td[text()='Car Model']")
.Select(node => node.SelectSingleNode("following-sibling::*[1][self::td]").InnerText);
By the way, I think it would be nice to add css class on the content element like
<td class="car-model"><strong> Toyota 2016 </strong></td>
Which will make the html more meaningful and easier to extract.

HTML Agility Pack - Grab Text after a node

I have some HTML that I'm parsing using C#
The sample text is below, though this is repeated about 150 times with different records
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
I'm trying to get the text in an array which will be like
customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy
I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag
any help would be appreciated
You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :
var raw = #"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}
dotnetfiddle demo
output :
Title, : Mr
First name, : Fake
Surname, : Guy
You should be able to remove the ":" by doing simple string manipulation, if needed...
<strong> is a common tag, so something specific for the sample format you provided.
var html = #"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>
<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
foreach (var node in strong.Where(
// 2. followed by non-empty text node
x => x.NextSibling is HtmlTextNode
&& !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
// 3. followed by <br>
&& x.NextSibling.NextSibling is HtmlNode
&& x.NextSibling.NextSibling.Name.ToLower() == "br"))
{
Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
}
}

Extract string from HTML

I want to extract the string KLE3KAN918D429 from the following html code:
<td class="Labels"> CODE (Sp Number): </td><td width="40.0%"> KLE3KAN918D429</td>
Is there a method in C# where I can specify the source-text , start string , end string and get the string between start and end ?
You are, as per the comments, probably better off using a parsing library to iterate the DOM structure but if you can make some assumptions about the html you'll be parsing, you could do something like below:
var html = "<td class=\"Labels\"> CODE (Sp Number): </td><td width=\"40.0%\"> KLE3KAN918D429</td>";
var labelIndex = html.IndexOf("<td class=\"Labels\">");
var pctIndex = html.IndexOf("%", labelIndex);
var closeIndex = html.IndexOf("<", pctIndex);
var key = html.Substring(pctIndex + 3, closeIndex - pctIndex - 3).Trim();
System.Diagnostics.Debug.WriteLine(key);
Likely quite brittle but sometimes quick and dirty is all that is required.
As others already suggested, you should use something like HtmlAgilityPack for parsing html. Don't use regular expressions or other hacks for parsing html.
You have several td nodes in your html string. Getting last one is really easy with td[last()] XPath:
string html = "<td class=\"Labels\"> CODE (Sp Number): </td><td width=\"40.0%\"> KLE3KAN918D429</td>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var td = doc.DocumentNode.SelectSingleNode("td[last()]");
var result = td.InnerText.Trim(); // "KLE3KAN918D429"
I really suggest using HTMLAgilityPack for this.
It's as easy as:
var doc = new HtmlDocument();
doc.LoadHtml(#"<td class=""Labels""> CODE (Sp Number): </td><td width=""40.0%""> KLE3KAN918D429</td>");
var tdNode = doc.DocumentNode.SelectSingleNode("//td[#class='Labels' and text()=' CODE (Sp Number): ']/following-sibling::td[1]");
Console.WriteLine(tdNode.InnerText.Trim());
Before you start, add HtmlAgilityPack from NuGet:
Install-Package HtmlAgilityPack

Extract the contents of a string between two string delimiters using match in C#

So, say I'm parsing the following HTML string:
<html>
<head>
RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!!
</head>
<body>
<table class="table">
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
</table>
<body>
</html>
and I want to isolate the contents of ** (everything inside of the table class)
Now, I used regex to accomplish this:
string pagesource = (method that extracts the html source and stores it into a string);
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">;
string memberList = Regex.Split(splitSource[1], "</table>");
//the list of table members will be in memberList[0];
//method to extract links from the table
ExtractLinks(memberList[0]);
I've been looking at other ways to do this extraction, and I came across the Match object in C#.
I'm attempting to do something like this:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");
The purpose of the above was to hopefully extract a match value between the two delimiters, but, when I try to run it the match value is:
match.value = </table>
MY question, as such, is: is there a way to extract data from my string that is slightly easier/more readable/shorter than my method using regex? For this simple example, regex is fine, but for more complex examples, I find myself with the coding equivalent of scribbles all over my screen.
I would really like to use match, because it seems like a very neat and tidy class, but I can't seem to get it working for my needs. Can anyone help me with this?
Thank you very much!
Use an HTML parser, like HTML Agility Pack.
var doc = new HtmlDocument();
using (var wc = new WebClient())
using (var stream = wc.OpenRead(url))
{
doc.Load(stream);
}
var table = doc.DocumentElement.Element("html").Element("body").Element("table");
string tableHtml = table.OuterHtml;
You can use XPath with the HTmlAgilityPack:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var elements = doc.DocumentNode.SelectNodes("//table[#class='table']");
foreach (var ele in elements)
{
MessageBox.Show(ele.OuterHtml);
}
You have add parenthesis in the regular expression in order to capture the matches:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");
Anyways it seems that only Chuck Norris can parse HTML with regex correctly.

Parsing html with the HTML Agility Pack and Linq

I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."} when I try to look in data
As for your attempt, you have two issues with your code:
ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[#class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}
Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[#id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.
instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"
I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.

Categories

Resources