Correctly use regular expressions to extract word

Correctly use regular expressions to extract word - c#

I've got an ASP.NET Core project that requires me to read the response from a website and extract a certain word.
What I've tried was to replace the tags with white space, and remove the tags. Unfortunately, I'm not getting any where with this. What is a better approach?
I want to extract Toyota from these html tags
<tr>
<td class="text-muted">Car Model</td>
<td><strong>Toyota 2015</strong></td>
</tr>
I've tried:
var documentSource = streamReader.ReadToEnd();
//removes html content
Regex remove = new Regex(#"<[^>].+?>");
var strippedSource = remove.Replace(documentSource.Replace("\n", ""), "");
//convert to array
string[] siteContextArray = strippedSource.Split(',');
//matching string
var match = new Regex("Car Model ([^2015]*)");
List<Model> modelList = new List<Model>();
Model model = new Model();
foreach (var item in siteContextArray)
{
var wordMatch = match.Match(item);
if (wordMatch.Success)
{
model.Add(
new Model
{
CarModel = wordMatch.Groups[1].Value
}
);
}
}
return modelList;

Use NuGet to retrieve HTML Agility Pack on your solution.
Usage
var html = #"
<tr>
<td class=""text-muted"">Car Model</td>
<td><strong> Toyota 2015 </strong></td>
</tr>
<tr>
<td class=""text-muted"">Car Model</td>
<td><strong> Toyota 2016 </strong></td>
</tr>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var models = htmlDoc.DocumentNode
.SelectNodes("//tr/td[text()='Car Model']")
.Select(node => node.SelectSingleNode("following-sibling::*[1][self::td]").InnerText);
By the way, I think it would be nice to add css class on the content element like
<td class="car-model"><strong> Toyota 2016 </strong></td>
Which will make the html more meaningful and easier to extract.

Related

How to sanitize html with HtmlAgilityPack?

I'm facing a problem in my webscraper, essentially I need to get the decimal number inside the cell team_a_col home:
<th>Med. goal subiti p/p</th>
<td class='team_a_col total'>0.76</td>
<td class='team_a_col home'>0.89
<td class='team_a_col away'>0.62</td></td>
so the result should be: 0.89
but as you can see the html have a bad structure, so instead of get 0.89 I get also the content of team_a_col away with this code:
node.SelectSingleNode(".//td[#class='team_a_col home']").InnerText.Trim();
How can I get only 0.89? The </td> should be before of <team_a_col away..

You should set HtmlDocument.FixNestedTags to true:
string html = "<th>Med. goal subiti p/p</th><td class='team_a_col total'>0.76</td><td class='team_a_col home'>0.89<td class='team_a_col away'>0.62</td></td>";
var doc = new HtmlAgilityPack.HtmlDocument
{
OptionFixNestedTags = true,
OptionCheckSyntax = true,
OptionAutoCloseOnEnd = true
};
doc.LoadHtml(html);
string tdText = doc.DocumentNode.SelectSingleNode(".//td[#class='team_a_col home']")?.InnerText.Trim();
With FixNestedTags the result is: 0.89

Could you just take whole line and then substring and fetch the data?
var node = doc.DocumentNode.SelectNodes("//htmlelment/htmlelment");
string[] nodeArray = node[0].OuterHtml.Split(' ');

Regular expression to match everything, except HTML tags

<tr><td>Di, 12.04.16</td><td>1</td><td>D</td><td>D</td><td>255</td><td>ABC</td><tr>
I want to only match ABC or anything else that stand between
<td>
</td> (before and after ABC)
This Patter doesnt work for me:
((?!<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>[1-9][0-2]?</td><td>[A-Z]?[A-Z]?[A-Z]?[A-Z]?[1-5]?</td><td>(---|[A-Z]?[A-Z]?[A-Z]?[A-Z]?[1-5]?)</td><td>).*(?!</td></tr>))
Do you have any idea?
Thx for help

As Amy said, don't use regex to parse HTML. You can install Html Agility Pack from NuGet and use System.Linq Namespace to parse it.
For example here:
string html = "<html><head></head><body><p class='testclass'>This is a paragraph.</p><table><tr><td>Di, 12.04.16</td><td>1</td><td>D</td><td>D</td><td>255</td><td>ABC</td><tr></table></body></html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var programmes = doc.DocumentNode.Descendants().Where(d => d.GetAttributeValue("class", "") == "testclass");
var trs = doc.DocumentNode.Descendants("tr"); // Give you all the trs
foreach (var tr in trs)
{
var tds = tr.Descendants("td").ToArray(); // Get all the tds
//Sample, show the result in a TextBlock
foreach (var td in tds)
{
txt.Text = txt.Text + " " + td.InnerText;
}
}
The result is so:

htmlagilitypack parse table by th

I am trying to parse the following table using the htmlagilitypack.
<tr>
<th>
Anställda:
</th>
<td>
0 - 4
</td>
</tr>
<tr>
<th>
Oms (tkr):
</th>
<td>
5 409
</td>
</tr>
Im trying to extract the value for Oms (tkr): (in this case 5 409)
The below code gives me the above html table. Problem is I grab the Oms (tkr) value out. It should also be said that Oms (tkr) is not always on the same place, it can be further down or further up in the table. By this i mean that Oms can sometimes be where Anställda is and so forth.
foreach (HtmlAgilityPack.HtmlNode graf in (IEnumerable<HtmlAgilityPack.HtmlNode>)doc.DocumentNode.SelectNodes("//div[#id=\"info\"]//table")) {
var tabellHTdML = graf.InnerHtml;
MessageBox.Show(tabellHTdML);
}
I've tried to do:
if (tabellHTML.Contains("Oms"))
{
item.OMS = cells.InnerText;
}
But cant seem to get the correct value..any ideas what i'm doing wrong?

The following code:
HtmlDocument doc = new HtmlDocument();
doc.Load("test.htm");
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//th[starts-with(normalize-space(text()), 'Oms')]").InnerHtml.Trim());
will dump this:
Oms (tkr)
But you'll have to parse the end manually. The Html Agility Pack only knows about elements and attributes. The XPATH expression means: select any TH element that has a text content that starts with 'Oms', once trimmed (normalize-space).

Extract the contents of a string between two string delimiters using match in C#

So, say I'm parsing the following HTML string:
<html>
<head>
RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!!
</head>
<body>
<table class="table">
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
</table>
<body>
</html>
and I want to isolate the contents of ** (everything inside of the table class)
Now, I used regex to accomplish this:
string pagesource = (method that extracts the html source and stores it into a string);
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">;
string memberList = Regex.Split(splitSource[1], "</table>");
//the list of table members will be in memberList[0];
//method to extract links from the table
ExtractLinks(memberList[0]);
I've been looking at other ways to do this extraction, and I came across the Match object in C#.
I'm attempting to do something like this:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");
The purpose of the above was to hopefully extract a match value between the two delimiters, but, when I try to run it the match value is:
match.value = </table>
MY question, as such, is: is there a way to extract data from my string that is slightly easier/more readable/shorter than my method using regex? For this simple example, regex is fine, but for more complex examples, I find myself with the coding equivalent of scribbles all over my screen.
I would really like to use match, because it seems like a very neat and tidy class, but I can't seem to get it working for my needs. Can anyone help me with this?
Thank you very much!

Use an HTML parser, like HTML Agility Pack.
var doc = new HtmlDocument();
using (var wc = new WebClient())
using (var stream = wc.OpenRead(url))
{
doc.Load(stream);
}
var table = doc.DocumentElement.Element("html").Element("body").Element("table");
string tableHtml = table.OuterHtml;

You can use XPath with the HTmlAgilityPack:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var elements = doc.DocumentNode.SelectNodes("//table[#class='table']");
foreach (var ele in elements)
{
MessageBox.Show(ele.OuterHtml);
}

You have add parenthesis in the regular expression in order to capture the matches:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");
Anyways it seems that only Chuck Norris can parse HTML with regex correctly.

Parsing html with the HTML Agility Pack and Linq

I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."} when I try to look in data

As for your attempt, you have two issues with your code:
ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.
With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;

Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[#class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}

Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[#id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.

instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"

I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Correctly use regular expressions to extract word - c#

Related

How to sanitize html with HtmlAgilityPack?

Regular expression to match everything, except HTML tags

htmlagilitypack parse table by th

Extract the contents of a string between two string delimiters using match in C#

Parsing html with the HTML Agility Pack and Linq

Categories

Resources