c# substring - parse all text in between - c#

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.
to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
The Top Social Networking Sites People Are Using
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
The Top
</p>
</div>
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking- websites"
>
Top 15 Most Popular Social Networking Sites | January 2019
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
</a>
<p class="result-snippet">
Top 15 Most
</p>
</div>
i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.
int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);
to grab url i am using the following:
var regexURLParser = new Regex(#"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);
what i want is to grab is the url from these:
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking-websites"
>
so that the outcome shows only:
https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites

You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.
To add HTMLAgilityPack using NuGet
go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3
after the installation you can extract Urls like below.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
.ForEach(x=>
{
//Use HasClass method to filter elements
if (!string.IsNullOrEmpty(x.GetAttributeValue("href", ""))
&& x.HasClass("result-title") && x.HasClass("js-result-title"))
{
listOfUrls.Add(x.GetAttributeValue("href", ""));
}
});
listOfUrls.ForEach(x => Console.WriteLine(x));
EDIT
Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.
Another way
shorter and another way to get filtered values.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = doc.DocumentNode.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "result-title js-result-title")
.Select(x => x.GetAttributeValue("href", "")).ToList();

Related

C# HTMLNode get correctly innerText of div

I am trying to correctly extract the innerText of a list of div I am getting from a website.
This is what I came up with but still a bit buggy as it misses whitespaces and the - symbol.
var first = mainmenuTitles[x].Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "left").Elements("a").ToList();
string final = "";
foreach (var countfirst in first)
{
final += countfirst.InnerText;
}
Console.WriteLine("Tittle: " + final);
This is how the html code looks like
<div class="row row-tall mt4">
<div class="clear">
<div class="left">
<a href="/soccer/italy/">
<strong>Italy</strong>
</a>
-
Serie C:: group B
</div> <div class="right fs11"> March 31 </div> </div> </div>
The text I am trying to get should look like this ->
Italy - Serie C:: group B
I am not a html guru so forgive me if it is too simple and I am missing it.
You can write a query to look up all nodes with xpath //div/a and then concatenate the inner text to get the text you are looking for. Make sure you trim the text to get rid of extra spaces and returns.
Console.WriteLine(string.Join(" - ", doc.DocumentNode.SelectNodes("//div/a").Select(x => x.InnerText.Trim())));
Output:
Italy - Serie C:: group B
Side note... you can use different queries to ensure you get the right div by using name of class as well. e.g. .SelectNodes("//div[#class='row row-tall mt4']/a");. This will give you all the <a> tags under that div.

XPath retrieving values from multiple tags inside a node

I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.
Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}
Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.

Cannot get content of specific div with html agility pack

I'm using html agility pack for take some data from a website, now there is a bit problem. I want get some data from this div:
<div class="container middle">
<div class="details clearfix">
<dl>
<dt>Gara</dt>
<dd>Super League</dd>
<dt>Data</dt>
<dd><span class='timestamp' data-value='1467459300' data-format='d mmmm yyyy'>2 luglio 2016</span></dd>
<dt>Game week</dt>
<dd>15</dd>
<dt>calcio di inizio</dt>
<dd>
<span class='timestamp' data-value='1467459300' data-format='HH:MM'>13:35</span>
(<span class="game-minute">FP'</span>)
</dd>
</dl>
</div>
the problem's that there are two div with the class container middle and details clearfix, I want get the content onlhy of the specific div pasted above. This div have a dl tag for each tag.
This is my code:
var url = "http://it.soccerway.com/matches/2016/07/02/china-pr/csl/henan-jianye/beijing-guoan-football-club/2207361/";
var doc = new HtmlDocument();
doc.LoadHtml(new WebClient().DownloadString(url));
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode("//div[#class='container middle']");
and this return a wrong result, in particular this:
<div class="container middle">
<h3 class="thick scoretime score-orange">
0 - 0
</h3>
this is the complete source code.
Well, you could do the following, for this particular web-page:
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
Console.WriteLine(matchDetails[1].InnerHtml);
and working with HtmlNode via matchDetails[1]. To retrieve other data you can use similar xpath requests, like:
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
var dl = matchDetails[1].SelectSingleNode(".//dl");
var dt = dl.SelectNodes(".//dt");
var dd = dl.SelectNodes(".//dd");
for (int i = 0; i < dt.Count; i++) {
var name = dt[i].InnerHtml;
var value = dd[i].InnerHtml;
Console.WriteLine(name + ": " + value);
}
Of course, you need some check for the NullReference and stuff
Query div with class details clearfix should return the target div element. There is one crucial detail you need to be aware of though,
that a . before / is needed to make the XPath relative to the context element referenced by infoDiv, otherwise the XPath will be evaluated on the root document context (as if it was called on doc.DocumentNode instead of on infoDiv) :
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode(".//div[#class='details clearfix']");

How to get specific data using HtmlAgilityPack

I am using HtmlAgilityPack for scrapping data.
Here is the link that i am using to scrap data
This Link
The structure is something like that
<div id="left">
<h2>
<i id="bn7483" class="fa fa-volume-up fa-lg in au" title="Speak!"/>
<span class="in">(dhaarmika) </span>
<div class="row">
...
I need two data from there one is "(dhaarmika)" and another is the id from that is "bn7483" using this code
HtmlAgilityPack.HtmlDocument doc2 = web2.Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
HtmlNodeCollection nodes = doc2.DocumentNode.SelectNodes("//span[#class='in']");
I was able to get the first one data that is "(dhaarmika)".
But i couldn't get the second data.
Could anyone tell me how to get the second data???
Another possible way is by selecting preceding sibling of the <span> you already found :
var doc2 = new HtmlWeb().Load("http://www.shabdkosh.com/bn/translate/ধার্মিক");
var span = doc2.DocumentNode.SelectSingleNode("//span[#class='in']");
var i = node.SelectSingleNode("preceding-sibling::i[#id]")
.Attributes["id"]
.Value;

Searching in HTML file using C# where many similar tags exist

Imagin the part of HTML file below:
<div class='span1 league'>
<div class='league-gold-1 leagues size-64'></div>
</div>
<div class='span4 stats'>
<div class='points'>
<span class="gold">491</span>
points
(<span class="gold">391</span> away for region #1)
</div>
<div class='games'>
Won <span class="text-success">37</span>,
lost <span class="text-error">51</span>,
ratio <span>42.05</span>%
</div>
<div class='race'>
Favorite Race:
<div class='race-terran races size-16'></div>
<span>Terran</span>
</div>
</div>
Say I need to get number of Won and Lost games which are 37 and 51 in this case. Also the points (in this case 491). I've been trying with html agility pack but no success so far. If you now a way around this please let me know!
Using HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fname);
var won = doc.DocumentNode.SelectSingleNode("//div[#class='games']/*[#class='text-success']").InnerText;
var lost = doc.DocumentNode.SelectSingleNode("//div[#class='games']/*[#class='text-error']").InnerText;
var points = doc.DocumentNode.SelectSingleNode("//div[#class='points']/*[#class='gold']").InnerText;
You can also use Linq instead of XPath
var won = doc.DocumentNode.Descendants("span")
.First(s=>s.Attributes.Any(a=>a.Value=="text-success"))
.InnerText;
As a workaround you could try regex
Match m = Regex.Match(htmlstring, "<span class=\"text-success\">([0-9]+?)</span>.*?<span class=\"text-error\">([0-9]+?)</span>", RegexOptions.Singleline);
string won = m.Result("$1");
string loss = m.Result("$2");

Categories

Resources