I am trying to correctly extract the innerText of a list of div I am getting from a website.
This is what I came up with but still a bit buggy as it misses whitespaces and the - symbol.
var first = mainmenuTitles[x].Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "left").Elements("a").ToList();
string final = "";
foreach (var countfirst in first)
{
final += countfirst.InnerText;
}
Console.WriteLine("Tittle: " + final);
This is how the html code looks like
<div class="row row-tall mt4">
<div class="clear">
<div class="left">
<a href="/soccer/italy/">
<strong>Italy</strong>
</a>
-
Serie C:: group B
</div> <div class="right fs11"> March 31 </div> </div> </div>
The text I am trying to get should look like this ->
Italy - Serie C:: group B
I am not a html guru so forgive me if it is too simple and I am missing it.
You can write a query to look up all nodes with xpath //div/a and then concatenate the inner text to get the text you are looking for. Make sure you trim the text to get rid of extra spaces and returns.
Console.WriteLine(string.Join(" - ", doc.DocumentNode.SelectNodes("//div/a").Select(x => x.InnerText.Trim())));
Output:
Italy - Serie C:: group B
Side note... you can use different queries to ensure you get the right div by using name of class as well. e.g. .SelectNodes("//div[#class='row row-tall mt4']/a");. This will give you all the <a> tags under that div.
Related
trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.
to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
The Top Social Networking Sites People Are Using
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
The Top
</p>
</div>
<div class="result js-result card-mobile ">
<div class="result-firstline-container">
<div class="result-firstline-title">
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking- websites"
>
Top 15 Most Popular Social Networking Sites | January 2019
</a>
</div>
</div>
<a
class="result-url js-result-url"
href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
</a>
<p class="result-snippet">
Top 15 Most
</p>
</div>
i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.
int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);
to grab url i am using the following:
var regexURLParser = new Regex(#"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);
what i want is to grab is the url from these:
<a
class="result-title js-result-title"
href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"
>
<a
class="result-title js-result-title"
href="http://www.ebizmba.com/articles/social-networking-websites"
>
so that the outcome shows only:
https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites
You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.
To add HTMLAgilityPack using NuGet
go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3
after the installation you can extract Urls like below.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
.ForEach(x=>
{
//Use HasClass method to filter elements
if (!string.IsNullOrEmpty(x.GetAttributeValue("href", ""))
&& x.HasClass("result-title") && x.HasClass("js-result-title"))
{
listOfUrls.Add(x.GetAttributeValue("href", ""));
}
});
listOfUrls.ForEach(x => Console.WriteLine(x));
EDIT
Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.
Another way
shorter and another way to get filtered values.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"put html string here");
var listOfUrls = doc.DocumentNode.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "result-title js-result-title")
.Select(x => x.GetAttributeValue("href", "")).ToList();
I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.
Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}
Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.
I am trying to go through each html node and get its attribute and innerText. At the moment when I am scanning through any html I am getting this stupid #text node even though it doesn't exist.
Here is my html
<div class="demographic-info adr editable-item" id="demographics">
<div id="location-container" data-li-template="location">
<div id="location" class="editable-item">
<dl>
<dt>Location</dt>
<dd>
<span class="locality">Bolton, United Kingdom</span>
</dd>
<dt>Industry</dt>
<dd class="industry">Computer Games</dd>
</dl>
</div>
</div>
</div>
And here is my c#
foreach (HtmlNode node in j.ChildNodes)
if (node.HasChildNodes)
checkNode(node);
static void checkNode(HtmlNode node)
{
foreach (HtmlNode n in node.ChildNodes)
{
if (n.HasChildNodes)
checkNode(n);
else
{
HtmlNode nodeValue = hasValueInNode(n);
if (nodeValue != null)
addCategories(nodeValue);
}
}
}
When I go through debug mode to check which node the compiler is at and I get this:
1 = div, 2 = #text, 3 = div, 4 = #text, 5 = div, 6 = #text, 7 = dl ...
and so on!
I am guessing that is detecting blank space or return space as a node but this is such a waste of loops. Can someone explain this to me and a way to avoid it. Thanks
This is how HTML/XML works. There is a text node every time there is some text inside a node. In this case it happens to be whitespace, but it is still text and it cannot be discarded. The node is not "stupid" and it does exist.
Your code is free to check if the text node is whitespace and ignore it if you want to, or you can make the XML so that there isn't any whitespace.
Just as a thought: how would you tell the parser which whitespace should be important:
<div>
<div>Test<span>
</span>test</div>
</div>
So, should the parser just be "there's Test and then there's empty span element and then test, so actualy the text inside is 'Testtest'"? Or how would it know what to do?
I'm using CsQuery to read values of HTML elements.
In advance, I don't know if the <a> element contains a <font> element or not.
Is there a way to read the InnerText of an anchor regardless if it contains a fontelement or not?
Scenario 1: Text inside font element
<div class="link">
<a href="http://www.example.com/1">
<font>Foo</font>
</a>
</div>
Scenario 2: Text without font element
<div class="link">
<a href="http://www.example.com/2">
Foo
</a>
</div>
I've got the following working solution:
var dom = CQ.CreateFromUrl("http://www.myurl.com");
var a = new CQ(dom.Select("div.link a").InnerHTML);
var font = a.Select("font");
var myValue = a.Count() > 0 ? font[0].InnerText : a[0].InnerText;
But it's a bit messy and I'd rather just always remove the font element - if present - so I could go for the anchor value right away. Something like Contents() in combination with UnWrap(), but I haven't succeeded to make it work. Ideas anyone?
var dom = CQ.CreateFromUrl("http://www.myurl.com");
string result = dom[".link a"].Text();
Imagin the part of HTML file below:
<div class='span1 league'>
<div class='league-gold-1 leagues size-64'></div>
</div>
<div class='span4 stats'>
<div class='points'>
<span class="gold">491</span>
points
(<span class="gold">391</span> away for region #1)
</div>
<div class='games'>
Won <span class="text-success">37</span>,
lost <span class="text-error">51</span>,
ratio <span>42.05</span>%
</div>
<div class='race'>
Favorite Race:
<div class='race-terran races size-16'></div>
<span>Terran</span>
</div>
</div>
Say I need to get number of Won and Lost games which are 37 and 51 in this case. Also the points (in this case 491). I've been trying with html agility pack but no success so far. If you now a way around this please let me know!
Using HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fname);
var won = doc.DocumentNode.SelectSingleNode("//div[#class='games']/*[#class='text-success']").InnerText;
var lost = doc.DocumentNode.SelectSingleNode("//div[#class='games']/*[#class='text-error']").InnerText;
var points = doc.DocumentNode.SelectSingleNode("//div[#class='points']/*[#class='gold']").InnerText;
You can also use Linq instead of XPath
var won = doc.DocumentNode.Descendants("span")
.First(s=>s.Attributes.Any(a=>a.Value=="text-success"))
.InnerText;
As a workaround you could try regex
Match m = Regex.Match(htmlstring, "<span class=\"text-success\">([0-9]+?)</span>.*?<span class=\"text-error\">([0-9]+?)</span>", RegexOptions.Singleline);
string won = m.Result("$1");
string loss = m.Result("$2");