I am writing a program that parse a website.
I manage to find a link in the website, but I needed to pass the exact Innertext words to find it.
I'm looking for a way to do the same thing but to find it by partial inner text
example:
innertext is: "hi my name is"
I want to be able to find it by putting only
"hi my"
foreach (var title in htmlNodes)
{
if (keywords == title.SelectSingleNode("div/h1").InnerText)
{
if (color == title.SelectSingleNode("div/p").InnerText)
{
Console.WriteLine(title.SelectSingleNode("div/p/a").GetAttributeValue("href", "pas d'addresse"));
}
}
}
here keywords need to match exactly the innertext in div/h1. I want it to be partial.
here is the html code :
<article>
<div class="inner-article">
<a style = "height:150px;" href="/shop/shirts/c712g63kx/p1us9bkh7">
<img width = "150" height="150" src="//assets.supremenewyork.com/146319/vi/qW2Nur88W30.jpg" alt="Qw2nur88w30">
</a>
<h1>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Tiger Stripe Rayon Shirt</a>
</h1>
<p>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Teal</a>
</p>
</div>
</article>
thank you all for your answers!
I found out how to resolve my problem. It was actually quite simple. here is the code:
if ((title.SelectSingleNode("div/h1").InnerText).Contains(keywords))
Now the problem is to do it with case insensitive.
Related
I am trying to correctly extract the innerText of a list of div I am getting from a website.
This is what I came up with but still a bit buggy as it misses whitespaces and the - symbol.
var first = mainmenuTitles[x].Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "left").Elements("a").ToList();
string final = "";
foreach (var countfirst in first)
{
final += countfirst.InnerText;
}
Console.WriteLine("Tittle: " + final);
This is how the html code looks like
<div class="row row-tall mt4">
<div class="clear">
<div class="left">
<a href="/soccer/italy/">
<strong>Italy</strong>
</a>
-
Serie C:: group B
</div> <div class="right fs11"> March 31 </div> </div> </div>
The text I am trying to get should look like this ->
Italy - Serie C:: group B
I am not a html guru so forgive me if it is too simple and I am missing it.
You can write a query to look up all nodes with xpath //div/a and then concatenate the inner text to get the text you are looking for. Make sure you trim the text to get rid of extra spaces and returns.
Console.WriteLine(string.Join(" - ", doc.DocumentNode.SelectNodes("//div/a").Select(x => x.InnerText.Trim())));
Output:
Italy - Serie C:: group B
Side note... you can use different queries to ensure you get the right div by using name of class as well. e.g. .SelectNodes("//div[#class='row row-tall mt4']/a");. This will give you all the <a> tags under that div.
I'm currently creating a crawler and I'm at the point where I need to abstract data in a set so I can send it to a database as a single row, nice and neat.
Here is a snip-it of my program, it correctly goes to each page so far and retrieves the correct corresponding url
int tempflag = 0;
//linkValueList is full of sub urls previously crawled in the program
foreach (string str in linkValueList)
{
string tempURL = baseURL + str;
HtmlWeb tempWeb = new HtmlWeb();
HtmlDocument tempHtml = tempWeb.Load(tempURL);
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
//get the category from the linkNameList
string tempCategory = linkNameList.ElementAt(tempflag);
//grab url
string tempHref = node.GetAttributeValue("data-itemurl", string.Empty);
//grab image url
//grab brand
//grab name
//grab price
//send to database via INSERT
}
tempflag++;
}
Here is the site code I am working with, this is an example of one item, each item looks similar
<article .... itemprop="product" data-itemurl="Item's url">
<figure>
<a ....>
<img .... src="item's image source" ...>
</a>
<div ...>
<a>....</a>
</div>
</figure>
<div ...>
<a ....>
<div class="brand" itemprop="brand>Item's Brand</div>
<div class="title" itemprop="name">Item's Name</div>
</a>
<div ....>
<div class="msrp"></div>
<div class="price" itemprop="price">$18.99 - $119.99</div>
<span ...> ... </span>
</div>
</div>
</article>
As you can see I have already used XPath to get myself inside of the <article> tag to get the data-itemurl to retrieve the item's url. My question is now that I am already inside of the <article> tag, is there an easy way to now access the other tags nested inside?
I need to get to the <img> tag for the image's url, <div itemprop="brand"> for the brand, <div itemprop="name"> for the item name, and <div itemprop="price"> for the price.
As I mentioned before, I am trying to get all of that information in one go around so I can query it all into a database as a single insert statement at the end of each loop.
Sure you can use another XPath to query within a given element. One thing to note, which many have been troubled with, never start a relative XPath with /, for it will search the entire document instead, start with ./ if you need to, for example (SelectSingleNode() assumed to always find the target element here, otherwise you need to check whether the result is not null first) :
foreach (HtmlNode node in tempHtml.DocumentNode.SelectNodes("//article[#itemprop='product']"))
{
img = node.SelectSingleNode(".//img").GetAttributeValue("src","");
brand = node.SelectSingleNode(".//div[#itemprop='brand']").InnerText.Trim();
.....
}
Sure you can use node.Descendants("img") or node.Descendants("div").Where(d => d.Attributes.Contains("itemprop") && d.Attributes["itemprop"].Value.Equals("price"))
Hope it helps.
How to get the text "Attractions" from the below HTML ?
<li class="product">
<strong>
Attractions
</strong>
<span></span>
</li>
I usually get this done by the below code, when i need the text inside span. But need some help for the above situation.
foreach (HtmlNode selectNode in htmlDocument.DocumentNode.SelectNodes("//span[#class='cityName']"))
{
Result = selectNode.InnerHtml;
}
How can i do this ?
Result = htmlDocument.DocumentNode.SelectSingleNode("//li[#class='product']/strong/a").InnerText;
You can also do a foreach using SelectNodes like what you did up there.
i have following html
<ul class="enh-toggle">
<li>
Design<sup>1</sup><span class="accordion"></span>
<ul id="design">
<li>
<strong>Dimensions</strong>
<ul><li>length:12.3cm</li></ul>
</li>
</ul>
</li>
</ul>
i use the following code to get ul[id='design']
HTMLNode node = doc.DocumentNode.SelectSingleNode("//ul[#class='enh-toggle']//ul[#id='design']");
this just work perfect...
now my question is how can i get the strong tag text. i use the following code but it don't works
string text = node.SelectSingleNode("/li/strong").InnerText;
variation on the "li/strong" answers:
string text = node.SelectSingleNode("./li/strong").InnerText;
A single slash in XPath is the root of the document. You just want to select the direct descendants, so you don't need to give a context:
string text = node.SelectSingleNode("li/strong").InnerText;
I think it should just be:
string text = node.SelectSingleNode("li/strong").InnerText;
..without the leading /.
Could someone please help with resolving the trouble parsing sequential tags of Html by Agility in C#? I have 2 question as listed below.
in this case, I want to parse following Html and store them into a structure (list, stack, etc) so I can use these data effectively.
<h3> header </h3>
<p> paragraph 1</p>
<p>
Google
Gizmodo
</p>
<ul>
<li> something is here with a download
link
</li>
<li> hello
<img src="www.imagesource.com"/>
</li>
</ul>
How to parse these data in sequential manner?
If I use var ParaTags = HtmlDocument.DocumentNode.Descendants("p");,
then I can only get all "p" tags. but I don't know how to get "h3" then "p" in sequence, because "p" is not inside "h3".
following code will returns me all hyperlinks,
var links =
from paras in document.DocumentNode.Descendants("p")
from hyperLinks in paras.Descendants("a").Where(x => x.Attributes["href"].Value != "")
select hyperLinks;
What's the best way to parse and store those mixed content with string, hyperlinks, and images?
so I can output them later in a efficient way? List, stack?
Another word, I want to store every possible content from html and reserve its format if possible. so I can resemble the content in proper format once i reload it onto the app.
Thank you!
If you want to extract all href and src attributes you may try this:
using System;
using System.Linq;
using HtmlAgilityPack;
public class Program
{
static void Main()
{
var document = new HtmlDocument();
document.Load("test.html");
var links =
from element in document.DocumentNode.Descendants()
let href = element.Attributes["href"]
let src = element.Attributes["src"]
where href != null || src != null
select href != null ? href.Value : src.Value;
foreach (var link in links)
{
Console.WriteLine(link);
}
}
}
outputs:
www.google.com
www.gizmodo.com
www.google.com
www.imagesource.com