Finding and extracting an email address using HtmlAgilityPack

Finding and extracting an email address using HtmlAgilityPack - c#

Within HTML code, I am trying to extract an email address only if it is provided by the user. Here is a sample of the HTML:
<div class="header">
<div class="details">
<span>
<!-- Random description here which MAY contain an email address -->
</span>
</div>
</div>
I have managed to get to <span>by using HTML Agility Pack as follows:
var getWeb = new HtmlWeb();
var pageHtml = getWeb.Load("website here");
IEnumerable<string> listItemHtml = pageHtml.DocumentNode.SelectNodes(
#"//div[#class='header']
/div[#class='details']
/span").Select(span => span.InnerText);
My next challenge is to search through this text and check if an email has been provided, which I am unable to figure out. Could someone please help me with this?

Related

How do I download the HTML code of the url with the images NOT being hidden

I am trying to do some webscraping but when I download the html of the url the images are hidden but in my browser they are not "user-ad-row__image image image--is-hidden" instead of "user-ad-row__image image image--is-visible". Was seeing if webclient changed anything. Using the HtmlAgilityPack.
var url = "https://www.gumtree.com.au/s-motorcycles-scooters/wa/drz+400/k0c18322l3008845";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
WebClient client = new WebClient();
htmlDocument.LoadHtml(client.DownloadString(url));
<div class="user-ad-row__main-image-wrapper user-ad-row__main-image-wrapper--has-image"><img class="user-ad-row__image image image--is-hidden" src="" alt="Suzuki Drz-400E"></div>
</div>
<div class="user-ad-row__details">
<div class="user-ad-row__info">
<p class="user-ad-row__title">Suzuki Drz-400E</p>
<div class="user-ad-price user-ad-price--row"><span class="user-ad-price__price">$4,250</span>
<!-- -->
<!-- --><span class="user-ad-price__price-negotiable user-ad-price__price-negotiable--with-price">Negotiable</span>
<!-- -->
<!-- -->
<!-- -->
</div>
<ul class="user-ad-attributes">
<li class="user-ad-attributes__attribute">Learner Approved</li>
<li class="user-ad-attributes__attribute">6000 km</li>
</ul>
<p id="user-ad-desc-MAIN-1228533281" class="user-ad-row__description user-ad-row__description--regular">For sale 2008 Drz-400E excellent condition, well looked after starts first time evertime serviced about a month ago. Just paid 3 months rego. Call or text </p>
</div>
<div class="user-ad-row__extra-info">
<div class="user-ad-row__location"><span class="user-ad-row__location-area">Perth City Area</span>Perth<span class="user-ad-row__distance"> </span></div>
<p class="user-ad-row__age">15/09/2019</p>
</div>
</div>
<button id="" type="button" class="user-ad-row__watchlist-heart-wrapper watchlist-heart Button__buttonBase--3YR6h Button__button--2NsdC Button__buttonBasic--3CSBx" role=""><span class="" aria-hidden="true"><span class="icon-heart heart"></span></span>
</button>```

The website that you provided loads images using Javascript and according to an internet search it appears that HtmlAgilityPack only renders the HTML but is unable to run Javascript.
Some solutions would be:
WebBrowser Class
It's kind of tricky if you want to mix it with the HtmlAgilityPack but provides decent performance.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible.
Javascript.Net
It allows you to run scripts using Chrome's V8 JavaScript engine. Near the bottom of the page there will be something like <script src="/latest/resources/react/app.full.something.js"></script>
If you are able to figure out how that loads then you should be able to get all of the images.

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.

You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

HtmlAgilityPack (C#) can't read past hidden text

using the following url:
link to search results page
I am trying to first scrape the text from the a tag from this html that can be seen from the source code when viewed with Firebug:
<div id="search-results" class="search_results">
<div class="searchResultItem">
<div class="searchResultImage photo">
<h3 class="black">
<a class="linkmed " href="/content/1/2484243.html">加州旱象不减 开源节流声声急</a>
</h3>
<p class="resultPubDate">15.10.2014 06:08 </p>
<p class="resultText">
</div>
</div>
<p class="more-results">
But what I get back when I scrape the page is:
<div class="search_results" id="search-results">
<input type="hidden" name="ctl00$ctl00$cpAB$cp1$hidSearchType" id="hidSearchType">
</div>
<p class="more-results">
Is there anyway to view the source the way Firebug does?

How are you scraping the page? Use something like Fiddler and check the request and the response for dynamic pages like these ones. The reason why Firebug sees more is because all of the dynamic elements have loaded already when you are viewing it in your browser, when in fact your scraping method is only one piece of the puzzle (the initial HTML).
Hint: For this search page, you will see that the request for the results data is actually a) a separate GET request with b) a long query string and c) a cookie on the header, which returns a JSON object containing the data. This is why the link you posted just gives me "undefined," because it does not contain the search data.

Parse a div with HTML Agility Pack

I've this HTML code:
div class="singolo-contenuto link_azure">
<p><img src="" class="left pad2 field_foto" alt="" /><p> Message </p>
</div>
I need to "capture" "Message".
I'm trying with:
String message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']").InnerText;
but doesn't works... I obtain a lot of the full page... what's wrong?

The XPath expression you have just gets you to the <div> tag. You need to get deeper into the last <p> tag. This will work:
var message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']//p[last()]").InnerText;

Getting value from string using specific conditions

I have an html data in my string in which i need to get only paragraph values.Below is a sample html.
<html>
<head>
<title>
<script>
<div>
Some contents
</div>
<div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
<div>
Other html elements
</div>
So how to get the data from the paragraphs using string manipulation.
Desired Output
<Div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>

Give the div an ID, e.g.
<div id="test">
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
then use //div[#id='test']/p.
The solution broken down:
//div - All div elements
[#id='test'] - With an ID attribute whose value is test
/p

I have used Html agility Pack for something like this. Then you can use LINQ to get what you want.

Xpath is the obvious answer (if the HTML is decent, has a root etc), failing that some third party widget like chilkat

If you use Html Agility Pack as mentioned in the other posts, you can get all paragraph elements in the html by using:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html string");
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p")
Since you are using .net Framework 2.0, you would want an older version of Agility Pack, which can be found here: HTML Agility Pack
If you want just the text inside the paragraph, you can use
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p/text()")

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding and extracting an email address using HtmlAgilityPack - c#

Related

How do I download the HTML code of the url with the images NOT being hidden

Retrieving specific URLs with HtmlAgilityPack C#

HtmlAgilityPack (C#) can't read past hidden text

Parse a div with HTML Agility Pack

Getting value from string using specific conditions

Categories

Resources