C# htmlAgility Webscrape html node inside the first Html node

C# htmlAgility Webscrape html node inside the first Html node - c#

I am trying to access these nodes
on this website.
http://bcres.paragonrels.com/publink/default.aspx?GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&Report=Yes
however they appear to be in a secondary Html document within the initial one.
I am confused how I access the secondary html path and then parse through for the
this is an example of one of the nodes.
<div style="top:219px;left:555px;width:45px;height:14px;" id="" class="mls29">2</div>
I am using htmlAgility pack and I recieve null whenever I try to access Div.
I tried working my way down the nodes but It didn't work.
Any help or a place to look up the necessary information to figure this out would be appreciated
var webGet = new HtmlWeb();
var document = webGet.Load("http://bcres.paragonrels.com/publink/default.aspx?GUID=d27a1d95- 623d-4f6a-9e49-e2e46ede136c&Report=Yes");
var divTags = document.DocumentNode.SelectNodes("/html");
var text = document.DocumentNode.InnerText;
MessageBox.Show(text);

You will be able to scrape the data if you access the following url:
http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://bcres.paragonrels.com/publink/Report.aspx?outputtype=HTML&GUID=2033c143-cdf1-46b4-9aac-2e27371de22d&ListingID=262103824:0&Report=Yes&view=29&layout_id=63");
var presentedBy = hd.DocumentNode.CssSelect(".mls23.at-phone-link");
if (presentedBy != null)
{
Console.WriteLine(presentedBy.FirstOrDefault().InnerText);
}
As an example, scraping the Presented By field:
Remarks:
I use ScrapySharp nuget package along with HtmlAgilityPack, so I can scrape using css selectors instead of xpath expressions - something I find easier to do.
The url you are scraping from is your problem. I am scraping from the last get request that is performed after the page is loaded, as you can see in the screenshot below, using Firefox developer tools to analyze the site traffic/network requests/responses:
I could not yet identify who/what triggers this http request in the end (may be by javascript code, may be via one of the frame htmls that are requested in the main document (the frame-enabled one).
If you only have a couple of urls like this to scrape, then even manually extracting the correct url will be an option.

Related

Webclient.DownloadString() does not retrieve current whole page

I know there is another question with practically identical title here: Webclient.DownloadString does not retrieve the whole page
But the solution doesn't help me, maybe somebody else have the same problem.
I'm trying to get the html code of this URL:
https://cubebrush.co/?freebies=true
To achieve that, I'm using the following code in C#:
WebClient webClient = new WebClient();
string webString = webClient.DownloadString("https://cubebrush.co/?freebies=true");
But the retrieved html lacks some information, for example, all the button tags inside the website. This can be quickly checked using the library HtmlAgilityPack and checking for all the tags inside the website with the following code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webString);
HashSet<string> hs = new HashSet<string>();
foreach (var dec in doc.DocumentNode.Descendants())
{
hs.Add(dec.Name);
}
If we run this, it will show 26 tags, but none of them will be a button tag. This makes sense, since the initial webString also lacks that "button information".
I've tried to copy webString into a file, to check if, as the initial commented post says, was a problem with the visualizer, but it doesn't, visualizer and file looks exactly equal.
Can somebody tells me what I'm doing wrong? Thanks!

HTMLAgilityPack load AJAX content for scraping

Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.
All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.
At present Im calling the required page via this code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[#class=\"nav\"]");
An example bit of code that Ive seen saying to use the WebBrowser control:
if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);
Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.

It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.
Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.

I struggled all day to get this right so here is a FedEx tracking example of what the accepted answer is referring to (I think):
Dim body As String
body = "data={""TrackPackagesRequest"":{""appType"":""WTRK"",""appDeviceType"":""DESKTOP"",""supportHTML"":true,""supportCurrentLocation"":true,""uniqueKey"":"""",""processingParameters"":{},""trackingInfoList"":[{""trackNumberInfo"":{""trackingNumber"":" & Chr(34) & "YOUR TRACKING NUMBER HERE" & Chr(34) & ",""trackingQualifier"":"""",""trackingCarrier"":""""}}]}}"
body = body & "&action=trackpackages&locale=en_US&version=1&format=json"
With CreateObject("MSXML2.XMLHTTP")
.Open("POST", "https://www.fedex.com/trackingCal/track", False)
.setRequestHeader("Referer", "https://www.fedex.com/apps/fedextrack/?tracknumbers=YOUR TRACKING NUMBER HERE")
.setRequestHeader("User-Agent", "Mozilla/5.0")
.setRequestHeader("X-Requested-With", "XMLHttpRequest")
.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8")
.send(body)
Dim Reply = .responseText
End With
Alternatively have you considered building a browser into your application using Cefsharp.net and then using Dev Tools through the .net interface?
You may have noticed that even dynamically AJAX/JS generated HTML can be found using e.g. Inspect Element option in Firefox. So that code is sitting on your computer even if you can't scrape it using traditional HTML scraping methods.
Another option to consider.
https://cefsharp.github.io/

text returning as NULL using htmlagility pack + xpath

I'm currently playing around with htmlagility pack, however, I don't seem to be getting any data back from the following url:
http://cloud.tfl.gov.uk/TrackerNet/LineStatus
This is the code i'm using:
var url = #"http://cloud.tfl.gov.uk/TrackerNet/LineStatus";
var webGet = new HtmlWeb();
var doc = webGet.Load(url);
However, when I check the contents of 'doc', the text value is set to null. I've tried other url's and i'm receiving the HTML used on the site. Is it just this particular url, or am I doing something wrong. Any help would be appreciated.

HtmlAgilityPack is an HTML parser, thus you won't be successful in trying to parse a non-HTML webpage such as the XML your want to parse.

html parsing problem using C#

From here, I am trying to get data from stock quote for every 10 mins interval.
I used WebClient for downloading the page content and for parsing I used regular expressions. It is working fine for other urls. For the Particular URL, my parsing code not working.
I think it is the problem with javascript, When I load the page in Browser, after loading the page content, It took some extra time to plot the data. May be this guy is using some client side script for this page. Can anyone help me Please..........

HTML Agility Pack will save you tons of headaches. Try it instead of using regexps to parse HTML.
For what it's worth, in the page you link to the quote data is indeed in Javascript code, check http://www.nseindia.com/js/getquotedata.js and http://www.nseindia.com/js/quote_data.js

as per #Vinko Vrsalovic answer, Html Agility pack is your friend. Here is a sample
WebClient client = new WebClient();
string source = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(source);
HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//*[#href]");
foreach (HtmlNode node in nodes)
{
if (node.Attributes.Contains("class"))
{
if (node.Attributes["class"].Value.Contains("StockData"))
{// Here is our info }
}
}

Sending a web page by email programmatically. Image URLs not resolved

I am writing a SharePoint timer job, which needs to pull the content of a web page, and send that HTML as an email.
I am using HttpWebRequest and HttpWebResponse objects to pull the content.
The emailing functionality works fine except for one problem.
The web page which serves up the content of my email contains images.
When the html of the page is sent as an email, the image URLs inside the HTML code are all relative URLs, they are not resolved as an absolute URL.
How do i resolve the image URLs to their absolute paths inside the web page content?
Is there any straight forward way to do this? I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.

Try adding a base element to the head of the html document you retrieve. As href attribute you should use the url of the page you are retrieving.

Found this cool Codeplex tool called HtmlAgilityPack.
http://www.codeplex.com/htmlagilitypack
Using this API, we can parse Html like we can parse XML documents. We can also query and search nodes using XPath.
I used the following code snippet to fix the Image URLs
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlMessage);
//This selects all the Image Nodes
HtmlNodeCollection hrefNodes = htmlDoc.DocumentNode.SelectNodes("//img");
foreach (HtmlNode node in hrefNodes)
{
string imgUrl = node.Attributes["src"].Value;
node.Attributes["src"].Value = webAppUrl + imgUrl;
}
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
htmlDoc.OptionOutputAsXml = false;
htmlDoc.Save(sw);
htmlMessage = sb.ToString();

I've run into this problem a few times, and I dont think there is any magic wand method out there to do it all for you. HTMLAgilityPack does a good job for aggregating the content you need, but you will have to decipher it yourself. For example; getting the list of HtmlNodes that contain "//img" could return any of the following items:
<img src="http://www.adg2435.com/pictures/pic.jpg"/> //absolute url
<img src="coolpicture.jpg"/> //relative to the page
<img src="pictures/pic.jpg"/>
<img src="./pictures/pic.jpg"/>
It is up to you to figure out which types of links are going to show up on the given webpage.
You also need to account for things like this: (Truncate your image url after the extension ".jpg")
<img src="/pictures/pic.jpg?45823593&xyz=95325235r0634945823ot49140200"/>
So, I find it handy to keep a few things on hand at any given time:
The source URL for the entire page
The domain for the given url (to do things like say "does the given src contain the domain?")
This is how you would get the domain of the source link:
Uri domainUri = new Uri(fullUrl);
domainUrl = domainUri.GetLeftPart(UriPartial.Authority);
Potentially, you may want the subdomain (i.e. "http://www.mysite.com/pictures/")

I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Too bad, because that's the only way you'll get the images to show up. Would you rather download all the images and embed them in the email too?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# htmlAgility Webscrape html node inside the first Html node - c#

Related

Webclient.DownloadString() does not retrieve current whole page

HTMLAgilityPack load AJAX content for scraping

text returning as NULL using htmlagility pack + xpath

html parsing problem using C#

Sending a web page by email programmatically. Image URLs not resolved

Categories

Resources