Scraping data using HTMLAgilityPack - c#

In HTMLAgailityPack, how to get the data from the website which is not coming in the innerhtml method of it. For example, if in the link below:
https://www.theice.com/productguide/ProductSpec.shtml?specId=1496#expiry
The table starting with contract symbol is not coming in the innerhtmltext. Please let me know how to get this table data through HTMLAgailityPack?
Regards

You need to send a GET request to https://www.theice.com/productguide/ProductSpec.shtml?expiryDates=&specId=1496&_=1342907196619
The content is being loaded dynamically via javascript. Perhaps you can parse the innerhtmltext to see what link the javascript will send the GET request to

If its not 'coming in the innerhtml' that would mean that its being put in there by a script. I'm not able to check this page myself so I'm not sure.
If its coming from a script, you can't get it very easily. You can play around viewing the javascript and maybe being able to read the data as its coming in.
Basically install Firebug on your browser, and look at the data transfers being made. Sometimes you're lucky, sometimes you're not.
Or you can take the simple method and use the winforms WebBrowser control, load it in it, let it run the script then scrape from there. Note that this will leak memory and GDI handles like crazy.

Pleae use this XPath to get that table you want //*[#id="right"]/div/table
e.g.
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id="right"]/div/table"));
string html = node.InnerHtml;

Related

How to update viewsource of the webpage dynamically in .net

There is one website named "www.localbanya.com", i wanted to grab the HTML information from that site, they list products, the structure of their display is:
First they display some around 8-10 products on page-load, and
later when user scrolls down it generates more products.
Now as this is happening based on javascript, i am not able to get the whole page source using WebClient.
I wanted to know is there any way i can update the page-source while using WebClient class in .net to retrieve whole page information or any other alternative i can use to get the whole page HTML information, at once.
You can refer this for reference localbanya product page
Any help will be a appreciated.
WebClient obviously doesn't run the javascript.
so you gonna need some sort of a headless browser to do it.
There are many options for it, though I don't know any C# or .NET implementation..
You may look into Phantom JS and other headless browsers which replicate what a normal browser does and you can write scripts for it.
Also refer to this question
Headless browser for C# (.NET)?
You can also run something like Fiddler to see what requests were made from the page when scrolling down, to reverse engineer how the data is retrieved, and replicate that with a WebClient if possible.
Hope this Helps.

Download specific Tag Source

I have some web pages with A lot of tags.i want to download source Page where it tag is Span and its className is Something.
It's possible that i just download a part of page(source code) not whole page?
I know that i can do it with webbrowser(for example navigate to my destination page and search for specific tag and get its source code)
But with it,i must first get whole page and after that get specific tag.
there is any way (for example: WebClient class) to download just my specific tag with specific ClassName source code?
No, the HTTP protocol doesn't have any facility to do what you need (the only thing one can do is get a certain Range, but that requires you to know exactly where the data is, so that doesn't seem to help), you will have to download the entire page and then parse what you need.
I'm afraid that you cannot download just parts of your page and you need to load the whole page first. But to make it maybe easier, you can parse the HTML in XML and then work with it which is a lot easier.

How can I scrape a page that contains data updated with JavaScript after page load?

Im trying to scrape a page. Everything is ok, but when values are updated, the sourse code of page is still the same for a minute. Even when i refresh a page with slow internet connection, first i see old data, and only after page gets fully loaded values are current.
I guess javascript updates them. BUt it still has to download them somehow.
How can i get current values?
I write my program in C#, but if you have some ideas/advices/examples language doesnt really matter.
Thank you.
You're right - javascript is probably updating the data after load.
I could think of three ways to handle this:
Use a webbrowser control - I guess your using the HttpWebRequest object to retrieve values from the site. This won't work if you need to let the javascript to run. You can use the webbrowser control, let the javascript run and retrieve values from the DOM. Only thing I don't like about this approach is it feels like a hack and probably too clunky for prod applications. You also need to know when to read the contents of the DOM (an update might be inprogress in the background). Google "C# WebBrowser Control Read DOM Programmatically" or you can read more about that here.
I personally prefer this over the previous but it doesn't work all the time. First you need to inspect the website from firebug or something and see which urls are called from the background. Say for example the site is updating stock quotes using javascript. Most likely, its using an asynchronous request to retrieving the updated information from a webservice. Using firebug, you can view this under NET>XHR. Now is the hard part. Well, take a look at the request and the values returned. The idea is, you can try to retrieve the values your self and parse the contents - which can be a lot easier than scraping a page. The problem is, you would need to do a bit of reverse engineering to get it right. You might also encounter problems with authentication and/or encryption.
Lastly and my most preferred solution is asking the owner [of the site your are scraping] directly.
I think the WebBrowser control approach is probably OK and doesn't depend on third party libraries. Here is what I intend to use and it solves the problem of waiting for the page to complete loading:
private string ReadPage(string Link)
{
using (var client = new WebClient())
{
this.wbrwPages.Navigate(Link);
while (this.wbrwPages.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
ReadPage = this.wbrwPages.DocumentText;
}
}
I will get information out of the HTML through some form of DOM or XPath treatment. I am curious if others will have comments about entering the 'while' loop and depending upon the 'complete' state to get me out of it. I may put a timer of some sort in there as well - just to be safe.

Grab details from web page

I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)
The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.
You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.
yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.
You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class

WebCrawling Dynamic Links

Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.
it would be the same way even it is dynamic or not. actually a crawler is only a mater of 3 things
The url
The data it sent to server if it is a POST Method then
The cookie if authentication is required
that's all,
the common problem when doing crawler:
Miss-guess of default page [index.html, index.php, default.aspx etc].. actually it will work without it for all method [POST/GET]
One of each field name is not written exactly
ASP.Net form viewstate id field (i forgot the name) but i can be achieve easily
Dynamic page generated by javascript. this one is the hardest part and the most cases even google still have problem about this.
hope that help.
You might want to look at this question which details how to write a crawler or look at the source code for http://searcharoo.net/ which contains a good crawler (see here).

Categories

Resources