HtmlAgilityPack The parsed value differ - c#

I am trying to parse a web document. Using HtmlAgilityPack (C#)
That is exactly what looking for href value of a tag.
I'm parsing the http://www.ntis.go.kr/ThRndGateList.do
Although successful parse, the value is slightly different.
I do not know why.
The actual value of the web is as follows:
The value obtained through the Htmlagilitypack are as follows:
As you can see, this strange starting with "jsessionid" value in the href value is obtained. What reason?
Thank you for regards.

It is probably because in your browser (in your case Chrome) you are logged in. If you make the request via HtmlAgilityPack you are like a freshly open browser:
Not logged in
Never on this page before
The Webapplication your trying to use generates a JSESSIONID when someone opens the page for the first time and this id is transfered via the URL.
This question could help you to understand the technology behind the webapplication: Under what conditions is a JSESSIONID created?

Related

C# programmatically communication with website

I have following problem.
Example site http://eisk.apphb.com/web-form-samples/listing-page.aspx
My c# application has to read data from gridview, but only for specific supervisor, so i need to change programmatically value in drop down list.
I have problem with change this value and get site with actually data in grid view.
Please help me solve this case.
I did something related in the past using Selenium, but now the changed the API and it's not so easy as it was before. (They merged Selenium with Web Driver)
You can see more about that in here:
http://www.seleniumhq.org/docs/05_selenium_rc.jsp
You can also use Watin to do the same.
http://www.codeproject.com/Articles/17064/WatiN-Web-Application-Testing-In-NET
You could use an HttpWebRequest to download the site content then use HtmlAgilityPack to parse the HTML and get the data.

Scraping data using HTMLAgilityPack

In HTMLAgailityPack, how to get the data from the website which is not coming in the innerhtml method of it. For example, if in the link below:
https://www.theice.com/productguide/ProductSpec.shtml?specId=1496#expiry
The table starting with contract symbol is not coming in the innerhtmltext. Please let me know how to get this table data through HTMLAgailityPack?
Regards
You need to send a GET request to https://www.theice.com/productguide/ProductSpec.shtml?expiryDates=&specId=1496&_=1342907196619
The content is being loaded dynamically via javascript. Perhaps you can parse the innerhtmltext to see what link the javascript will send the GET request to
If its not 'coming in the innerhtml' that would mean that its being put in there by a script. I'm not able to check this page myself so I'm not sure.
If its coming from a script, you can't get it very easily. You can play around viewing the javascript and maybe being able to read the data as its coming in.
Basically install Firebug on your browser, and look at the data transfers being made. Sometimes you're lucky, sometimes you're not.
Or you can take the simple method and use the winforms WebBrowser control, load it in it, let it run the script then scrape from there. Note that this will leak memory and GDI handles like crazy.
Pleae use this XPath to get that table you want //*[#id="right"]/div/table
e.g.
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[#id="right"]/div/table"));
string html = node.InnerHtml;

Post data to server and then parse the HTML-code C#

I'm trying to parse a website. The only problem is that the site dosen't use a specific URL to the site I wan't to parse. The content is being displayed to the site using JavaScript on the same webpage so the content is different depending on the searchquery.
Is it possible to choose a value from a dropdown-menu and then post that to the server and then parse the HTML-code in C#?
Clarification:The code is returned in HTML.
I know the name of the option from the dropdown i want to post, but how do I do that from code-behind?
Most sites do not really generate HTML in Javascript. Much more often you see Asp.Net sites where Javascript is used for a postback (and name of the dropdown is posted back in __EVENTTARGET field)
Then you can do the same in your application - you have to imitate filling the form - pass all the fields to the server including VIEWSTATE and EVENTTARGET.
Having said that, it might be against the site's terms of use.
You definitely need to checkout Selenium, it does exactly what you need. It is commonly used as a testing framework. However you can use it to manipulate HTML tags even when the website uses javascript.
Note: Selenium allows you to open and manipulate a website using a browser such as FireFox, Chrome, IE, etc. However, I think what you need here is to use the WebDriver, which manipulates the website without opening a browser. Most of my experience using Selenium is with Java, but I found multiple tutorials online for .net too.

Grab details from web page

I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)
The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.
You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.
yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.
You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class

WebCrawling Dynamic Links

Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.
it would be the same way even it is dynamic or not. actually a crawler is only a mater of 3 things
The url
The data it sent to server if it is a POST Method then
The cookie if authentication is required
that's all,
the common problem when doing crawler:
Miss-guess of default page [index.html, index.php, default.aspx etc].. actually it will work without it for all method [POST/GET]
One of each field name is not written exactly
ASP.Net form viewstate id field (i forgot the name) but i can be achieve easily
Dynamic page generated by javascript. this one is the hardest part and the most cases even google still have problem about this.
hope that help.
You might want to look at this question which details how to write a crawler or look at the source code for http://searcharoo.net/ which contains a good crawler (see here).

Categories

Resources