scrape table from web site - c#

i want to scrape table from this site : http://www.x-rates.com/table/?from=INR&amount=1
I want this table
and i want to do this with C# Windows Application
I use web request and response and it shows me all page source code
how can I pick that specific table ??
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.x-rates.com/table/?from=INR&amount=1");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
richTextBox.text = reader.ReadToEnd();

Use C# HTML Agility pack to extract HTML table from your response. It's a table and you can easily extract that table from HTML.
Parsing HTML Table in C#
May be above link will help you.

Related

How to scrape data from another website which is built in AngularJS?

I have to get some specific data from another web page which is built in AngularJS.
What I have done until now:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
It's not returning proper HTML and I suppose (after searching) that the site is returning 4 items but the page source shows only one item with this {{item.name}} type of syntax.
How to solve this issue?
If you use HttpWebRequest, it will just return you the HTML template, it will not contain any data. Due to the nature of Angular, data binding happens later on using JavaScript.
I suggest you to use WebBrowser Control instead of HttpWebRequest for data scraping. Using WebBrowser you should be able to get the complete HTML after the $scope is initialized and data is added to the DOM.
To know more about how to use WebBrowser in ASP.NET you can check this link

Parse and extract a value from streaming XML file?

After much ado, I managed to create a restful service in asp.net MVC following Omar's brilliant Restful Asp.net article
Just one little thing remains.
My Asp.Net MVC controller returns an XML file , which has this tag
< FileCode > 24233224< / FileCode >
This is a console application I use to send a Get request which gives me the whole Xml file
//Generate get request
string url = "http://localhost:1193/Home/index?File=343456789012286";
HttpWebRequest GETRequest = (HttpWebRequest)WebRequest.Create(url);
GETRequest.Method = "GET";
GETRequest.ContentType = "text/xml";
GETRequest.Accept = "text/xml";
Console.WriteLine("Sending GET Request");
HttpWebResponse GETResponse = (HttpWebResponse)GETRequest.GetResponse();
Stream GETResponseStream = GETResponse.GetResponseStream();
StreamReader sr = new StreamReader(GETResponseStream);
Console.WriteLine("Response from Server");
// This writes whole file on screen
Console.WriteLine(sr.ReadToEnd());
I could perhaps save this file and then use Linq to parse it, but can't I just get the value in my tag out without saving it ? I simply need the FileCode
Thankyou :)
Yuo could emply the XPathReader (source download).
It comes with source and testsuite.
What it gives you is the ability to work with highlevel query constructs (XPath) in streaming mode.
There is also a similar article on CodeProject: Fast screen scraping with XPath over a modified XmlTextReader and SgmlReader

How to scrape data

I am trying scrape data from this url: http://icecat.biz/en/p/Coby/DP102/desc.htm
I want to scrape that specs table from that url.
But I checked source code of url that spec table is not displaying because i think that table is loading using Ajax.
How can I get that table.Whats needs to be done?
I used the following code:
string Strproducturl = "http://icecat.biz/en/p/Coby/DP102/desc.htm";
System.Net.ServicePointManager.Expect100Continue = false;
HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(Strproducturl);
httpWebRequest.KeepAlive = true;
ASCIIEncoding encoding = new ASCIIEncoding();
HttpWebResponse httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
StreamReader streamReader = new StreamReader(responseStream);
string response = streamReader.ReadToEnd();
As IanNorton mentioned, you'll need to make your request to the URL that Icecat use to load the specs using AJAX. For the example link you provided, the specs details URL you'll need to request will be:
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
You can then work your way through the HTML response to get the spec details you require.
You mentioned in your comment that the scraping process is automated. The specs URL is in a basic format, you just need the product ID. However, if you don't have the IDs, just a series of URLs like the example on in your original question, you'll need to get the product ID from the URL you have.
For example, the URL example you gave redirects to a different URL:
http://icecat.biz/p/coby/dp102/digital-photo-frames-0716829961025-dp-102-digital-photo-frame-1091664.html
This URL contains the product ID, right at the end.
You could do a HttpWebRequest to your original URL, stop before it does the redirect and catch the redirecting URL:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://icecat.biz/en/p/Coby/DP102/desc.htm");
request.AllowAutoRedirect = false;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if(response.StatusCode == HttpStatusCode.Redirect){
string redirectUrl = response.GetResponseHeader("Location");
}
Once you've got the redirectUrl variable, you can use Regex to get the ID then do another HttpWebRequest to the specs detail URL.
I would suggest that you use a library like HtmlAgilityPack to select various elements from the html document.
I took a quick look at the link and noticed that the data is actually loaded using an addtional ajax request. You can use the following url to get the ajax data
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
The use HtmlAgilityPack to parse that data.
I know this is very old but you could more easily just retrieve the XML from
https://openIcecat-xml:freeaccess#data.icecat.biz/export/freexml.int/EN/1091664.xml
You will also get all images and descriptions as well :-)

how to read the response from a web site?

I have a website url which gives corresponding city names by taking zip code as input parameter. Now I want to know how to read the response from the site.
This is the link I am using http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680
You'll have to Use the HTTPWebRequest object to connect to the site and scrape the information from the response.
Look for html tags or class names that wrap the content you are trying to find, then use either regexes or string functions to get the required data.
Good example here:
try this (you'll need to include System.text and System.net)
WebClient client = new WebClient();
string url = "http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680";
Byte[] requestedHTML;
requestedHTML = client.DownloadData(url);
UTF8Encoding objUTF8 = new UTF8Encoding();
string html = objUTF8.GetString(requestedHTML);
Response.Write(html);
The simplest way it to use the light-weight WebClient classes in System.Net namespace. The following example code will just download the entire response as a string:
using (WebClient wc = new WebClient())
{
string response = wc.DownloadString("http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680");
}
However, if you require more control over the response and request process then you can use the more heavy-weight HttpWebRequest Class. For instance, you may want to deal with different status codes or headers. There's an example of using HttpWebRequest this in the article How to use HttpWebRequest and HttpWebResponse in .NET on CodeProject.
Used the WebClient Class (http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=VS.100%29.aspx) to request the page and get the response as a string.
WebClient wc = new WebClient();
String s = wc.DownloadString(DestinationUrl);
You can search the response for specific HTML using String.IndexOf, SubString, etc, regular expressions, or try something like the HTML Agility Pack (http://htmlagilitypack.codeplex.com/) which was created specifically to help parse HTML.
first of all, you better find a good Web Service for this purpose.
and this is an HttpWebRequest example:
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680");
httpRequest.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse httpResponse = (HttpWebResponse)httpRequest.GetResponse();
Stream dataStream = httpResponse.GetResponseStream();
You need to use HttpWebRequest for receiving content and some tools for parsing html and finding what you need. One of the most popular libs for working with html in c# is HtmlAgilityPack, you can see simple example here: http://www.fairnet.com/post/2010/08/28/Html-screen-scraping-with-HtmlAgilityPack-Library.aspx
you can use a WebClient object, and an easy way to scrape the data is with xpath.

Extract news links from news website

Is there any reliable method to find out the collection of links which is directed us to detail news page. in other word after visiting the first page of website I just want those links that refer to a news item. any solution ?
If it is for one certain website, you could always try to fetch the HTML of the website and extract the links to the news articles by using regular expressions. Just find pieces in the HTML that your code can use to identify where the links are.
I did this a couple of times to scrape some info from a website.
But maybe an obvious question, there is no RSS feed available on the website?
You can do a simple WebRequest and download a page and search through the html for the content that you want to parse.
WebRequest req = WebRequest.Create
("http://www.domain.com/news.html");
req.Proxy = null;
using (WebResponse res = req.GetResponse())
using (Stream s = res.GetResponseStream())
using (StreamReader sr = new StreamReader(s))
File.WriteAllText("news.html", sr.ReadToEnd());
//search through html page for news content.
System.Diagnostics.Process.Start("news.html");

Categories

Resources