Extract news links from news website - c#

Is there any reliable method to find out the collection of links which is directed us to detail news page. in other word after visiting the first page of website I just want those links that refer to a news item. any solution ?

If it is for one certain website, you could always try to fetch the HTML of the website and extract the links to the news articles by using regular expressions. Just find pieces in the HTML that your code can use to identify where the links are.
I did this a couple of times to scrape some info from a website.
But maybe an obvious question, there is no RSS feed available on the website?

You can do a simple WebRequest and download a page and search through the html for the content that you want to parse.
WebRequest req = WebRequest.Create
("http://www.domain.com/news.html");
req.Proxy = null;
using (WebResponse res = req.GetResponse())
using (Stream s = res.GetResponseStream())
using (StreamReader sr = new StreamReader(s))
File.WriteAllText("news.html", sr.ReadToEnd());
//search through html page for news content.
System.Diagnostics.Process.Start("news.html");

Related

How to scrape data from another website which is built in AngularJS?

I have to get some specific data from another web page which is built in AngularJS.
What I have done until now:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
It's not returning proper HTML and I suppose (after searching) that the site is returning 4 items but the page source shows only one item with this {{item.name}} type of syntax.
How to solve this issue?
If you use HttpWebRequest, it will just return you the HTML template, it will not contain any data. Due to the nature of Angular, data binding happens later on using JavaScript.
I suggest you to use WebBrowser Control instead of HttpWebRequest for data scraping. Using WebBrowser you should be able to get the complete HTML after the $scope is initialized and data is added to the DOM.
To know more about how to use WebBrowser in ASP.NET you can check this link

scrape table from web site

i want to scrape table from this site : http://www.x-rates.com/table/?from=INR&amount=1
I want this table
and i want to do this with C# Windows Application
I use web request and response and it shows me all page source code
how can I pick that specific table ??
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.x-rates.com/table/?from=INR&amount=1");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
richTextBox.text = reader.ReadToEnd();
Use C# HTML Agility pack to extract HTML table from your response. It's a table and you can easily extract that table from HTML.
Parsing HTML Table in C#
May be above link will help you.

How to scrape data

I am trying scrape data from this url: http://icecat.biz/en/p/Coby/DP102/desc.htm
I want to scrape that specs table from that url.
But I checked source code of url that spec table is not displaying because i think that table is loading using Ajax.
How can I get that table.Whats needs to be done?
I used the following code:
string Strproducturl = "http://icecat.biz/en/p/Coby/DP102/desc.htm";
System.Net.ServicePointManager.Expect100Continue = false;
HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(Strproducturl);
httpWebRequest.KeepAlive = true;
ASCIIEncoding encoding = new ASCIIEncoding();
HttpWebResponse httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
StreamReader streamReader = new StreamReader(responseStream);
string response = streamReader.ReadToEnd();
As IanNorton mentioned, you'll need to make your request to the URL that Icecat use to load the specs using AJAX. For the example link you provided, the specs details URL you'll need to request will be:
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
You can then work your way through the HTML response to get the spec details you require.
You mentioned in your comment that the scraping process is automated. The specs URL is in a basic format, you just need the product ID. However, if you don't have the IDs, just a series of URLs like the example on in your original question, you'll need to get the product ID from the URL you have.
For example, the URL example you gave redirects to a different URL:
http://icecat.biz/p/coby/dp102/digital-photo-frames-0716829961025-dp-102-digital-photo-frame-1091664.html
This URL contains the product ID, right at the end.
You could do a HttpWebRequest to your original URL, stop before it does the redirect and catch the redirecting URL:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://icecat.biz/en/p/Coby/DP102/desc.htm");
request.AllowAutoRedirect = false;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if(response.StatusCode == HttpStatusCode.Redirect){
string redirectUrl = response.GetResponseHeader("Location");
}
Once you've got the redirectUrl variable, you can use Regex to get the ID then do another HttpWebRequest to the specs detail URL.
I would suggest that you use a library like HtmlAgilityPack to select various elements from the html document.
I took a quick look at the link and noticed that the data is actually loaded using an addtional ajax request. You can use the following url to get the ajax data
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
The use HtmlAgilityPack to parse that data.
I know this is very old but you could more easily just retrieve the XML from
https://openIcecat-xml:freeaccess#data.icecat.biz/export/freexml.int/EN/1091664.xml
You will also get all images and descriptions as well :-)

how to read the response from a web site?

I have a website url which gives corresponding city names by taking zip code as input parameter. Now I want to know how to read the response from the site.
This is the link I am using http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680
You'll have to Use the HTTPWebRequest object to connect to the site and scrape the information from the response.
Look for html tags or class names that wrap the content you are trying to find, then use either regexes or string functions to get the required data.
Good example here:
try this (you'll need to include System.text and System.net)
WebClient client = new WebClient();
string url = "http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680";
Byte[] requestedHTML;
requestedHTML = client.DownloadData(url);
UTF8Encoding objUTF8 = new UTF8Encoding();
string html = objUTF8.GetString(requestedHTML);
Response.Write(html);
The simplest way it to use the light-weight WebClient classes in System.Net namespace. The following example code will just download the entire response as a string:
using (WebClient wc = new WebClient())
{
string response = wc.DownloadString("http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680");
}
However, if you require more control over the response and request process then you can use the more heavy-weight HttpWebRequest Class. For instance, you may want to deal with different status codes or headers. There's an example of using HttpWebRequest this in the article How to use HttpWebRequest and HttpWebResponse in .NET on CodeProject.
Used the WebClient Class (http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=VS.100%29.aspx) to request the page and get the response as a string.
WebClient wc = new WebClient();
String s = wc.DownloadString(DestinationUrl);
You can search the response for specific HTML using String.IndexOf, SubString, etc, regular expressions, or try something like the HTML Agility Pack (http://htmlagilitypack.codeplex.com/) which was created specifically to help parse HTML.
first of all, you better find a good Web Service for this purpose.
and this is an HttpWebRequest example:
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("http://zipinfo.com/cgi-local/zipsrch.exe?zip=60680");
httpRequest.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse httpResponse = (HttpWebResponse)httpRequest.GetResponse();
Stream dataStream = httpResponse.GetResponseStream();
You need to use HttpWebRequest for receiving content and some tools for parsing html and finding what you need. One of the most popular libs for working with html in c# is HtmlAgilityPack, you can see simple example here: http://www.fairnet.com/post/2010/08/28/Html-screen-scraping-with-HtmlAgilityPack-Library.aspx
you can use a WebClient object, and an easy way to scrape the data is with xpath.

page posting issue when working in Screen Scraping

I am working on screen scraping and done successfully in 3 websites, I have an issue in last website
here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page
Here is My Test
However, when I hit from my application, since here I don't have an option to post, it only fetch html of requested page that is obviously my above mention HTML test link, that actually have parameter in URL to get the result.
How can I handle this situtation?
Please give me hint.
Thanks
here is my C# code, I am using HTMLAgality
String url;
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc;
url = "http://mysampleURL";
doc = hw.Load(url);
Use the WebClient class for posting the form of the first page with the expected input values. The input values can be found in the source of the first page, but it's also possible to capture them using Fiddler which is imho a great tool for these scenarios.
Example:
NameValueCollection values = new NameValueCollection();
values.Add("action","hotelPackageWizard#searchHotelOnly");
values.Add("packageType","HOTEL_ONLY");
// etc..
WebClient webclient = new WebClient();
webclient.Headers.Add("Content-Type","application/x-www-form-urlencoded");
byte[] responseArray = webclient.UploadValues("http://www.expedia.com/Hotels?rfrr=-905&","POST", values);
string response = System.Text.Encoding.ASCII.GetString(responseArray);
If the resource requires a POST, then you MUST submit a POST.
This is a fairly simple task. Here is an example from Rick Strahl's blog. The code is a bit rustic but works and will get you heading the right direction
string lcUrl = "http://www.west-wind.com/testpage.wwd";
HttpWebRequest loHttp =
(HttpWebRequest) WebRequest.Create(lcUrl);
// *** Send any POST data
string lcPostData =
"Name=" + HttpUtility.UrlEncode("Rick Strahl") +
"&Company=" + HttpUtility.UrlEncode("West Wind ");
loHttp.Method="POST";
byte [] lbPostBuffer = System.Text.
Encoding.GetEncoding(1252).GetBytes(lcPostData);
loHttp.ContentLength = lbPostBuffer.Length;
Stream loPostData = loHttp.GetRequestStream();
loPostData.Write(lbPostBuffer,0,lbPostBuffer.Length);
loPostData.Close();
HttpWebResponse loWebResponse = (HttpWebResponse) loHttp.GetResponse();
Encoding enc = System.Text.Encoding.GetEncoding(1252);
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(),enc);
string lcHtml = loResponseStream.ReadToEnd();
loWebResponse.Close();
loResponseStream.Close();
For screen scraping tasks that involve posting forms such as log-ins, maintaining cookies, taking care of XSRF tokens, one solution is to use CURL. But it is not easy.
I then explored Selenium and I love it. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.

Categories

Resources