How to scrape data from another website which is built in AngularJS? - c#

I have to get some specific data from another web page which is built in AngularJS.
What I have done until now:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
It's not returning proper HTML and I suppose (after searching) that the site is returning 4 items but the page source shows only one item with this {{item.name}} type of syntax.
How to solve this issue?

If you use HttpWebRequest, it will just return you the HTML template, it will not contain any data. Due to the nature of Angular, data binding happens later on using JavaScript.
I suggest you to use WebBrowser Control instead of HttpWebRequest for data scraping. Using WebBrowser you should be able to get the complete HTML after the $scope is initialized and data is added to the DOM.
To know more about how to use WebBrowser in ASP.NET you can check this link

Related

Parsing web site using HtmlAgilityPack does not return values as seen on browser

When parsing the site https://holfuy.com/en/weather/1284 HtmlAgilityPack returns "-" for relevant data.
string url = "https://holfuy.com/en/weather/1284";
var web = new HtmlWeb();
web.PreRequest += request =>
{
request.CookieContainer = new System.Net.CookieContainer();
return true;
};
HtmlDocument doc = web.Load(url);
string data = doc.DocumentNode.SelectNodes("//*[#id=\"j_pressure\"]")[0].InnerText;
Console.WriteLine(data);
What is the reason behind this?
It seems that data is dynamically loaded into page and if you need to parse it you need to hook real browser, through for example Selenium and use one of available drivers there or if you don't want to include entire Selenium just hook some headless browser like phantom.js directly. Once you do it, just set some small delay for data to render, load page and parse.
You can see more information here:
Running Scripts in HtmlAgilityPack

scrape table from web site

i want to scrape table from this site : http://www.x-rates.com/table/?from=INR&amount=1
I want this table
and i want to do this with C# Windows Application
I use web request and response and it shows me all page source code
how can I pick that specific table ??
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.x-rates.com/table/?from=INR&amount=1");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
richTextBox.text = reader.ReadToEnd();
Use C# HTML Agility pack to extract HTML table from your response. It's a table and you can easily extract that table from HTML.
Parsing HTML Table in C#
May be above link will help you.

WebRequest response content

I'm trying to find out response content of the given url using HttpWebRequest
var targetUri = new Uri("http://www.foo.com/Message/CheckMsg?msg=test");
var webRequest = (HttpWebRequest)WebRequest.Create(targetUri);
var webRequestResponse = webRequest.GetResponse();
The above code always returns the home page (http://www.foo.com) content. I was expecting http://www.foo.com/Message page content. something wrong or am I missing something?
Is the CheckMsg is an html or php file? When I'm accessing websites using webrequest I always have to use the extension. Otherwise the website will think it's a folder. I would recommend trying to add that.
var targetUri = new Uri("http://www.foo.com/Message/CheckMsg.html?msg=test");

How to scrape data

I am trying scrape data from this url: http://icecat.biz/en/p/Coby/DP102/desc.htm
I want to scrape that specs table from that url.
But I checked source code of url that spec table is not displaying because i think that table is loading using Ajax.
How can I get that table.Whats needs to be done?
I used the following code:
string Strproducturl = "http://icecat.biz/en/p/Coby/DP102/desc.htm";
System.Net.ServicePointManager.Expect100Continue = false;
HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(Strproducturl);
httpWebRequest.KeepAlive = true;
ASCIIEncoding encoding = new ASCIIEncoding();
HttpWebResponse httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
StreamReader streamReader = new StreamReader(responseStream);
string response = streamReader.ReadToEnd();
As IanNorton mentioned, you'll need to make your request to the URL that Icecat use to load the specs using AJAX. For the example link you provided, the specs details URL you'll need to request will be:
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
You can then work your way through the HTML response to get the spec details you require.
You mentioned in your comment that the scraping process is automated. The specs URL is in a basic format, you just need the product ID. However, if you don't have the IDs, just a series of URLs like the example on in your original question, you'll need to get the product ID from the URL you have.
For example, the URL example you gave redirects to a different URL:
http://icecat.biz/p/coby/dp102/digital-photo-frames-0716829961025-dp-102-digital-photo-frame-1091664.html
This URL contains the product ID, right at the end.
You could do a HttpWebRequest to your original URL, stop before it does the redirect and catch the redirecting URL:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://icecat.biz/en/p/Coby/DP102/desc.htm");
request.AllowAutoRedirect = false;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if(response.StatusCode == HttpStatusCode.Redirect){
string redirectUrl = response.GetResponseHeader("Location");
}
Once you've got the redirectUrl variable, you can use Regex to get the ID then do another HttpWebRequest to the specs detail URL.
I would suggest that you use a library like HtmlAgilityPack to select various elements from the html document.
I took a quick look at the link and noticed that the data is actually loaded using an addtional ajax request. You can use the following url to get the ajax data
http://icecat.biz/index.cgi?ajax=productPage;product_id=1091664;language=en;request=feature
The use HtmlAgilityPack to parse that data.
I know this is very old but you could more easily just retrieve the XML from
https://openIcecat-xml:freeaccess#data.icecat.biz/export/freexml.int/EN/1091664.xml
You will also get all images and descriptions as well :-)

How can I Browse a page Programmatically?

I've seen numerous examples on how to get the contents of a URI. I also used HTMLAgilityPack a lot.
What I want is to create Unit Testing environment for asp websites.
I've seen the BrowserSession and this Question but although, the process seems fine, they do not login in a website. I tried numerous well-known websites.
Any ideas on how to browse though code?
It sounds like you want to submit a form on a web page and view the response HTML back of the resulting page.
This method will take a form target URL and submit a post with the given named arguments in the parms Dictionary.
I have used the method below to perform password authentication on a web page and view the response after authentication. You will need to know the target Url and the form fields you wish to pass in the request.
private string SubmitRequest(string url, Dictionary<string, string> parms)
{
var req = WebRequest.Create(url);
req.Method = "POST";
string parmsString = string.Join("&", parms.Select(p => string.Format("{0}={1}", p.Key, p.Value)));
req.ContentLength = parmsString.Length;
using (StreamWriter writer = new StreamWriter(req.GetRequestStream()))
{
writer.Write(parmsString);
writer.Close();
}
var res = req.GetResponse();
using (StreamReader reader = new StreamReader(res.GetResponseStream()))
{
string response = reader.ReadToEnd();
reader.Close();
return response;
}
}
If there is something more specific you are wanting or this is not what you are looking for then please post a comment.
My suggestion is to try some tutorials of WebDriverJs and see if that works for you. It is mainly used for testing but can also be used for other purposes. I am using it to automate responding to user's queries on a shopping platform.

Categories

Resources