I want to "simulate" navigation through a website and parse the responses.
I just want to make sure i am doing something reasonable before i start, I saw 2 options to do so:
Using the WebBrowser class.
Using the HttpWebRequest class.
So my initial though was to use HttpWebRequest and just parse the response.
What do you guys think?
Also wanted to ask,i use c# cause its my strongest language, but what are common languages used to do such stuff as mining from websites?
If you start doing it manually, you probably will end up hard-coding lots of cases. Try Html Agility Pack or something else support xpath expressions.
There are alot of Mining and ETL tools out there for serious data mining needs.
For "user simulation" I would suggest using Selenum web driver or PhantomJS, which is much faster but has some limitations in browser emulation, while Selenium provides almost 100% browser features support.
If you're going to mine data from a website there is something you must do first in order to be 'polite' to the websites you are mining from. You have to obey the rules set in that websites robots.txt, which is almost always located at www.example.com/robots.txt.
Then use HTML Agility Pack to traverse the website.
Or Convert the html document to xhtml using html2xhtml. Then use an xml parser to traverse the website.
Remember to:
Check for duplicate pages. (general idea is to hash each the html doc at each url. Look up (super)shingles)
Respect the robots.txt
Get the absolute URL from each page
Filter duplicate URL from your queue
Keep track of the URLs you have visited(ie. timestamp)
Parse your html doc. And keep your queue updated.
Keywords: robots.txt, absolute URL, html parser, URL normalization, mercator scheme.
Have fun.
Related
There are lots of sites that use this (imo) annoying "infinite scrolling" style.
Examples of this are sites like tumblr, twitter, 9gag, etc..
I recently tried to scrape some pics off of these sites programatically with HtmlAgilityPack.
like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var primary = doc.DocumentNode.SelectNodes("//img[#class='badge-item-img']");
var picstring = primary.Select(r => r.GetAttributeValue("src", null)).FirstOrDefault();
This works fine, but when I tried to load in the HTML from certain sites, I noticed that I only got back a small amount of content (lets say the first 10 "posts" or "pictures", or whatever..)
This made me wonder if it would be possible to simulate the "scrolling down to the bottom" of the page in c#.
This isn't just the case when I load the html programatically, when I simply go to sites like tumblr, and I check firebug or just "view source", I expected that all the content would be in there somewhere, but alot of it seems to be hidden/inserted with javascript. Only the content that is actually visible on my screen is present in the HTML source.
So my questions is: is it possible to simulate infinitely scrolling down to a page, and loading in that HTML with c# (preferably)?
(I know that I can use API's for tumblr and twitter, but i'm just trying to have some fun hacking stuff together with HtmlAgilityPack)
There is no way to reliably do this for all such websites in one shot, short of embedding a web browser (which typically won't work in headless environments).
What you should consider doing instead is looking at the site's JavaScript in order to see what AJAX queries are used to fetch content as the user scrolls down.
Alternatively, use a web debugger in your browser (such as the one included in Chrome). These debuggers usually have a "network" pane you can use to inspect AJAX requests performed by the page. Looking at these requests as you scroll down should give you enough information to write C# code that simulates those requests.
You will then have to parse the response from those requests as whatever type of content that particular API delivers, which will probably be JSON or XML, but almost certainly not HTML. (This may be better for you anyway, since it will save you having to parse out display-oriented HTML, whereas the AJAX API will give you data objects that should be much easier to use.)
Those sites are making asynchronous http requests to load the subsequent page contents. Since HTML agility pack doesn't have a javascript interpreter (thank heavens for that), you will need to make those requests yourself. It is most likely that most sites will not return html fragments, but rather JSON. For that, you'll need to use a JSON parser, not the HTML agility pack.
I want to submit Google queries like these:
http://www.google.ch/search?q=100+eur+to+chf
http://www.google.ch/search?q=1.5*17.5
...from a C# console application and capture the result reported back by Google (and ignore any links to other sites). Is there a specific Google API that helps me with this task?
I got this idea from the tool Launchy (launchy.net). The plugin GCalc does this, I also found the source file for this module:
http://launchy.svn.sourceforge.net/viewvc/launchy/tags/2.5/plugins/gcalc/gcalc.cpp?revision=614&view=markup
It looks like GCalc does not use any Google API at all. But I've got no clue how to do the same in C#, and I would prefer to use a proper API. But if there isn't one, I could use some help/pointers on how to copy the GCalc functionality to C# (.net libraries/classes...?)
Google calculator results don't show up when using the API. So if you want them, you'll have to scrape the page. Be careful doing so as it's against Google' terms of service so your IP will be banned if you send too many frequent requests.
Once you've got the results page, use an html parser. The result is in a <b> tag (e.g. <b>1 + 1 = 2</b>; if it's not present, then you have no calculator result). Be careful of <sup> tags within the result (e.g. <b>(1 (m^2) kg) / 2 = 0.5 m<sup>2</sup> kg</b>). You might also want to decode the html entities.
You can use WebClient.DownloadString(String url). This way you get page (html) as string.
You have to parse result, but that shouldn't be hard. HttpAgilityPack is good c# html parser that uses XPath for data retrieval.
why not use HTTPWebRequest and then parse the result as macrog stated in his answer.
I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)
The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.
You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.
yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.
You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class
I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links?
Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details.
You're trying to program a web crawler but it cannot crawl URLs that end with .php?
If that's the case you need to take a step back and think about why that is. It could be because the crawler chooses which URLs to crawl using a regex based on an URI scheme.
In most cases these URLs are just normal HTML but they could also be a generated image (like a captcha) or a download link for a 700mb iso file - and there's no way to know be certain without checking out the header of the HTTP response from that URL.
Note: If you're writing your own crawler from scratch you're going to need good understanding of HTTP.
The first thing your crawler is going to see when gets an URL is the header, which contains a MIME content-type - it tells a browser/crawler how to process and open the data (is it HTML, normal text, .exe, etc). You'll probably want to download pages based on the MIME type instead of an URL scheme. The MIME type for HTML is text/html and you should check for that using the HTTP library you're using before downloading the rest of the content of an URL.
The Javascript problem
Same as above except that running javascript in the crawler/parser is pretty uncommon for simple projects and might create more problems than it solves. Why do you need Javascript?
A different solution
If you're willing to learn Python (or already know it) I suggest you look at Scrapy. It's a web crawling framework built similarly to the Django web framework. It's really easy to use and a lot of problems have already been solved so it could be a good starting point if you're trying to learn more about the technology.
Does anyone have experience with a query language for the web?
I am looking for project, commercial or not, that does a good job at making a webpage queryable and that even follows links on it to aggregate information from a bunch of pages.
I would prefere a sql or linq like syntax. I could of course download a webpage and start doing some XPATH on it but Im looking for a solution that has a nice abstraction.
I found websql
http://www.cs.utoronto.ca/~websql/
Which looks good but I'm not into Java
SELECT a.label
FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";
Are there others out there?
Is there a library that can be used in a .NET language?
See hpricot (a Ruby library).
# load the RedHanded home page
doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
(doc/"span.entryPermalink").set("class", "newLinks")
# remove the sidebar
(doc/"#sidebar").remove
# print the altered HTML
puts doc
It supports querying with CSS or XPath selectors.
Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively.
For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness.
There is also this C# html parser which looks good but I've not tried it.
You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.
I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.