I want to submit Google queries like these:
http://www.google.ch/search?q=100+eur+to+chf
http://www.google.ch/search?q=1.5*17.5
...from a C# console application and capture the result reported back by Google (and ignore any links to other sites). Is there a specific Google API that helps me with this task?
I got this idea from the tool Launchy (launchy.net). The plugin GCalc does this, I also found the source file for this module:
http://launchy.svn.sourceforge.net/viewvc/launchy/tags/2.5/plugins/gcalc/gcalc.cpp?revision=614&view=markup
It looks like GCalc does not use any Google API at all. But I've got no clue how to do the same in C#, and I would prefer to use a proper API. But if there isn't one, I could use some help/pointers on how to copy the GCalc functionality to C# (.net libraries/classes...?)
Google calculator results don't show up when using the API. So if you want them, you'll have to scrape the page. Be careful doing so as it's against Google' terms of service so your IP will be banned if you send too many frequent requests.
Once you've got the results page, use an html parser. The result is in a <b> tag (e.g. <b>1 + 1 = 2</b>; if it's not present, then you have no calculator result). Be careful of <sup> tags within the result (e.g. <b>(1 (m^2) kg) / 2 = 0.5 m<sup>2</sup> kg</b>). You might also want to decode the html entities.
You can use WebClient.DownloadString(String url). This way you get page (html) as string.
You have to parse result, but that shouldn't be hard. HttpAgilityPack is good c# html parser that uses XPath for data retrieval.
why not use HTTPWebRequest and then parse the result as macrog stated in his answer.
Related
I want to "simulate" navigation through a website and parse the responses.
I just want to make sure i am doing something reasonable before i start, I saw 2 options to do so:
Using the WebBrowser class.
Using the HttpWebRequest class.
So my initial though was to use HttpWebRequest and just parse the response.
What do you guys think?
Also wanted to ask,i use c# cause its my strongest language, but what are common languages used to do such stuff as mining from websites?
If you start doing it manually, you probably will end up hard-coding lots of cases. Try Html Agility Pack or something else support xpath expressions.
There are alot of Mining and ETL tools out there for serious data mining needs.
For "user simulation" I would suggest using Selenum web driver or PhantomJS, which is much faster but has some limitations in browser emulation, while Selenium provides almost 100% browser features support.
If you're going to mine data from a website there is something you must do first in order to be 'polite' to the websites you are mining from. You have to obey the rules set in that websites robots.txt, which is almost always located at www.example.com/robots.txt.
Then use HTML Agility Pack to traverse the website.
Or Convert the html document to xhtml using html2xhtml. Then use an xml parser to traverse the website.
Remember to:
Check for duplicate pages. (general idea is to hash each the html doc at each url. Look up (super)shingles)
Respect the robots.txt
Get the absolute URL from each page
Filter duplicate URL from your queue
Keep track of the URLs you have visited(ie. timestamp)
Parse your html doc. And keep your queue updated.
Keywords: robots.txt, absolute URL, html parser, URL normalization, mercator scheme.
Have fun.
HI I am pretty new in C# sphere. Been in php and JavaScript since the beginning of this year. I want to scrap posts and comments from a blog. The site is http://www.somewhereinblog.net
What I want to do is
1. I want to log in using a software
2. Then download the html
3. Then use regular expressions, xpath whatever comes handy to separate the contents of posts and comments
I been searching all over. Understood very little. Though I am quite sure I need to use 'htmlagilitypack'. I dont know how to add a library to c# console or form application. Can someone give me some help? I badly need this. And I am not too into C# just a week. So would be grateful if there is some detailed information. Waiting eagerly.
Thanks in advance brothers.
Using Webclient you can login and download
Instead html-agility-pack I like CsQuery because lets you use jQuery syntax inside a string in C# code, so you can download to a string the html, and search and do things in it like with jQuery and HTML page.
I'd like to add some kind of simple URL resolution and formatting to my C# and jQuery-based ASP.NET web application. I currently allow users to add simple text-based descriptions to items and leave simple comments ('simple' as in I only allow plain text).
What I need to support is the ability for a user to enter something like:
Check out this cool link: http://www.really-cool-site.com
...and have the URL above properly resolved as a link and automagically turned into a clickable link...kinda like the way the editor in StackOverflow works. Except that we don't want to support BBCode or any of its variants. The user experience would actually be more like the way Facebook resolves user-generated URL's.
What are some jQuery + C# solutions I should consider?
There's another question with a solution that might help you. It uses a regex in pure JS.
Personally though, I would do it server-side when the user submits it. That way, you only need to do it once, rather than every time you display that text. You could use a similar regex in C#.
I ended up using server-side C# code to do the linkification. I use an AJAX-jQuery wrapper to call into a PageMethod that does the work.
The PageMethod both linkifies and sanitizes the user-supplied string, then returns the result.
I use the Microsoft Anti-Cross Site Scripting Library (AntiXSS) to sanitize:
http://www.microsoft.com/download/en/details.aspx?id=5242
And I use C# code I found here and there to resolve and shorten links using good olde string parsing and regular expressions.
My method is not as cool as the way FaceBook does it in real time, but at least now my users can add links to their descriptions and comments.
I want to design my own search engine application, where all the results are displayed to the user on one single page (from Google/Bing etc) unlike Google where it is displayed on different pages.
Does there exist any such API's which can get me all those results?
PS. I am using C#, and considering the IEnumerator interface for this?
If you just want to be able to serve search results to users, then the APIs provided by search engines are probably the way to go. As already mentioned there's Bing's Live Search API (which I've not used but looks fine), and also Google's Web Search API.
Additionally, there's Yahoo BOSS which I found very easy to use. However, it looks like BOSS is now a paid API - so depending on your budget/intention, it might not suit.
Google's Web Search API is now deprecated, but should still work for a small number of queries - it's the platform that tools like this number of results counter are built on. It's been replaced by the Google Custom Search API which depending on your needs may or may not work for you. I've not used it, but it looks fine, and is free for small numbers of queries.
The problem with crawling and then parsing search pages is that search engines regularly change the underlying html of the search result pages - so any screen scraping approach will be quite brittle. Additionally, the terms of service of most commercial search engines prohibit automated access - if you go ahead anyway they may well block your crawler. These two problems are probably why awesome third party parsing APIs don't really exist.
What you can do is to fetch data from different APIs (bing/google etc) and then display it to the user in one flow. Otherwise, crawling search engines is totally illegal.
For Google, you can go to Google Custom Search API or if you have products to search then Google Shopping API.
For Bing, there is a simple and straightforward API.
check NUTCH. Is this what you are looking for?
Bing has an open api http://www.bing.com/developers
Google gives you an api then immediately takes it away. http://code.google.com/apis/websearch/docs/
The google api is deprecated and I think they have another one that is even more limited. Once upon a time they had an API that was comparable to Bing's.
For the exact scenario you mentioned though, the best thing to do is first parse out the number of results, then keep iterating through the pages. You also need to handle errors well because Google very often lies about the number of results it contains.
i m working in same project.
Generate sitemap
private void SubmitSitemap(string PortalName)
{
//PING SEARCH ENGINES TO LET THEM KNOW WE UPDATED OUR SITEMAP
//resubmit to google
System.Net.WebRequest reqGoogle = System.Net.WebRequest.Create("http://www.google.com/webmasters/tools/ping?sitemap=" + HttpUtility.UrlEncode("http://your path'" + PortalName + "'/sitemap.xml"));
reqGoogle.GetResponse();
//resubmit to ask
System.Net.WebRequest reqAsk = System.Net.WebRequest.Create("http://submissions.ask.com/ping?sitemap=" + HttpUtility.UrlEncode("http://your path + "'/sitemap.xml"));
reqAsk.GetResponse();
//resubmit to yahoo
System.Net.WebRequest reqYahoo = System.Net.WebRequest.Create("http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=" + HttpUtility.UrlEncode("http://yourpath/sitemap.xml"));
reqYahoo.GetResponse();
//resubmit to bing
System.Net.WebRequest reqBing = System.Net.WebRequest.Create("http://www.bing.com/webmaster/ping.aspx?siteMap=" + HttpUtility.UrlEncode("http://yourpath + "'/sitemap.xml"));
reqBing.GetResponse();
}
Generate robots.txt file and place it in your root directory.Friendly name and other issues are also imp for this purpose.
I need to write a C# code for grabbing contents of a web page. Steps looks like following
Browse to login page
I have user name and a password, provide it programatically and login
Then you are in detail page
You have to get some information there, like (prodcut Id, Des, etc.)
Then need to click(by code) on Detail View
Then you can get the price for that product from there.
Now it is done, so we can write detail line into text file like this...
ABC Printer::225519::285.00
Please help me on this, (Even VB.Net Code is ok, I can convert it to C#)
The WatiN library is probably what you want, then. Basically, it controls a web browser (native support for IE and Firefox, I believe, though they may have added more since I last used it) and provides an easy syntax for programmatically interacting with page elements within that browser. All you'll need are the names and/or IDs of those elements, or some unique way to identify them on the page.
You should be able to achieve this using the WebRequest class to retrieve pages, and the HTML Agility Pack to extract elements from HTML source.
yea I downloaded that library. Nice one.
Thanks for sharing it with me. But I have a issue with that library. The site I want to get data is having a "captcha" on the login page.
I can enter that value if this can show image and wait for my input.
Can we achive that from this library, if you can like to have a sample.
You should be able to achieve this by using two classes in C#, HttpWebRequest (to request the web pages) and perhaps XmlTextReader (to parse the HTML/XML response).
If you do not wish to use XmlTextReader, then I'd advise looking into Regular Expressions, as they are fantastically useful for extracting information from large bodies of text where-in patterns exist.
How to: Send Data Using the WebRequest Class