Meta Search Engine/ Web Scraping in Android Studio/JAVA - c#

I want to create an application that basically search for something with some filters from various websites (I don't require to login to those third party websites so the data available is open to public) and show it on my application. I have a few questions:
1. Is It Legal ?
2. Is this web scraping or Meta Search Engine ?
3. Can I get more information (any web links/articles) to know more
about it ? How to achieve it technically ? One way I know that we can use the XPath technique to scrape but I am wondering if there are more ways.
I am NOT asking for the entire code. Just how to start / Any guidance?
Thank You in Advance !

Firstly you need to understand how search engines work!
-Our so called search engines like google have special programs designed to mine out information from the web they are called "Spiders" what a spider does is basically scroll over all web pages within the search query and find matching information however that's a really complex thing to work on! it takes really good code and algorithm expertise to develop a spider for yourself. However if you can master that you'll be earning a smooth sum of money, but it's really rare unless you're a blatant genius!

Related

Pull website text from one site and display it on another

Hi I am trying to pull this string from courseweb.hopkinsschools.org and display it on my own asp.net application. I have been looking for a long time for a tutorial but nothing works. Any help would be greatly appreciated.
Picture of String needed:
String
When I started doing work with websites and interfacing with other websites, I originally wanted to do what you're talking about, reading the text from pages, because thats how we as people interface with computers and websites.
But that is not how computers should ever interface with other websites unless absolutely necessary.
Moodle has an API for such things like course management. Its kind of difficult to find information on, but its called Moodle Web Services if I remember quickly. I'll add a link back if I can find it.
What these will do is let you access moodle in a computer friendly way, ie. a way your computer can easily understand, instead of trying to read webpages.
Edit
Here are some resources to get you started:
https://docs.moodle.org/dev/Web_services
https://code.google.com/p/mnet-csharp/
https://delog.wordpress.com/2010/08/31/integrating-a-c-app-with-moodle-using-xml-rpc/
https://delog.wordpress.com/2010/09/08/integrating-c-app-with-moodle-2/

Connecting To A Website To Look Up A Word(Compiling Mass Data/Webcrawler)

I am currently developing a Word-Completion application in C# and after getting the UI up and running, keyboard hooks set, and other things of that nature, I came to the realization that I need a WordList. The only issue is, I cant seem to find one with the appropriate information. I also don't want to spend an entire week formatting and gathering a WordList by hand.
The information I want is something like "TheWord, The definition, verb/etc."
So, it hit me. Why not download a basic word list with nothing but words(Already did this; there are about 109,523 words), write a program that iterates through every word, connects to the internet, retrieves the data(definition etc) from some arbitrary site, and creates XML data from said information. It could be 100% automated, and I would only have to wait for maybe an hour depending on my internet connection speed.
This however, brought me to a few questions.
How should I connect to a site to look up these words? << This my actual question.
How would I read this information from the website?
Would I piss off my ISP or the website for that matter?
Is this a really bad idea? Lol.
How do you guys think I should go about this?
EDIT
Someone noticed that Dictionary.com uses the word as a suffix in the url. This will make it easy to iterate through the word file. I also see that the webpage is stored in XHTML(Or maybe just HTML). Here is the source for the Word "Cat". http://pastebin.com/hjZj6AC1
For what you marked as your actual question - you just need to download the data from the website and find what you need.
A great tool for this is CsQuery which allows you to use jquery selectors.
You could do something like this:
var dom = CQ.CreateFromUrl("http://www.jquery.com");
string definition = dom.Select(".definitionDiv").Text();

Get Html from Web page and create Setup project for Wpf Application (C#)

I'm trying to create a wpf application such as a movies library because i would like to manage and sort out my movies with a pretty interface.
I'd like to create a library with all my movies getting information from the web, but i don't know how very well.
I thought to get the information from a web site, for example imdb, but i don't know if it's legally to capture html from page to get the nested information.
It's my first desktop application and I would also like to know if it is necessary to create a database within the project and then create a setup project with specified script for deploy it.
Sorry for the confusion but i would like to know too much things :)
Thanks a lot in advance.
The legality of web scraping is a grey area. See my question, "Legality of Web Scraping vs Normal Use" and the corresponding answers for some insight.
Even if the legality is not a problem, web scraping is a flimsy approach because the webpage structure may change without notice, making your application suddenly useless until you update it to the new format. You are much better off using some sort of web API (if the site providing the information offers it).
Whether you need a database or not depends entirely on what your application will be doing and how you design it - it's not something any of us can tell you.
Same goes for the setup project - in fact I wouldn't worry about that until you actually have a working application. Take it step by step and keep the scope within control.
Yes I did not think about api.
It's a great idea, maybe use "themoviedb".
But if i create an application based on it, that has to show all the movies that you have stored on your hdd and get , for example, the posters, the description and the ranking, i have to create a database according to you?
Thanks a lot.

How to design a customized Search Engine?

I want to design my own search engine application, where all the results are displayed to the user on one single page (from Google/Bing etc) unlike Google where it is displayed on different pages.
Does there exist any such API's which can get me all those results?
PS. I am using C#, and considering the IEnumerator interface for this?
If you just want to be able to serve search results to users, then the APIs provided by search engines are probably the way to go. As already mentioned there's Bing's Live Search API (which I've not used but looks fine), and also Google's Web Search API.
Additionally, there's Yahoo BOSS which I found very easy to use. However, it looks like BOSS is now a paid API - so depending on your budget/intention, it might not suit.
Google's Web Search API is now deprecated, but should still work for a small number of queries - it's the platform that tools like this number of results counter are built on. It's been replaced by the Google Custom Search API which depending on your needs may or may not work for you. I've not used it, but it looks fine, and is free for small numbers of queries.
The problem with crawling and then parsing search pages is that search engines regularly change the underlying html of the search result pages - so any screen scraping approach will be quite brittle. Additionally, the terms of service of most commercial search engines prohibit automated access - if you go ahead anyway they may well block your crawler. These two problems are probably why awesome third party parsing APIs don't really exist.
What you can do is to fetch data from different APIs (bing/google etc) and then display it to the user in one flow. Otherwise, crawling search engines is totally illegal.
For Google, you can go to Google Custom Search API or if you have products to search then Google Shopping API.
For Bing, there is a simple and straightforward API.
check NUTCH. Is this what you are looking for?
Bing has an open api http://www.bing.com/developers
Google gives you an api then immediately takes it away. http://code.google.com/apis/websearch/docs/
The google api is deprecated and I think they have another one that is even more limited. Once upon a time they had an API that was comparable to Bing's.
For the exact scenario you mentioned though, the best thing to do is first parse out the number of results, then keep iterating through the pages. You also need to handle errors well because Google very often lies about the number of results it contains.
i m working in same project.
Generate sitemap
private void SubmitSitemap(string PortalName)
{
//PING SEARCH ENGINES TO LET THEM KNOW WE UPDATED OUR SITEMAP
//resubmit to google
System.Net.WebRequest reqGoogle = System.Net.WebRequest.Create("http://www.google.com/webmasters/tools/ping?sitemap=" + HttpUtility.UrlEncode("http://your path'" + PortalName + "'/sitemap.xml"));
reqGoogle.GetResponse();
//resubmit to ask
System.Net.WebRequest reqAsk = System.Net.WebRequest.Create("http://submissions.ask.com/ping?sitemap=" + HttpUtility.UrlEncode("http://your path + "'/sitemap.xml"));
reqAsk.GetResponse();
//resubmit to yahoo
System.Net.WebRequest reqYahoo = System.Net.WebRequest.Create("http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=" + HttpUtility.UrlEncode("http://yourpath/sitemap.xml"));
reqYahoo.GetResponse();
//resubmit to bing
System.Net.WebRequest reqBing = System.Net.WebRequest.Create("http://www.bing.com/webmaster/ping.aspx?siteMap=" + HttpUtility.UrlEncode("http://yourpath + "'/sitemap.xml"));
reqBing.GetResponse();
}
Generate robots.txt file and place it in your root directory.Friendly name and other issues are also imp for this purpose.

Writing a C# program that scans ecommerce website and extracts products pictures + prices + description from them

I'm developing an ecommerce search engine that allows you to search for products in a lot of ecommerce websites.
How do I approach the matter?
I need an application that will be able to scan websites, parse their HTML and determine which of the images in the website are product images, which are product descriptions, which are product prices.
Would be happy to hear any idea, example.
Thanks in advance.
edit:
My question is not how to get the HTML from the websites(which is called screen scraping) but more on how to parse that information and understand which of the html contains the actual data i am looking for, and which is not.
You may find this thread helpful in your quest. I had outlined the basic steps there. Here's the link to all questions tagged as "Screen-scraping" on SO. Also, lots of material on the web - Google.
Most of the sites you'd be scraping (more correctly web-scraping) have partner APIs for "reseller" type deals. For you to circumvent that with screen scraping will quickly find your IP blocked by their traffic servers, and potentially put you in a legal situation.
This is ethically dubious at best.

Categories

Resources