Scrape google's all search results based on certain criteria?

Scrape google's all search results based on certain criteria? - c#

I am working on my mapper and I need to get the full map of newegg.com
I could try to scrap NE directly (which kind of violates NE's policies), but they have many products that are not available via direct NE search, but only via google.com search; and I need those links too.
Here is the search string that returns 16mil of results:
https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=newegg.com&as_occt=url&safe=off&tbs=&as_filetype=&as_rights=
I want my scraper to go over all results and log hyperlinks to all these results.
I can scrap all the links from google search results, but google has limit of 100 pages for each query- 1,000 results and again, google is not happy with this approach. :)
I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

I am new to this; Could you advise / point me in the right direction ?
Are there any tools/methodology that could help me to achieve my
goals?
Google takes a lot of steps to prevent you from crawling their pages and I'm not talking about merely asking you to abide by their robots.txt. I don't agree with their ethics, nor their T&C, not even the "simplified" version that they pushed out (but that's a separate issue).
If you want to be seen, then you have to let google crawl your page; however, if you want to crawl Google then you have to jump through some major hoops! Namely, you have to get a bunch of proxies so you can get past the rate limiting and the 302s + captcha pages that they post up any time they get suspicious about your "activity."
Despite being thoroughly aggravated about Google's T&C, I would NOT recommend that you violate it! However, if you absolutely need to get the data, then you can get a big list of proxies, load them in a queue and pull a proxy from the queue each time you want to get a page. If the proxy works, then put it back in the queue; otherwise, discard the proxy. Maybe even give a counter for each failed proxy and discard it if it exceeds some number of failures.

I've not tried it but you can use googles custom search API. Of course, its starts to cost money after 100 searches a day. I guess they must be running a business ;p

It might be a bit late but I think it is worth to mention that you can professionally scrape Google reliable and not cause problems with it.
Actually it is not of any threat I know about to scrape Google.
It is cahllenging if you are unexperienced but I am not aware about a single case of legal consequence and I am always following this topic.
Maybe one of the largest cases of scraping happened some years ago when Microsoft scraped Google to power Bing. Google was able to proof it by placing fake results which do not exist in real world and Bing suddenly took them up.
Google named and shamed them, that's all that happened as far as I remember.
Using the API is rarely ever a real use, it costs a lot of money to use it for even a small amount of results and the free amount is rather small (40 lookups per hour before ban).
The other downside is that the API does not mirror the real search results, in your case maybe less a problem but in most cases people want to get the real ranking positions.
Now if you do not accept Googles TOS or ignore it (they did not care about your TOS when they scraped you in their startup) you can go another route.
Mimic a real user and get the data directly from the SERPs.
The clue here is to send around 10 requests per hour (can be increased to 20) with each IP address (yes you use more than one IP). That amount has proven to cause no problem with Google over the past years.
Use caching, databases, ip rotation management to avoid hitting it more often than required.
The IP addresses need to be clean, unshared and if possible without abusive history.
The originally suggested proxy-list would complicate the topic a lot as you receive unstable, unreliable IPs with questionable absuive use, share and history.
There is an open source PHP project on http://scraping.compunect.com which contains all the features you need to start, I used it for my work which now runs for some years without troubles.
Thats a finished project which is mainly built to be used as customizable base of your project but runs standalone too.
Also PHP is not a bad choice, I originally was sceptical but I was running PHP (5) as background process for two years without a single interruption.
The performance is easily good enough for such a project so I would give it a shot.
Otherwise, PHP code is like C/JAVA .. you can see how things are done and repeat them in your own project.

Related

Real time monitor and block a web request

I am planning on creating a free open source Porn blocker software that as much as possible blocks porn websites.
The idea is to create a list of websites like xxxporn.xxx or whatever and once user at any time tries to visit that website in any web browser it just kills the request and the user goes no where.
I am good with programming and my problem isn't with code i just want to know from where should i start?
I heard about packet sniffers so how do i do it in C#? all i want is just a demo method or a code sample that shows me the currently vistied websites and kill the request when a predefined website is visited.

I wrote a web crawler and had to deal with filtering out porn on free crawls.
Looks for the following terms:
18 U.S.C 2257
18 U.S.C. 2257
section 2257 compliance
Most pornographic sites have these terms in their html source.

This is not an answer to your request, but rather only ones view on the subject.
Porn is not something that just Pops up while you are surfing regular web sites. like this one for example... Porn is something that you need to look for.
If you have small children and you dont want them to be exposed to things that are not under your control you can simply define all their surfing destinations with windows firewall.
If you have older children and you are afraid that they might wonder off in search for porn due to age or hormonal impulses, or get exposed by simply surfing to all sorts of dubious and pirated websites, you should have a talk with them and explain things in a grown manner and not try to block the reality of life in such medieval way.
In this modern age where internet governs all aspects of our lives, and there are far greater risks out there in cyberspace then porn, proper training and education on what is good and bad when surfing is the key to save kids from all risks and harmful contents.
I apologize that this has nothing to do with programming.

Google Search returns 503 error for complicated search queries

When I try to download a Google Search result page using HttpWebRequest in C#, everything works very well if I use simple search terms, like
http://www.google.com/search?q=stackoverflow
But when I try to make it more complex, for example
http://www.google.com/search?q=inurl%3A%22goethe%22%20filetype%3Apdf
which means
inurl:"goethe" filetype:pdf
, I will receive a 503 error because Google thinks I'm a bot. Is there any workaround?
Edit: UserAgent is set to "Mozilla/5.0".

well.. if your search is done programmatically, then Google just so happens to be right.. you ARE a bot :-)
cheers!

I don't believe it has much to do with how complex your query happens to be. The only thing that really matters is if they think that you're a bot. If you're submitting queries at a very high rate, then Google will think you're a bot so there are several possible solutions:
Reduce the rate at which you're sending queries.
Use proxies to make multiple queries.
Additionally, it's important to note that if you make web requests without saving cookies, then that might be another "signal" for Google to think that you're a bot. You should also be very careful not to get the proxies blocked by Google, because you're scraping the big G. It's hard to find free proxies and if you abuse them, then they'll get shut down so be a good citizen!
Good luck!

Try Google Custom Search APIs and Tools. This will allow you to retrieve search results without fear of being denied access (up to a limit).
Alternatively, mimic all nuances of a typical search query. For example, in my browser, searching for inurl:"goethe" filetype:pdf results in this URL being requested.
Then there are cookies and other http headers. Make it look a lot more like a browser is requesting it.

Obtaining Google search results' site position

I want to code some algorithm or parser which should get site position in google search results. The issue is every time google page layout will change I should correct/change the algorithm. How do you think guys is will really often change? Is there any techniques/advices/tricks about determining Google' site position?
How can I make robust position detection algorithm?
I want to use C#, .NET 2.0 and HtmlAgilityPack for that purpose. Any advices or proposes will be very appreciated. Thanks in advance, guys!
POST UPDATE
I know that google will show captcha to prevent machine queries. I got special service for that, that will recognise any captcha. Could you guys tell me about your experience in exact scraping results?

Google offer a plethora of APIs to access their services. For searching there's the Custom Search API.

I asked about this a year ago and got some good answers. Definitely the Agility Pack is the way to go.
In the end we did code up a rough scraper which did the job and ran without any problems. We were hitting Google relatively lightly (about 25 queries per day). We took the precaution of randomising 1) the order and 2) time of day and 3) time paused between queries. I don't know if any of that helped, but we were never hit by a captcha.
We don't bother with it much now.
Its main weaknesses were/are:
we only bothered to check the first page (we perhaps could have coded an enhanced version which looked at the first X pages, but maybe that would be a higher risk - in terms of being detected by Google).
its results were unreliable and jumped around. You could be 8th every day for weeks, except for a single random day when you were 3rd. Perhaps ... the whole idea of carefully taking a daily or weekly reading and logging our ranking is too flawed
To answer your question about Google breaking your code: Google didn't make a fundamentally breaking change in all the months we ran it but they changed something which broke the "snapshot" we were saving of the result (maybe a CSS change?) which did nothing to improve the credibility of the results.

We went through this process a few months back. We tried the API's mentioned above and the results were not even close to the actual search results. (Google for this lots of information).
Scraping the page is a problem, google seem to change the markup every few months and also have checks in place to work out if you are human or not.
We eventually gave up and went with one of the commercially available (And updated often) bits of kit.

I've coded a couple of projects on this, parsing organic results and adwords results. HTML Agility pack is definitely the way to go.
I was running a query every 3 minutes I think and that never triggered a CAPTCHA.
In regards to formatting changing, I was picking up on the ID of the UL (talking from memory here) and that only changed once in around a year (organic and adwords).
As mentioned above though, Google don't really like you doing this! :-)

I'm pretty sure that you will not easily get access to Google search results. They are constantly trying to stop people doing it.
If thought about screen scraping - be aware that they will start displaying captcha and you won't be able to get anything.

What is the profession / preferred way to handle input validation

I am new to C# and SQL. But over the last few years while learning both in college a question really begins to burn inside me. Here it is:
It seems to me that there are really two very generic ways to handle input validation (i.e. checking for required fields, and data in the correct ranges ect).
The first, and the way shown traditionally is: Once you develop your UI, and have connected it to a database back end in some manner. On the user interface, you check for correct input, such as blank text boxes, number ranges, or to ensure a radio or check box is selected ect.
The second, and the way shown in database development is: To set check constraints on fields such as no nulls allowed, unique values, and even ranges and required fields.
My dilema is this. Given that in modern languages like C# you can do general execption handling, and also given that major league fault tolerance is built into most databases like SQL Server with regard to handling data changes in respect to committing all or none. Details like this, and to this level, would be hard to program in anything but the simplest of programs.
So my question is, why not build all the requirements directly into the table at the database back end. Take advantage of the aformentioned fault tolerance, and just forget about programming if statements to ensure correct data is input, and instead just use a generic catch all execption handler if the data is not committed.
Perhaps that is how it is done, if so I would really like to know for sure. If not, why? My preference is to avoid writing code whenever possible. Less code, less debugging, and less problems when it comes to updating. So I would tend to go with that approach of letting the DB back end do the work. Is this the generally correct thing to do.
I know that general execption handling is considered "expensive" in terms of resources. But surley once you get past 5 or 10 if statements to handle different fields and their constraints, it must be more efficient code wise to just do a general execption handler. It certantly seems easier to understand overall. (At least the way I do it).
Thanks for your help with this.

OK, here is why you need it in both places.
First the integrity of the data should be paramount and data can be changed directly in database tables (deliberately through a script to say update a million prices or by accident or even by disgruntled or criminal employees trying to disrupt the database or steal from the company). Therefore is it reckless to avoid using constraints directly in the database and it leads to bad data.
Now at the user interface level, you want to prevent the user from wasting his time submitting bad data and you want to prevent the servers and networks from wasting their time trying to process it, so you write checks at that level. Plus you don't want the data in an inconsistent state if you need to insert to several tables and aren't using a transction (which you should be using but I would suspect it happens less often than it should.) Plus the users hate it when you try the insert and it fails and tells you that X is wrong and then they fix X and now Y is wrong but it was wrong before, the process just didn't get as far as Y before.

You do both.
Create constraints at the DB - level, and check for those constraints on the client level as well.
The validation on the DB makes sure that no invalid data gets in your DB, no matter how the data is inputted.
The validation at the client side improves the user-experience.

You generally can't build all the logic for checking into the database. Also not validating user input sufficiently is a good way to open yourself up to attack.
One way to write lesss guard code in every method is 'Code Contracts' a product of microsoft research.
All input should be validated both client and server side. Always.
Also with a giant catch it would be hard to tell which field was in error. So you would end up writing a lot of which field exploded code at the other end.

While I generally advocate putting as much in the database as possible (which means that you can have a high degree of confidence about the "raw" data as possible), that isn't always possible, even with the powerful constraints and triggers available in SQL.
In addition, there are high-level "integrity" things which may change over time, and it is not realistic to always have temporally-dynamic conditions in constraints. i.e. all HR records since 2007 must have a non-NULL birthdate, but prior ones are allowed to remain NULL, but any row cannot ever be set back to NULL.
My point is you can almost never put it all in the database.
Put the things in that you can, and put others at higher levels in the system. The database is a very important part of any system, but it isn't the only part. As long as its design helps it protect its perimeter and be able to provide reliable service and guarantee what it says it will guarantee so that other parts of the system can rely on their assumptions, then that's about the most you can ask for.

In addition to all answers made here, like that UI control improves drammaticaly UX for the user, and can completely change "image" of your app, that validation on DB is made for correct insert the data to DB, but on client it have to be done for correct insert of the client data.
Consider an example of standalone enterprise app. A client work at home, he filled 20 invoices late night on his notebook in Mongolia. The day after he came back, and sync it with his office SAP server. If the error will be figure out only during sync of the data, you can imagine what awful is this situation.
Just an example. There could a plenty of others, I'm sure.
Good luck.

Its 2 years later and I have a decent amount of experience now. I am not going to accept my answer as the right one as many here have done a great job and I am very happy with their answers. But I want to add another important consideration that looking back over my experience has not been highlighted here. I also use stack overflow for reference as I progress and I always find myself looking back over my questions and answers which is another reason I wanted to add this. Like a note to my future self.
While working at that company, I was asked to build an app that would do job abc. With this I also had to build part of the database. As I was finishing with the company I learned that they were writing another app which would use my database. Effectively my point is, that as many have pointed out, data is paramount, and you don't know how it is going to be accessed when you're gone.
I have also learned that there are 3 places that data needs to be verified:
on the actual database as explained
on the server side code behind which is not the same as the DB or client side validation
on the client side
There is another worry. With the advent of new tech like tablets and smart phones. This is yet another place where validation has to be implemented. The same rules for a 4th time (unless its a web app).
I later learned that prior to MVC we had CGI forms which had something to do with handling data over the network (I humbly admit ignorance on hardware side) but from what was explained to me it seems there may even be a 5th place to do validation (although I am open to being totally wrong about that).
I think the next guru in computer science will make a name for himself if he can find a way to abstract all that verification and validation to one place so that such rules don't have to be altered in a bunch of places.
worst case:
DB
Server side code
Client side code for web apps
What about if:
There may be a native client app (i.e. windows, linux or mac (at least 6 now))
There may be various phone apps (android, iPhone, and win phone to name 3, at least 9 now))
There may be some CGI or whatever
This totals 10+ places without much exaggeration and there are other operating systems.
Even for a simple age range this is getting to be messy, but what if they bring out some new email format, or other complicated validation, or you have to change a bunch of validation rules. Now you have to modify them across at least 3 or 4 places which in itself is bad.
The major problem with that is that you are modifying a lot of code and infrastructure that has been invested in, tested, and usually proven to work and delivered to the market...
As the number of client sides grow, modifying well tested code, can't be a good thing. I think this is going to be a major headache for the future. I wonder if there will be a design pattern or best practice to resolve it. If anyone knows of one, please tell me.

Can OmniTure SiteCatalyst or any web Analytics Software track Critical Page Views?

I need to track only human visits to my article pages. I hear that SiteCatalyst is the best of the best for page tracking. Here is what I am trying to do. I need to track every human visit if possible because this will affect the amount of money i have to pay. I will need to download site statistics for all of my pages with an accurate hit count. Again, I don't want to track spiders/bots. Once I download the site statistics I will use it to update hit counts to each of my articles. Then I will pay my writers according to how many hits they receive. Is SiteCatalyst able to do this. If not, who do you think can do something like this?

Luke - Quick answer there currently is no %100 accurate way to get this.
Omniture's SiteCatalyst does provide a very good tool for acquiring visitor information. You can acquire visitor information from any of the other vendors as well including the free option Google Analytics.
You may have been lead to believe as I had that Omniture strips out all bots and spiders by default. Omniture states that most bots and spiders do not load images or execute JavaScript, which is what they rely upon for tracking. I am not sure what the exact percentage is, but not all bots and spiders act in this way.
In order for you to gain a more accurate report on the number of "humans" you will need to know what the IP address of the visitor is and possibly the user agent. You can populate the agent and IP in PHP with these two variables $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REMOTE_ADDR']. You will then need to strip out the IP address of known bots/spiders from your reporting. You can do this with lists like this: http://www.user-agents.org/index.shtml or manually by looking at the user agent. Beware of relying upon the user agent as the bot can easily spoof this. This will never be %100 accurate because new bots/spiders pop up every day. I suggest looking further into "click fraud".
Contact me if you want further info.

omniture also weeds out traffic from known bots/spiders. But yeah...there is an accepted margin of error in the analytics industry because it can never be 100%, due to the nature of the currently available technology.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.