I want to code some algorithm or parser which should get site position in google search results. The issue is every time google page layout will change I should correct/change the algorithm. How do you think guys is will really often change? Is there any techniques/advices/tricks about determining Google' site position?
How can I make robust position detection algorithm?
I want to use C#, .NET 2.0 and HtmlAgilityPack for that purpose. Any advices or proposes will be very appreciated. Thanks in advance, guys!
POST UPDATE
I know that google will show captcha to prevent machine queries. I got special service for that, that will recognise any captcha. Could you guys tell me about your experience in exact scraping results?
Google offer a plethora of APIs to access their services. For searching there's the Custom Search API.
I asked about this a year ago and got some good answers. Definitely the Agility Pack is the way to go.
In the end we did code up a rough scraper which did the job and ran without any problems. We were hitting Google relatively lightly (about 25 queries per day). We took the precaution of randomising 1) the order and 2) time of day and 3) time paused between queries. I don't know if any of that helped, but we were never hit by a captcha.
We don't bother with it much now.
Its main weaknesses were/are:
we only bothered to check the first page (we perhaps could have coded an enhanced version which looked at the first X pages, but maybe that would be a higher risk - in terms of being detected by Google).
its results were unreliable and jumped around. You could be 8th every day for weeks, except for a single random day when you were 3rd. Perhaps ... the whole idea of carefully taking a daily or weekly reading and logging our ranking is too flawed
To answer your question about Google breaking your code: Google didn't make a fundamentally breaking change in all the months we ran it but they changed something which broke the "snapshot" we were saving of the result (maybe a CSS change?) which did nothing to improve the credibility of the results.
We went through this process a few months back. We tried the API's mentioned above and the results were not even close to the actual search results. (Google for this lots of information).
Scraping the page is a problem, google seem to change the markup every few months and also have checks in place to work out if you are human or not.
We eventually gave up and went with one of the commercially available (And updated often) bits of kit.
I've coded a couple of projects on this, parsing organic results and adwords results. HTML Agility pack is definitely the way to go.
I was running a query every 3 minutes I think and that never triggered a CAPTCHA.
In regards to formatting changing, I was picking up on the ID of the UL (talking from memory here) and that only changed once in around a year (organic and adwords).
As mentioned above though, Google don't really like you doing this! :-)
I'm pretty sure that you will not easily get access to Google search results. They are constantly trying to stop people doing it.
If thought about screen scraping - be aware that they will start displaying captcha and you won't be able to get anything.
Related
I guess I'm gonna take a lot of heat about this question, and even some down votes but I am really lost here.
I know what SCORM stands for and what is it good for. I saw the paid "engines" like scorm.com but it starts from $20K...
I work for an LMS site software, we have videos, courses and whatever... my manager said "we have a provider that has a lot of courses in SCORM format, build a tool that import them into our database."
Oh god, help me, is there an easy way to do that or am I facing a year of hard, not satisfying work now? (don't know if I can use the non-free ones, depends on prices).
ASP.NET, C# platform.
Short answer: buy the Rustici guys' engine. If "a lot of courses" is let's say 500, you're looking at $40 per course. You'll probably sell them for more. Or think of it as 3 months work paid instead of 12-20 months of your own work and suffering. Like taking a train instead of walking.
Long answer. I think a lot of people here assumed you want to import SCORM courses and have them work flawlessly. I'm not sure that's what your boss wants you to do. Maybe if you simply import the courses (upload and extract zip files), determine the launcher file (described in easily parseable imsmanifest.xml) and launch it in a popup/iframe, the course will.. just work.
Sure, you will not be able to receive scores and completions. Sure, if the course relies on some data from LMS like student name, you won't be able to supply it.Sure, if the course cannot detect SCORM API, it'll throw an error at you. But you might be able to code a very basic fake API that does nothing or some basic communication functions with your own platform, and you'll be able to launch all those courses. Maybe you don't need 90% of what SCORM offers/requires.
Someone mentioned climbing Mt.Everest. Well, think of it as photoshopping your happy sun burnt face onto a picture of the summit. Not quite the same result but the effort invested is a million times smaller as well.
I wrote an LMS using ASP.net. It took three of us over a year to write the scorm engine and player. It is basic and it was not easy. Tell your boss he just asked you to climb Everest without cold weather gear :)
$20K vs 1 or 2 guys salary for 2 years - if they can read fast or have prior experience with SCORM 1.2 and or 2004. Then all the pain and suffering in between. Rustici has all this figured out and it also works pretty quick compared to other canned systems.
Beyond the API, and XML Parsing, validation, access, sequencing rules, arbitrary limits, error codes and messages, your also deciding on a API implementation. There are LMS systems out there that literally talk to the backend on every GetValue/SetValue call, which seriously lags out the user experience. If I spent that much time building this, only to find out I did it the slowest possible way I think I'd be curled up in the corner somewhere rocking back and forth.
I would say though this space is filled with a ton of legacy code stretching back to the early 2000's that's drastically due for a overhaul. Any code you manage to beg borrow or steal is going to be some old school stuff. None of it tied into a 'managed code' format or anything you could unit test without building that all from scratch too.
Nope, no easy solution here, especially for C#. The commercial solutions cost that much because their developers went through the pain of "a year of hard, not satisfying work". The open-source solutions are typically PHP-based. A few use Java.
I am working on my mapper and I need to get the full map of newegg.com
I could try to scrap NE directly (which kind of violates NE's policies), but they have many products that are not available via direct NE search, but only via google.com search; and I need those links too.
Here is the search string that returns 16mil of results:
https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=newegg.com&as_occt=url&safe=off&tbs=&as_filetype=&as_rights=
I want my scraper to go over all results and log hyperlinks to all these results.
I can scrap all the links from google search results, but google has limit of 100 pages for each query- 1,000 results and again, google is not happy with this approach. :)
I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?
I am new to this; Could you advise / point me in the right direction ?
Are there any tools/methodology that could help me to achieve my
goals?
Google takes a lot of steps to prevent you from crawling their pages and I'm not talking about merely asking you to abide by their robots.txt. I don't agree with their ethics, nor their T&C, not even the "simplified" version that they pushed out (but that's a separate issue).
If you want to be seen, then you have to let google crawl your page; however, if you want to crawl Google then you have to jump through some major hoops! Namely, you have to get a bunch of proxies so you can get past the rate limiting and the 302s + captcha pages that they post up any time they get suspicious about your "activity."
Despite being thoroughly aggravated about Google's T&C, I would NOT recommend that you violate it! However, if you absolutely need to get the data, then you can get a big list of proxies, load them in a queue and pull a proxy from the queue each time you want to get a page. If the proxy works, then put it back in the queue; otherwise, discard the proxy. Maybe even give a counter for each failed proxy and discard it if it exceeds some number of failures.
I've not tried it but you can use googles custom search API. Of course, its starts to cost money after 100 searches a day. I guess they must be running a business ;p
It might be a bit late but I think it is worth to mention that you can professionally scrape Google reliable and not cause problems with it.
Actually it is not of any threat I know about to scrape Google.
It is cahllenging if you are unexperienced but I am not aware about a single case of legal consequence and I am always following this topic.
Maybe one of the largest cases of scraping happened some years ago when Microsoft scraped Google to power Bing. Google was able to proof it by placing fake results which do not exist in real world and Bing suddenly took them up.
Google named and shamed them, that's all that happened as far as I remember.
Using the API is rarely ever a real use, it costs a lot of money to use it for even a small amount of results and the free amount is rather small (40 lookups per hour before ban).
The other downside is that the API does not mirror the real search results, in your case maybe less a problem but in most cases people want to get the real ranking positions.
Now if you do not accept Googles TOS or ignore it (they did not care about your TOS when they scraped you in their startup) you can go another route.
Mimic a real user and get the data directly from the SERPs.
The clue here is to send around 10 requests per hour (can be increased to 20) with each IP address (yes you use more than one IP). That amount has proven to cause no problem with Google over the past years.
Use caching, databases, ip rotation management to avoid hitting it more often than required.
The IP addresses need to be clean, unshared and if possible without abusive history.
The originally suggested proxy-list would complicate the topic a lot as you receive unstable, unreliable IPs with questionable absuive use, share and history.
There is an open source PHP project on http://scraping.compunect.com which contains all the features you need to start, I used it for my work which now runs for some years without troubles.
Thats a finished project which is mainly built to be used as customizable base of your project but runs standalone too.
Also PHP is not a bad choice, I originally was sceptical but I was running PHP (5) as background process for two years without a single interruption.
The performance is easily good enough for such a project so I would give it a shot.
Otherwise, PHP code is like C/JAVA .. you can see how things are done and repeat them in your own project.
I've been entrusted with an idiotic and retarded task by my boss.
The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be accessed programmatically.
Now the thing: the application is made with standart aspx webform engine, so nothing like standard URLs or posts, but the dreadful postback engine crowded with javascript and non accessible html. The pagination links call the infamous javascript:__doPostBack(param, param) so I think it wouldn't even work if I try even to simulate clicks on those links.
There are also inputs to filter the results and they are also part of the postback mechanism, so I can't simulate a regular post to get the results.
I was forced to do something like this in the past, but it was on a standard-like website with parameters in the querystring like pagesize and pagenumber so I was able to sort it out.
Anyone has a vague idea if this is doable, or if I should tell to my boss to quit asking me to do this retarded stuff?
EDIT: maybe I was a bit unclear about what I have to achieve. I have to parse, extract and convert that data in another format - let's say excel - and not just read it. And this stuff must be automated without user input. I don't think Selenium would cut it.
EDIT: I just blogged about this situation. If anyone is interested can check my post at http://matteomosca.com/archive/2010/09/14/unethical-programming.aspx and comment about that.
Stop disregarding the tools suggested.
No, the parser you can write isn't WatiN or Selenium, both of those Will work in that scenario.
ps. had you mentioned anything on needing to extract the data from flash/flex/silverlight/similar this would be a different answer.
btw, reason to proceed or not is Definitely not technical, but ethical and maybe even lawful. See my comment on the question for my opinion on this.
WatiN will help you navigate the site from the perspective of the UI and grab the HTML for you, and you can find information on .NET DOM parsers here.
Already commented but think thus is actually an answer.
You need a tool which can click client side links and wait while page reloads.
Tool s like selenium can do that.
Also (from comments) WatiN WatiR
#Insane, the CDC's website has this exact problem, and the data is public (and we taxpayers have paid for it), I'm trying to get the survey and question data from http://wwwn.cdc.gov/qbank/Survey.aspx and it's absurdly difficult. Not illegal or unethical, just a terrible implementation that appears to be intentionally making it difficult to get the data (also inaccessible to search engines).
I think Selenium is going to work for us, thanks for the suggestion.
Hello people from StackOverflow!
I come to you with yet another question. :)
As stated in some of my previous questions, I'm interested in creating a website that handles jobs and company openings for people to browse. I intend to have a way for people to upload CV's, apply to a position, and have companies post jobs as well.
Since I've never done a project of this scope before, I fear that I may be neglecting certain things that are a must for a web-targeted application.
I realize that is a very broad question, perhaps too broad to even answer. However, I'd really like someone to provide just a little input on this. :)
What things do I need to have in mind when I create a website of this type?
I'm going to be using ASP.Net and C#.
Edit: Just to clarify, the website is going to be local to a country in eastern europe.
Taking on careers.stackoverflow then? :)
One of the biggest things, is not even a technical thing to be thinking about - how are you going to pull in enough users to make the site take off?
It's a bit of a chicken and egg situation - if you don't have recruiters on the site, noone's CV will get viewed. If you don't have CVs listed, recruiters won't use the site. So first and foremost, you need to be thinking about how you will build up a community.
the site must have a good, easy to use, user experience. Make it easy for everyone to achieve what they want.
what makes your site stand out from others? why should people use yours instead of another one?
You could start with the free "Job Site Starter Kit":
http://www.asp.net/downloads/starter-kits/job/
* Enables job seekers to post resumes
* Enables job seekers to search for job postings
* Enables employers to enter profile of their company
* Enables employers to post one or more job postings
First you need a community. It doesn't really matter which one, but it would help if you were also a member of this community. Let's take Underwater Basket Weavers. Then find a problem that this community has or something this community needs to share. Almost invariably it involves information exchange but in some cases it may actually be service based. Then focus your efforts on solving or supplementing that issue. For our Underwater Basket Weavers, we may have a need to share techniques on how to weave specific materials, where to get materials. How could they share this information and how could you make it interesting to them?
Know your audience. Learn their issues. Apply yourself to filling that void.
I am having a problem with one of my team members output. He seems to be always 'busy' yet I am unable to see exactly what code he has done and he seems be delivering very little and it seems to take a long time to do so. I'd like to investigate further using TFS and was wondering if there is any functionality in TFS that shows what has been written by an individual or similar?
Just to clarify I am NOT spying I am trying to resolve situation. This is only a starting point. I un derstand that quantity of code does not equate to best programmer
thanks for any answers
Your best programmer may in fact write less code than your worst programmer, in fact really good programmers often write less code. Be careful about using this information to evaluate performance. Since you are using TFS, I assume you are also using the work item tracking. That is really a better way to evaluate performance than using lines of code. See which checkins cause the most problems, which fix the most defects, and how many rounds it takes for something to be truly fixed.
For me the simplest thing is to set up email alerts for checkins. You get the checkin comment, some work item info assuming they are associating/resolving on checkin, and list of changed files, as they happen. Lets you see what part of the code that dev is in and after a while you get a sense when "it's quiet. Too quiet" because someone isn't checking in. It doesn't take the place of forensics of what he did all month, but it keeps me feeling connected. It also gives me intuitive feelings like "he's in the reports, so I'll be able to show those to the user earlier in the cycle" or "jeez, he's doing all the stupid typos in error messages and other no-thinking things, and not tackling his real hard stuff" or even "he's doing his pri 2 stuff while he has a large pile of pri 1". All of these enable a 30 second hallway conversation to deliver a course correction as close in time to the problem as possible.
See the following blog post I put together a while ago:
Getting Started with the TFS Data Warehouse
This one talks you through getting code churn for each area of your codebase, but it would be easy to add team members into that as well to get a breakdown by team member who did the check-in.
But I agree with your question - this is not a good way to check on the productivity of your colleague. Instead I would talk with them to raise your concerns.
While I am away from TFS right now, you can view a list of checkins by user in Team Explorer, and in each of these you can see the files which have been changed and look at the diffs for each.
You can get this from the TFS Cube, if you have it set up. There are a large number of dimensions within Code Churn. Some of this is also available in the TfsWarehouse database as well.
If you do have the cube set up, just point Excel at it and have some fun playing around. Keep in mind, though, that the numbers can point you in the wrong direction. Use discretion.