Retrieve old searches from Google web history - c#

I want to retrieve old Google searches which I did a few years/months back and that are present in Google web history. How can I programmatically retrieve them all?
https://www.google.com/history/?output=rss only provides recent Google searches, but not all of them.
Also this question : How can I retrieve my Google search history? doesn't provide any answer for my question!

You can pass month, day and year as parameters to obtain history of a specific day.
E.g. https://www.google.com/history/lookup?month=12&day=1&yr=2010&output=rss for Dec, 1 2010.
There are no ways to obtain history for a full month or year, let alone the entire history. But this information about the parameters must at least enable you to obtain the entire history in some loop which goes one day further back in the time everytime. Be carecul that you don't leech too much in a too short time.

You really need to parse HTML page by page and then fetch your data, because i dont think there is any alternative!

I think this will be very difficult.
I know this doesn't answer you question completely but at least the web pages may be preserved. There are organizations and tools that allow you to recreate web pages from past dates - see for example http://www.mementoweb.org/.
UPDATE: I have just learnt that Memento has won a digital preservation award (http://www.dpconline.org/newsroom)

I know you're not looking to go back through every page, but you don't really need to parse the whole page, just look for the html that always precedes an entry. From me just starting up google web history and doing some simple searches, if you look through a history page, each String that you've searched follows: <td style="padding:3px 0"><table id=bkmk_view_ class=noborder ><tr><td><table class="elem noborder"><tr><td class="grey" nowrap>Searched for </td><td nowrap><a title="http://www.google.com/search?q=
and is followed by & (ampersand). This sequence of preceding html is unique on the page, only occuring when historical search terms are listed.
If you use two terms, you get a + in between the terms. Other conventions for different searching modes, I didn't go through them all.
It looks like if you use BalusC's method to pass parameters, then you could retreive the html, search the document for the string I mentioned(be sure to \" and other special characters), then copy the next String until you reach a & character. Then, all you need to do is parse your search term, not the whole page. Go through source code until you reach the end, then go to your next iteration in the loop.

static void GetGoogleWebHistory(int month, int day, int yr, string UserName, string Pass)
{
string iURL = "http://www.google.com/history/lookup?month=" + month + "&day=" + day + "&yr=" + yr + "&output=rss";
WebClient client = new WebClient();
GDataCredentials gdc = new GDataCredentials(UserName, Pass);
RequestSettings rs = new RequestSettings(Guid.NewGuid().ToString(), gdc);
XmlDocument XDoc = new XmlDocument();
XDoc.LoadXml(client.DownloadString(iURL));
}

Related

Reserved Symbol (&, /, etc) in URL Sharepoint Rest Query - Bad Request

One problem I have been facing off and on for the past few weeks was trying to search SharePoint for a list item value and kept getting bad request error. I had two symbols causing problems, one was that I could not search for something with anh & symbol, and the other was a / (forward slash).
My code looked like:
ServiceContext context = new ServiceContext(new Uri("http://URL/_vti_bin/listdata.svc"));
context.Credentials = CredentialCache.DefaultCredentials;
var requestType = (System.Data.Services.Client.DataServiceQuery<ListTypeValue>)context.ListType
.Where(v => v.Value.Equals(search));
After searching the internet, nothing valid came back besides saying change IIS settings, or convert it to ASCII Html value (NOTE: converting & to %27 still causes bad request error).
I would really not recommend using the combination of StartsWith and Length - performance could become a real issue in that case. Assuming you need a string key and that you want your keys to be able to contain special characters, Peter Qian has blogged about the best recommendation we can give. (This behavior actually comes from a combination of IIS and the .NET stack.)
Check out http://blogs.msdn.com/b/peter_qian/archive/2010/05/25/using-wcf-data-service-with-restricted-characrters-as-keys.aspx for more details, but your problem should be solved by turning off ASP.NET request filtering. Note that this has non-trivial security risks. Peter points out some of them, and security filtering tools like asafaweb.com will complain about this solution.
Long story short: if you can use integers or avoid the restricted characters in keys, your Web application will be more secure.
I found this article and it gave me an idea. I was going to try and use HEX, but since I am using a web service, I couldn't figure anything out. Finally I thought, hey, someone stated how they used substringof, why not try startswith!
Finally! A solution to this problem.
INVALID:
http://URL/_vti_bin/listdata.svc/ListType('Search & Stuff')
VALID:
http://URL/_vti_bin/listdata.svc/ListType() $filter=startswith(Value,'Search & Stuff')
I took it a step further since this could potentially return not what I wanted. I added length is equal to the strings length and it is working perfectly!
My C# Code looks like this now:
ServiceContext context = new ServiceContext(new Uri("http://URL/_vti_bin/listdata.svc"));
context.Credentials = CredentialCache.DefaultCredentials;
var requestType = (System.Data.Services.Client.DataServiceQuery<ListTypeValue>)context.ListType
.Where(v => v.Value.StartsWith(search) && v.Value.Length == search.Length);
Hopefully this helps someone else out and saves some hair pulling!
UPDATE:
After getting replies from people, I found a better way to do this. Instead of the standard LINQ method of querying, there is another option. See MSDN Article
System.Data.Services.Client.DataServiceQuery<ListTypeValue> lobtest = context.ListType
.AddQueryOption("$filter", "(Value eq '" + CleanString(search) + "')");
Now I will implement link posted from Mark Stafford and parse out reserved characters.

What are good ways to extract price ,mileage and location from auto-dealer website?

I have crawled some auto websites and trying to extract information from these sites.
I need following information - Vin, mileage,price and location.
I tried for regular expression approach but it do not scale since i have around 20000 websites to
extract. I want to try machine learning for extraction.
Some context : The all webpages i have downloaded have vins.I have used regex to find out that.
In some webpages , price is represented as any of the following words - price,market price , eprice, internet price,MSRP.
there are some price texts which are stroked out and another lower price is offered as in case of discount.I want my program to take this into consideration and ignore stroked out price consider the other price.
Mileage is represented as mileage or miles.
I thought of using wrapper induction , but read that approach would not work if the website changes the template of the site.
Moreover, that approach takes time to train a classifier per pattern per website.
So what kind of approach or algorithm i should use to extract price mileage and location from a webpage.
There are different ways to parse a html site:
you can use Regex
XPath also can be used to select the content
But the best way will be to use the HTML Agility Pack
HTML Agility Example:
var doc = new HtmlWeb().Load(url);
var comments = doc.Descendants("div")
.Where(div => div.GetAttributeValue("class", "") == "comment");
Here you can find a overview of different methods to parse HTML-fields via C# (including examples)
You may take a look at HtmlAgilityPack. It allows you to parse HTML and extract the necessary information using CSS selectors. It could make your code somehow more resilient to changes of the design and structure of the website.

Optimal way to cache time of day description

What is the best method to cache the following? I am creating an intranet web application template that will display the message, e.g., Good Morning, Justin Satyr! near the top of my master page header. Obviously, I will have to determine whether to show Morning, Afternoon or Evening. For clarity, my code is below:
string partOfDay;
var hours = DateTime.Now.Hour;
if (hours > 16)
{
partOfDay = "evening";
}
else if (hours > 11)
{
partOfDay = "afternoon";
}
else
{
partOfDay = "morning";
}
I do not want to re-determine this on each page load because that seems moderately redundant and because I have to poll a SQL server to retrieve the user's full name. What is the best way to cache this information? If I cache it for the length of the session, then if the user begins using the application at 11:00 AM and finishes at 3:00 PM, it will still say Good Morning.
Is the best thing to do simply re-determine the M/A/E word each page load and cache the person's full name for the session? Or is there a better way?
I would just keep the user name in the Session object, the rest honestly is not worth caching and checking if it is out of date etc., just re-run it on each page - provided you put the implementation into a common library /class so you keep your code DRY.
In my opinion there is absolutely no need to cache the part of day. User information can be made available in the Session.
If you are talking in ASP.NET MVC context, you can use System.Web.Helpers namespace, where you can find WebCache helper. Than you need to calculate minutes to time of day_time will be changed and call WebCache.Set method with paramters: value="your string", minutesToCache=calculated_value.
Old, I know, but I don't cache mine, due to the obvious reason that the users time may change during the session. I store their calculated time in my session (calculates based on their timezone), and then use this code at the top of all pages:
<strong>#string.Format("Good {0}, ", SessionManager.GetUserCurrentDate().Hour > 16 ? "Evening" : SessionManager.GetUserCurrentDate().Hour > 11 ? "Afternoon" : "Morning") + SessionManager.GetDisplayName())</strong>
Works well for me!

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul
You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.
In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html
What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Detecting honest web crawlers

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?
If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.
Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.
You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits.
Here are a few c# solutions for you:
1) Simplest
Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user.
Catches 99+% of crawlers.
bool iscrawler = Regex.IsMatch(Request.UserAgent, #"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
2) Medium
Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too.
Catches close to 100% of crawlers.
Matches ‘bot’, ‘crawler’, ‘spider’ upfront.
You can add to it any other known crawlers.
List<string> Crawlers3 = new List<string>()
{
"bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
"lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",
"atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
"cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
"esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
"htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
"image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
"lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
"motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
"netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
"patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
"raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
"searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
"curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
"urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
"webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
"webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
"wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
3) Paranoid
Is pretty fast, but a little slower than options 1 and 2.
It’s the most accurate, and allows you to maintain the lists if you want.
You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future.
If we get a short match we log it and check it for a false positive.
// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
"googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
"yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
"botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
"ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
"dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
"irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
"simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
"vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
"spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};
// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
"baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
"nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
"bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
"cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
"fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
"havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
"jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
"larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
"merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
"muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
"objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
"phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
"roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
"senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
"spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
"titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
"webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
"webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
"robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
"legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
string match = null;
if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));
if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);
bool iscrawler = match != null;
Notes:
It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.
Honeypots
The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.
Postives: It will match some unknown crawlers that aren’t declaring themselves.
Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.
You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.
One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...
Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...
Any visitor whose entry page is /robots.txt is probably a bot.
Something quick and dirty like this might be a good start:
return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
Note: rails code, but regex is generally applicable.
I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought.
It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.
void CheckBrowserCaps()
{
String labelText = "";
System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
{
labelText = "Browser is a search engine.";
}
else
{
labelText = "Browser is not a search engine.";
}
Label1.Text = labelText;
}
HttpCapabilitiesBase.Crawler Property

Categories

Resources