Detecting honest web crawlers

Detecting honest web crawlers - c#

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?
If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.
Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.

You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits.
Here are a few c# solutions for you:
1) Simplest
Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user.
Catches 99+% of crawlers.
bool iscrawler = Regex.IsMatch(Request.UserAgent, #"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
2) Medium
Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too.
Catches close to 100% of crawlers.
Matches ‘bot’, ‘crawler’, ‘spider’ upfront.
You can add to it any other known crawlers.
List<string> Crawlers3 = new List<string>()
{
"bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
"lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",
"atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
"cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
"esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
"htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
"image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
"lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
"motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
"netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
"patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
"raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
"searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
"curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
"urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
"webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
"webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
"wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
3) Paranoid
Is pretty fast, but a little slower than options 1 and 2.
It’s the most accurate, and allows you to maintain the lists if you want.
You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future.
If we get a short match we log it and check it for a false positive.
// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
"googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
"yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
"botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
"ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
"dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
"irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
"simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
"vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
"spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};
// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
"baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
"nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
"bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
"cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
"fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
"havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
"jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
"larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
"merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
"muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
"objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
"phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
"roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
"senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
"spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
"titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
"webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
"webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
"robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
"legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
string match = null;
if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));
if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);
bool iscrawler = match != null;
Notes:
It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.
Honeypots
The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.
Postives: It will match some unknown crawlers that aren’t declaring themselves.
Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.

You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.

One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...
Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...

Any visitor whose entry page is /robots.txt is probably a bot.

Something quick and dirty like this might be a good start:
return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
Note: rails code, but regex is generally applicable.

I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought.
It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.

void CheckBrowserCaps()
{
String labelText = "";
System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
{
labelText = "Browser is a search engine.";
}
else
{
labelText = "Browser is not a search engine.";
}
Label1.Text = labelText;
}
HttpCapabilitiesBase.Crawler Property

Related

Is it advisable to use tokens for the purpose of syntax highlighting?

I'm trying to implement syntax highlighting in C# on Android, using Xamarin. I'm using the ANTLR v4 library for C# to achieve this. My code, which is currently syntax highlighting Java with this grammar, does not attempt to build a parse tree and use the visitor pattern. Instead, I simply convert the input into a list of tokens:
private static IList<IToken> Tokenize(string text)
{
var inputStream = new AntlrInputStream(text);
var lexer = new JavaLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
tokenStream.Fill();
return tokenStream.GetTokens();
}
Then I loop through all of the tokens in the highlighter and assign a color to them based on their kind.
public void HighlightAll(IList<IToken> tokens)
{
int tokenCount = tokens.Count;
for (int i = 0; i < tokenCount; i++)
{
var token = tokens[i];
var kind = GetSyntaxKind(token);
HighlightNext(token, kind);
if (kind == SyntaxKind.Annotation)
{
var nextToken = tokens[++i];
Debug.Assert(token.Text == "#" && nextToken.Type == Identifier);
HighlightNext(nextToken, SyntaxKind.Annotation);
}
}
}
public void HighlightNext(IToken token, SyntaxKind tokenKind)
{
int count = token.Text.Length;
if (token.Type != -1)
{
_text.SetSpan(_styler.GetSpan(tokenKind), _index, _index + count, SpanTypes.InclusiveExclusive);
_index += count;
}
}
Initially, I figured this was wise because syntax highlighting is largely context-independent. However, I have already found myself needing to special-case identifiers in front of #, since I want those to get highlighted as annotations just as on GitHub (example). GitHub has further examples of coloring identifiers in certain contexts: here, List and ArrayList are colored, while mItems is not. I will likely have to add further code to highlight identifiers in those scenarios.
My question is, is it a good idea to examine tokens rather than a parse tree here? On one hand, I'm worried that I might have to end up doing a lot of special-casing for when a token's neighbors alter how it should be highlighted. On the other, parsing will add additional overhead for memory-constrained mobile devices, and make it more complicated to implement efficient syntax highlighting (e.g. not re-tokenizing/parsing everything) when the user edits text in the code editor. I also found it significantly less complicated to handle all of the token types rather than the parser rule types, because you just switch on token.Type rather than overriding a bunch of Visit* methods.
For reference, the full code of the syntax highlighter is available here.

It depends on what you are syntax highlighting.
If you use a naive parser, then any syntax error in the text will cause highlighting to fail. That makes it quite a fragile solution since a lot of the texts you might want to syntax highlight are not guaranteed to be correct (particularly user input, which at best will not be correct until it is fully typed). Since syntax highlighting can help make syntax errors visible and is often used for that purpose, failing completely on syntax errors is counter-productive.
Text with errors does not readily fit into a syntax tree. But it does have more structure than a stream of tokens. Probably the most accurate representation would be a forest of subtree fragments, but that is an even more awkward data structure to work with than a tree.
Whatever the solution you choose, you will end up negotiating between conflicting goals: complexity vs. accuracy vs. speed vs. usability. A parser may be part of the solution, but so may ad hoc pattern matching.

Your approach is totally fine and pretty much what everybody's using. And it's totally normal to fine tune type matching by looking around (and it's cheap since the token types are cached). So you can always just look back or ahead in the token stream if you need to adjust actually used SyntaxKind. Don't start parsing your input. It won't help you.

I ended up choosing to use a parser because there were too many ad hoc rules. For example, although I wanted to color regular identifiers white, I wanted types in type declarations (e.g. C in class C) to be green. There ended up being about 20 of these special rules in total. Also, the added overhead of parsing turned out to be miniscule compared to other bottlenecks in my app.
For those interested, you can view my code here: https://github.com/jamesqo/Repository/blob/e5d5653093861bc35f4c0ac71ad6e27265e656f3/Repository.EditorServices/Internal/Java/Highlighting/JavaSyntaxHighlighter.VisitMethods.cs#L19-L76. I've highlighted all of the ~20 special rules I've had to make.

String matching from a list

I am making a web browser and I am stuck on this one thing. I want the addres bar to act as an address bar and a search bar. First I tried seeing if querying the search bar with if adrBarTextBox.text.endswith(".com") but immediately realized that not every domain ended with .com.
The code I am currently using (and are stuck with) is:
// Populate List.
var list = new List<string>();
list.Add(Properties.Settings.Default.suffix);
(Properties.Settings.Default.suffix is a list of every domain suffix currently available)
// Search for this element.
if (adrBarTextBox.Text.Contains(list.something????))
{
// Do something (I have this part all set up)
}
The part i am having trouble with is
if (adrBarTextBox.Text.Contains(list.
I know it doesn't make sense but thats why i am asking. I have sat here thinking of a new way for hours and I am lost. I know that .Text.Contains(list) doesn't make sense and that's what I am stuck with.
I know the question is a bit noobish and there is probably some simple easy was staring me right in the face but hey. We all have to learn from somewhere.

You may need this
if (list.Any(x => adrBarTextBox.Text.Contains(x)))
{
//...
}

Use Uri.IsWellFormedUriString to determine if the input string is a valid URL.
If you want to match a string with words against another list of words, use
myList.Any(item => input.Contains(item));

Fuzzy pattern matching from emails in C#

I'm looking for a way to extract bits of data from emails. I'm primarily looking at subject lines and the email body, and extracting customer and order reference numbers.
Imagine I'm a company where customers can email an info#mydomain.com and they might add a specific customer number or order reference in the subject line or body of the email. However, they might not always provide these references in the optimal format. I want to extract the data out, and return a probability of how likely the data is valid.
Is there some kind of technique I can use to attempt to scan an email and return a probable customer number and or order reference with a degree of probability (a bit like Bayesian spam filtering)?
I was considering some kind of regular expression engine, but that seemed too rigid. I was also looking at NUML.net and wondering if it could help me, but I'm a little out of my depth, since I'm not entirely sure what I need. I've come across the Levenshtein algorithm, but that seems to be matching two fixed strings, rather than a fixed string and a pattern.
I'm imagining an API that look a little like this:
// emailMessage is a Mandrill inbound object, in case anybody wonders
EmailScanResult results = EmailScanner.Scan(emailMessage, new {ScanType.CustomerNo, ScanType.OrderReference});
foreach (var result in results)
{
var scanType = result.Type; // I.e. ScanType.CustomerNo
var score = result.Score; // e.g. 1.2
var value = result.Value; // CU-233454345-2321
}
Possible inputs for this are varied; E.g. For the same customer number:
DF-232322-AB2323
df-232322-AB2323
232322-ab2323
232322AB2323
What kinds of algorithms would be useful for such a task? Are there any recommended .NET libraries for this, and do you know of any appropriate examples?

If I got it right, you could use a regular expression with no problem. For example, with the input samples you gave, you could use a regex like:
([A-Z|a-z]{2,2}-){0,1}\d{6,6}-{0,1}\d{4,4}
The first part gets the DF- or df-, which may or may not occur: ([A-Z|a-z]{2,2}-){0,1}
The second part gets the first group of digits: \d{6,6}
Then, we say that it could have a dash: \-{0,1}
Finally, we get the last group of digits: \d{4,4}
This would cover the values you provided as sample, but you also could write other expressions to fetch other values.
Or, maybe, you could use something like Lucene.net. From what I know, this could help you too.
http://pt.slideshare.net/nitin_stephens/lucene-basics
http://jsprunger.com/getting-started-with-lucene-net/

Reserved Symbol (&, /, etc) in URL Sharepoint Rest Query - Bad Request

One problem I have been facing off and on for the past few weeks was trying to search SharePoint for a list item value and kept getting bad request error. I had two symbols causing problems, one was that I could not search for something with anh & symbol, and the other was a / (forward slash).
My code looked like:
ServiceContext context = new ServiceContext(new Uri("http://URL/_vti_bin/listdata.svc"));
context.Credentials = CredentialCache.DefaultCredentials;
var requestType = (System.Data.Services.Client.DataServiceQuery<ListTypeValue>)context.ListType
.Where(v => v.Value.Equals(search));
After searching the internet, nothing valid came back besides saying change IIS settings, or convert it to ASCII Html value (NOTE: converting & to %27 still causes bad request error).

I would really not recommend using the combination of StartsWith and Length - performance could become a real issue in that case. Assuming you need a string key and that you want your keys to be able to contain special characters, Peter Qian has blogged about the best recommendation we can give. (This behavior actually comes from a combination of IIS and the .NET stack.)
Check out http://blogs.msdn.com/b/peter_qian/archive/2010/05/25/using-wcf-data-service-with-restricted-characrters-as-keys.aspx for more details, but your problem should be solved by turning off ASP.NET request filtering. Note that this has non-trivial security risks. Peter points out some of them, and security filtering tools like asafaweb.com will complain about this solution.
Long story short: if you can use integers or avoid the restricted characters in keys, your Web application will be more secure.

I found this article and it gave me an idea. I was going to try and use HEX, but since I am using a web service, I couldn't figure anything out. Finally I thought, hey, someone stated how they used substringof, why not try startswith!
Finally! A solution to this problem.
INVALID:
http://URL/_vti_bin/listdata.svc/ListType('Search & Stuff')
VALID:
http://URL/_vti_bin/listdata.svc/ListType() $filter=startswith(Value,'Search & Stuff')
I took it a step further since this could potentially return not what I wanted. I added length is equal to the strings length and it is working perfectly!
My C# Code looks like this now:
ServiceContext context = new ServiceContext(new Uri("http://URL/_vti_bin/listdata.svc"));
context.Credentials = CredentialCache.DefaultCredentials;
var requestType = (System.Data.Services.Client.DataServiceQuery<ListTypeValue>)context.ListType
.Where(v => v.Value.StartsWith(search) && v.Value.Length == search.Length);
Hopefully this helps someone else out and saves some hair pulling!
UPDATE:
After getting replies from people, I found a better way to do this. Instead of the standard LINQ method of querying, there is another option. See MSDN Article
System.Data.Services.Client.DataServiceQuery<ListTypeValue> lobtest = context.ListType
.AddQueryOption("$filter", "(Value eq '" + CleanString(search) + "')");
Now I will implement link posted from Mark Stafford and parse out reserved characters.

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul

You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.

In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html

What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Detecting honest web crawlers - c#

You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.

Any visitor whose entry page is /robots.txt is probably a bot.

Something quick and dirty like this might be a good start: return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i Note: rails code, but regex is generally applicable.

I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought. It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.

Related

Is it advisable to use tokens for the purpose of syntax highlighting?

String matching from a list

Fuzzy pattern matching from emails in C#

Reserved Symbol (&, /, etc) in URL Sharepoint Rest Query - Bad Request

Findings string segments in a string

Categories

Resources