How to implement Stack Overflow's "are you a human" feature? - c#
On this site if you do too many clicks or post comments too fast or something like that you get redirected to the "are you a human" screen. Does anybody know how to do something similar?
It's almost certainly a heuristic that tries to "guess" that a user is some form of automated process, rather than a person, for example:
More than "x" requests to do the same thing in a row
More than "x" actions in a "y" period of time
Ordinarily the "x" and "y" values would be formulated to be ones that it would be unlikely for a "real person" to do, like:
Editing the same answer 5 times in a row
Downvoting 10 questions within 1 minute
Once you've got your set of rules, you can then implement some code that checks them at the start of each request, be it in a method that's called in Page_Load, something in your masterpage, something in the asp.net pipeline, that's the easy bit! ;)
Here is a very nice Captcha Control for asp.net that first of all you need
http://www.codeproject.com/KB/custom-controls/CaptchaControl.aspx
Then you can use it together with this idea that try to find the dos attacks
http://weblogs.asp.net/omarzabir/archive/2007/10/16/prevent-denial-of-service-dos-attacks-in-your-web-application.aspx
be ware of a bug in this code in line if( context.Request.Browser.Crawler ) return false;, its must return true, or totally remove it for sure.
and make it your compination for the clicks, or submits.
If a user make too many clicks on a period of time, or many submits, then you simple open the capthaControl, and if the clicks are by far too many, then triger the dos attact. This way you have 2 solution in one, Dos attact prevent, with captcha at the same time.
I have made somthing similar my self, but I have change the source code of both, a lot to feet my needs.
One more interesting link for a different code for the dos attack.
http://madskristensen.net/post/Block-DoS-attacks-easily-in-ASPNET.aspx
Hope this help you.
At a guess...
Write a HTTP handler that records requests and store them in session.
When a new request comes in, check to see how many requests are stored (and expire old ones).
If the amount of requests in the past few minutes exceeds a given threshold, redirect the user.
If you're doing this in ASP.NET webforms, you could do this check on the site master page, ( or write a IHttpHandler).
If you're using an MVC framework, you could write a base controller that does this check for every action.
With rails, you could write a before_request filter.
With asp.net MVC, you could write a [ActionFilterAttribute] attribute
You should have a session to track the user activity.
In session you can have counter for commenting and posting like:
(pseudo code instead of C#, sorry :)
if (post_event) {
posts_during_1_minute_interval++;
if (time_now-reference_time > 1_minute) {
reference_time = time_now;
posts_during_1_minute_interval=0;
}
}
...
if (posts_during_1_minute_interval > 10) redirect("/are-you-human.htm");
where on are-you-human.htm page you can have recaptcha, as they have here on StcakOverflow.com
see also:https://blog.stackoverflow.com/2009/07/are-you-a-human-being/
just check how many hit / minutes you get from a specific ip or session or whatever and decide what are your preferred threshold and your good to go
I'd also check the user agent header of the request - if it doesn't look like a popular browser (or is empty) then throw the "are you a human?" page.
Related
Prevent extrenalhit caching
If I understood correctly Facebook externalhit scrapes page every 24h for new data. Since my users are going to share dynamic images on Facebook,the image should not be cached, because it would change much more then once every 24h. Does externalhit ignores something like: context.Response.Cache.SetCacheability(HttpCacheability.NoCache); Is there some way to force it not to cache image? I know using linter clears cache but it would be silly to instruct my users to run linter every time they want to see a changed image instead of cached one. I assume some script to lint urls programmatically would be agianst their TOS?
Use a different URL for each image, and have the like button point to that URL - that's basically the only way to do this - otherwise you'd retroactively be changing the details of the thing which was liked - and the fields are locked after X likes and won't be updated ( I think X = 100)
I guess it's not possible! Just to accept answer.
Optimal way to cache time of day description
What is the best method to cache the following? I am creating an intranet web application template that will display the message, e.g., Good Morning, Justin Satyr! near the top of my master page header. Obviously, I will have to determine whether to show Morning, Afternoon or Evening. For clarity, my code is below: string partOfDay; var hours = DateTime.Now.Hour; if (hours > 16) { partOfDay = "evening"; } else if (hours > 11) { partOfDay = "afternoon"; } else { partOfDay = "morning"; } I do not want to re-determine this on each page load because that seems moderately redundant and because I have to poll a SQL server to retrieve the user's full name. What is the best way to cache this information? If I cache it for the length of the session, then if the user begins using the application at 11:00 AM and finishes at 3:00 PM, it will still say Good Morning. Is the best thing to do simply re-determine the M/A/E word each page load and cache the person's full name for the session? Or is there a better way?
I would just keep the user name in the Session object, the rest honestly is not worth caching and checking if it is out of date etc., just re-run it on each page - provided you put the implementation into a common library /class so you keep your code DRY.
In my opinion there is absolutely no need to cache the part of day. User information can be made available in the Session.
If you are talking in ASP.NET MVC context, you can use System.Web.Helpers namespace, where you can find WebCache helper. Than you need to calculate minutes to time of day_time will be changed and call WebCache.Set method with paramters: value="your string", minutesToCache=calculated_value.
Old, I know, but I don't cache mine, due to the obvious reason that the users time may change during the session. I store their calculated time in my session (calculates based on their timezone), and then use this code at the top of all pages: <strong>#string.Format("Good {0}, ", SessionManager.GetUserCurrentDate().Hour > 16 ? "Evening" : SessionManager.GetUserCurrentDate().Hour > 11 ? "Afternoon" : "Morning") + SessionManager.GetDisplayName())</strong> Works well for me!
How to detect Javascript pop-up notifications in WatiN?
I have a, what seems to be, rather common scenario I'm trying to work through. I have a site that accepts input through two different text fields. If the input is malformed or invalid, I receive a Javascript pop-up notification. I will not always receive one, but I should in the event of (like I said earlier) malformed data, or when a search result couldn't be found. How can I detect this in WatiN? A quick Google search produced results that show how to click through them, but I'm curious as to whether or not I can detect when I get one? In case anyone is wondering, I'm using WatiN to do some screen scraping for me, rather than integration testing :) Thanks in advance! Ian
Here's what I came up with. I read this question several times before I came up with the obvious solution.. Can I read JavaScript alert box with WatiN? This is the code I came up with.. While it does force a delay of 3 seconds if the alert doesn't happen, it works perfectly for my scenario. Hope someone else finds this useful.. frame.Button(Find.ByName("go")).ClickNoWait(); System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch(); stopwatch.Start(); while (stopwatch.Elapsed.TotalMilliseconds < 3000d) { if (alertDialogHandler.Exists()) { // Do whatever I want to do when there is an alert box. alertDialogHandler.OKButton.Click(); break; } }
Hard Code List of Years?
This is the scenario. You've got a web form and you want to prompt the customer to select their birth year. a) hard code the values in the dropdown list? b) Grab valid years from a DB table I can see a maintenance nightmare with copying a set of years hard coded in .aspx files everywhere. updated: for loop is not ideal (maintenance nightmare and error prone). The user then has to sift through 120 years that haven't even got here yet. I still like the DB approach: * Single point of data * No duplication of code * Update the table as needed to add more years * Year table values could be used for some other dropdown for some other purpose entirely for something other than Birth year Simple as that. No need to go updating code everywhere. I feel for data that is universal like this, we shouldn't be hard coding this shiza into a bunch of pages which is totally going against reuse and error prone...really it's not pratical. I'd take the hit to the DB for this. Updated (again...after thinking about this): Here's my idea. Just create a utility or helper method called GetYears that runs that loop and returns a List<int> back and I can bind that to whatever I want (dropdownlist, etc.). And I like the web.config idea of maintaining that end year.
C) Use a for-loop to generate the years in a range of your choice. Something as simple as this pseudocode: for (int i = 1900 ; i < THIS_YEAR - 13 ; i++) { validyears.options.Add(i); }
Neither - provide a centralized service which can decide which mechanism to use, then the application doesn't care, and you are free to choose hardcoding, sliding window or database mechanisms. To expand, typically, I would do something like this: Define IPopulatableYear interface which has a single AddYear method taking an int and constructing an appropriate ListItem or whatever. Make MyYearListBox inherit from regular ListBox implement IPopulatableYear (this works for winForms or WebForms) Create static method or singleton or method in your DAL or whatever. Like this: PopulateYears(IPopulatableYear pl) { // Very simple implementation - change at will for (int lp = 2009 ; lp < 2009 + 10 ; lp++) { pl.Add(lp); } } or PopulateYears(IPopulatableYear pl) { // A DB implementation SQLDataReader dr = DAL.YearSet() ; // Your choice of mechanism here while ( dr.Read() ) { pl.Add(dr[YEAR]); } } or PopulateYears(IPopulatableYear pl) { // A DB limits implementation with different ranges defined in database by key - key determined by control itself - IPopulatableYear needs to implement a .YearSetKey property SQLDataReader dr = DAL.YearLimits(pl.YearSetKey) ; // Your choice of mechanism here for ( int lp = dr[YEAR_MIN] ; lp <= dr[YEAR_MAX] ; lp++ ) { pl.Add(lp); } } The mechanism is now centrally managed. Use MyYearListBox on your forms and call PopulateYears() on it. If your forms are smart, they can detect all MyYearListBox instances and call it, so you no longer have any new code - just drag it on.
Take a look at Enumerable.Range. I think making DB calls is FAR less performant than Enumerable.Range.
E) Use a text input box, because that will always work. (Be sure to validate it, of course, as a number. Include "Y2K" and "The year World War II started" in a dictionary of years, of course.)
How you present the year selection in the web form is irrelevant. It's an interface decision. Your server should not trust the data coming in, and should validate it accordingly. It's trivial to emulate a form submission, so it doesn't matter how it's presented. Heck, you can generate the drop down with javascript so there is no load on the server. You can validate with a rule on the backend, rather than a lookup.
Since you're raising this whole issue (and making a bunch of comments), maybe it's within your power to think long and hard this. For the end user, it's hard to beat the ease-of-use of a text box. Yup, you're going to get bogus data, but computers are supposed to make things easier, not harder. Scrolling through a long list of years to find the year I know I was born is a nuisance. Especially with all those young whippersnappers and old farts who want to enter birth years that aren't anywhere close to mine! But stepping back even further...do you really need to ask the user their birth year in the first place? Is it that important to your application? Could you avoid the issue entirely by letting somebody else deal with that? Say by using OpenID, Windows Live ID or Facebook Connect?
Detecting honest web crawlers
I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents? If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent. Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.
You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits. Here are a few c# solutions for you: 1) Simplest Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers. bool iscrawler = Regex.IsMatch(Request.UserAgent, #"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase); 2) Medium Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot’, ‘crawler’, ‘spider’ upfront. You can add to it any other known crawlers. List<string> Crawlers3 = new List<string>() { "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google", "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne", "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco", "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio", "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired", "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves", "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker", "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget", "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic", "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite", "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus", "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au", "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy", "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch", "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper", "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper", "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm", "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); bool iscrawler = Crawlers3.Exists(x => ua.Contains(x)); 3) Paranoid Is pretty fast, but a little slower than options 1 and 2. It’s the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive. // crawlers that have 'bot' in their useragent List<string> Crawlers1 = new List<string>() { "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot", "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google", "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot", "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot", "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot", "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot", "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot", "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot", "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot" }; // crawlers that don't have 'bot' in their useragent List<string> Crawlers2 = new List<string>() { "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial", "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz", "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder", "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther", "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider", "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33", "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber", "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox", "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn", "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online", "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler", "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven", "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess", "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy", "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton", "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria", "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e", "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam", "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget", "legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); string match = null; if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x)); else match = Crawlers2.FirstOrDefault(x => ua.Contains(x)); if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua); bool iscrawler = match != null; Notes: It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster. Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration. Always put the heaviest bots at the start of the list, so they match sooner. Put the lists into a static class so that they are not rebuilt on every pageview. Honeypots The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers. Postives: It will match some unknown crawlers that aren’t declaring themselves. Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.
You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.
One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs... Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...
Any visitor whose entry page is /robots.txt is probably a bot.
Something quick and dirty like this might be a good start: return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i Note: rails code, but regex is generally applicable.
I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought. It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.
void CheckBrowserCaps() { String labelText = ""; System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser; if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler) { labelText = "Browser is a search engine."; } else { labelText = "Browser is not a search engine."; } Label1.Text = labelText; } HttpCapabilitiesBase.Crawler Property