Note:
It's not RedirectToAction() that causes the problem, but rewrite rules in web.config
Original Question:
There is a strange behaviour when using RedirectToAction() which I cannot reproduce locally (debug). At some point my route value gets changed from äa to äa, but only on the server (Azure). I added logging to pinpoint the exact spot and ended up here:
RedirectToAction("square", new { id = criteria.Trim().ToLower() });
Locally this redirects correctly to /find/square/äa, but on the server it ends up at /find/square/äa. I logged every step and when my string is passed to RedirectToAction it's still intact (even after Trim() and ToLower()), but broken after redirection. I'm using the default route
routes.MapRoute("Default", "{controller}/{action}/{id}", new { controller = "Home", action = "Index", id = UrlParameter.Optional }
with controller find, action square and id äa (in this case)
This is pretty surprising to me since .NET is usually very careful and robust when it comes to umlauts and UTF-8. I'm especially lost here, because I'm not able to reproduce the issue locally. I assume it's a server setting, but Azure is pretty sparse at this point. Did anyone experience a similar behaviour before?
It's the one part of my code, that I didn't post, the rewrite rules. I only found out after I had a closer look at the redirects in Fiddler. As a matter of fact I found an additional redirect after the one I was intentionally triggering and it was caused by one of my rewrite rules:
<rule name="Lowercase Path" stopProcessing="true" >
<match url="[A-Z]" ignoreCase="false" />
<conditions>
<add input="{REQUEST_METHOD}" matchType="Pattern" pattern="POST" ignoreCase="true" negate="true" />
<add input="{URL}" pattern="^.*\.(axd|css|js|jpg|jpeg|png|gif|txt|xml|svg|pdf)$" negate="true" ignoreCase="true" />
</conditions>
<action type="Redirect" url="{tolower:{URL}}" redirectType="Permanent" />
</rule>
In fact it's the following line:
<action type="Redirect" url="{tolower:{URL}}" redirectType="Permanent" />
As soon as I disable this rule, everything works. My other rules which don't use {ToLower:} are not producing any problems. I'm not sure, why this doesn't break things locally. While doing my research I found an article describing how Microsoft patched a broken rewrite module which caused similar issues with umlauts. It's speculation, but maybe the rewrite version on my azure instance is unpatched - or it is something completely different.
Unfortunately there are few alternatives to using a rewrite-rule with {tolower:{URL}}, so I had to do a rather ugly workaround and check my URL in Global.asax.cs like so:
protected void Application_BeginRequest(object sender, EventArgs e)
{
var pathAndQuery = System.Web.HttpUtility.UrlDecode(Request.Url.PathAndQuery);
if (pathAndQuery == null || Regex.IsMatch(pathAndQuery, #"^.*\.(axd|css|js|jpg|jpeg|png|gif|txt|xml|svg|pdf)$"))
{
return;
}
if (Regex.IsMatch(pathAndQuery, ".*[A-Z].*"))
{
Response.Redirect(pathAndQuery.ToLower(), false);
Response.StatusCode = 301;
}
}
I'm not entirely happy with this approach; it feels smelly, but until I found a better solution this has to suffice.
Related
I have a web-application for the company's inner needs. However, customers should have access only by some specific URL sent by company.
So, to make things short, site can be accessed by everyone, but only by typing concrete URL. No search engine allowed to index any page of the site.
How can I reach it?
I found an approach based on robots.txt, but could it be implemented in web.config file?
There is a Web.Config solution as well, based on the User Agent. It uses the IIS rewrite module. But it is not foolproof since User Agents can be easly changed. Note that I did not create the list myself but found it recently on the web, but cannot seem to find the article anymore.
<rewrite>
<rules>
<rule name="Abuse User Agents Blocking" stopProcessing="true">
<match url=".*" ignoreCase="false" />
<conditions logicalGrouping="MatchAny">
<add input="{HTTP_USER_AGENT}" pattern="^.*(1Noonbot|1on1searchBot|3D_SEARCH|3DE_SEARCH2|3GSE|50.nu|192.comAgent|360Spider|A6-Indexer|AASP|ABACHOBot|Abonti|abot|AbotEmailSearch|Aboundex|AboutUsBot|AccMonitor\ Compliance|accoona|AChulkov.NET\ page\ walker|Acme.Spider|AcoonBot|acquia-crawler|ActiveTouristBot|Acunetix|Ad\ Muncher|AdamM|adbeat_bot|adminshop.com|Advanced\ Email|AESOP_com_SpiderMan|AESpider|AF\ Knowledge\ Now\ Verity|aggregator:Vocus|ah-ha.com|AhrefsBot|AIBOT|aiHitBot|aipbot|AISIID|AITCSRobot|Akamai-SiteSnapshot|AlexaWebSearchPlatform|AlexfDownload|Alexibot|AlkalineBOT|All\ Acronyms|Amfibibot|AmPmPPC.com|AMZNKAssocBot|Anemone|Anonymous|Anonymouse.org|AnotherBot|AnswerBot|AnswerBus|AnswerChase\ PROve|AntBot|antibot-|AntiSantyWorm|Antro.Net|AONDE-Spider|Aport|Aqua_Products|AraBot|Arachmo|Arachnophilia|archive.org_bot|aria\ eQualizer|arianna.libero.it|Arikus_Spider|Art-Online.com|ArtavisBot|Artera|ASpider|ASPSeek|asterias|AstroFind|athenusbot|AtlocalBot|Atomic_Email_Hunter|attach|attrakt|attributor|Attributor.comBot|augurfind|AURESYS|AutoBaron|autoemailspider|autowebdir|AVSearch-|axfeedsbot|Axonize-bot|Ayna|b2w|BackDoorBot|BackRub|BackStreet\ Browser|BackWeb|Baiduspider-video|Bandit|BatchFTP|baypup|BDFetch|BecomeBot|BecomeJPBot|BeetleBot|Bender|besserscheitern-crawl|betaBot|Big\ Brother|Big\ Data|Bigado.com|BigCliqueBot|Bigfoot|BIGLOTRON|Bilbo|BilgiBetaBot|BilgiBot|binlar|bintellibot|bitlybot|BitvoUserAgent|Bizbot003|BizBot04|BizBot04\ kirk.overleaf.com|Black.Hole|Black\ Hole|Blackbird|BlackWidow|bladder\ fusion|Blaiz-Bee|BLEXBot|Blinkx|BlitzBOT|Blog\ Conversation\ Project|BlogMyWay|BlogPulseLive|BlogRefsBot|BlogScope|Blogslive|BloobyBot|BlowFish|BLT|bnf.fr_bot|BoaConstrictor|BoardReader-Image-Fetcher|BOI_crawl_00|BOIA-Scan-Agent|BOIA.ORG-Scan-Agent|boitho.com-dc|Bookmark\ Buddy|bosug|Bot\ Apoena|BotALot|BotRightHere|Botswana|bottybot|BpBot|BRAINTIME_SEARCH|BrokenLinkCheck.com|BrowserEmulator|BrowserMob|BruinBot|BSearchR&D|BSpider|btbot|Btsearch|Buddy|Buibui|BuildCMS|BuiltBotTough|Bullseye|bumblebee|BunnySlippers|BuscadorClarin|Butterfly|BuyHawaiiBot|BuzzBot|byindia|BySpider|byteserver|bzBot|c\ r\ a\ w\ l\ 3\ r|CacheBlaster|CACTVS\ Chemistry|Caddbot|Cafi|Camcrawler|CamelStampede|Canon-WebRecord|Canon-WebRecordPro|CareerBot|casper|cataguru|CatchBot|CazoodleBot|CCBot|CCGCrawl|ccubee|CD-Preload|CE-Preload|Cegbfeieh|Cerberian\ Drtrs|CERT\ FigleafBot|cfetch|CFNetwork|Chameleon|ChangeDetection|Charlotte|Check&Get|Checkbot|Checklinks|checkprivacy|CheeseBot|ChemieDE-NodeBot|CherryPicker|CherryPickerElite|CherryPickerSE|Chilkat|ChinaClaw|CipinetBot|cis455crawler|citeseerxbot|cizilla.com|ClariaBot|clshttp|Clushbot|cmsworldmap|coccoc|CollapsarWEB|Collector|combine|comodo|conceptbot|ConnectSearch|conpilot|ContentSmartz|ContextAd|contype|cookieNET|CoolBott|CoolCheck|Copernic|Copier|CopyRightCheck|core-project|cosmos|Covario-IDS|Cowbot-|Cowdog|crabbyBot|crawl|Crawl_Application|crawl.UserAgent|CrawlConvera|crawler|crawler_for_infomine|CRAWLER-ALTSE.VUNET.ORG-Lynx|crawler-upgrade-config|crawler.kpricorn.org|crawler#|crawler4j|crawler43.ejupiter.com|Crawly|CreativeCommons|Crescent|Crescent\ Internet\ ToolPak\ HTTP\ OLE\ Control|cs-crawler|CSE\ HTML\ Validator|CSHttpClient|Cuasarbot|culsearch|Curl|Custo|Cutbot|cvaulev|Cyberdog|CyberNavi_WebGet|CyberSpyder|CydralSpider).*$" />
<add input="{HTTP_USER_AGENT}" pattern="^.*(D1GArabicEngine|DA|DataCha0s|DataFountains|DataparkSearch|DataSpearSpiderBot|DataSpider|Dattatec.com|Dattatec.com-Sitios-Top|Daumoa|DAUMOA-video|DAUMOA-web|Declumbot|Deepindex|deepnet|DeepTrawl|dejan|del.icio.us-thumbnails|DelvuBot|Deweb|DiaGem|Diamond|DiamondBot|diavol|DiBot|didaxusbot|DigExt|Digger|DiGi-RSSBot|DigitalArchivesBot|DigOut4U|DIIbot|Dillo|Dir_Snatch.exe|DISCo|DISCo\ Pump|discobot|DISCoFinder|Distilled-Reputation-Monitor|Dit|DittoSpyder|DjangoTraineeBot|DKIMRepBot|DoCoMo|DOF-Verify|domaincrawler|DomainScan|DomainWatcher|dotbot|DotSpotsBot|Dow\ Jonesbot|Download|Download\ Demon|Downloader|DOY|dragonfly|Drip|drone|DTAAgent|dtSearchSpider|dumbot|Dwaar|Dwaarbot|DXSeeker|EAH|EasouSpider|EasyDL|ebingbong|EC2LinkFinder|eCairn-Grabber|eCatch|eChooseBot|ecxi|EdisterBot|EduGovSearch|egothor|eidetica.com|EirGrabber|ElisaBot|EllerdaleBot|EMail\ Exractor|EmailCollector|EmailLeach|EmailSiphon|EmailWolf|EMPAS_ROBOT|EnaBot|endeca|EnigmaBot|Enswer\ Neuro|EntityCubeBot|EroCrawler|eStyleSearch|eSyndiCat|Eurosoft-Bot|Evaal|Eventware|Everest-Vulcan|Exabot|Exabot-Images|Exabot-Test|Exabot-XXX|ExaBotTest|ExactSearch|exactseek.com|exooba|Exploder|explorersearch|extract|Extractor|ExtractorPro|EyeNetIE|ez-robot|Ezooms|factbot|FairAd\ Client|falcon|Falconsbot|fast-search-engine|FAST\ Data\ Document|FAST\ ESP|fastbot|fastbot.de|FatBot|Favcollector|Faviconizer|FDM|FedContractorBot|feedfinder|FelixIDE|fembot|fetch_ici|Fetch\ API\ Request|fgcrawler|FHscan|fido|Filangy|FileHound|FindAnISP.com_ISP_Finder|findlinks|FindWeb|Firebat|Fish-Search-Robot|Flaming\ AttackBot|Flamingo_SearchEngine|FlashCapture|FlashGet|flicky|FlickySearchBot|flunky|focused_crawler|FollowSite|Foobot|Fooooo_Web_Video_Crawl|Fopper|FormulaFinderBot|Forschungsportal|fr_crawler|Francis|Freecrawl|FreshDownload|freshlinks.exe|FriendFeedBot|frodo.at|froGgle|FrontPage|Froola|FU-NBI|full_breadth_crawler|FunnelBack|FunWebProducts|FurlBot|g00g1e|G10-Bot|Gaisbot|GalaxyBot|gazz|gcreep|generate_infomine_category_classifiers|genevabot|genieBot|GenieBotRD_SmallCrawl|Genieo|Geomaxenginebot|geometabot|GeonaBot|GeoVisu|GermCrawler|GetHTMLContents|Getleft|GetRight|GetSmart|GetURL.rexx|GetWeb!|Giant|GigablastOpenSource|Gigabot|Girafabot|GleameBot|gnome-vfs|Go-Ahead-Got-It|Go!Zilla|GoForIt.com|GOFORITBOT|gold|Golem|GoodJelly|Gordon-College-Google-Mini|goroam|GoSeebot|gotit|Govbot|GPU\ p2p|grab|Grabber|GrabNet|Grafula|grapeFX|grapeshot|GrapeshotCrawler|grbot|GreenYogi\ [ZSEBOT]|Gromit|GroupMe|grub|grub-client|Grubclient-|GrubNG|GruBot|gsa|GSLFbot|GT::WWW|Gulliver|GulperBot|GurujiBot|GVC|GVC\ BUSINESS|gvcbot.com|HappyFunBot|harvest|HarvestMan|Hatena\ Antenna|Hawler|Hazel's\ Ferret\ hopper|hcat|hclsreport-crawler|HD\ nutch\ agent|Header_Test_Client|healia|Helix|heritrix|hijbul-heritrix-crawler|HiScan|HiSoftware\ AccMonitor|HiSoftware\ AccVerify|hitcrawler_|hivaBot|hloader|HMSEbot|HMView|hoge|holmes|HomePageSearch|Hooblybot-Image|HooWWWer|Hostcrawler|HSFT\ -\ Link|HSFT\ -\ LVU|HSlide|ht:|htdig|Html\ Link\ Validator|HTMLParser|HTTP::Lite|httplib|HTTrack|Huaweisymantecspider|hul-wax|humanlinks|HyperEstraier|Hyperix).*$" />
<add input="{HTTP_USER_AGENT}" pattern="^.*(ia_archiver|IAArchiver-|ibuena|iCab|ICDS-Ingestion|ichiro|iCopyright\ Conductor|id-search|IDBot|IEAutoDiscovery|IECheck|iHWebChecker|IIITBOT|iim_405|IlseBot|IlTrovatore|Iltrovatore-Setaccio|ImageBot|imagefortress|ImagesHereImagesThereImagesEverywhere|ImageVisu|imds_monitor|imo-google-robot-intelink|IncyWincy|Industry\ Cortexcrawler|Indy\ Library|indylabs_marius|InelaBot|Inet32\ Ctrl|inetbot|InfoLink|INFOMINE|infomine.ucr.edu|InfoNaviRobot|Informant|Infoseek|InfoTekies|InfoUSABot|INGRID|Inktomi|InsightsCollector|InsightsWorksBot|InspireBot|InsumaScout|Intelix|InterGET|Internet\ Ninja|InternetLinkAgent|Interseek|IOI|ip-web-crawler.com|Ipselonbot|Iria|IRLbot|Iron33|Isara|iSearch|iSiloX|IsraeliSearch|IstellaBot|its-learning|IU_CSCI_B659_class_crawler|iVia|iVia\ Page\ Fetcher|JadynAve|JadynAveBot|jakarta|Jakarta\ Commons-HttpClient|Java|Jbot|JemmaTheTourist|JennyBot|Jetbot|JetBrains\ Omea\ Pro|JetCar|Jim|JoBo|JobSpider_BA|JOC|JoeDog|JoyScapeBot|JSpyda|JubiiRobot|jumpstation|Junut|JustView|Jyxobot|K.S.Bot|KakcleBot|kalooga|KaloogaBot|kanagawa|KATATUDO-Spider|Katipo|kbeta1|Kenjin.Spider|KeywenBot|Keyword.Density|Keyword\ Density|kinjabot|KIT-Fireball|Kitenga-crawler-bot|KiwiStatus|kmbot-|kmccrew|Knight|KnowItAll|Knowledge.com|Knowledge\ Engine|KoepaBot|Koninklijke|KrOWLer|KSbot|kuloko-bot|kulturarw3|KummHttp|Kurzor|Kyluka|L.webis|LabelGrab|Labhoo|labourunions411|lachesis|Lament|LamerExterminator|LapozzBot|larbin|LARBIN-EXPERIMENTAL|LBot|LeapTag|LeechFTP|LeechGet|LetsCrawl.com|LexiBot|LexxeBot|lftp|libcrawl|libiViaCore|libWeb|libwww|libwww-perl|likse|Linguee|Link|link_checker|LinkAlarm|linkbot|LinkCheck\ by\ Siteimprove.com|LinkChecker|linkdex.com|LinkextractorPro|LinkLint|linklooker|Linkman|LinkScan|LinksCrawler|LinksManager.com_bot|LinkSweeper|linkwalker|LiteFinder|LitlrBot|Little\ Grabber\ at\ Skanktale.com|Livelapbot|LM\ Harvester|LMQueueBot|LNSpiderguy|LoadTimeBot|LocalcomBot|locust|LolongBot|LookBot|Lsearch|lssbot|LWP|lwp-request|lwp-trivial|LWP::Simple|Lycos_Spider|Lydia\ Entity|LynnBot|Lytranslate|Mag-Net|Magnet|magpie-crawler|Magus|Mail.Ru|Mail.Ru_Bot|MAINSEEK_BOT|Mammoth|MarkWatch|MaSagool|masidani_bot_|Mass|Mata.Hari|Mata\ Hari|matentzn\ at\ cs\ dot\ man\ dot\ ac\ dot\ uk|maxamine.com--robot|maxamine.com-robot|maxomobot|Maxthon$|McBot|MediaFox|medrabbit|Megite|MemacBot|Memo|MendeleyBot|Mercator-|mercuryboard_user_agent_sql_injection.nasl|MerzScope|metacarta|Metager2|metager2-verification-bot|MetaGloss|METAGOPHER|metal|metaquerier.cs.uiuc.edu|METASpider|Metaspinner|MetaURI|MetaURI\ API|MFC_Tear_Sample|MFcrawler|MFHttpScan|Microsoft.URL|MIIxpc|miner|mini-robot|minibot|miniRank|Mirror|Missigua\ Locator|Mister.PiX|Mister\ PiX|Miva|MJ12bot|mnoGoSearch|mod_accessibility|moduna.com|moget|MojeekBot|MOMspider|MonkeyCrawl|MOSES|Motor|mowserbot|MQbot|MSE360|MSFrontPage|MSIECrawler|MSIndianWebcrawl|MSMOBOT|Msnbot|msnbot-products|MSNPTC|MSRBOT|MT-Soft|MultiText|My_Little_SearchEngine_Project|my-heritrix-crawler|MyApp|MYCOMPANYBOT|mycrawler|MyEngines-US-Bot|MyFamilyBot|Myra|nabot|nabot_|Najdi.si|Nambu|NAMEPROTECT|NatchCVS|naver|naverbookmarkcrawler|NaverBot|Navroad|NearSite|NEC-MeshExplorer|NeoScioCrawler|NerdByNature.Bot|NerdyBot|Nerima-crawl-).*$" />
<add input="{HTTP_USER_AGENT}" pattern="^.*(T-H-U-N-D-E-R-S-T-O-N-E|Tailrank|tAkeOut|TAMU_CRAWLER|TapuzBot|Tarantula|targetblaster.com|TargetYourNews.com|TAUSDataBot|taxinomiabot|Tecomi|TeezirBot|Teleport|Teleport\ Pro|TeleportPro|Telesoft|Teradex\ Mapper|TERAGRAM_CRAWLER|TerrawizBot|testbot|testing\ of|TextBot|thatrobotsite.com|The.Intraformant|The\ Dyslexalizer|The\ Intraformant|TheNomad|Theophrastus|theusefulbot|TheUsefulbot_|ThumbBot|thumbshots-de-bot|tigerbot|TightTwatBot|TinEye|Titan|to-dress_ru_bot_|to-night-Bot|toCrawl|Topicalizer|topicblogs|Toplistbot|TopServer\ PHP|topyx-crawler|Touche|TourlentaScanner|TPSystem|TRAAZI|TranSGeniKBot|travel-search|TravelBot|TravelLazerBot|Treezy|TREX|TridentSpider|Trovator|True_Robot|tScholarsBot|TsWebBot|TulipChain|turingos|turnit|TurnitinBot|TutorGigBot|TweetedTimes|TweetmemeBot|TwengaBot|TwengaBot-Discover|Twiceler|Twikle|twinuffbot|Twisted\ PageGetter|Twitturls|Twitturly|TygoBot|TygoProwler|Typhoeus|U.S.\ Government\ Printing\ Office|uberbot|ucb-nutch|UCSD-Crawler|UdmSearch|UFAM-crawler-|Ultraseek|UnChaos|unchaos_crawler_|UnisterBot|UniversalSearch|UnwindFetchor|UofTDB_experiment|updated|URI::Fetch|url_gather|URL-Checker|URL\ Control|URLAppendBot|URLBlaze|urlchecker|urlck|UrlDispatcher|urllib|URLSpiderPro|URLy.Warning|USAF\ AFKN\|usasearch|USS-Cosmix|USyd-NLP-Spider|Vacobot|Vacuum|VadixBot|Vagabondo|Validator|Valkyrie|vBSEO|VCI|VerbstarBot|VeriCiteCrawler|Verifactrola|Verity-URL-Gateway|vermut|versus|versus.integis.ch|viasarchivinginformation.html|vikspider|VIP|VIPr|virus-detector|VisBot|Vishal\ For\ CLIA|VisWeb|vlad|vlsearch|VMBot|VocusBot|VoidEYE|VoilaBot|Vortex|voyager|voyager-hc|voyager-partner-deep|VSE|vspider).*$" />
<add input="{HTTP_USER_AGENT}" pattern="^.*(W3C_Unicorn|W3C-WebCon|w3m|w3search|wacbot|wastrix|Water\ Conserve|Water\ Conserve\ Portal|WatzBot|wauuu\ engine|Wavefire|Waypath|Wazzup|Wazzup1.0.4800|wbdbot|web-agent|Web-Sniffer|Web.Image.Collector|Web\ CEO\ Online|Web\ Image\ Collector|Web\ Link\ Validator|Web\ Magnet|webalta|WebaltBot|WebAuto|webbandit|webbot|webbul-bot|WebCapture|webcheck|Webclipping.com|webcollage|WebCopier|WebCopy|WebCorp|webcrawl.net|webcrawler|WebDownloader\ for|Webdup|WebEMailExtrac|WebEMailExtrac.*|WebEnhancer|WebFerret|webfetch|WebFetcher|WebGather|WebGo\ IS|webGobbler|WebImages|Webinator-search2.fasthealth.com|Webinator-WBI|WebIndex|WebIndexer|weblayers|WebLeacher|WeblexBot|WebLinker|webLyzard|WebmasterCoffee|WebmasterWorld|WebmasterWorldForumBot|WebMiner|WebMoose|WeBot|WebPix|WebReaper|WebRipper|WebSauger|Webscan|websearchbench|WebSite|websitemirror|WebSpear|websphinx.test|WebSpider|Webster|Webster.Pro|Webster\ Pro|WebStripper|WebTrafficExpress|WebTrends\ Link\ Analyzer|webvac|webwalk|WebWalker|Webwasher|WebWatch|WebWhacker|WebXM|WebZip|Weddings.info|wenbin|WEPA|WeRelateBot|Whacker|Widow|WikiaBot|Wikio|wikiwix-bot-|WinHttp.WinHttpRequest|WinHTTP\ Example|WIRE|wired-digital-newsbot|WISEbot|WISENutbot|wish-la|wish-project|wisponbot|WMCAI-robot|wminer|WMSBot|woriobot|worldshop|WorQmada|Wotbox|WPScan|wume_crawler|WWW-Mechanize|www.freeloader.com.|WWW\ Collector|WWWOFFLE|wwwrobot|wwwster|WWWWanderer|wwwxref|Wysigot|X-clawler|Xaldon|Xenu|Xenu's|Xerka\ MetaBot|XGET|xirq|XmarksFetch|XoviBot|xqrobot|Y!J|Y!TunnelPro|yacy.net|yacybot|yarienavoir.net|Yasaklibot|yBot|YebolBot|yellowJacket|yes|YesupBot|Yeti|YioopBot|YisouSpider|yolinkBot|yoogliFetchAgent|yoono|Yoriwa|YottaCars_Bot|you-dir|Z-Add\ Link|zagrebin|Zao|zedzo.digest|zedzo.validate|zermelo|Zeus|Zeus\ Link\ Scout|zibber-v|zimeno|Zing-BottaBot|ZipppBot|zmeu|ZoomSpider|ZuiBot|ZumBot|Zyborg|Zyte).*$" />
<add input="{HTTP_USER_AGENT}" pattern="^.*(Nessus|NESSUS::SOAP|nestReader|Net::Trackback|NetAnts|NetCarta\ CyberPilot\ Pro|Netcraft|NetID.com|NetMechanic|Netprospector|NetResearchServer|NetScoop|NetSeer|NetShift=|NetSongBot|Netsparker|NetSpider|NetSrcherP|NetZip|NetZip-Downloader|NewMedhunt|news|News_Search_App|NewsGatherer|Newsgroupreporter|NewsTroveBot|NextGenSearchBot|nextthing.org|NHSEWalker|nicebot|NICErsPRO|niki-bot|NimbleCrawler|nimbus-1|ninetowns|Ninja|NjuiceBot|NLese|Nogate|Nomad-V2.x|NoteworthyBot|NPbot|NPBot-|NRCan\ intranet|NSDL_Search_Bot|nu_tch-princeton|nuggetize.com|nutch|nutch1|NutchCVS|NutchOrg|NWSpider|Nymesis|nys-crawler|ObjectsSearch|oBot|Obvius\ external\ linkcheck|Occam|Ocelli|Octopus|ODP\ entries|Offline.Explorer|Offline\ Explorer|Offline\ Navigator|OGspider|OmiExplorer_Bot|OmniExplorer_Bot|omnifind|OmniWeb|OnetSzukaj|online\ link\ validator|OOZBOT|Openbot|Openfind|Openfind\ data|OpenHoseBot|OpenIntelligenceData|OpenISearch|OpenSearchServer_Bot|OpiDig|optidiscover|OrangeBot|ORISBot|ornl_crawler_1|ORNL_Mercury|osis-project.jp|OsO|OutfoxBot|OutfoxMelonBot|OWLER-BOT|owsBot|ozelot|P3P\ Client|page_verifier|PageBitesHyperBot|Pagebull|PageDown|PageFetcher|PageGrabber|PagePeeker|PageRank\ Monitor|pamsnbot.htm|Panopy|panscient.com|Pansophica|Papa\ Foto|PaperLiBot|parasite|parsijoo|Pathtraq|Pattern|Patwebbot|pavuk|PaxleFramework|PBBOT|pcBrowser|pd-crawler|PECL::HTTP|penthesila|PeoplePal|perform_crawl|PerMan|PGP-KA|PHPCrawl|PhpDig|PicoSearch|pipBot|pipeLiner|Pita|pixfinder|PiyushBot|planetwork|PleaseCrawl|Plucker|Plukkie|Plumtree|Pockey|Pockey-GetHTML|PoCoHTTP|pogodak.ba|Pogodak.co.yu|Poirot|polybot|Pompos|Poodle\ predictor|PopScreenBot|PostPost|PrivacyFinder|ProjectWF-java-test-crawler|ProPowerBot|ProWebWalker|PROXY|psbot|psbot-page|PSS-Bot|psycheclone|pub-crawler|pucl|pulseBot\ \(pulse|Pump|purebot|PWeBot|pycurl|Python-urllib|pythonic-crawler|PythonWikipediaBot|q1|QEAVis\ agent|QFKBot|qualidade|Qualidator.com|QuepasaCreep|QueryN.Metasearch|QueryN\ Metasearch|quest.durato|Quintura-Crw|QunarBot|Qweery_robot.txt_CheckBot|QweeryBot|r2iBot|R6_CommentReader|R6_FeedFetcher|R6_VoteReader|RaBot|Radian6|radian6_linkcheck|RAMPyBot|RankurBot|RcStartBot|RealDownload|Reaper|REBI-shoveler|Recorder|RedBot|RedCarpet|ReGet|RepoMonkey|RepoMonkey\ Bait|Riddler|RIIGHTBOT|RiseNetBot|RiverglassScanner|RMA|RoboPal|Robosourcer|robot|robotek|robots|Robozilla|rogerBot|Rome\ Client|Rondello|Rotondo|Roverbot|RPT-HTTPClient|rtgibot|RufusBot|Runnk\ online\ rss\ reader|s~stremor-crawler|S2Bot|SafariBookmarkChecker|SaladSpoon|Sapienti|SBIder|SBL-BOT|SCFCrawler|Scich|ScientificCommons.org|ScollSpider|ScooperBot|Scooter|ScoutJet|ScrapeBox|Scrapy|SCrawlTest|Scrubby|scSpider|Scumbot|SeaMonkey$|Search-Channel|Search-Engine-Studio|search.KumKie.com|search.msn.com|search.updated.com|search.usgs.gov|Search\ Publisher|Searcharoo.NET|SearchBlox|searchbot|searchengine|searchhippo.com|SearchIt-Bot|searchmarking|searchmarks|searchmee_v|SearchmetricsBot|searchmining|SearchnowBot_v1|searchpreview|SearchSpider.com|SearQuBot|Seekbot|Seeker.lookseek.com|SeeqBot|seeqpod-vertical-crawler|Selflinkchecker|Semager|semanticdiscovery|Semantifire1|semisearch|SemrushBot|Senrigan|SEOENGWorldBot|SeznamBot|ShablastBot|ShadowWebAnalyzer|Shareaza|Shelob|sherlock|ShopWiki|ShowLinks|ShowyouBot|siclab|silk|Siphon|SiteArchive|SiteCheck-sitecrawl|sitecheck.internetseer.com|SiteFinder|SiteGuardBot|SiteOrbiter|SiteSnagger|SiteSucker|SiteSweeper|SiteXpert|SkimBot|SkimWordsBot|SkreemRBot|skygrid|Skywalker|Sleipnir|slow-crawler|SlySearch|smart-crawler|SmartDownload|Smarte|smartwit.com|Snake|Snapbot|SnapPreviewBot|Snappy|snookit|Snooper|Snoopy|SocialSearcher|SocSciBot|SOFT411\ Directory|sogou|sohu-search|sohu\ agent|Sokitomi|Solbot|sootle|Sosospider|Space\ Bison|Space\ Fung|SpaceBison|SpankBot|spanner|Spatineo\ Monitor\ Controller|special_archiver|SpeedySpider|Sphider|Sphider2|spider|Spider.TerraNautic.net|SpiderEngine|SpiderKU|SpiderMan|Spinn3r|Spinne|sportcrew-Bot|spyder3.microsys.com|sqlmap|Squid-Prefetch|SquidClamAV_Redirector|Sqworm|SrevBot|sslbot|SSM\ Agent|StackRambler|StarDownloader|statbot|statcrawler|statedept-crawler|Steeler|STEGMANN-Bot|stero|Stripper|Stumbler|suchclip|sucker|SumeetBot|SumitBot|SummizeBot|SummizeFeedReader|SuperBot|superbot.com|SuperHTTP|SuperLumin|SuperPagesBot|Supybot|SURF|Surfbot|SurfControl|SurveyBot|suzuran|SWEBot|swish-e|SygolBot|SynapticWalker|Syntryx\ ANT\ Scout\ Chassis\ Pheromone|SystemSearch-robot|Szukacz).*$" />
</conditions>
<action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
</rule>
</rules>
</rewrite>
More info:
http://www.blogtips.org/web-crawlers-love-the-good-but-kill-the-bad-and-the-ugly/
https://perishablepress.com/eight-ways-to-blacklist-with-apaches-mod_rewrite/
No, it cant be achieved by web config file.
Dealing with search bots is done with robots.txt, its pretty simple solution. As written above, crawlers cannot see your web config file.
But, in code you can determine that HttpUserAgent is google bot, after you determine that, u dont write content to web page, just send Response code 404, or which i think is better, code 410.
Im not 100% behind above solution because there was no need to use that, but logically i think that would do the same. And you would just waste google energy to crawl your page :)
Web.config is obviously not accessible to users or crawlers. You should be able to create a route "/robots.txt", if your application resides in the root folder of the website, and return the text that normally should be inside that file from a controller, OWIN middleware or HTTP handler. But why?
As explained in other answers the way to prevent bots from crawling a page is by using robots.txt.
However based on your description this might not even be needed.
Here is why:
In order for a page to be crawled it needs to be linked to. There are two ways for this to happen:
The page is linked in your root/front web page (i.e. the page shown by typing httpx://mysite.domain/). My understanding is this does not happen. The information you want out of search engines is only available from specific deep (secret?) URLs not linked in the web site.
The page is linked another web-site. I.e. someone posts the URL to the information in a different web-site. In this case if you want to prevent the content of the page being crawled you need to use robots.txt, however the URL itself will be indexed in the search engine as this is information in a different web page.
The question then is how likely is option 2? For an example of public URLs that are not indexed unless published by a third party see dropbox/google drive/one drive URLs used for sharing to everyone. The URL usually includes a GUID which by definition is practically impossible to guess or brute force.
in the following relative url I want to get the optional origin parameter but I just need the origin itself (that would be test in the following url not with-test-origin)
/somequery/with-test-origin/param/param/more
the url matches when I use this regular expression:
([^/]+)/(?:(?:with-)([^/]*)(-origin)/)?([^/]*/)?([^/]*/)?more
but not when I use this:
([^/]+)/(?:(?:with-)([^/]*)(?:-origin)/)?([^/]*/)?([^/]*/)?more
what is the problem? why it doesn't work when I don't want to catch (-origin) group?
Edit:
I test it in an application it worked fine. I also tried this in web.config:
([^/]+)/(?:(with-)([^/]*)(?:-origin)/)?([^/]*/)?([^/]*/)?more
in web.config (removed ?: from (?:with-) and put it back in (?:-origin) and it worked in web.config too. Is there a limitation in web.config file for the amount of groups or the groups I don't want to catch?
this is how I use it in my web.config file:
<rule name="tst" enabled="true" stopProcessing="true">
<match url="([^/]+)/(?:(?:with-)([^/]*)(?:-origin)/)?([^/]*/)?([^/]*/)?more" />
<action type="Rewrite" url="moreinfo.aspx?cat={R:1}&sub1={R:2}&sub2={R:3}&sub3={R:4}&sub4={R:5}" />
</rule>
The matching behavior should not change based on this difference. The problem is either:
Something else in your web.config file.
A bug in the regex engine or web.config.
After hours of struggles I found the problem at last.
the matches are zero based and are like {R:0} {R:1} , ...
When using ?: as the number of matches were reduced by 1 so for example the {R:1} that were valid before adding ?: turns to be invalid.
I am specifying a custom route and an action. What am I doing wrong? I am getting a 404 error. I set it up similar to this example:
http://www.codeproject.com/Articles/190267/Controllers-and-Routers-in-ASP-NET-MVC
URL:
http://localhost:14133/ScanSummary/mywebsite.com
Route:
routes.MapRoute(
"ScanSummary",
"ScanSummary/{domain}",
new { controller = "ScanSummary", action = "Get" }
);
Controller:
public class ScanSummaryController : Controller
{
public ActionResult Get(string domain)
{
return View();
}
}
Because of the .com extension, IIS thinks that this is a file on the disk and attempts to serve it directly instead of going through the pipeline.
One way to fix the issue is by running managed modules for all requests:
<system.webServer>
<modules runAllManagedModulesForAllRequests="true" />
...
</system.webServer>
Another (and IMHO better way) is to explicitly map the MVC handler to this endpoint:
<system.webServer>
<modules runAllManagedModulesForAllRequests="false" />
<handlers>
...
<add
name="SvanSummaryHandler"
path="ScanSummary/*"
verb="GET,HEAD,POST"
type="System.Web.Handlers.TransferRequestHandler"
preCondition="integratedMode,runtimeVersionv4.0" />
</handlers>
Also I would recommend you reading the following blog post to better understand the gotchas that are awaiting you if you intend to be passing arbitrary characters in the path portion of an URL: http://www.hanselman.com/blog/ExperimentsInWackinessAllowingPercentsAnglebracketsAndOtherNaughtyThingsInTheASPNETIISRequestURL.aspx
And simply follow what Scott Hanselmann suggests:
After ALL this effort to get crazy stuff in the Request Path, it's
worth mentioning that simply keeping the values as a part of the Query
String (remember WAY back at the beginning of this post?) is easier,
cleaner, more flexible, and more secure.
Oh and to make the things even funnier you may take a look at this blog post: http://bitquabit.com/post/zombie-operating-systems-and-aspnet-mvc/
After reading those did you start using query strings? I hope you did.
you need to replace
new { controller = "ScanSummary", action = "Get" }
with
new { controller = "ScanSummary", action = "Get", domain = UrlParameter.Optional }
I have rewritten an old application that then have quite a lot of external applications that redirects to a specific url containging the ?id={someid} querystring.. however.. episerver seems to do something with the id-key.. since that's never displayed in Request.QueryStrings.AllKeys..So my guess is that epi is doing some kind of url-rewriting or something.. is there any way to get around this and be able to use the id-key for only a specific page/location?
I would suggest using the IIS 7 url rewrite module to transform the id querystring key to something that will not conflict with EPiServer, as id is used for the page id in EPiServers internal url after the friendly url has been converted within the url rewrite provider.
The IIS url rewrite module would give you a chance to change this before hitting EPiServer.
You could write a regular expression to capture the id param, so you could add the following rule to your rewrite configuration
<rule name="QueryString">
<match url="(.*)" />
<conditions logicalGrouping="MatchAll" trackAllCaptures="false">
<add input="{QUERY_STRING}" pattern="(.*)id=([a-zA-Z0-9_-]+)(.*)" />
</conditions>
<action type="Rewrite" url="{R:1}?{C:1}oldid={C:2}{C:3}" appendQueryString="false" />
</rule>
Which would rewrite the url to /my-url-path/?oldid=123&another=param
More on how to implement as a rule in the IIS url rewrite module here
The querystring parameter id is reserved by EPiServer. You need to check it before EPi touches it.
Update: If you step up to EPi 7.5 the query parameter “ID” is no longer handled by EPi unless you register a EPiServer.Web.Routing.ClassicLinkRoute.
So I have a route like this in my MVC 3 application running under IIS 7:
routes.MapRoute(
"VirtualTourConfig",
"virtualtour/config.xml",
new { controller = "VirtualTour", action = "Config" }
);
The trick is that a file actually exists at /virtualtour/config.xml. It seems like the request is just returning the xml file at that location instead of hitting the route, which processes the XML, makes some changes and returns a custom XmlResult.
Any suggestions on how I can tell my application to hit the route and not the actual file in the event that the file exists on disk?
EDIT: It appears that I can use routes.RouteExistingFiles = true; in the RegisterRoutes method of Global.asax to tell the application to ignore files on disk. This, however, sets the flag globally and breaks a lot of other requests within the application. For example, I still want calls to /assets/css/site.css to return the CSS file without having to specifically set routes up for each static asset. So now the question becomes, is there a way to do this on a per-route basis?
So far the best answer to this that I have found is to globally apply routes.RouteExistingFiles=true and then selectively ignore the routes I want to pass through to existing files like .js, .css, etc. So I ended up with something like this:
public static void RegisterRoutes(RouteCollection routes)
{
routes.IgnoreRoute("{resource}.axd/{*pathInfo}");
routes.IgnoreRoute("*.js|css|swf");
routes.RouteExistingFiles = true;
routes.MapRoute(
"VirtualTourConfig",
"virtualtour/config.xml",
new { controller = "VirtualTour", action = "Config" }
);
}
If anyone has a better solution, I'd like to see it. I'd much prefer to selectively apply an "RouteExistingFIles" flag to individual routes but I don't know if there's a way to do that.
No solution here, just an idea.
You can try to implement solution based on your own VirtualPathProvider that will provide your own mapping of web paths to file system paths and use default provider for all paths you don't want to take care of.
A very simple solution, certainly if you consider Scott's answer, would be eg: routes.IgnoreRoute("templates/*.html");,
routes.IgnoreRoute("scripts/*.js"); and routes.IgnoreRoute("styles/*.css");.
This gives you both the simplicity of just supplying paths to the RoutesCollection and avoids the work of having to initiate eg. a VirtualPathProvider.
You could try UrlRewiting e.g.:
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="VirtualTourConfig">
<match url="^virtualtour/config.xml" />
<action type="Rewrite" url="virtualtour/config" />
</rule>
</rules>
</rewrite>
NB I'm using this for a different use-case (i.e. serving an angular app from asp.net MVC) - I haven't tested whether MVC routing occurs AFTER url rewriting.
In my case I have a normal MVC route (i.e. /Dashboard/ => DashboardController.Index()) but need all relative paths in the Views/Dashboard/Index.cshtml to serve static files without getting confused by MVC routing i.e. /Dashboard/app/app.module.js must serve a static file. I use UrlRewriting to map /Dashboard/(.+) to /ng/Dashboard/{R:1} as follows:
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="Dashboard">
<match url="^dashboard/(.+)" />
<action type="Rewrite" url="ng/dashboard/{R:1}" />
</rule>
</rules>
</rewrite>