Reserved Symbol (&, /, etc) in URL Sharepoint Rest Query - Bad Request - c#

One problem I have been facing off and on for the past few weeks was trying to search SharePoint for a list item value and kept getting bad request error. I had two symbols causing problems, one was that I could not search for something with anh & symbol, and the other was a / (forward slash).
My code looked like:
ServiceContext context = new ServiceContext(new Uri("http://URL/_vti_bin/listdata.svc"));
context.Credentials = CredentialCache.DefaultCredentials;
var requestType = (System.Data.Services.Client.DataServiceQuery<ListTypeValue>)context.ListType
.Where(v => v.Value.Equals(search));
After searching the internet, nothing valid came back besides saying change IIS settings, or convert it to ASCII Html value (NOTE: converting & to %27 still causes bad request error).

I would really not recommend using the combination of StartsWith and Length - performance could become a real issue in that case. Assuming you need a string key and that you want your keys to be able to contain special characters, Peter Qian has blogged about the best recommendation we can give. (This behavior actually comes from a combination of IIS and the .NET stack.)
Check out http://blogs.msdn.com/b/peter_qian/archive/2010/05/25/using-wcf-data-service-with-restricted-characrters-as-keys.aspx for more details, but your problem should be solved by turning off ASP.NET request filtering. Note that this has non-trivial security risks. Peter points out some of them, and security filtering tools like asafaweb.com will complain about this solution.
Long story short: if you can use integers or avoid the restricted characters in keys, your Web application will be more secure.

I found this article and it gave me an idea. I was going to try and use HEX, but since I am using a web service, I couldn't figure anything out. Finally I thought, hey, someone stated how they used substringof, why not try startswith!
Finally! A solution to this problem.
INVALID:
http://URL/_vti_bin/listdata.svc/ListType('Search & Stuff')
VALID:
http://URL/_vti_bin/listdata.svc/ListType() $filter=startswith(Value,'Search & Stuff')
I took it a step further since this could potentially return not what I wanted. I added length is equal to the strings length and it is working perfectly!
My C# Code looks like this now:
ServiceContext context = new ServiceContext(new Uri("http://URL/_vti_bin/listdata.svc"));
context.Credentials = CredentialCache.DefaultCredentials;
var requestType = (System.Data.Services.Client.DataServiceQuery<ListTypeValue>)context.ListType
.Where(v => v.Value.StartsWith(search) && v.Value.Length == search.Length);
Hopefully this helps someone else out and saves some hair pulling!
UPDATE:
After getting replies from people, I found a better way to do this. Instead of the standard LINQ method of querying, there is another option. See MSDN Article
System.Data.Services.Client.DataServiceQuery<ListTypeValue> lobtest = context.ListType
.AddQueryOption("$filter", "(Value eq '" + CleanString(search) + "')");
Now I will implement link posted from Mark Stafford and parse out reserved characters.

Related

Google cloud vision DOCUMENT_TEXT_DETECTION language hints - how i can make force use only one language?

I work with text recognition on documents. Library for .NET Core
var client = ImageAnnotatorClient.Create();
ImageContext context = new ImageContext();
foreach (var hints in context.LanguageHints)
{
Console.WriteLine("hint " + hints);
}
var response = client.DetectDocumentText(fromBytes, context);
In Russian there is a phrase "Однажды вечером"(One evening)
If I use Google DOCUMENT_TEXT_DETECTION without language hints, then I get the result "ОДНАЖДЫ BELEPOM" (First word correct! Second - fail)
Okay, maybe it doubts the language, we will indicate clearly that it is Russian
context.LanguageHints.Add("ru");
var response = client.DetectDocumentText(fromBytes, context);
The result is the same - the second word is entirely in Latin.
I think, somehow I am asking a hint wrong? Let's try other languages ​​for examples
context.LanguageHints.Add("en");
var response = client.DetectDocumentText(fromBytes, context);
result:
DAHAYKALI BELEPOM
As we can see, the hints functionality works by itself, it's just that the Russian word "ВЕЧЕРОМ"(On Evening) is recognized with such low confidence that fallback to used in English OCR.
The question is - how do you force disable all OCR module, except only one language?(russian in this case) Let him at least try to recognize Russian letters, albeit with less confidence.
Compare your request to the one in the API , in the api you receive the correct value. "Однажды вечером"
(scroll down you have the request json)
As per this link it is not possible to restrict the language, it is only a hint.
The solution, not sure if it will work, is to use tesseract with a restricted characters set and run this text through it. https://stackoverflow.com/a/70345684/14980950.

I'm trying to skip tokens in gplex (a flex/lex port) and using yylex() is causing a stack overflow. Is there a better way to skip?

So right now in my lexer I'm trying to skip certain tokens like comments and whitespace, except I need to add them to my "skipped list", rather than hiding them altogether.
In my Scanner frame I have
public int Skip(int sym) {
Token t = _InitToken();
t.SymbolId=sym;
t.Line = Current.Line;
t.Column =Current.Column;
t.Position=Current.Position;
t.Value = yytext;
t.Skipped = null;
_skipped.Add(t);
return yylex();
}
keep in mind this is c# but the interface isn't much different than the C one in lex/flex
I then use this function above in my scanner like so:
"/*" { if(!_TryReadUntilBlockEnd("*/")) return -1; return Skip(478); }
\/\/[^\n]* { return Skip(477); }
where 477 is my symbol id (the lex file is generated hence the lack of constants)
All _TryReadUntilBlockEnd("*/") does is read until it finds a trailing */, consuming it. It's a well tested method and can be ignored for the purposes of this question, except as explanation for how i match the end of a comment. This takes over the underlying input from gplex and handles advancing the underlying input stream itself (like fget() or whatever in C i forget). Basically it's neutral here other than reading the entire comment. Skip(478) is the relevant bit, not this.
It works fine in many cases. The only problem is I'm using it in a recursive descent parser that's parsing C#, and so the stack gets heavy, and when i have a huge stream of line comments it stack overflows.
I can solve it by finding some way to run a match without invoking a lex action again instead of calling yylex() if it's possible - that way i can rewrite it to be iterative, but i have no idea how, and what I've seen from the generated code suggests it's not possible.
The other way I can solve it - and this is my preferred way - is to match multiple C# line comments in one match. That way I only recurse once.
But this is multiline match expression which is disabled by default i think?
How do i enable multiline matching in either flex, lex or *gplex? Or is there another solution to the above problem? *gplex 1.2.2 preferred but it's completely undocumented
I'll take anything at this point. Thanks in advance!
I shouldn't have been calling yylex() at all. Thanks Jonathan and rici in the comments.

Incorrect Encoding Causing Problems with DB Query

It's late and my caffeine IV is running low so my mind is mush and I'm having problems finding a solution to what I think is a simple encoding problem (which I have almost no experience dealing with).
I have a DB using EF6 Code First and everything seems to work well until I copy some text from a website forum contained within a codeblock. I checked the header and it's supposedly encoded in UTF-8.
I essentially take this text, split it to an array of strings and check the DB for a record matching the string in each line. Everything was going well until I hit a problem with a string "Magnеtic" not matching up to anything in my DB table yet when I went into the SQLMS and queried the table with LIKE '%Magnеtic%' I got a result.
I dropped the text from the website into Notepad++ with the text from the DB query and saw that they look equal:
Magnеtic
Magnеtic
Then, I changed the encoding to ANSI and it showed:
Magnetic <--From DB
Magnеtic <--From website
A tiny light bulb went on in my head but my attempts to remedy this issue has failed.
I've tried using various methods but I think it's my fried brain attacking the problem with the wrong tools:
string.compare(a, b) == 0
string.equals(a, b)
string.ToUpperInvariant()
and probably a few others that I can't remember.
So now you know what my issue is and I feel this is such a simple problem to fix but, like I said, I'm fried and now need some community help.
I'm not a professional coder, more a hobbyist so I may not be using best practices or advanced techniques to do some things.
Edit:
Today I did some more searching and found a couple of methods that didn't work.
private string RemoveAccent(string txt)
{
byte[] bytes = Encoding.GetEncoding("Cyrillic").GetBytes(txt);
return Encoding.ASCII.GetString(bytes);
}
This one appears to remove the accented characters of the Cyrillic encoding. The result wasn't as expected but it DID have an effect.
Results:
Magn?tic <- Computer interpretation
Magnetic <- Visual representation
I also tried:
private string RemoveAccent2(string txt)
{
char[] toReplace = "àèìòùÀÈÌÒÙ äëïöüÄËÏÖÜ âêîôûÂÊÎÔÛ áéíóúÁÉÍÓÚðÐýÝ ãñõÃÑÕšŠžŽçÇåÅøØ".ToCharArray();
char[] replaceChars = "aeiouAEIOU aeiouAEIOU aeiouAEIOU aeiouAEIOUdDyY anoANOsSzZcCaAoO".ToCharArray();
for (int i = 0; i < toReplace.Count(); i++)
{
txt = txt.Replace(toReplace[i], replaceChars[i]);
}
return txt;
}
This method didn't provide any changes.
What can help in these cases is to copy-paste the character into google. In this case, the results point to the Wikipedia article about the letter Ye in Cyrillic, which looks exactly like E in the Latin alphabet, but has different encoding in Unicode.
This means the results you're getting are correct: the string “Magnеtic” looks exactly the same as “Magnetic” (at least using common fonts), but it's actually a different string.

What is the way to apply extension methods?

When attempting to solve the problem
How many seven-element subsets (not repeatable) are there in a set of nine elements ?
I tried
IEnumerable<string> NineSet =new string[] {"a","b","c","d","e","f","g","h","i"};
var SevenSet =
from first in NineSet
from second in NineSet
where first.CompareTo(second)< 0 && first.Count() + second.Count()==7
select new { first, second };
What is the problem that prevents me from attempting to use first.Count() and second.Count()? I did not check whether it is the best solution for the problem.
As already stated, what you have written down will lead you to nowhere. This is a question of combinatorics. AFAIK there is nothing pre-made in the .NET framework to solve for you combinatorics problems, hence you will have to implement the correct algorithm. If you get stuck, there are solutions out there, e.g. http://www.codeproject.com/KB/recipes/Combinatorics.aspx, where you can look at the source to see what you need to do.
first and second are strings, so you'll count their characters (this compiles, but intellisence hides it).
You're looking for something like NineSet.Count(first.Equals)
Well...
You haven't shown the error message, so it's hard to know what's wrong
Every element is of length exactly one, so I'm not sure what you're expecting to happen
As you know that first and second are strings, why aren't you using first.Length and second.Length?
As a side issue, I don't think this approach is going to solve the problem for you, I'm afraid...

Detecting honest web crawlers

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?
If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.
Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.
You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits.
Here are a few c# solutions for you:
1) Simplest
Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user.
Catches 99+% of crawlers.
bool iscrawler = Regex.IsMatch(Request.UserAgent, #"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
2) Medium
Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too.
Catches close to 100% of crawlers.
Matches ‘bot’, ‘crawler’, ‘spider’ upfront.
You can add to it any other known crawlers.
List<string> Crawlers3 = new List<string>()
{
"bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
"lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",
"atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
"cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
"esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
"htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
"image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
"lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
"motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
"netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
"patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
"raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
"searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
"curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
"urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
"webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
"webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
"wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
3) Paranoid
Is pretty fast, but a little slower than options 1 and 2.
It’s the most accurate, and allows you to maintain the lists if you want.
You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future.
If we get a short match we log it and check it for a false positive.
// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
"googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
"yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
"botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
"ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
"dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
"irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
"simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
"vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
"spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};
// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
"baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
"nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
"bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
"cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
"fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
"havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
"jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
"larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
"merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
"muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
"objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
"phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
"roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
"senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
"spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
"titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
"webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
"webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
"robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
"legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
string match = null;
if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));
if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);
bool iscrawler = match != null;
Notes:
It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.
Honeypots
The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.
Postives: It will match some unknown crawlers that aren’t declaring themselves.
Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.
You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.
One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...
Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...
Any visitor whose entry page is /robots.txt is probably a bot.
Something quick and dirty like this might be a good start:
return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
Note: rails code, but regex is generally applicable.
I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought.
It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.
void CheckBrowserCaps()
{
String labelText = "";
System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
{
labelText = "Browser is a search engine.";
}
else
{
labelText = "Browser is not a search engine.";
}
Label1.Text = labelText;
}
HttpCapabilitiesBase.Crawler Property

Categories

Resources