String matching from a list - c#

I am making a web browser and I am stuck on this one thing. I want the addres bar to act as an address bar and a search bar. First I tried seeing if querying the search bar with if adrBarTextBox.text.endswith(".com") but immediately realized that not every domain ended with .com.
The code I am currently using (and are stuck with) is:
// Populate List.
var list = new List<string>();
list.Add(Properties.Settings.Default.suffix);
(Properties.Settings.Default.suffix is a list of every domain suffix currently available)
// Search for this element.
if (adrBarTextBox.Text.Contains(list.something????))
{
// Do something (I have this part all set up)
}
The part i am having trouble with is
if (adrBarTextBox.Text.Contains(list.
I know it doesn't make sense but thats why i am asking. I have sat here thinking of a new way for hours and I am lost. I know that .Text.Contains(list) doesn't make sense and that's what I am stuck with.
I know the question is a bit noobish and there is probably some simple easy was staring me right in the face but hey. We all have to learn from somewhere.

You may need this
if (list.Any(x => adrBarTextBox.Text.Contains(x)))
{
//...
}

Use Uri.IsWellFormedUriString to determine if the input string is a valid URL.
If you want to match a string with words against another list of words, use
myList.Any(item => input.Contains(item));

Related

Get certain row by searching for a string

I am very new to C# and am trying to feel it out. Slow going so far! What I am trying to achieve should be relatively simple; I want to read a row from a CSV file with a search. I.e. if I search for username "Toby" it would fetch the entire row, preferably as an array.
Here is my users.csv file:
Id,Name,Password
1,flugs,password
2,toby,foo
I could post the code that I've tried, but I haven't even come close in previous attempts. It's a bit easier to do such a thing in Python, it may be easy in C# too but I'm far too new to know!
Does anyone have any ideas as to how I should approach/code this? Many thanks.
Easy to do in c# too:
var lineAsArray = File.ReadLines("path").First(s => s.Contains(",toby,")).Split(',');
If you want case insens, use e.g. Contains(",toby,", StringComparison.OrdinalIgnoreCase)
If your user is going to type in "Toby" you can either concatenate a comma on the start/end of it to follow this simplistic searching (which will find Toby anywhere on the line) or you can split the lone first and look to see if the second element is Toby
var lineAsArray = File.ReadLines("path").Split(',').First(a => a[1].Equals("toby"));
To make this one case insensitive, put a suitable StringComparison argument into the Equals using the same approach as above
Sky's the limit with how involved you want to get with it; using a library that parses CSV to objects that represent your lines with named, typed parameters is probably where I'd stop.. take a look at CSVHelper from josh close or ServiceStack Text, though there are no shortage of csv parser libs- it's been done to death!

C# Refine class property which is a List<string>

I have a class property:-
public List<string> szTypeOfFileList{get;set;}
As the name suggest, the property stores user selection of types of Files of interest (.txt, .doc, .rtf, .pdf, etc).
I am trying to assess whether there is way I could refine this List as it is being populated by user entries OR, if I should wait for all entries and then call a separate method to refine the property.
What I mean by this is, let's say a particular user input is ".doc/.docx". Currently, this would be stored in the List as a single string item. However I want it to be stored as two items separately. This will keep the code in one place and wont effect future modules and such.
private List<string> _szTypeOfFileList = new List<string>();
public List<string> szTypeOfFileList
{
get
{
return _szTypeOfFileList;
}
set
{
// Some kind of validation/refining method here ?? //
}
}
EDIT:-
Because my FileTypeList is coming from a checkboxList, I had to use a different methodology than the answer I accepted (which pointed me in the right direction).
foreach (object itemchecked in FileTypeList.CheckedItems)
{
string[] values = itemchecked.ToString().Split('/');
foreach(var item in values)
TransactionBO.Instance.szTypeOfFileList.Add(item);
}
This part of my code is in the UI class before it is passed on to the Business class.
If you know that it'll always be split with a "/" character, just use a split on the string. Including a simple bit of verification to prevent obvious duplicates, you might do something along the lines of:
string[] values = x.Split('/');
foreach (string val in values) {
if (!_szTypeOfFileList.Contains(val.ToLower().Trim())) {
_szTypeOfFileList.Add(val.ToLower().Trim());
}
}
You can also use an array of characters in place of the '/' to split against, if you need to consider multiple characters in that spot.
I would consider changing the List to something more generic. Do they really need a List ...or maybe a collection? array? enumerable ? (have a read through this link )
second, in your Set method, you'll want to take their input, break it up and add it. Here comes the question: is a list the best way of doing it ?
What about duplicate data ? do you just add it again? do you need to search for it in order to figure out if you're going to add it ?
Think about dictionary or hashtable, or any of type of collection that will help you out with your data . I would have a read through : this question (oh my ... wrong link ... nobody complained though ... so much for providing links ... :)
var extensions = userInput.Split('/').ToList();

What is a good substitute for a big switch-case?

I have objects called Country. At some point in the program, I want to set the field power of each object.
The power for each country is fixed and I have data for all 196 countries here on a piece of paper. My code should check, for instance, if the country's name is USA (and if so, set its power to 100) and so on.
I know I can do it with a switch-case, but what is the best, nicest, and most efficient way to do it?
You can store country-power pairs into a Dictionary<string, int> then just get the score of a particular country by using indexer:
var points = new Dictionary<string,int>();
// populate the dictionary...
var usa = points["USA"];
Edit: As suggested in comments you should store the information in external file, for example an xml would be a good choice. That way you don't have to modify the code to add or remove countries. You just need to store them into XML file, edit it whenever you need.Then parse it when your program starts, and load the values into Dictionary.You can use LINQ to XML for that.If you haven't use it before there are good examples in the documentation to get you started.
Whilst Selmans answer is right and good, it does not answer how to actually populate the Dictionary. Here is it:
var map = new Dictionary<string, int> {
{"USA", 100},
{"Germany", 110}
};
you may however also just add it as follows:
map.Add("USA", 100);
map.Add("Germany", 110);
Now you may access the value (as already mentioned by Semans):
map["USA"] = 50; // set new value for USA
int power = map["USA"]; // get new value
EDIT: As already mentioned within comments and other answers you may of course store the data within an external file or any other data-storage. Having said this you may just initialize an empty dictionary and then fill it with the Add-method previously mentioned for every record within that storage.
This is the right question to begin with, but there are a lot of things you need to learn. Many folk have given you answers to the question you asked. I'm going to be annoyingly Zen and tell you to unask the question because there is a larger problem to resolve.
Instead of hard coding this, store the related properties in an n-tuple also known as a database row and use a database engine to manage the relation between the two. And then since you are using C# it would probably be smart to learn to use LINQ. But before you do that, learn a bit of data modelling theory, because data-modelling is what you are doing.
Since you said you have "objects" called "Country", and you have tagged your question "C#", it would seem that you have two forces at work in your code. One is that having to refer to a map, however efficiently implemented, is not as cheap as referring to a member variable. On the other hand there might be some benefit to a setup where all the attributes of a country can be found in the same place as the attributes of other countries (the map-oriented solutions do address this concern). But these forces can be reconciled something like this:
class Country { // Apologies that this sketch is more C++ than C#
public:
Country(string name_, int power_);
private:
string name;
int power;
};
void MakeCountries()
{
countries.Add(new Country("USA", 50));
countries.Add(new Country("Germany", 60));
// ....
}
Do you need to update your data at runtime?
Yes? Load data from external storage into a dictionary.
No? Use a switch
Let the compiler generate dictionaries and hash-based lookups for you.
When you profiler starts screaming, explore alternative solutions.
For example, read that answer from "What is quicker, switch on string or elseif on type?".
What about making an array of Strings for storing country names in ascending order of their power. It will be more simple to implement.Then the index of each country can represent its power. This is possible, only if the power is continues counting numbers.
If its not , another siple way is to implement them as linked list. So that u will be able to change if u want. A list with 2 fields; 1for the country and other for the power

Hard Code List of Years?

This is the scenario. You've got a web form and you want to prompt the customer to select their birth year.
a) hard code the values in the dropdown list?
b) Grab valid years from a DB table
I can see a maintenance nightmare with copying a set of years hard coded in .aspx files everywhere.
updated:
for loop is not ideal (maintenance nightmare and error prone). The user then has to sift through 120 years that haven't even got here yet.
I still like the DB approach:
* Single point of data
* No duplication of code
* Update the table as needed to add more years
* Year table values could be used for some other dropdown for some other purpose entirely for something other than Birth year
Simple as that. No need to go updating code everywhere. I feel for data that is universal like this, we shouldn't be hard coding this shiza into a bunch of pages which is totally going against reuse and error prone...really it's not pratical. I'd take the hit to the DB for this.
Updated (again...after thinking about this):
Here's my idea. Just create a utility or helper method called GetYears that runs that loop and returns a List<int> back and I can bind that to whatever I want (dropdownlist, etc.). And I like the web.config idea of maintaining that end year.
C) Use a for-loop to generate the years in a range of your choice.
Something as simple as this pseudocode:
for (int i = 1900 ; i < THIS_YEAR - 13 ; i++)
{
validyears.options.Add(i);
}
Neither - provide a centralized service which can decide which mechanism to use, then the application doesn't care, and you are free to choose hardcoding, sliding window or database mechanisms.
To expand, typically, I would do something like this:
Define IPopulatableYear interface which has a single AddYear method taking an int and constructing an appropriate ListItem or whatever.
Make MyYearListBox inherit from regular ListBox implement IPopulatableYear (this works for winForms or WebForms)
Create static method or singleton or method in your DAL or whatever.
Like this:
PopulateYears(IPopulatableYear pl) {
// Very simple implementation - change at will
for (int lp = 2009 ; lp < 2009 + 10 ; lp++) {
pl.Add(lp);
}
}
or
PopulateYears(IPopulatableYear pl) {
// A DB implementation
SQLDataReader dr = DAL.YearSet() ; // Your choice of mechanism here
while ( dr.Read() ) {
pl.Add(dr[YEAR]);
}
}
or
PopulateYears(IPopulatableYear pl) {
// A DB limits implementation with different ranges defined in database by key - key determined by control itself - IPopulatableYear needs to implement a .YearSetKey property
SQLDataReader dr = DAL.YearLimits(pl.YearSetKey) ; // Your choice of mechanism here
for ( int lp = dr[YEAR_MIN] ; lp <= dr[YEAR_MAX] ; lp++ ) {
pl.Add(lp);
}
}
The mechanism is now centrally managed.
Use MyYearListBox on your forms and call PopulateYears() on it. If your forms are smart, they can detect all MyYearListBox instances and call it, so you no longer have any new code - just drag it on.
Take a look at Enumerable.Range. I think making DB calls is FAR less performant than Enumerable.Range.
E) Use a text input box, because that will always work.
(Be sure to validate it, of course, as a number. Include "Y2K" and "The year World War II started" in a dictionary of years, of course.)
How you present the year selection in the web form is irrelevant. It's an interface decision. Your server should not trust the data coming in, and should validate it accordingly. It's trivial to emulate a form submission, so it doesn't matter how it's presented. Heck, you can generate the drop down with javascript so there is no load on the server.
You can validate with a rule on the backend, rather than a lookup.
Since you're raising this whole issue (and making a bunch of comments), maybe it's within your power to think long and hard this.
For the end user, it's hard to beat the ease-of-use of a text box. Yup, you're going to get bogus data, but computers are supposed to make things easier, not harder. Scrolling through a long list of years to find the year I know I was born is a nuisance. Especially with all those young whippersnappers and old farts who want to enter birth years that aren't anywhere close to mine!
But stepping back even further...do you really need to ask the user their birth year in the first place? Is it that important to your application? Could you avoid the issue entirely by letting somebody else deal with that? Say by using OpenID, Windows Live ID or Facebook Connect?

Detecting honest web crawlers

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?
If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.
Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.
You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits.
Here are a few c# solutions for you:
1) Simplest
Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user.
Catches 99+% of crawlers.
bool iscrawler = Regex.IsMatch(Request.UserAgent, #"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
2) Medium
Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too.
Catches close to 100% of crawlers.
Matches ‘bot’, ‘crawler’, ‘spider’ upfront.
You can add to it any other known crawlers.
List<string> Crawlers3 = new List<string>()
{
"bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
"lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",
"atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
"cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
"esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
"htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
"image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
"lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
"motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
"netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
"patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
"raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
"searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
"curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
"urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
"webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
"webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
"wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
3) Paranoid
Is pretty fast, but a little slower than options 1 and 2.
It’s the most accurate, and allows you to maintain the lists if you want.
You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future.
If we get a short match we log it and check it for a false positive.
// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
"googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
"yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
"botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
"ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
"dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
"irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
"simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
"vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
"spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};
// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
"baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
"nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
"bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
"cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
"fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
"havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
"jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
"larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
"merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
"muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
"objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
"phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
"roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
"senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
"spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
"titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
"webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
"webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
"robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
"legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
string match = null;
if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));
if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);
bool iscrawler = match != null;
Notes:
It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.
Honeypots
The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.
Postives: It will match some unknown crawlers that aren’t declaring themselves.
Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.
You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.
One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...
Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solution I wrote to trap and block those creepy crawlers...
Any visitor whose entry page is /robots.txt is probably a bot.
Something quick and dirty like this might be a good start:
return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
Note: rails code, but regex is generally applicable.
I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought.
It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.
void CheckBrowserCaps()
{
String labelText = "";
System.Web.HttpBrowserCapabilities myBrowserCaps = Request.Browser;
if (((System.Web.Configuration.HttpCapabilitiesBase)myBrowserCaps).Crawler)
{
labelText = "Browser is a search engine.";
}
else
{
labelText = "Browser is not a search engine.";
}
Label1.Text = labelText;
}
HttpCapabilitiesBase.Crawler Property

Categories

Resources