service or strategy to detect if users enter fake names? - c#

Im searching about services/strategies to detect when entered names in forms are spammy, example: asdasdasd, ksfhaiodsfh, wpoeiruopwieru, zcpoiqwqwea. crazy keyboard inputs.
I am trying akismet is not specially for names (http://kemayo.wordpress.com/2005/12/02/akismet-py/).
thanks in advance.

One strategy is having a black list with weird names and/or a white list with normal names, to reject/accept names. But it can be difficult to found it.

you could look for unusual character combinations like many consecutive vowels/consonants, and watch your registrations and create a list of recurring patterns (like asd) in false names
i would refrain from automatically block those inputs and rather mark them for examination

Ask for a real email and send info to connect there. Then get info from the account.
No way is really safe anyway.

If speed isn't an issue, download a list of the top 100k most common names, throw them in an O(1) lookup data structure, see if the input is there, and if not, you could always compare the input to the entries using a string similarity algorithm.
Although if you do, you will probably want to bucket by starting letter to prevent having to perform that calculation on the entire list.

Related

Sort a text file where each line is a name followed by a score

I'm going to make a scoreboard for an XNA game but I'm having some trouble figuring out how. The player, after finishing a level, will enter his name when prompted. His name and time will be recorder in a text file something like this:
sam 90
james 64
matthew 100
I'm trying to figure out a way to sort this data by the time taken only and not taking into account the name.
I haven't started coding this yet but if anybody can give me any ideas it would be greatly appreciated.
First, read the text file using File.ReadAllLines(...) so you get a string array. Then iterate over the array and split each string on blank space (assuming users can't enter spaces in their names) and order on the second element, which should be the score. You have to cast it into a string with int.Parse(...) to be able to order it properly.
string[] scores = File.ReadAllLines("scorefile.txt");
var orderedScores = scores.OrderByDescending(x => int.Parse(x.Split(' ')[1]));
foreach (var score in orderedScores)
{
Console.WriteLine(score);
}
//outputs:
//matthew 100
//sam 90
//james 64
I would recommend using something like a semi-colon to separate the name and the score instead of a space, as that makes it much easier to handle in the case that users are allowed to enter spaces in their names.
Why not make this into a database file, .sdf which you can easily create on the fly if needed. Would be best for keeping track of data allowing sorting and future reuse.
SQLite is designed for this exact purpose, makes basic CRUD operations a doddle. You can also encrypt database files if your game grows and you want to start using it as a way to download/upload high scores and share scores with friends/world.
Don't get me wrong, this is definitely a little more work than simply parsing a text file, but it is future proof and you get a lot of functionality right out of the bag without having to write parsers and complex search routines etc.
XML is a definite other choice, or JSON. All 3 are all good alternatives. A plain text file probably isn't the way to go though, as in the end will probably cause you more work.
Create the score table as
name::score
and read every line and on it
string line = "Sam::255";
string name = line.split("::")[0];
Also do similar to the score.

WP7 - Dynamic information request to a server based on information entered by the user

This time, i come here just to see some opinios/view points.
I have a 'autocomplete' component that get from my server, the cities names of my country. At each city name typed on this component, it should go to my server a get some info.
Actually, how am I doing it?
At each letter typed on this component, it requests a list of cities that starts with this letter.
Obviously, that is no a good way to do it, because each request based just on a letter give me very similar lists.
Do you can think a better way to do it?
What is a better way? Do not make unecessary requests.
You could either preload all the city names locally (a country with 10,000 cities having an average name length of 11 bytes [10 single-byte characters + NUL] would require not much more than 110KB of space, depending on the method of storage [possibly something closer to 200KB?], so if you're okay with a [quite possibly very] small delay when loading the page and aren't worrying much about phone data limits, I'd suggest this), or you could have the city names be cached on the local machine, so while unique key combinations will result in server fetches, a repeated key combination in a later component will not.
I'm not really experienced with this aspect of programming, though, so I'm probably not the best person to give this sort of advice.

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul
You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.
In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html
What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Efficient Datastructure for tags?

Imagine you wanted to serialize and deserialize stackoverflow posts including their tags as space efficiently as possible (in binary), but also for performance when doing tag lookups. Is there a good datastructure for that kind of scenario?
Stackoverflow has about 28532 different tags, you could create a table with all tags and assign them an integer, Furthermore you could sort them by frequency so that the most common tags have the lowest numbers. Still storing them simply like a string in the format "1 32 45" seems a bit inefficent borth from a searching and storing perspective
Another idea would be to save tags as a variable bitarray which is attractive from a lookup and serializing perspective. Since the most common tags are first you potentially could fit tags into a small amount of memory.
The problem would be of course that uncommon tags would yield huge bitarrays. Is there any standard for "compressing" bitarrays for large spans of 0's? Or should one use some other structure completely?
EDIT
I'm not looking for a DB solution or a solution where I need to keep entire tables in memory, but a structure for filtering individual items
Not to undermine your question but 28k records is really not all that many. Are you perhaps optimizing prematurely?
I would first stick to using 'regular' indices on a DB table. The harshing heuristics they use are typically very efficient and not trivial to beat (or if you can is it really worth the effort in time and are the gains large enough?).
Also depending on where you actually do the tag query, is the user really noticing the 200ms time gain you optimized for?
First measure then optimize :-)
EDIT
Without a DB I would probably have a master table holding all tags together with an ID (if possible hold it in memory). Keep a regular sorted list of IDs together with each post.
Not sure how much storage based on commonality would help. A sorted list in which you can do a regular binary search may prove fast enough; measure :-)
Here you would need to iterate all posts for every tag query though.
If this ends up being to slow you could resort to storing a pocket of post identifiers for each tag. This data structure may become somewhat large though and may require a file to seek and read against.
For a smaller table you could resort to build one based on a hashed value (with duplicates). This way you could use it to quickly get down to a smaller candidate list of posts that need further checking to see if they match or not.
You need second table with 2 fields: tag_id question_id
That's it. Then you create indexes on tag_id, question_id and question_id, tag_id - that would be covering index so all your queries would be very fast.
I have a feeling you abstracted your question too much; you didn't say very much about how you want to access the datastructure, which is very important.
That being said, I suggest to count the number of occurances for each tag and then use Huffman coding to come up with the shortest encoding which can be used for the tags. This is not entirely perfect, but I'd stick with it until you've demonstrate that it's inappropriate. You can then associate the codes with each question.
If you want to efficiently lookup questions within a specific tag, you will need some kind of index. Maybe, all Tag objects could have an array of references (references, pointers, nummeric-id, etc) to all the questions that are tagged with this particular tag. This way you simply need to find the tag object and you have an array pointing to all the questions of that tag.

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements:
PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
FullName, DOB
FullName, Last 4 SSN
First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?
There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.
This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.
Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.
If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.

Categories

Resources