So I've searched fuzzy searching, the Levenshtein Distance Algorithm and I'm not sure if either are a true fit for what I'm doing. Please let me know your thoughts, if any...
How can I take a user's full name, and generate a list of similar names? I want to prevent a user from creating multiple accounts in an application by providing a "Hey are you sure none of these are you" as a final step before account creation.
I've found this article, but it's entirely SQL-based (http://stackoverflow.com/questions/988050/matching-records-based-on-person-name)
I'm using c# / Linq, SqlServer.
Thanks for your time!
Here is a link to a SOUNDEX implementation in .NET:
http://www.codeproject.com/KB/recipes/soundex.aspx
I haven't used it but it seems to be rated well
If it were me, I would require an exact match on the last name, and then only try to guess variances of the first name. This would narrow down your field of work quite a bit.
Then, as you suggested in your comments, you could apply rules of +/- a few characters of the first name length as well as a threshold of say (80%) of the characters must match.
Also, you can then only look at first names that also match the first X characters as well, as most English name deviations will be after X number of characters.
Example:
John Doe
Johnny Doe
Johnathan Doe
Related
First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.
If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.
Basically, I need to create a method that will do its darn best to take a name field and split it into title, firstNames and lastNames.
E.g. Mr. Daniel George Trump will become:
Title: Mr.
FirstNames: Daniel George
LastNames: Trump
or
Mr. Daniel George Trump and Mrs. Sarah Trump will become:
Title: Mr and Mrs
FirstNames: Daniel George & Sarah
LastNames: Trump & Trump (some inputs may be two people with different surnames).
Thanks
Don't. Don't try to interpret a name in a program. You'll never ever only sometimes get it right.
People naming is so extremely complicated that it study, called anthroponymia, is the matter of a branch of antropology.
Let's imagine you begin with a title, a given name, a middle name and a family name. Being the family name the name of the father. So you just split the name in components and assign each component.
But, this approach is plain wrong in hundreds of cases. Some people just use one given name and one family name. Others (spanish for example) use one given name and two family names. Some cultures (hungarian, some asian) reverse the order to be the family name first and then the given name (eastern order). Some use as family name the name of the mother (matronyms). Some, the name of the clan, or a historical name of the family, or the birth place. Some (portuguese) usually set the mother family name as the middle name. Some people from countries that use eastern order, reverse the order when addressing western audiences, some other people from the same countries, don't. Combinations are infinite. A complete and terrible nightmare.
So, the only correct way, in my opinion, is to leave a single name entity and let the users choose whatever they want to go there. And don't try to interpret it.
But, there are ocasions were some external regulation may require your software to comply with a given set of separated fields. In that case, and only in that case, your only bet is to pass this directly to the users, allowing them to set the fields at their own preference.
You might make a list of standard titles (Mr., Mrs., etc.) and try to match any of those. Then for two-word names, use the first as the first name and the second as the last name. For longer names, maybe use the first two words as the first name and the rest as the last.
I'm using c# and a database in SQL Server.
I have an autocomplete field that works fine with normal characters. I would like to add the functionality of autocomplete special characters too, as ö, Ä, é, è, ...
I would also like to add the possibility to identify characters that may sound similar in some languages, as 'b' and 'v', so if I type 'boor' would find 'voor' as a possible suggestion.
Any ideas?
Thanks
Edit: The autocomplete textboxes are used for names and surnames (one for each). They are made with AutoCompleteStringCollection. They search in the database for names or surnames that already exist.
This part of the application basically gives to the user the possibility to add new persons in the application (name,surname,etc).
The goal is that when the user is creating a new person in the application, he/she will get a list with the persons with a similar name or surname to the one he/she is typing in.
So if we already have 'James Smith' in the database, when the user is typing Smyth, he/she should get the possibility to change to Smith (as a autocomplete, maybe), saying "hey, do you mean 'Smith'?" So we avoid that the user creates the same person with wrong names.
Because we are working with names and surnames from people from all over the world, the errors in the creation of a new person can come from any language.
PD:
would it be a good idea to create my own autocomplete? hiding/showing a listbox right under the textbox
For what I'm trying, the function SOUNDEX works really good for characters like ö, Ä, é, è, ... But I can't call the database for every single name or surname. So I don't know well how to use use it.
I am not sure what do you mean by autocompletion. Regarding the second part of your question, you probably need a SQL Server feature SOUNDEX. It returns four-character (SOUNDEX) code to evaluate the similarity of two strings.
Use it like:
SELECT SOUNDEX ('Smith'), SOUNDEX ('Smythe');
Above words are spelled almost identically so they have the same soundex'es S530 and S530.
I think the soundex may be used with various languages, though I am not totally sure.
Unfortunately you can use as AutoCompletionSource only a AutoCompleteStringCollection.
But the logic, what will be presented to the user (the box with the matching items below the textbox itself) is fully controlled by the TextBox and can't be influenced in any way.
So even if you use something like SoundEx() or Levenstein Distance, you can't tell it the TextBox, cause it always makes a String.StartsWith() on the given collection and on a selection it replaces the whole content by the select value from the source.
That's something that already drove me crazy. You simply can't really influence what items from the list will be presented to the user and you can't influence what happens if some item from the box is selected.
I would look into Levenshtein distance.
Soundex is rather primitive. It was originally developed to be hand calculated. It results in a key and works well with western names and surnames.
Levenshtein distance looks at two string values and produces a value based on their similarity. It's looking for missing or substituted letters(no phonetic comparison as SoundEx).
Wikipedia reference: http://en.wikipedia.org/wiki/Levenstein_distance
Website for testing two string values using Levenshtein distance: http://gtools.org/levenshtein-calculate.php
I looked through the related questions, there were quite a few but I don't think any answered this question. I am very new to Regex but I'm trying to get better so bear with me please. I am trying to match several groups in a string, but in any order. Is this something I should be using Regex for? If so, how? If it matters, I plan to use these in IronPython.
EDIT: Someone asked me to be more specific, so here:
I want to use re.match with a regex like:
\[image\s*(?(#alt:(?<alt>.*?);).*(#title:(?<title>.*?);))*.*\](?<arg>.*?)\[\/image\]
But it will only match the named groups when they are in the right order, and separated with a space. I would like to be able to match the named groups in any order, as long as they appear where they do now in the regex.
A typical string that will be applied to this might look like:
[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien.png[/image]
So the 'attributes' (things that come between '#' and ';' in the first 'tag') should be matched in any order, as long as they both appear.
The answer to the question in your title is "no" -- to match N groups "in any order", the regex should have an "or" (the | feature in the regex pattern) among the N! (N factorial) possible permutations of the groups, the product of all integers from 1 to N. That's a number which grows extremely fast -- for N just equal 6, it's already 720, for 7, it's almost 5000, and so on at a dizzying pace -- so this approach is totally impractical for any N which isn't really tiny.
The solutions may be many, depending on what you want the groups to be separated with. Let's say, for example, that you don't care (if you DO care, edit your question with better specs).
In this case, if overlapping matches are impossible or are OK with you, make N separate regular expressions, one per group -- say these N compiled RE objects are in a list named grps, then
mos = [g.search(thestring) for g in grps]
is the list of match objects for the groups (None for a group which doesn't match). With the mos list you can do all sorts of checks and/or further manipulations, for example all(mos) is True if and only if all the groups matched, in which case [m.group() for m in mos] is the list of substrings that have been matched, and so on, and so forth.
If you need non-overlapping matches, it's a bit more complicated -- you may extract the boundaries of all possible matches for each group, then seeing if there's a way to extract from these N lists a set of N intervals, one per lists, so that no two of them are pairwise intersecting. This is a somewhat subtle algorithm (if you want reasonable speed for a large N, of course), so I think it's worth a separate question, and in any case it's not worth discussing right here when the very issue of whether it's needed or not depends on so incredibly many factors that you have not specified.
So, please edit your question with more precise specifications, first, and then things can perhaps be clarified to provide you with the code and/or algorithms you need.
Edit: I see the OP has now clarified the issue at least of the extent of providing an example -- although, confusingly, he offers a RE pattern example and a string example that should not match, regardless of ordering (the RE specifies the presence of a substring #title which the example string does not have -- puzzling!).
Anyway, if the number of groups in the example (two which appear to be interchangeable, one which appears to have to occur in a specific spot) is representative of the OP's actual problems, then the total number of permutations of interest is just two, so joining the "just two" permutations with a vertical bar | would of course be quite feasible. Is that the case in the OP's real problems, though...?
Edit: if the number of permutations of interest is tiny, here's an example of one way to avoid the problem of repeated group names in the pattern (syntax requires Python 2.7 or better, but that's just for the final "dict comprehension" -- the same functionality is available in many previous version of Python, just with the less elegant dict(('a', ... syntax;-)...:
>>> r = re.compile(r'(?P<a1>a.*?a).*?(?P<b1>b.*?b)|(?P<b2>b.*?b).*?(?P<a2>a.*?a)')
>>> m = r.search('zzzakkkavvvbxxxbnnn')
>>> g = m.groupdict()
>>> d = {'a':(g.get('a1') or g.get('a2')), 'b':(g.get('b1') or g.get('b2'))}
>>> d
{'a': 'akkka', 'b': 'bxxxb'}
This is very similar to one of the key problems with using regular expressions to parse HTML - there is no requirement that attributes always be specified in the same order, and many tags have surprising attributes (like <br clear="all">. So it seems you are working with a very similar markup syntax.
Pyparsing addresses this problem in an indirect way - instead of trying to parse all different permutations, parse the general "#attrname:attribute value;" syntax, and keep track of the attributes keys and values in an attribute mapping data structure. The mapping makes it easy to get the "title" attribute, regardless of whether it came first or last in the image tag. This behavior is built into the pyparsing API methods, makeHTMLTags and makeXMLTags.
Of course, this markup is not XML, but a similar approach gives some pretty easy to work with results:
text = """[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien1.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien2.png[/image]
"""
from pyparsing import Suppress, Group, Word, alphas, SkipTo, Dict, ZeroOrMore
LBRACK,RBRACK,COLON,SEMI,AT = map(Suppress,"[]:;#")
tagAttribute = Group(AT + Word(alphas) + COLON + SkipTo(SEMI) + SEMI)
imageTag = LBRACK + "image" + Dict(ZeroOrMore(tagAttribute)) + RBRACK
imageLink = imageTag + SkipTo("[/image]")("text")
for taginfo in imageLink.searchString(text):
print taginfo.alt
print taginfo.title
print taginfo.text
print
Prints:
alien
reddit alien
http://www.reddit.com/alien1.png
alien
reddit alien
http://www.reddit.com/alien2.png
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.