Fuzzy data matching for personal demographic information

Fuzzy data matching for personal demographic information - c#

Let's say I have a database filled with people with the following data elements:
PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
FullName, DOB
FullName, Last 4 SSN
First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?

There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.

This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.

Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.

If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.

Related

String likeness algorithms

I have two strings (they're going to be descriptions in a simple database eventually), let's say they're
String A: "Apple orange coconut lime jimmy buffet"
String B: "Car
bicycle skateboard"
What I'm looking for is this. I want a function that will have the input "cocnut", and have the output be "String A"
We could have differences in capitalization, and the spelling won't always be spot on. The goal is a 'quick and dirty' search if you will.
Are there any .net (or third party), or recommend 'likeness algorithms' for strings, so I could check that the input has a 'pretty close fragment' and return it? My database is going to have liek 50 entries, tops.

What you’re searching for is known as the edit distance between two strings. There exist plenty of implementations – here’s one from Stack Overflow itself.
Since you’re searching for only part of a string what you want is a locally optimal match rather than a global match as computed by this method.
This is known as the local alignment problem and once again it’s easily solvable by an almost identical algorithm – the only thing that changes is the initialisation (we don’t penalise whatever comes before the search string) and the selection of the optimum value (we don’t penalise whatever comes after the search string).

service or strategy to detect if users enter fake names?

Im searching about services/strategies to detect when entered names in forms are spammy, example: asdasdasd, ksfhaiodsfh, wpoeiruopwieru, zcpoiqwqwea. crazy keyboard inputs.
I am trying akismet is not specially for names (http://kemayo.wordpress.com/2005/12/02/akismet-py/).
thanks in advance.

One strategy is having a black list with weird names and/or a white list with normal names, to reject/accept names. But it can be difficult to found it.

you could look for unusual character combinations like many consecutive vowels/consonants, and watch your registrations and create a list of recurring patterns (like asd) in false names
i would refrain from automatically block those inputs and rather mark them for examination

Ask for a real email and send info to connect there. Then get info from the account.
No way is really safe anyway.

If speed isn't an issue, download a list of the top 100k most common names, throw them in an O(1) lookup data structure, see if the input is there, and if not, you could always compare the input to the entries using a string similarity algorithm.
Although if you do, you will probably want to bucket by starting letter to prevent having to perform that calculation on the entire list.

WP7 - Dynamic information request to a server based on information entered by the user

This time, i come here just to see some opinios/view points.
I have a 'autocomplete' component that get from my server, the cities names of my country. At each city name typed on this component, it should go to my server a get some info.
Actually, how am I doing it?
At each letter typed on this component, it requests a list of cities that starts with this letter.
Obviously, that is no a good way to do it, because each request based just on a letter give me very similar lists.
Do you can think a better way to do it?
What is a better way? Do not make unecessary requests.

You could either preload all the city names locally (a country with 10,000 cities having an average name length of 11 bytes [10 single-byte characters + NUL] would require not much more than 110KB of space, depending on the method of storage [possibly something closer to 200KB?], so if you're okay with a [quite possibly very] small delay when loading the page and aren't worrying much about phone data limits, I'd suggest this), or you could have the city names be cached on the local machine, so while unique key combinations will result in server fetches, a repeated key combination in a later component will not.
I'm not really experienced with this aspect of programming, though, so I'm probably not the best person to give this sort of advice.

Category Matching - regex vs full text search

I have a fairly large category table with 1500 categories (some singular words others containing multiple) in it and I'm looking for the best way to match new products to these categories by their title.
I've been looking at using regex and looping through the product description for key words but this wouldn't be very efficient when trying to add over one thousand products at a time, I've also been looking at full text search (FREETEXT and contains) but FreeText search seems to bring back alot of results as its matching any and all words in a product description.
Has anyone done something similar in terms of trying to automate which category a product is by its description and can offer some advice or pointers?

So the question as I understand it is, given a description tell me what category this description is applicable to?
A common method to do this kind of work is to build a Naive Bayesian Classification process, and put all of your descriptions through this.
Classification like this usually takes place in two stages.
stage 1 : known description/category pairs are used to "train" the classifier.
stage 2 : Once the classifier is trained, you can then give it unknown data, and it would then return a probability that the description would match a given category.
The classifier in this approach is usually pretty accurate, but given we are dealing with statistics, errors usually do creep in

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.