String likeness algorithms

String likeness algorithms - c#

I have two strings (they're going to be descriptions in a simple database eventually), let's say they're
String A: "Apple orange coconut lime jimmy buffet"
String B: "Car
bicycle skateboard"
What I'm looking for is this. I want a function that will have the input "cocnut", and have the output be "String A"
We could have differences in capitalization, and the spelling won't always be spot on. The goal is a 'quick and dirty' search if you will.
Are there any .net (or third party), or recommend 'likeness algorithms' for strings, so I could check that the input has a 'pretty close fragment' and return it? My database is going to have liek 50 entries, tops.

What you’re searching for is known as the edit distance between two strings. There exist plenty of implementations – here’s one from Stack Overflow itself.
Since you’re searching for only part of a string what you want is a locally optimal match rather than a global match as computed by this method.
This is known as the local alignment problem and once again it’s easily solvable by an almost identical algorithm – the only thing that changes is the initialisation (we don’t penalise whatever comes before the search string) and the selection of the optimum value (we don’t penalise whatever comes after the search string).

Related

Merging 2 lists using Levenshtein Distance on terms in list

Good afternoon,
I'm hoping i can get an assist on this from someone. If not some example code then some general direction i should be going with this.
Essentially i have two large lists (roughly 10-20,000 records each) of string terms and ID's. These lists are coming from two different data providers. The lists are obviously related to one another topically, however each data provider has slight variations in their terms naming conventions. For example list1 would have a term "The Term (Some Subcategory)" and list2 would have "the term - some subcategory". Additionally list1 could have "The Term (Some Subcategory)" and "The Term (Some Subcategory 2)" while list2 only has "the term - some subcategory".
Both lists have the following properties - "term" and "id". What i need to do is compare every term in both lists and if a reasonable match is found generate a new list containing "term", "list1id", "list2id" properties. If no match is found for a term i need it also to be added to the list with either "list1id" or "list2id" null/blank (which will indicate the origin of the unmatched term).
I'm willing to us a NuGet package to occumplish this or if anyone has a good example of what i need that would be helpful too. Essentially i'm attempting to generate a new merged list based on fuzzy terms within each while retaining the ID's of the matched terms somehow.
My research has dug up some similar articles and source such as https://matthewgladney.com/blog/data-science/using-levenshtein-distance-in-csharp-to-associate-lists-of-terms/ and https://github.com/wolfgarbe/symspell but neither seem to fit what i need.
Where do i go from here with this? Any help would be awesome!
Nugs

Your question is pretty broad, but I will attempt a broad answer to, at least, get you started. I've done this sort of thing before.
Do it in two stages: first normalize, then match. By doing this you eliminate known but irrelevant causes of differences. By normalize, for example, make everything caps, remove whitespace, remove non-alphanumeric characters, etc. You'll need to be a little creative and work within any constraints you might have (is "Amy (pony)" the same thing as "Amy pony"?). Then calculate
the distance.
Create a class with a few properties to contain the value from the left list, the value from the right list, the normalized values, the score, etc.
When you get a match, create an instance of that class, add it to a list or equivalent, remove the old values from the original lists, then keep going.
Try to write your code so you keep track of intermediate values (e.g. the normalized values, etc). This will make it easier to debug, and will allow you to log everything after you've done processing.
Once you're done, you can then throw away intermediate values and keep just the things you identified as a match.

To find out the number of occruence of words in a file

I came across this question in an interview:
We have to find out the number of occurences of two given words in a text file with <=n words between them.
Example1:
text:`this is first string this is second string`
Keywords:`this, string`
n= 4
output= 2
"this is first string" is the first occurrence and number of words between this and string is 2(is, first) which is less than 4.
this is second string is the remaining string. number of words between *this and string * is 2 (is, second) which is less than 4.
Therefore the answer is 2.
I have thought that I will use
Dictionary<string, List<int>>.
My idea was that I use the dictionary and get the list of places where the particular word is repeated and then iterate through both the lists, increment the count if a condition is met and then display the count.
Is my thinking process correct? Please provide any suggestions to improve my solution.
Thanks,

Not an answer per-se (as quite honestly, I don't understand the question :P), but to add some general interview advice to the other answers:
In interviews the interviewer is always looking for the thought process and that you are a critical, logical thinker. Not necessarily that you have excellent coding recall and can compile code in your brain.
In addition interviews are a stressful process. By slowing down and talking out loud as you work things out you not only look like a better communicator and logical thinker (even if getting the question wrong), you also give yourself time to think.
Use a pen and paper, speak as you think, start off from the top and work through it. I've got jobs even if I didn't know the answers to tech questions by demonstrating that I can at least try to work things out ;-)
In short, it's not just down to technical prowess

I think it depends if the call is done only one or multiple times per string. If it's something like
int getOccurences(String str, String reference, int min_size) { ... }
then you don't really need the dictionary, not even a ist. You can just iterate through the string to find occurrences of words and then check the number of separators between them.
If on the other hand the problem is for arbitrary search/indexing, IMHO you do need a dictionary. I'd go for a dictionary where the key is the word and the value is a list of indexes where it occurs.
HTH

If you need to do that repeatedly for different pairs of words in the same text, then a word dictionary with a list of indexes is a good solution. However, if you were only looking for one pair, then two lists of indexes for those two words would be sufficient.
The lists allow you to separate the word detection operation from the counting logic.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements:
PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
FullName, DOB
FullName, Last 4 SSN
First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?

There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.

This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.

Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.

If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.

Should we store format strings in resources?

For the project that I'm currently on, I have to deliver specially formatted strings to a 3rd party service for processing. And so I'm building up the strings like so:
string someString = string.Format("{0}{1}{2}: Some message. Some percentage: {3}%", token1, token2, token3, number);
Rather then hardcode the string, I was thinking of moving it into the project resources:
string someString = string.Format(Properties.Resources.SomeString, token1, token2, token3, number);
The second option is in my opinion, not as readable as the first one i.e. the person reading the code would have to pull up the string resources to work out what the final result should look like.
How do I get around this? Is the hardcoded format string a necessary evil in this case?

I do think this is a necessary evil, one I've used frequently. Something smelly that I do, is:
// "{0}{1}{2}: Some message. Some percentage: {3}%"
string someString = string.Format(Properties.Resources.SomeString
,token1, token2, token3, number);
..at least until the code is stable enough that I might be embarrassed having that seen by others.

There are several reasons that you would want to do this, but the only great reason is if you are going to localize your application into another language.
If you are using resource strings there are a couple of things to keep in mind.
Include format strings whenever possible in the set of resource strings you want localized. This will allow the translator to reorder the position of the formatted items to make them fit better in the context of the translated text.
Avoid having strings in your format tokens that are in your language. It is better to use
these for numbers. For instance, the message:
"The value you specified must be between {0} and {1}"
is great if {0} and {1} are numbers like 5 and 10. If you are formatting in strings like "five" and "ten" this is going to make localization difficult.
You can get arround the readability problem you are talking about by simply naming your resources well.
string someString = string.Format(Properties.Resources.IntegerRangeError, minValue, maxValue );
Evaluate if you are generating user visible strings at the right abstraction level in your code. In general I tend to group all the user visible strings in the code closest to the user interface as possible. If some low level file I/O code needs to provide errors, it should be doing this with exceptions which you handle in you application and consistent error messages for. This will also consolidate all of your strings that require localization instead of having them peppered throughout your code.

One thing you can do to help add hard coded strings or even speed up adding strings to a resource file is to use CodeRush Xpress which you can download for free here: http://www.devexpress.com/Products/Visual_Studio_Add-in/CodeRushX/
Once you write your string you can access the CodeRush menu and extract to a resource file in a single step. Very nice.
Resharper has similar functionality.

I don't see why including the format string in the program is a bad thing. Unlike traditional undocumented magic numbers, it is quite obvious what it does at first glance. Of course, if you are using the format string in multiple places it should definitely be stored in an appropriate read-only variable to avoid redundancy.
I agree that keeping it in the resources is unnecessary indirection here. A possible exception would be if your program needs to be localized, and you are localizing through resource files.

yes you can
new lets see how
String.Format(Resource_en.PhoneNumberForEmployeeAlreadyExist,letterForm.EmployeeName[i])
this will gave me dynamic message every time
by the way I'm useing ResXManager

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.