Category Matching - regex vs full text search

Category Matching - regex vs full text search - c#

I have a fairly large category table with 1500 categories (some singular words others containing multiple) in it and I'm looking for the best way to match new products to these categories by their title.
I've been looking at using regex and looping through the product description for key words but this wouldn't be very efficient when trying to add over one thousand products at a time, I've also been looking at full text search (FREETEXT and contains) but FreeText search seems to bring back alot of results as its matching any and all words in a product description.
Has anyone done something similar in terms of trying to automate which category a product is by its description and can offer some advice or pointers?

So the question as I understand it is, given a description tell me what category this description is applicable to?
A common method to do this kind of work is to build a Naive Bayesian Classification process, and put all of your descriptions through this.
Classification like this usually takes place in two stages.
stage 1 : known description/category pairs are used to "train" the classifier.
stage 2 : Once the classifier is trained, you can then give it unknown data, and it would then return a probability that the description would match a given category.
The classifier in this approach is usually pretty accurate, but given we are dealing with statistics, errors usually do creep in

Related

Merging 2 lists using Levenshtein Distance on terms in list

Good afternoon,
I'm hoping i can get an assist on this from someone. If not some example code then some general direction i should be going with this.
Essentially i have two large lists (roughly 10-20,000 records each) of string terms and ID's. These lists are coming from two different data providers. The lists are obviously related to one another topically, however each data provider has slight variations in their terms naming conventions. For example list1 would have a term "The Term (Some Subcategory)" and list2 would have "the term - some subcategory". Additionally list1 could have "The Term (Some Subcategory)" and "The Term (Some Subcategory 2)" while list2 only has "the term - some subcategory".
Both lists have the following properties - "term" and "id". What i need to do is compare every term in both lists and if a reasonable match is found generate a new list containing "term", "list1id", "list2id" properties. If no match is found for a term i need it also to be added to the list with either "list1id" or "list2id" null/blank (which will indicate the origin of the unmatched term).
I'm willing to us a NuGet package to occumplish this or if anyone has a good example of what i need that would be helpful too. Essentially i'm attempting to generate a new merged list based on fuzzy terms within each while retaining the ID's of the matched terms somehow.
My research has dug up some similar articles and source such as https://matthewgladney.com/blog/data-science/using-levenshtein-distance-in-csharp-to-associate-lists-of-terms/ and https://github.com/wolfgarbe/symspell but neither seem to fit what i need.
Where do i go from here with this? Any help would be awesome!
Nugs

Your question is pretty broad, but I will attempt a broad answer to, at least, get you started. I've done this sort of thing before.
Do it in two stages: first normalize, then match. By doing this you eliminate known but irrelevant causes of differences. By normalize, for example, make everything caps, remove whitespace, remove non-alphanumeric characters, etc. You'll need to be a little creative and work within any constraints you might have (is "Amy (pony)" the same thing as "Amy pony"?). Then calculate
the distance.
Create a class with a few properties to contain the value from the left list, the value from the right list, the normalized values, the score, etc.
When you get a match, create an instance of that class, add it to a list or equivalent, remove the old values from the original lists, then keep going.
Try to write your code so you keep track of intermediate values (e.g. the normalized values, etc). This will make it easier to debug, and will allow you to log everything after you've done processing.
Once you're done, you can then throw away intermediate values and keep just the things you identified as a match.

General String Filter / Search

I'm looking for some bright ideas around filtering and string searches, or at least pointers to some good articles on the subject. I have a C# WP7 app that has a list of items, and it searches against another list. But I find that the filter isn't very good. Lets say I have Francisco in one list, it will find San Francisco Bay Bridge in the list. One idea is to be exact, but then you miss on things like "The Bay" won't be matched to "Bay".
I guess I'm looking for best practices to make the filtering engine smarter, right now, it's pretty much just display if you get any match anywhere in the list. Below is the basic code i'm using,very simple find on a name. Just want some ideas to make it more "smart".
subList.Select(arty => mainList.Where(Item => Item.AllItems.IndexOf(item.Name, StringComparison.OrdinalIgnoreCase) >= 0))

Another popular approach is similarity in terms of necessary insertion, deletion, or substitution operations to go from one string ( the search term) to the other (matches in the data store) - the Levenshtein distance.
You can compute a score using the Levenshtein (or Edit) distance and then take the top N in terms of minimal changes needed below a certain threshold.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Algorithm for text classification

I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents which have already been categorized. What algorithm should I use to do the job. I don't need to do it fast. I need to be sure that the algorithm categorizes correctly (as far as possible).
What algorithm should I use? Is there an implementation of in in C#?
Thank you for your help!

Take a look at term frequency and inverse document frequency also cosine similarity to find important words to create categories and assign documents to categories based on similarity
EDIT:
Found an example here

Interesting articles :
A self-organizing semantic map for information retrieval
WEBSOM - self-organizing maps of document collections

The major issue IMHO here is the length of the documents. I think I would call it phrase classification and there is work going on on this because of the twitter thing. You could bring in additional text performing a web search on the 30 words and then analyzing the top matches. There is a paper about this but I can't find it right now. Then I would try a feature vector approach (tdf-idf as in Jimmy's answer) and a multiclass SVM for classification.

Perhaps a decision tree combined with a NN?

You can use SVM Algorithm for Classify text in C# with libsvm.net library.

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements:
PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
FullName, DOB
FullName, Last 4 SSN
First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?

There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.

This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.

Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.

If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.