I'm looking for some bright ideas around filtering and string searches, or at least pointers to some good articles on the subject. I have a C# WP7 app that has a list of items, and it searches against another list. But I find that the filter isn't very good. Lets say I have Francisco in one list, it will find San Francisco Bay Bridge in the list. One idea is to be exact, but then you miss on things like "The Bay" won't be matched to "Bay".
I guess I'm looking for best practices to make the filtering engine smarter, right now, it's pretty much just display if you get any match anywhere in the list. Below is the basic code i'm using,very simple find on a name. Just want some ideas to make it more "smart".
subList.Select(arty => mainList.Where(Item => Item.AllItems.IndexOf(item.Name, StringComparison.OrdinalIgnoreCase) >= 0))
Another popular approach is similarity in terms of necessary insertion, deletion, or substitution operations to go from one string ( the search term) to the other (matches in the data store) - the Levenshtein distance.
You can compute a score using the Levenshtein (or Edit) distance and then take the top N in terms of minimal changes needed below a certain threshold.
Related
Good afternoon,
I'm hoping i can get an assist on this from someone. If not some example code then some general direction i should be going with this.
Essentially i have two large lists (roughly 10-20,000 records each) of string terms and ID's. These lists are coming from two different data providers. The lists are obviously related to one another topically, however each data provider has slight variations in their terms naming conventions. For example list1 would have a term "The Term (Some Subcategory)" and list2 would have "the term - some subcategory". Additionally list1 could have "The Term (Some Subcategory)" and "The Term (Some Subcategory 2)" while list2 only has "the term - some subcategory".
Both lists have the following properties - "term" and "id". What i need to do is compare every term in both lists and if a reasonable match is found generate a new list containing "term", "list1id", "list2id" properties. If no match is found for a term i need it also to be added to the list with either "list1id" or "list2id" null/blank (which will indicate the origin of the unmatched term).
I'm willing to us a NuGet package to occumplish this or if anyone has a good example of what i need that would be helpful too. Essentially i'm attempting to generate a new merged list based on fuzzy terms within each while retaining the ID's of the matched terms somehow.
My research has dug up some similar articles and source such as https://matthewgladney.com/blog/data-science/using-levenshtein-distance-in-csharp-to-associate-lists-of-terms/ and https://github.com/wolfgarbe/symspell but neither seem to fit what i need.
Where do i go from here with this? Any help would be awesome!
Nugs
Your question is pretty broad, but I will attempt a broad answer to, at least, get you started. I've done this sort of thing before.
Do it in two stages: first normalize, then match. By doing this you eliminate known but irrelevant causes of differences. By normalize, for example, make everything caps, remove whitespace, remove non-alphanumeric characters, etc. You'll need to be a little creative and work within any constraints you might have (is "Amy (pony)" the same thing as "Amy pony"?). Then calculate
the distance.
Create a class with a few properties to contain the value from the left list, the value from the right list, the normalized values, the score, etc.
When you get a match, create an instance of that class, add it to a list or equivalent, remove the old values from the original lists, then keep going.
Try to write your code so you keep track of intermediate values (e.g. the normalized values, etc). This will make it easier to debug, and will allow you to log everything after you've done processing.
Once you're done, you can then throw away intermediate values and keep just the things you identified as a match.
I have a fairly large category table with 1500 categories (some singular words others containing multiple) in it and I'm looking for the best way to match new products to these categories by their title.
I've been looking at using regex and looping through the product description for key words but this wouldn't be very efficient when trying to add over one thousand products at a time, I've also been looking at full text search (FREETEXT and contains) but FreeText search seems to bring back alot of results as its matching any and all words in a product description.
Has anyone done something similar in terms of trying to automate which category a product is by its description and can offer some advice or pointers?
So the question as I understand it is, given a description tell me what category this description is applicable to?
A common method to do this kind of work is to build a Naive Bayesian Classification process, and put all of your descriptions through this.
Classification like this usually takes place in two stages.
stage 1 : known description/category pairs are used to "train" the classifier.
stage 2 : Once the classifier is trained, you can then give it unknown data, and it would then return a probability that the description would match a given category.
The classifier in this approach is usually pretty accurate, but given we are dealing with statistics, errors usually do creep in
For an assignment for school I have to make a solver for a Rush Hour game.. if you aren't familiar with Rush Hour.. check this link: http://www.puzzles.com/products/rushhour.htm
For this solver I have to use the A* search algorithm, I looked on the internet a bit, and I think I quite understood how the algorithm works.. only I don't really have an idea how to implement it in the solver.. nor how I should build up the grid for the cars.. Can someone please give me some tips/help for this?
Not a complete solution..
To represent the grid of cars, I'd just use a rectangular array of cells where each cell is marked with an integer -- 0 indicates "empty", and each car has a particular number, so the different cars in the grid will manifest themselves as consecutive cells with the same number.
At this point, you should be able to write a function to return all the possible "moves" from a given grid, where a "move" is a transition from one grid state to another grid state -- you probably don't need to encode a better representation of a move than that.
To implement A*, you'll need a naive heuristic for figuring out how good a move looks, so you know which moves to try first. I would suggest initially that any move which either moves the target car closer to the goal or makes space nearer the front of the target car might be a better candidate move. Like Will A said in the comments, unless you're solving a 1000x1000 Rush Hour board, this probably isn't a big deal.
That's all the tricky parts I can think of.
As mquander or Will have already pointed out, the A* algorithm might be a bit an overfit for your problem.
I just give you now some hints what other algorithm you can use to solve the problem.
I don't want to explain how those algorithms work since you can find many good description in the internet. However, if you have a question, don't hesitate to ask me.
You can use some algorithms which belong to the kind of "uninformed search". There you have for example breadth first search, deep-first search, uniform cost search, depth-limited search or iterative deepening search. If you use breadth-first search or uniform cost search then you might have to deal with available memory space problem since those algorithms have an exponential space complexity (and you have to keep the whole search space in memory). Therefore, using a deep-first search (space complexity O(b*m)) is more memory friendly since the left part of the tree which you visit firstly can be omitted if it does not contain the solution. Depth-limited search and iterative deepening search are almost the same, whereas in the iterative deepening search you increase the search level of your tree iteratively.
If you compare time complexity (b=branching factor of the tree, m=maximum depth of the tree, l=depth level limit, d=depth of the solution):
breadth-first: b^(d+1)
uniform cost: b^?
depth-fist:b^m
depth-limited: b^l if (l>d)
iterative deepening: b^d
So as you can see iterative deepening or breadth-first search perform quite well. The problem of the depth-limited search is if your solution is located deeper than you search level, then you will not find a solution.
Then you have the so called "informed search" such as best-first search, greedy search, a*, hill climbing or simulated annealing. In short, for the best-first search, you use an evaluation function for each node as an estimate of “desirability". The goal of the greedy search is to expand the node which brings you closer to goal. Hill climbing and simulated annealing are very similar. Stuart Russell explains hill climbing as following (which I like a lot...): the hill climbing algorithm is like climbing Everest in thick fog with amnesia". It is simply a loop that continually moves in the direction of increasing value. So you just "walk" to the direction which increases your evaluation function.
I would use one of the uniformed search algorithms since they are very easy to implement (you just need to programme tree and traverse it correctly). Informed search performs usually better if you have a good evaluation function...
Hope that helps you...
I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.
In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.
I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/
I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents which have already been categorized. What algorithm should I use to do the job. I don't need to do it fast. I need to be sure that the algorithm categorizes correctly (as far as possible).
What algorithm should I use? Is there an implementation of in in C#?
Thank you for your help!
Take a look at term frequency and inverse document frequency also cosine similarity to find important words to create categories and assign documents to categories based on similarity
EDIT:
Found an example here
Interesting articles :
A self-organizing semantic map for information retrieval
WEBSOM - self-organizing maps of document collections
The major issue IMHO here is the length of the documents. I think I would call it phrase classification and there is work going on on this because of the twitter thing. You could bring in additional text performing a web search on the 30 words and then analyzing the top matches. There is a paper about this but I can't find it right now. Then I would try a feature vector approach (tdf-idf as in Jimmy's answer) and a multiclass SVM for classification.
Perhaps a decision tree combined with a NN?
You can use SVM Algorithm for Classify text in C# with libsvm.net library.