I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.
Related
I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.
I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?
Thanks
There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).
When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).
I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.
Perhaps you should start with a spell checker - there are a number of libraries available that do this.
There are a few c# snippets online that will get the ball rolling:
Levenshtein:
http://www.dotnetperls.com/levenshtein
Boyer-Moore:
http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150
Based on those, you can easily implement your own Word Matcher module.
I'm writing a word game. I have access to the dictionary object to validate the words. I need to find all possible words that contains a word and a set of additional characters.
for example:
lets the say the word is "MEN" and the set of additional characters are "WALOHTD". I need a way to find words like....
1.MEND
2.WOMEN
3.MENTAL
4. etc....
basically we are looking at all possible words that contain "MEN" and any of the specific additional characters.
I can certainly write code that can loop through the entire dictionary to first words that contains the subword and then check for the specific characters existance but that is not optimal. It's taking more than a second. Any help towards optimal solution is greatly appreciated.
_rey
The problem is a mixture of that of regular language and that of searching a data structure.
Considering the first aspect alone, we'd be inclined to use a regular expression. You don't say if we can repeat the "additional characters". If we can, it's easy enough [WALOTHD]*MEN[WALOTHD]* for your case, and that's easily adapted.
If we can't repeat, then we can start with [WALOTHD]{0,7}MEN[WALOTHD]{0,7} and filter out any that break the rule ("ALLOTMENT" matches that expression, but repeats L and T).
Or we can try to build a much more complicated regular expression, though I'm not sure if the gains in the better expression would out-weigh the cost of working out what it was though.
Coming from the other side of searching a dictionary, a DAWG is very space-efficient and makes finding matches that contain substrings relatively efficient. It's not a complete match to this puzzle, as we have quite a few permutations of prefixes and suffixes to worry about. Without testing, I'd guess it'd being reasonably good if we can't repeat from the "additional", and horrible if we can. But that is just a guess. A GADDAG might well be worth looking at, it'd be bigger than a DAWG, but likely faster for this sort of search (GADDAGs are used in scrabble-solving, which is pretty much the same problem that you have here).
I need a way to recognize urls with similar pattern, e.g. a function which returns true when matched
http://mysite.com/page/123
and
http://mysite.com/page/456
or
http://mysite.com/?page=123
and
http://mysite.com/?page=456
or
http://mysite.com/?page=123¶m=2
and
http://mysite.com/?page=456¶m=3
I don't need to check validity of urls here, only find out if the pattern is the same.
I probably need a regular expression for it, but can't figure out how to do it. Can anyone help? Thanks.
May be you can try levenshtein distance
http://www.dotnetperls.com/levenshtein, which is used to find similarity between strings.
Use a lowest common subsequence algorithm and divide by the length of either of the strings. If it's above an arbitrary number, they're common enough.
Not a specific answer, but I feel that if you want this to work well in a generalised sense, you will need to be content-aware, i.e. you need to break each URL into subsections:
Protocol
Domain
Path
Querystrings
... And process each separately. The level of acceptable fuzziness will control how much you need to break up the URL, but each section would (I feel) need quite specific inspection. The protocol and domain could be straight string matches, but the paths could perhaps be split by '/' and then after basic length checks, the elements could be compared one by one, only comparing items of equal depth (using direct equality or a "change distance" like the Levenshtein distance mentioned earlier). The querystrings could be broken up into dictionaries via a simple split on "&" then by "=", which you could sort and compare however you want. This would also satisfy #MarcGravell's question about reordered querystring parameters.
So I've searched fuzzy searching, the Levenshtein Distance Algorithm and I'm not sure if either are a true fit for what I'm doing. Please let me know your thoughts, if any...
How can I take a user's full name, and generate a list of similar names? I want to prevent a user from creating multiple accounts in an application by providing a "Hey are you sure none of these are you" as a final step before account creation.
I've found this article, but it's entirely SQL-based (http://stackoverflow.com/questions/988050/matching-records-based-on-person-name)
I'm using c# / Linq, SqlServer.
Thanks for your time!
Here is a link to a SOUNDEX implementation in .NET:
http://www.codeproject.com/KB/recipes/soundex.aspx
I haven't used it but it seems to be rated well
If it were me, I would require an exact match on the last name, and then only try to guess variances of the first name. This would narrow down your field of work quite a bit.
Then, as you suggested in your comments, you could apply rules of +/- a few characters of the first name length as well as a threshold of say (80%) of the characters must match.
Also, you can then only look at first names that also match the first X characters as well, as most English name deviations will be after X number of characters.
Example:
John Doe
Johnny Doe
Johnathan Doe
We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.
Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.
In the future we would like to have a search that performs something like a Contains rather than a StartsWith.
We use
Lucene.Net 2.9.2.2
StandardAnalyzer
default QueryParser
Samples:
(Title:Orch*) matches: Orchestra
but:
(Title:rch*) does not match: Orchestra
We want the first and the second one to both match Orchestra.
Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?
How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.
First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".
You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?
You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.
I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.
#Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.
Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.