Fuzzy searching from C#

Fuzzy searching from C# - c#

I have a database containing records, some of the properties form an address. I have a c# web app that features searches by address, but I need more than just the wildcard symbol to retrieve matches. Is there a means of implementing a fuzzy/rough search from the web app?
My two parameters are:
Address
Postcode
And only one needs to be populated to complete the search. Searching with both parameters should also be an available option.

Fuzzy matching is usually not built into DBs because there is no efficient way to index columns in this way. Basically you'll either have to run the fuzzy matching algorithm on every row or you have to create an index of every possible fuzzy match for each row. One will make searching slow, the other would make insertions slow and drastically increase the size of the DB. Based on the exact fuzzy match and tolerance there could be a hybrid solution that you could implement, but this will not be a trivial task. My own experience with fuzzy matching was to always have one index that had to be an exact match so that the amount of data that I had to run the fuzzy match on would be limited. If that is not possible in your case then building the index of all matching fuzzy matches might be the only solution. Finally you might want to back up and ask yourself if you really need a fuzzy match or if you just need to maybe break the address look-up into the numerical part and the street name. Both of those can be extracted from the address that the user enters before you attempt the look-up. Then you'd just have to store the numerical and street portions of your address in your DB separately.
EDIT
One option would be to do an exact match on the numerical portion of the address, get the results back from the DB and use the fuzzy match on the street portion to eliminate and order the results. But this could get tricky with some odd ball addresses that might not have a numerical part, or if the user spells out the numerical part like "One Main St". Also the best way to pull this off would be to create a separate columns for the numerical and street name portions of the address, which means updating your DB and doing some parsing on your data. And then you might have to deal with other issues in the address like "SW" vs "South West" that could cause the fuzzy matching to fail.

Related

Searching for a substring in a key of a key-value pair more efficient when stored in a data structure or in a long string?

I have a string searching problem and two ideas came to mind on how to implement it. I was wondering if people can indicate which method would give me more efficient performance, or perhaps even suggest a better way of doing it?
The problem is I have a text file of around 450kb containing data in the following format:
description1, code1\n
description2, code2\n
description3, code3\n
...
It is two columns of data delimited by a comma and each record consists of a description and a code.
The code is a short three character text that is not immediately meaningful to the user, which is why there is description data paired with the code.
The description data is a short sentence that describes to the user what the code means.
I'm trying to create a GUI where the user can enter a search keyword in an editable text field which is then used to search against the description data. The system would then return back all the filtered records, i.e., all the description data that has the keyword as a substring and the code that it is paired with for the user to select. This occurs for each character the user types.
The first idea that came to mind on how to implement this feature is to create a key-value pair collection using the description data as key, such as a NameValueCollection, and then use a foreach loop to go through each record and search the key for the matching substring.
The second idea is to read the whole text file into one long string, and use the String.IndexOf() method to search for the keyword and wherever there is a hit in the search, I extract that portion of the record to return to the user.
The second idea came to mind because I was concerned by the performance impact that the first idea may have. I've read that the IndexOf method in use with StringComparison.Ordinal performs better than a Boyer–Moore string search algorithm so I think implementing it this way would have better performance?
So when searching for a substring in the key, does it provide faster retrieval to store the whole file as a string or in a NameValueCollection, or are there better ways of doing this?

If you have a wide collection of strings that you are planning on searching for the exact same substring, you have many options available.
One option would be to use the Aho-Corasick string matching algorithm to search for the search query in every single one of the lines of the file. The total runtime of doing this will be O(m + n + z), where m is the length of the query, z is the number of total matches, and n is the total number of characters in all of the strings in the file.
A better but more complex option would be to build a generalized suffix tree out of all the lines of the file. You could then find all matching lines in time O(n + z), where n is the length of the pattern to search for and z is the total number of lines in the file. This requires O(m) preprocessing time, where m is the total number of characters in the file. This is much, much faster than the first option, but you would probably have to find a good suffix tree library, as suffix tree construction algorithms are fairly complex.
Hope this helps!

Given a dictionary, what's the optimal way to find all possible words that contains a particular set of characters and a string

I'm writing a word game. I have access to the dictionary object to validate the words. I need to find all possible words that contains a word and a set of additional characters.
for example:
lets the say the word is "MEN" and the set of additional characters are "WALOHTD". I need a way to find words like....
1.MEND
2.WOMEN
3.MENTAL
4. etc....
basically we are looking at all possible words that contain "MEN" and any of the specific additional characters.
I can certainly write code that can loop through the entire dictionary to first words that contains the subword and then check for the specific characters existance but that is not optimal. It's taking more than a second. Any help towards optimal solution is greatly appreciated.
_rey

The problem is a mixture of that of regular language and that of searching a data structure.
Considering the first aspect alone, we'd be inclined to use a regular expression. You don't say if we can repeat the "additional characters". If we can, it's easy enough [WALOTHD]*MEN[WALOTHD]* for your case, and that's easily adapted.
If we can't repeat, then we can start with [WALOTHD]{0,7}MEN[WALOTHD]{0,7} and filter out any that break the rule ("ALLOTMENT" matches that expression, but repeats L and T).
Or we can try to build a much more complicated regular expression, though I'm not sure if the gains in the better expression would out-weigh the cost of working out what it was though.
Coming from the other side of searching a dictionary, a DAWG is very space-efficient and makes finding matches that contain substrings relatively efficient. It's not a complete match to this puzzle, as we have quite a few permutations of prefixes and suffixes to worry about. Without testing, I'd guess it'd being reasonably good if we can't repeat from the "additional", and horrible if we can. But that is just a guess. A GADDAG might well be worth looking at, it'd be bigger than a DAWG, but likely faster for this sort of search (GADDAGs are used in scrabble-solving, which is pretty much the same problem that you have here).

special characters in Autocomplete c#

I'm using c# and a database in SQL Server.
I have an autocomplete field that works fine with normal characters. I would like to add the functionality of autocomplete special characters too, as ö, Ä, é, è, ...
I would also like to add the possibility to identify characters that may sound similar in some languages, as 'b' and 'v', so if I type 'boor' would find 'voor' as a possible suggestion.
Any ideas?
Thanks
Edit: The autocomplete textboxes are used for names and surnames (one for each). They are made with AutoCompleteStringCollection. They search in the database for names or surnames that already exist.
This part of the application basically gives to the user the possibility to add new persons in the application (name,surname,etc).
The goal is that when the user is creating a new person in the application, he/she will get a list with the persons with a similar name or surname to the one he/she is typing in.
So if we already have 'James Smith' in the database, when the user is typing Smyth, he/she should get the possibility to change to Smith (as a autocomplete, maybe), saying "hey, do you mean 'Smith'?" So we avoid that the user creates the same person with wrong names.
Because we are working with names and surnames from people from all over the world, the errors in the creation of a new person can come from any language.
PD:
would it be a good idea to create my own autocomplete? hiding/showing a listbox right under the textbox
For what I'm trying, the function SOUNDEX works really good for characters like ö, Ä, é, è, ... But I can't call the database for every single name or surname. So I don't know well how to use use it.

I am not sure what do you mean by autocompletion. Regarding the second part of your question, you probably need a SQL Server feature SOUNDEX. It returns four-character (SOUNDEX) code to evaluate the similarity of two strings.
Use it like:
SELECT SOUNDEX ('Smith'), SOUNDEX ('Smythe');
Above words are spelled almost identically so they have the same soundex'es S530 and S530.
I think the soundex may be used with various languages, though I am not totally sure.

Unfortunately you can use as AutoCompletionSource only a AutoCompleteStringCollection.
But the logic, what will be presented to the user (the box with the matching items below the textbox itself) is fully controlled by the TextBox and can't be influenced in any way.
So even if you use something like SoundEx() or Levenstein Distance, you can't tell it the TextBox, cause it always makes a String.StartsWith() on the given collection and on a selection it replaces the whole content by the select value from the source.
That's something that already drove me crazy. You simply can't really influence what items from the list will be presented to the user and you can't influence what happens if some item from the box is selected.

I would look into Levenshtein distance.
Soundex is rather primitive. It was originally developed to be hand calculated. It results in a key and works well with western names and surnames.
Levenshtein distance looks at two string values and produces a value based on their similarity. It's looking for missing or substituted letters(no phonetic comparison as SoundEx).
Wikipedia reference: http://en.wikipedia.org/wiki/Levenstein_distance
Website for testing two string values using Levenshtein distance: http://gtools.org/levenshtein-calculate.php

Ideas for creating a "Did you mean XYZ" feature into website

I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.

Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.

Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.

This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.

You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.

Regular expression to extract domain name from any domain

I'm trying to extract the domain name from a string in C#. You don't necessarily have to use a RegEx but we should be able to extract yourdomain.com from all of the following:
yourdomain.com
www.yourdomain.com
http://www.yourdomain.com
http://www.yourdomain.com/
store.yourdomain.com
http://store.yourdomain.com
whatever.youdomain.com
*.yourdomain.com
Also, any TLD is acceptable, so replace all the above with .net, .org, 'co'uk, etc.

If no scheme present (no colon in string), prepend "http://" to make it a valid URL.
Pass string to Uri constructor.
Access the Uri's Host property.
Now you have the hostname. What exactly you consider the ‘domain name’ of a given hostname is a debatable point. I'm guessing you don't simply mean everything after the first dot.
It's not possible to distinguish hostnames like ‘whatever.youdomain.com’ from domains-in-an-SLD like ‘warwick.ac.uk’ from just the strings. Indeed, there is even a bit of grey area about what is and isn't a public SLD, given the efforts of some registrars to carve out their own niches.
A common approach is to maintain a big list of SLDs and other suffixes used by unrelated entities. This is what web browsers do to stop unwanted public cookie sharing. Once you've found a public suffix, you could add the one nearest prefix in the host name split by dots to get the highest-level entity responsible for the given hostname, if that's what you want. Suffix lists are hell to maintain, but you can piggy-back on someone else's efforts.
Alternatively, if your app has the time and network connection to do it, it could start sniffing for information on the hostname. eg. it could do a whois query for the hostname, and keep looking at each parent until it got a result and that would be the domain name of the lowest-level entity responsible for the given hostname.
Or, if all that's too much work, you could try just chopping off any leading ‘www.’ present!

I would recommend trying this yourself. Using regulator and a regex cheat sheet.
http://sourceforge.net/projects/regulator/
http://regexlib.com/CheatSheet.aspx
Also find some good info on Regular Expressions at coding horror.

Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).

A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to:
(?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$))
You would catch .anything and .co.anything, which I imagine covers most realistic cases...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.