I'm using c# and a database in SQL Server.
I have an autocomplete field that works fine with normal characters. I would like to add the functionality of autocomplete special characters too, as ö, Ä, é, è, ...
I would also like to add the possibility to identify characters that may sound similar in some languages, as 'b' and 'v', so if I type 'boor' would find 'voor' as a possible suggestion.
Any ideas?
Thanks
Edit: The autocomplete textboxes are used for names and surnames (one for each). They are made with AutoCompleteStringCollection. They search in the database for names or surnames that already exist.
This part of the application basically gives to the user the possibility to add new persons in the application (name,surname,etc).
The goal is that when the user is creating a new person in the application, he/she will get a list with the persons with a similar name or surname to the one he/she is typing in.
So if we already have 'James Smith' in the database, when the user is typing Smyth, he/she should get the possibility to change to Smith (as a autocomplete, maybe), saying "hey, do you mean 'Smith'?" So we avoid that the user creates the same person with wrong names.
Because we are working with names and surnames from people from all over the world, the errors in the creation of a new person can come from any language.
PD:
would it be a good idea to create my own autocomplete? hiding/showing a listbox right under the textbox
For what I'm trying, the function SOUNDEX works really good for characters like ö, Ä, é, è, ... But I can't call the database for every single name or surname. So I don't know well how to use use it.
I am not sure what do you mean by autocompletion. Regarding the second part of your question, you probably need a SQL Server feature SOUNDEX. It returns four-character (SOUNDEX) code to evaluate the similarity of two strings.
Use it like:
SELECT SOUNDEX ('Smith'), SOUNDEX ('Smythe');
Above words are spelled almost identically so they have the same soundex'es S530 and S530.
I think the soundex may be used with various languages, though I am not totally sure.
Unfortunately you can use as AutoCompletionSource only a AutoCompleteStringCollection.
But the logic, what will be presented to the user (the box with the matching items below the textbox itself) is fully controlled by the TextBox and can't be influenced in any way.
So even if you use something like SoundEx() or Levenstein Distance, you can't tell it the TextBox, cause it always makes a String.StartsWith() on the given collection and on a selection it replaces the whole content by the select value from the source.
That's something that already drove me crazy. You simply can't really influence what items from the list will be presented to the user and you can't influence what happens if some item from the box is selected.
I would look into Levenshtein distance.
Soundex is rather primitive. It was originally developed to be hand calculated. It results in a key and works well with western names and surnames.
Levenshtein distance looks at two string values and produces a value based on their similarity. It's looking for missing or substituted letters(no phonetic comparison as SoundEx).
Wikipedia reference: http://en.wikipedia.org/wiki/Levenstein_distance
Website for testing two string values using Levenshtein distance: http://gtools.org/levenshtein-calculate.php
Related
First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.
If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.
I have a database containing records, some of the properties form an address. I have a c# web app that features searches by address, but I need more than just the wildcard symbol to retrieve matches. Is there a means of implementing a fuzzy/rough search from the web app?
My two parameters are:
Address
Postcode
And only one needs to be populated to complete the search. Searching with both parameters should also be an available option.
Fuzzy matching is usually not built into DBs because there is no efficient way to index columns in this way. Basically you'll either have to run the fuzzy matching algorithm on every row or you have to create an index of every possible fuzzy match for each row. One will make searching slow, the other would make insertions slow and drastically increase the size of the DB. Based on the exact fuzzy match and tolerance there could be a hybrid solution that you could implement, but this will not be a trivial task. My own experience with fuzzy matching was to always have one index that had to be an exact match so that the amount of data that I had to run the fuzzy match on would be limited. If that is not possible in your case then building the index of all matching fuzzy matches might be the only solution. Finally you might want to back up and ask yourself if you really need a fuzzy match or if you just need to maybe break the address look-up into the numerical part and the street name. Both of those can be extracted from the address that the user enters before you attempt the look-up. Then you'd just have to store the numerical and street portions of your address in your DB separately.
EDIT
One option would be to do an exact match on the numerical portion of the address, get the results back from the DB and use the fuzzy match on the street portion to eliminate and order the results. But this could get tricky with some odd ball addresses that might not have a numerical part, or if the user spells out the numerical part like "One Main St". Also the best way to pull this off would be to create a separate columns for the numerical and street name portions of the address, which means updating your DB and doing some parsing on your data. And then you might have to deal with other issues in the address like "SW" vs "South West" that could cause the fuzzy matching to fail.
I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.
I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?
Thanks
There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).
When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).
I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.
Perhaps you should start with a spell checker - there are a number of libraries available that do this.
There are a few c# snippets online that will get the ball rolling:
Levenshtein:
http://www.dotnetperls.com/levenshtein
Boyer-Moore:
http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150
Based on those, you can easily implement your own Word Matcher module.
I have a xml with two properties: word and link.
How can I replace the words on a text to a link using the xml information.
Ex.:
XML
<word>dog</word>
<link>http://www.dog.com</link>
Text: The dog is nice.
Result: The dog is nice.
Results OK.
The problems:
1- If the text has the word dogs the result is incorret, because of "s".
2- I've tested doing a split by space on text to fix it, but if the word is composed like new year the result is incorret again.
Does anyone have any suggestions to do it and fix these problems (plural and compound words)?
Thanks for the help.
You can use Lucene.Net's contrib package Snowball for stemming (words->word , came->come , having->have etc.). But you will still have troubles with compound words
If you roll your own solution, I have had good success with the .NET pluralization capabilities:
http://msdn.microsoft.com/en-us/library/system.data.entity.design.pluralizationservices.pluralizationservice.aspx
Essentially, you can pass a word in its plural form and receive a singular version and vice versa.
This could be fairly intensive depending on how often the content changed, i.e. this wouldn't be a good choice to search thousands of words in real time.
Assuming that you can pre-process/cache the results or that the source file is small, you could:
Run Once
Identify all candidate words from the source file.
Parse/split phrases and pass them through the pluralization libraries to determine their plural counterparts.
Generate (and precompile) simple regular expressions to locate the words that you do want to match. For example, if you want to match "dog" but not "dogs" you could create a regex like dog[^s] which could then be executed against the text.
Run Whenever a Search/Replace is Needed
Run your list of source expressions against the text in question. I would suggest ordering the expressions from shortest to longest (otherwise a short expression may replace a word that was just parsed by a longer expression).
Again, this would be processor intensive to run in real-time (most solutions will be). As always, if you are parsing HTML, you should use an HTML parser, not a regular expression. In this case, you might use a proper parser to locate all text nodes and then perform the search/replace on them.
An alternative solution would be to put the text and keyword list into a database and use SQL Server Full Text Indexing which tends to be pretty smart about these things and supports intelligent match predicates. You could even combine this with a CLR stored procedure to handle things that .NET excels at (like string parsing).
Regardless of the approach, this will not be an exact science.
You're likely going to need a dictionary. Create a text file/XML file that contains both the singular and plural forms of the words you want. At runtime, load them into a Dictionary<String, String>. Then look up the value of <word/> in the dictionary and extract its singular value.
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.