Max edit distance and suggestion based on word frequency

Max edit distance and suggestion based on word frequency - c#

I need a spell checker with the following specification:
Very scalable.
To be able to set a maximum edit distance for the suggested words.
To get suggestion based on provided words frequencies (most common word first).
I took a look at Hunspell:
I found the parameter MAXDIFF in the man but doesn't seem to work as expected. Maybe I'm using it the wrong way
file t.aff:
MAXDIFF 1
file dico.dic:
5
rouge
vert
bleu
bleue
orange
-
NHunspell.Hunspell h = new NHunspell.Hunspell("t.aff", "dico.dic");
List<string> s = h.Suggest("bleuue");
returns the same thing t.aff being empty or not:
bleue
bleu

We decided to use Apache Solr, which exactly fulfills our needs.
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck

A maxdiff of one should return a few, but still can return more than one.
Even a maxdiff of zero can give more than a single result, but it should lower the change. It depends on the n-gram. Try a maxdiff of zero less results, but this still doesn't guarantee you will get a single suggestion.
For your requirement to sort on the most frequent word, the Google ngram corpus is publicly available.

Related

Optimizing output parameters

I'm trying to solve a problem statement using C# as programming language.
In the problem system for an input (double/decimal) say Hi, the output generated is a form of dataset containing number of parameters (Fi, Pi and Ti). I somehow have to filter out only those entries in the data set which would satisfy the following conditions.
Fi > Fmin, where Fmin is some constant
Pi > Pmin, where Pmin is some constant
Ti < Tmax, where Tmax is some constant
Is there an efficient algorithm I could use in such cases where I could zero in on an optimal set of values for Hi for which the output parameter values are well within the constraints. Also I thought using Genetic Algorithms in this case makes sense but somehow I'm not able to formulate and fit the problem specific to Genetic Algorithms.
Any pointers/ suggestions are truly appreciated.

you can use Linq query
var result = DataSet.Where(x=>x.Fi> Fmin && x.Pi>Pmin && Ti < Tmax);

Well, it's hard for me to guess. I don't know the properties of the function for Fi etc.
An log-Barrier Method could be something interesting here. Or the SQP Method. But it has to be differntiable.
Otherwise simulated annealing could be interesting.
But these are just some guesses. It really depends on the problem.

I doubt that a Genetic Algorithm makes sense, seeing as you have only one input variable (Hi) that determines the outputs (Fi, Pi, Ti). The power of a Genetic Algorithm is that it blends good solutions into new solutions. If your solution is only one number, blending two good solutions will probably mean that you're finding some Hi inbetween (such as the average -> 0.5Hi1 + 0.5Hi2 or some other linear combination aHi1 + (1-a)Hi2 with a between 0 and 1).
I would recommend looking into Multi-start Local Search heuristics, such as link. This is a pretty solid heuristic that allows you to explore the solution space for Hi.
In their simplest form, such heuristics calculate the performance for N random values of Hi, and then search for further improvements in the area of the best performing Hi values out of those N initial values.
This sort of stuff is also pretty straight-forward to code, assuming that you have a way to obtain the Fi, Ti, and Pi values from your Hi input, and that you have some way to figure out which of your solutions perform 'best' (for instance through a fitness function as mentioned in the comments).

string(";P") is bigger or string("-_-") is bigger?

I found very confusing when sorting a text file. Different algorithm/application produces different result, for example, on comparing two string str1=";P" and str2="-_-"
Just for your reference here gave the ASCII for each char in those string:
char(';') = 59; char('P') = 80;
char('-') = 45; char('_') = 95;
So I've tried different methods to determine which string is bigger, here is my result:
In Microsoft Office Excel Sorting command:
";P" < "-_-"
C++ std::string::compare(string &str2), i.e. str1.compare(str2)
";P" > "-_-"
C# string.CompareTo(), i.e. str1.CompareTo(str2)
";P" < "-_-"
C# string.CompareOrdinal(), i.e. CompareOrdinal(w1, w2)
";P" > "-_-"
As shown, the result varied! Actually my intuitive result should equal to Method 2 and 4, since the ASCII(';') = 59 which is larger than ASCII('-') = 45 .
So I have no idea why Excel and C# string.CompareTo() gives a opposite answer. Noted that in C# the second comparison function named string.CompareOrdinal(). Does this imply that the default C# string.CompareTo() function is not "Ordinal" ?
Could anyone explain this inconsistency?
And could anyone explain in CultureInfo = {en-US}, why it tells ;P > -_- ? what's the underlying motivation or principle? And I have ever heard about different double multiplication in different cultureInfo. It's rather a cultural shock..!

?
std::string::compare: "the result of a character comparison depends only on its character code". It's simply ordinal.
String.CompareTo: "performs a word (case-sensitive and culture-sensitive) comparison using the current culture". So,this not ordinal, since typical users don't expect things to be sorted like that.
String::CompareOrdinal: Per the name, "performs a case-sensitive comparison using ordinal sort rules".
EDIT: CompareOptions has a hint: "For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list."

Excel 2003 (and earlier) does a sort ignoring hyphens and apostrophes, so your sort really compares ; to _, which gives the result that you have. Here's a Microsoft Support link about it. Pretty sparse, but enough to get the point across.

MonoTouch on iPad: How to make text search faster?

I need to do text search based on user input in a relative large list (about 37K lines with 50 to 100 chars each line). The search is done after entering each character and the result is shown in a UITableView. This is my current code:
if (input.Any(x => Char.IsUpper(x)))
return _list.Where(x => x.Desc.Contains(input));
else
return _list.Where(x => x.Desc.ToLower().Contains(input));
It performs okay on a MacBook running simulator, but too slow on iPad.
On interesting thing I observed is that it takes longer and longer as input grows. For example, say "examin" as input. It takes about 1 second after entering e, 2 seconds after x, 5 seconds after a, but 28 seconds after m and so on. Why that?
I hope there is a simple way to improve it.

Always take care to avoid memory allocations in time sensitive code.
For example we often produce code often allocates string without realizing it, e.g.
x => x.Desc.ToLower().Contains(input)
That will allocate a string to return from ToLower. From your description this will occurs many time. You can easily avoid this by using:
x = x.Desc.IndexOf ("s", StringComparison.OrdinalIgnoreCase) != -1
note: just select the StringComparison.*IgnoreCase that match your need.
Also LINQ is nice but it hides allocations in many cases - maybe not in your case but measuring is key to get things faster. In that case using another algorithm (like suggested in another answer) could give you much better results (but keep in mind the allocations ;-)
UPDATE:
Mono's Contains(string) will call, after a few checks, the following:
CultureInfo.CurrentCulture.CompareInfo.IndexOf (this, value, 0, length, CompareOptions.Ordinal);
which, with your ToLower requirement that using StringComparison.OrdinalIgnoreCase is the perfect (i.e. identical) match for your existing code (it did not do any culture specific comparison).

Generally I've found that contains operations are not preferable for search, so I'd recommend you take a look at the Mastering Core Data Session (login required ) video on the WWDC 2010 page (around the 10 min mark). Apple knows that 'contains' is terrible w/ SQLite on mobile devices, you can essentially do what Apple does to sort of "hack" FTS on the version of SQLite they ship.
Essentially they do prefix matching by creating a table like:
[[ pk_id || input || normalized_input ]]
Where input and normalized_input are both indexed explicitly. Then they prefix match against the normalized value. So for instance if a user is searching for 'snuggles' and so far they've typed in 'snu' the prefix matching query would look like:
normalized_input >= 'snu' and normalized_input < 'snt'
Not sure if this translates given your use case, but I thought it was worth mentioning. Hope it's helpful!

You need to use a trie. See http://en.wikipedia.org/wiki/Trie

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

All valid combinations of points, in the most (speed) effective way

I know there are quite some questions out there on generating combinations of elements, but I think this one has a certain twist to be worth a new question:
For a pet proejct of mine I've to pre-compute a lot of state to improve the runtime behavior of the application later. One of the steps I struggle with is this:
Given N tuples of two integers (lets call them points from here on, although they aren't in my use case. They roughly are X/Y related, though) I need to compute all valid combinations for a given rule.
The rule might be something like
"Every point included excludes every other point with the same X coordinate"
"Every point included excludes every other point with an odd X coordinate"
I hope and expect that this fact leads to an improvement in the selection process, but my math skills are just being resurrected as I type and I'm unable to come up with an elegant algorithm.
The set of points (N) starts small, but outgrows 64 soon (for the "use long as bitmask" solutions)
I'm doing this in C#, but solutions in any language should be fine if it explains the underlying idea
Thanks.
Update in response to Vlad's answer:
Maybe my idea to generalize the question was a bad one. My rules above were invented on the fly and just placeholders. One realistic rule would look like this:
"Every point included excludes every other point in the triagle above the chosen point"
By that rule and by choosing (2,1) I'd exclude
(2,2) - directly above
(1,3) (2,3) (3,3) - next line
and so on
So the rules are fixed, not general. They are unfortunately more complex than the X/Y samples I initially gave.

How about "the x coordinate of every point included is the exact sum of some subset of the y coordinates of the other included points". If you can come up with a fast algorithm for that simply-stated constraint problem then you will become very famous indeed.
My point being that the problem as stated is so vague as to admit NP-complete or NP-hard problems. Constraint optimization problems are incredibly hard; if you cannot put extremely tight bounds on the problem then it very rapidly becomes not analyzable by machines in polynomial time.

For some special rule types your task seems to be simple. For example, for your example rule #1 you need to choose a subset of all possible values of X, and than for each value from the subset assign an arbitrary Y.
For generic rules I doubt that it's possible to build an efficient algorithm without any AI.

My understanding of the problem is: Given a method bool property( Point x ) const, find all points the set for which property() is true. Is that reasonable?
The brute-force approach is to run all the points through property(), and store the ones which return true. The time complexity of this would be O( N ) where (a) N is the total number of points, and (b) the property() method is O( 1 ). I guess you are looking for improvements from O( N ). Is that right?
For certain kind of properties, it is possible to improve from O( N ) provided suitable data structure is used to store the points and suitable pre-computation (e.g. sorting) is done. However, this may not be true for any arbitrary property.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.