MonoTouch on iPad: How to make text search faster? - c#

I need to do text search based on user input in a relative large list (about 37K lines with 50 to 100 chars each line). The search is done after entering each character and the result is shown in a UITableView. This is my current code:
if (input.Any(x => Char.IsUpper(x)))
return _list.Where(x => x.Desc.Contains(input));
else
return _list.Where(x => x.Desc.ToLower().Contains(input));
It performs okay on a MacBook running simulator, but too slow on iPad.
On interesting thing I observed is that it takes longer and longer as input grows. For example, say "examin" as input. It takes about 1 second after entering e, 2 seconds after x, 5 seconds after a, but 28 seconds after m and so on. Why that?
I hope there is a simple way to improve it.

Always take care to avoid memory allocations in time sensitive code.
For example we often produce code often allocates string without realizing it, e.g.
x => x.Desc.ToLower().Contains(input)
That will allocate a string to return from ToLower. From your description this will occurs many time. You can easily avoid this by using:
x = x.Desc.IndexOf ("s", StringComparison.OrdinalIgnoreCase) != -1
note: just select the StringComparison.*IgnoreCase that match your need.
Also LINQ is nice but it hides allocations in many cases - maybe not in your case but measuring is key to get things faster. In that case using another algorithm (like suggested in another answer) could give you much better results (but keep in mind the allocations ;-)
UPDATE:
Mono's Contains(string) will call, after a few checks, the following:
CultureInfo.CurrentCulture.CompareInfo.IndexOf (this, value, 0, length, CompareOptions.Ordinal);
which, with your ToLower requirement that using StringComparison.OrdinalIgnoreCase is the perfect (i.e. identical) match for your existing code (it did not do any culture specific comparison).

Generally I've found that contains operations are not preferable for search, so I'd recommend you take a look at the Mastering Core Data Session (login required ) video on the WWDC 2010 page (around the 10 min mark). Apple knows that 'contains' is terrible w/ SQLite on mobile devices, you can essentially do what Apple does to sort of "hack" FTS on the version of SQLite they ship.
Essentially they do prefix matching by creating a table like:
[[ pk_id || input || normalized_input ]]
Where input and normalized_input are both indexed explicitly. Then they prefix match against the normalized value. So for instance if a user is searching for 'snuggles' and so far they've typed in 'snu' the prefix matching query would look like:
normalized_input >= 'snu' and normalized_input < 'snt'
Not sure if this translates given your use case, but I thought it was worth mentioning. Hope it's helpful!

You need to use a trie. See http://en.wikipedia.org/wiki/Trie

Related

String likeness algorithms

I have two strings (they're going to be descriptions in a simple database eventually), let's say they're
String A: "Apple orange coconut lime jimmy buffet"
String B: "Car
bicycle skateboard"
What I'm looking for is this. I want a function that will have the input "cocnut", and have the output be "String A"
We could have differences in capitalization, and the spelling won't always be spot on. The goal is a 'quick and dirty' search if you will.
Are there any .net (or third party), or recommend 'likeness algorithms' for strings, so I could check that the input has a 'pretty close fragment' and return it? My database is going to have liek 50 entries, tops.
What you’re searching for is known as the edit distance between two strings. There exist plenty of implementations – here’s one from Stack Overflow itself.
Since you’re searching for only part of a string what you want is a locally optimal match rather than a global match as computed by this method.
This is known as the local alignment problem and once again it’s easily solvable by an almost identical algorithm – the only thing that changes is the initialisation (we don’t penalise whatever comes before the search string) and the selection of the optimum value (we don’t penalise whatever comes after the search string).

String parsing and matching algorithm

I am solving the following problem:
Suppose I have a list of software packages and their names might looks like this (the only known thing is that these names are formed like SOMETHING + VERSION, meaning that the version always comes after the name):
Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1
Now, I need to parse this list and select only latest versions of each package. For this example the expected result would be:
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Current approach that I use can be described the following way:
Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
Apply the regular expression (or a set of regular expressions),
which tries to deduce the version from the string and perform
the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
// Example for this step:
// 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
// (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
Search through the corresponding group (in this example - the 'E' group)
and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
from it's significant part. This comparison is performed in ignore-case and
ignore-special-symbols mode.
// The matches for this step:
// Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
// Efficient.Exclusive.Zip.Archiver, {123.01}
// Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
// The last one will get picked, because year is ignored.
Get the possible version from each match, ***pick the latest, yield that match.***
Remove every possible match (including the initial element) from the list.
}
This algorithm (as I assume) should work for something like O(N * V + N lg N * M), where M stands for the average string matching time and V stands for the version regexp working time.
However, I suspect there is a better solution (there always is!), maybe specific data structure or better matching approach.
If you can suggest something or make some notes on the current approach, please do not hesitate to do this.
How about this? (Pseudo-Code)
Dictionary<string,string> latestPackages=new Dictionary<string,string>(packageNameComparer);
foreach element
{
(package,version)=applyRegex(element);
if(!latestPackages.ContainsKey(package) || isNewer)
{
latestPackages[package]=version;
}
}
//print out latestPackages
Dictionary operations are O(1), so you have O(n) total runtime. No pre-grouping necessary and instead of storing all matches, you only store the one which is currently the newest.
Dictionary has a constructor which accepts a IEqualityComparer-object. There you can implement your own semantic of equality between package names. Keep in mind however that you need to implement a GetHashCode method in this IEqualityComparer which should return the same values for objects that you consider equal. To reproduce the example above you could return a hash code for the first character in the string, which would reproduce the grouping you had inside your dictionary. However you will get more performance with a smarter hash code, which doesn't have so many collisions. Maybe using more characters if that still yields good results.
I think you could probably use a DAWG (http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) here to good effect. I think you could simply cycle down each node till you hit one that has only 1 "child". On this node, you'll have common prefixes "up" the tree and version strings below. From there, parse the version strings by removing everything that isn't a digit or a period, splitting the string by the period and converting each element of the array to an integer. This should give you an int array for each version string. Identify the highest version, record it and travel to the next node with only 1 child.
EDIT: Populating a large DAWG is a pretty expensive operation but lookup is really fast.

Linq keyword extraction - limit extraction scope

With regards to this solution.
Is there a way to limit the number of keywords to be taken into consideration? For example, I'd like only first 1000 words of text to be calculated. There's a "Take" method in Linq, but it serves a different purpose - all words will be calculated, and N records will be returned. What's the right alternative to make this correctly?
Simply apply Take earlier - straight after the call to Split:
var results = src.Split()
.Take(1000)
.GroupBy(...) // etc
Well, strictly speaking LINQ is not necessarily going to read everything; Take will stop as soon as it can. The problem is that in the related question you look at Count, and it is hard to get a Count without consuming all the data. Likewise, string.Split will look at everything.
But if you wrote a lazy non-buffering Split function (using yield return) and you wanted the first 1000 unique words, then
var words = LazySplit(text).Distinct().Take(1000);
would work
Enumerable.Take does in fact stream results out; it doesn't buffer up its source entirely and then return only the first N. Looking at your original solution though, the problem is that the input to where you would want to do a Take is String.Split. Unfortunately, this method doesn't use any sort of deferred execution; it eagerly creates an array of all the 'splits' and then returns it.
Consequently, the technique to get a streaming sequence of words from some text would be something like:
var words = src.StreamingSplit() // you'll have to implement that
.Take(1000);
However, I do note that the rest of your query is:
...
.GroupBy(str => str) // group words by the value
.Select(g => new
{
str = g.Key, // the value
count = g.Count() // the count of that value
});
Do note that GroupBy is a buffering operation - you can expect that all of the 1,000 words from its source will end up getting stored somewhere in the process of the groups being piped out.
As I see it, the options are:
If you don't mind going through all of the text for splitting purposes, then src.Split().Take(1000) is fine. The downside is wasted time (to continue splitting after it is no longer necesary) and wasted space (to store all of the words in an array even though only the first 1,000) will be needed. However, the rest of the query will not operate on any more words than necessary.
If you can't afford to do (1) because of time / memory constraints, go with src.StreamingSplit().Take(1000) or equivalent. In this case, none of the original text will be processed after 1,000 words have been found.
Do note that those 1,000 words themselves will end up getting buffered by the GroupBy clause in both cases.

Any way to make this LINQ faster?

I have a LINQ expression that's slowing down my application.
I'm drawing a control, but to do this, I need to know the max width of the text that will appear in my column.
The way I'm doing that is this:
return Items.Max(w => TextRenderer.MeasureText((w.RenatlUnit == null)? "" :
w.RenatlUnit.UnitNumber, this.Font).Width) + 2;
However, this iterates over ~1000 Items, and takes around 20% of the CPU time that is used in my drawing method. To make it worse, there are two other columns that this must be done with, so this LINQ statement on all the items/columns takes ~75-85% of the CPU time.
TextRenderer is from System.Windows.Forms package, and because I'm not using a monospaced font, MeasureText is needed to figure out the pixel width of a string.
How might I make this faster?
I don't believe that your problem lies in the speed of LINQ, it lies in the fact that you're calling MeasureText over 1000 times. I would imagine that taking your logic out of a LINQ query and putting it into an ordinary foreach loop would yield similar run times.
A better idea is probably to employ a little bit of sanity checking around what you're doing. If you go with reasonable inputs (and disregard the possibility of linebreaks), then you really only need to measure the text of strings that are, say, within 10% or so of the absolute longest (in terms of number of characters) string, then use the maximum value. In other words, there's no point in measuring the string "foo" if the largest value is "paleontology". There's no font that has widths THAT variable.
It's the MeasureText method that takes time, so the only way to increase the speed is to do less work.
You can cache the results of the call to MeasureText in a dictionary, that way you don't have to remeasure strings that already has been measured before.
You can calculate the values once and keep along with the data to display. Whenever you change the data, you recalculate the values. That way you don't have to measure the strings every time the control is drawn.
Step 0: Profile. Assuming you find that most of the execution time is indeed in MeasureText, then you can try the following to reduce the number of calls:
Compute the lengths of all individual characters. Since it sounds like you're rendering a number, this should be a small set.
Estimate the length numstr.Select(digitChar=>digitLengthDict[digitChar]).Sum()
Take the strings with the top N lengths, and measure only those.
To avoid even most of the cost of the lookup+sum, also filter to include only those strings within 90% of the maximum string-length, as suggested.
e.g. Something like...
// somewhere else, during initialization - do only once.
var digitLengthDict = possibleChars.ToDictionary(c=>c,c=>TextRenderer.MeasureText(c.ToString()));
//...
var relevantStringArray = Items.Where(w=>w.RenatlUnit!=null).Select(w.RenatlUnit.UnitNumber).ToArray();
double minStrLen = 0.9*relevantStringArray.Max(str => str.Length);
return (
from numstr in relevantStringArray
where str.Length >= minStrLen
orderby numstr.Select(digitChar=>digitLengthDict[digitChar]).Sum() descending
select TextRenderer.MeasureText(numstr)
).Take(10).Max() + 2;
If we knew more about the distribution of the strings, that would help.
Also, MeasureText isn't magic; it's quite possible you can duplicate it's functionality entirely quite easily for a limited set of inputs. For instance, it would not surprise me to learn that the Measured length of a string is precisely equal to the sum of the length of all characters in the string, minus the kerning overhang of all character bigrams in the string. If your string then consists of, say, 0-9, +, -, ,, ., and a terminator symbol, then a lookup table of 14 character widths and 15*15-1 kernel corrections might be enough to precisely emulate MeasureText at a far greater speed, and without much complexity.
Finally, the best solution is to not solve the problem at all - perhaps you can rearchitect the application to not require such a precise number - if a simpler estimate were to suffice, you could avoid MeasureText almost completely.
Unfortunately, it doesn't look like LINQ is your problem. If you ran a for loop and did this same calculation, the amount of time would be the same order of magnitude.
Have you considered running this calculation on multiple threads? It would work nicely with Parallel LINQ.
Edit: It seems Parallel LINQ won't work because MeasureText is a GDI function and will simply be marshaled back to the UI thread (thanks #Adam Robinson for correcting me.)
My guess is the issues is not the LINQ expression but calling the MeasureText several thousand times.
I think you could work around the non-monospaced font issue by breaking the problem into 4 parts.
Find the biggest number in terms of render size
Find the apartment unit with the most digits
Create a string with all values being the value determined in #1 and having size in #2.
Pass the value created in #3 to MeasureText and use that as your basis
This won't yield a perfect solution but it will ensure that you reserve at least enough space for your item and avoids the pitfall of calling MeasureText far too many times.
If you can't figure out how to make MeasureText faster, you could precalculate the width of all the characters in your font size and style and estimate the width of a string like that, although kerning of character pairs would suggest that it would probably be only an estimate and not precise.
You might want to consider as an approximation taking the length of the longest string and then finding the width of a string of that length of 0's (or whatever the widest digit is, I can't remember). That should be a much faster method, but it would only be an approximation and probably longer than necessary.
var longest = Items.Max( w => w.RenatlUnit == null
|| w.RenatlUnit.UnitNumber == null)
? 0
: w.RenatlUnit.UnitNumber.Length );
if (longest == 0)
{
return 2;
}
return TextRenderer.MeasureText( new String('0', longest ) ).Width + 2;

Rule Evaluation Systems and "not exact" matches (ej: x < 3000)

I am designing a rule evaluation system which need to handle a fact database and certain rules over the database. We currently have a modified version of RETE that works partially right with some drawbacks.
The problem is the rules doesn't limit to exact matches but they must also support inequality (as in less than) and other kinds of fuzzy calculations.
Examples, suppose you have this facts
(Salary John 58000)
(Salary Sara 78000)
(Employee John)
(Boss Sara)
(Married John Sara)
(Works John Stackoverflow)
you might have a rule that says:
(Salary ?w < 60000) /\ (Married ?w) /\ (Works ?w Stackoverflow) ==> Whatever
Obviously the result will be triggering the rule with an ?w value of "John" but the way we're doing that now is by looping trough each element in the fact base that matches the beggining of the first expression (Salary X X) and then making the comparison and storing the results in the fact base it self. For example, after the first pass you'll have the following item added to the fact base:
(Salary John 58000 < 60000)
and once that is made you perform the joins on the usual RETE way. That way it takes up a lot of space in the fact base, specially because rule can refer to any number and so you have those "calculated" facts as long as the rule is active.
On the other hand you can apply several rules with the first expression and you can keep using the standard matching algorithm to trigger the rules.
Does anyone know of any patterns, references or methods that handle this kind of behavior? The usual LEAPS, TREATS, RETE algorithms only handle (as far as I know) "exact" matching.
By the way, this is C# .NET.
CLIPS has supported conditional elements for as long as I've been aware of it - at least 15 years. Check out the basic programming guide for CLIPS and this CLIPS tutorial for examples. You can look at (or modify) the clips source for free.
CLIPS uses prefix notation, so your example conditional might look like:
(defrule fat-boy
(person-data (name ?name) (weight ?weight))
(test (> ?weight 100))
=>
(printout t ?name " weighs " ?weight " kg! " crlf)
)
As far as I understand the problem, all the fuzzy rules divide integer or floating point value ranges up into a limited number of subranges. For instance, if a salary is compared to 58000, 60000, 78000 values, you have 4 ranges: <58000, 58000-60000, 60000-78000, >78000.
If that is the case, maybe you can redefine your variables to be integers that are either 0,1,2,3, and thereby convert your inequality rules to equality rules.

Categories

Resources