How to achieve proper Fuzzy search in Lucene.Net?

How to achieve proper Fuzzy search in Lucene.Net? - c#

I have done Fuzzy search in lucene.Net. In this if i searched Feature, the Feature,Featured,featuring only should come.But the data came like based on text matching like venture,culture and etc. ture is matched in that fuzzy search.My code is
Query query = new FuzzyQuery(new Term("ContentText", searchString));
finalQuery.Add(query, BooleanClause.Occur.SHOULD);

You should take a look on the process called "Lemmatisation" (http://en.wikipedia.org/wiki/Lemmatisation). You would like to build your index based on the base form of the word (called lemma) - and the same you should do with your query.
Lucene supports English language out of the box so there should not be any problem with that.

You can pass additional filters that check the minimumscore property as well as the minimumsimilarity property that can enhance the quality of the results. Other things I have done in specific scenarios is use multiple different query types and combine the results (filter out low scores) and return a combined list. This works really well for things like an engine that can dynamically "assume did you mean..." results initially rather than asking you "did you mean".

You probably need to set Parser.FuzzyMinSim

Related

What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.

Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.

Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

getting a specific field values from lucene

i just started learning how lucene works and am trying to implement it in a site i allready wrote with mysql.
i have a field named city in my documents, and i want to get all the values for city from the documents.
i have found this question ( which is exactly what i need ) Get all lucene values that have a certain fieldName
but all they show there is a line of code, and as i said, i am not experienced enough to understand how to implement that.
can someone please help me with some code to implement IndexReader.Open(directory,true).Terms(new Term("city", String.Empty));
what comes before / after that declaration ?
i have tried this:
System.IO.DirectoryInfo directoryPath = new System.IO.DirectoryInfo(Server.MapPath("LuceneIndex"));
Directory directory = FSDirectory.Open(directoryPath);
Lucene.Net.Index.TermEnum iReader = IndexReader.Open(directory,true).Terms(new Term("city", String.Empty));
but how do i iterate over the results?

This loop should iterate over all the terms:
Term curTerm = iReader.Term();
bool hasNext = true;
while (curTerm != null && hasNext)
{
//do whatever you need with the current term....
hasNext = iReader.Next();
curTerm = iReader.Term();
}

I'm not familiar with C# API, but it looks very similar to the Java one.
What this code does is to get an instance of IndexReader with read-only access that is used to read data from Lucene index segments stored in directory. Then it gets an enumeration of all terms, starting at the given one. Dictionary (index part that stores the terms) in Lucene is organized .tis files, ordered lexicographically first by field name and then by term text.
So this statement gives you an enumeration of all term texts, starting at the beginning of the field city (besides: in Java you would rather write new Term("city")). You now need to find out the C# API of this enumeration, and then walk through it till you will get a Term that have something different as the field().
A final note: generally, you should avoid doing things like this: it may for example limit your ability to distribute the index. If it turns out that this is a thing that you are doing at the very beginning of using Lucene, then it's probable that you are using it more like a document database than a search library.

Speed up linq query without where clauses

Quick LINQ performance question.
I have a database with many many records and it's used for a webshop.
All query logic and paging is done with LINQ, and it performs quite well.
This is, because the usual search for products contains one or more where clause, and that shortens my result set to a couple of hundred results at max.
But.. there is an option to list all products (when no search criteria is provided), and that query is slow.. real slow. Even though i'm just asking for a single page with .Skip(20).Take(10), it's still slow because the total result is something like 140000 products. Is there a way to limit this (or all) query, so that the speed of the whole thing is kept okay?
I don't want to force my customers to provide one or more criteria.. but on the other hand i have no problem with telling them that they can never find more than 2000 products.
Thanks for helping!
Tys

Why don't you limit the number of records on the sql side as described in this post
http://www.sqlservercurry.com/2009/06/skip-and-take-n-number-of-records-in.html

Watch out for any "premature" enumerations when you pass down queries/results in your code!
There are also several LINQ visualizers available, which can help to see what the LINQ expressions actually translate to. Or you can play around with expressions in LINQPad before integrating in your code…

What you can do is to have Linq use stored procedure from the database.
In that case, it will be faster because it is the database engine who will do the work and return it to Linq; the database engine is made for that, and it is closer to data than Linq.
I suggest you give it a try and give us feedback

You can check what indexes has the table and what PK is. It could be the table has no index at all so records compared by field values. Also you can catch the query in the SqlProfiler, run it separately and analyse its query plan.

Union on two big LINQ queries not supported. How do I get around it (if at all possible)

This is quite a monster query I am writing.
A very quick description of the database: You have an image table, an imagetag table, which joins it to a tag table. Images can have 0 or many tags.
I have a query which does a full text search on the Image's title property and this works fine.
However, I want it so when you do a full text search, it also looks at the image's tag names to see if anything matches. For instance, you could have an image with a title of "Awesome Cakes", which has a tag of cooking. When a user does a full text search of cooking, it should find that image, because it has a corresponding tag.
Right?
So as I mentioned I have my fulltext method which works and returns a queryable list of images.
I have also made a method which finds images with matching tags to the full text query
IQueryable<Image> results = imageService.FullTextSearch(MakeSearchTerm(freeText));
IQueryable<Image> tagResults = imageService.FullTextTagSearch(freeText, tagService);
When I debug this and view the enumerations, they both have results.
What I want to do is union them into one result set.
The reason I want to do this, is later down the code, other filters are applied to the results before the actual query is executed, such as filtering by author and only taking the results needed for the particular page.
Unfortunately, when I try and union:
results = results.Concat(tagResults).Distinct();
Types in Union or Concat cannot be constructed with hierarchy.
I understand that this might not be possible :p But I'm just seeing if there are any good ideas and solutions, cheers.

I think there is a problem in the way you approach searching. Instead of returning Images, you should introduce a type that only captures the image metadata, with a key to the image.
I dont't get exactly what "Types in Union or Concat cannot be constructed with hierarchy" should mean though. Is that an exception?

I would try combining the FullTextSearch and FullTextTagSearch functions into something like this:
public IQueryable<Image> ImageSearch(string fullText)
{
return from image in ImageTable
where image.title.Contains(fullText)
|| image.tagName.Contains(fullText)
select image;
}

Fastest way to find objects from a collection matched by condition on string member

Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:
class ClassFoo
{
public string word;
public float score;
//... etc ...
}
Assume there's going to be like 50.000 items in the collection, all in memory.
Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:
List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
{
if (cf.word.StartsWith(query) || cf.word.EndsWith(query))
result.Add(cf);
}
How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?
The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.

With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.
If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.
For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.
So, it really comes down to what queries you want to do against that data.
In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.
Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.
EDIT, based on the comment
If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.
If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.
Anything else is some specialization on these concepts.

var Answers = myList.Where(item => item.bar.StartsWith(query) || item.bar.EndsWith(query));
that's the easiest in my opinion, should execute rather quickly.

Not sure I understand... All you can really do is optimize the rule, that's the part that needs to be fastest. You can't speed up the loop without just throwing more hardware at it.
You could parallelize if you have multiple cores or machines.

I'm not up on my Java right now, but I would think about the following things.
How you are creating your list? Perhaps you can create it already ordered in a way which cuts down on comparison time.
If you are just doing a straight loop through your collection, you won't see much difference between storing it as an array or as a linked list.
For storing the results, depending on how you are collecting them, the structure could make a difference (but assuming Java's generic structures are smart, it won't). As I said, I'm not up on my Java, but I assume that the generic linked list would keep a tail pointer. In this case, it wouldn't really make a difference. Someone with more knowledge of the underlying array vs linked list implementation and how it ends up looking in the byte code could probably tell you whether appending to a linked list with a tail pointer or inserting into an array is faster (my guess would be the array). On the other hand, you would need to know the size of your result set or sacrifice some storage space and make it as big as the whole collection you are iterating through if you wanted to use an array.
Optimizing your comparison query by figuring out which comparison is most likely to be true and doing that one first could also help. ie: If in general 10% of the time a member of the collection starts with your query, and 30% of the time a member ends with the query, you would want to do the end comparison first.

For your particular example, sorting the collection would help as you could binarychop to the first item that starts with query and terminate early when you reach the next one that doesn't; you could also produce a table of pointers to collection items sorted by the reverse of each string for the second clause.
In general, if you know the structure of the query in advance, you can sort your collection (or build several sorted indexes for your collection if there are multiple clauses) appropriately; if you do not, you will not be able to do better than linear search.

If it's something where you populate the list once and then do many lookups (thousands or more) then you could create some kind of lookup dictionary that maps starts with/ends with values to their actual values. That would be a fast lookup, but would use much more memory. If you aren't doing that many lookups or know you're going to be repopulating the list at least semi-frequently I'd go with the LINQ query that CQ suggested.

You can create some sort of index and it might get faster.
We can build a index like this:
Dictionary<char, List<ClassFoo>> indexByFirstLetter;
foreach (var cf in collection) {
indexByFirstLetter[cf.bar[0]] = indexByFirstLetter[cf.bar[0]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[0]].Add(cf);
indexByFirstLetter[cf.bar[cf.bar.length - 1]] = indexByFirstLetter[cf.bar[cf.bar.Length - 1]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[cf.bar.Length - 1]].Add(cf);
}
Then use the it like this:
foreach (ClasssFoo cf in indexByFirstLetter[query[0]]) {
if (cf.bar.StartsWith(query) || cf.bar.EndsWith(query))
result.Add(cf);
}
Now we possibly do not have to loop through as many ClassFoo as in your example, but then again we have to keep the index up to date. There is no guarantee that it is faster, but it is definately more complicated.

Depends. Are all your objects always going to be loaded in memory? Do you have a finite limit of objects that may be loaded? Will your queries have to consider objects that haven't been loaded yet?
If the collection will get large, I would definitely use an index.
In fact, if the collection can grow to an arbitrary size and you're not sure that you will be able to fit it all in memory, I'd look into an ORM, an in-memory database, or another embedded database. XPO from DevExpress for ORM or SQLite.Net for in-memory database comes to mind.
If you don't want to go this far, make a simple index consisting of the "bar" member references mapping to class references.

If the set of possible criteria is fixed and small, you can assign a bitmask to each element in the list. The size of the bitmask is the size of the set of the criteria. When you create an element/add it to the list, you check which criteria it satisfies and then set the corresponding bits in the bitmask of this element. Matching the elements from the list will be as easy as matching their bitmasks with the target bitmask. A more general method is the Bloom filter.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.