i just started learning how lucene works and am trying to implement it in a site i allready wrote with mysql.
i have a field named city in my documents, and i want to get all the values for city from the documents.
i have found this question ( which is exactly what i need ) Get all lucene values that have a certain fieldName
but all they show there is a line of code, and as i said, i am not experienced enough to understand how to implement that.
can someone please help me with some code to implement IndexReader.Open(directory,true).Terms(new Term("city", String.Empty));
what comes before / after that declaration ?
i have tried this:
System.IO.DirectoryInfo directoryPath = new System.IO.DirectoryInfo(Server.MapPath("LuceneIndex"));
Directory directory = FSDirectory.Open(directoryPath);
Lucene.Net.Index.TermEnum iReader = IndexReader.Open(directory,true).Terms(new Term("city", String.Empty));
but how do i iterate over the results?
This loop should iterate over all the terms:
Term curTerm = iReader.Term();
bool hasNext = true;
while (curTerm != null && hasNext)
{
//do whatever you need with the current term....
hasNext = iReader.Next();
curTerm = iReader.Term();
}
I'm not familiar with C# API, but it looks very similar to the Java one.
What this code does is to get an instance of IndexReader with read-only access that is used to read data from Lucene index segments stored in directory. Then it gets an enumeration of all terms, starting at the given one. Dictionary (index part that stores the terms) in Lucene is organized .tis files, ordered lexicographically first by field name and then by term text.
So this statement gives you an enumeration of all term texts, starting at the beginning of the field city (besides: in Java you would rather write new Term("city")). You now need to find out the C# API of this enumeration, and then walk through it till you will get a Term that have something different as the field().
A final note: generally, you should avoid doing things like this: it may for example limit your ability to distribute the index. If it turns out that this is a thing that you are doing at the very beginning of using Lucene, then it's probable that you are using it more like a document database than a search library.
Related
Currently I have 7,000 video entries and I have a hard time optimizing it to search for Tags and Actress.
This is my code I am trying to modify, I tried using HashSet. It is my first time using it but I don't think I am doing it right.
Dictionary dictTag = JsonPairtoDictionary(tagsId,tagsName);
Dictionary dictActresss = JsonPairtoDictionary(actressId, actressName);
var listVid = new List<VideoItem>(db.VideoItems.ToList());
HashSet<VideoItem> lll = new HashSet<VideoItem>(listVid);
foreach (var tags in dictTag)
{
lll = new HashSet<VideoItem>(lll.Where(q => q.Tags.Exists(p => p.Id == tags.Key)));
}
foreach (var actress in dictActresss)
{
listVid = listVid.Where(q => q.Actress.Exists(p => p.Id == actress.Key)).ToList();
}
First part I get all the Videos in Db by using db.VideoItems.ToList()
Then it will go through a loop to check if a Tag exist
For each VideoItem it has a List<Tags> and I use 'exist' to check if a tag is match.
Then same thing with Actress.
I am not sure if its because I am in Debug mode and ApplicationInsight is active but it is slow. And I will get like 10-15 events per second with baseType:RemoteDependencyData which I am not sure if it means it still connected to database (should not be since I only should only be messing with the a new list of all videos) or what.
After 7 mins it is still processing and that's the longest time I have waited.
I am afraid to put this on my live site since this will eat up my resource like candy
Instead of optimizing the linq you should optimize your database query.
Databases are great at optimized searches and creating subsets and will most likely be faster than anything you write. If you have need to create a subset based on more than on database parameter I would recommend looking into creating some indexes and using those.
Edit:
Example of db query that would eliminate first for loop (which is actually multiple nested loops and where the time delay comes from):
select * from videos where tag in [list of tags]
Edit2
To make sure this is most efficient, require the database to index on the TAGS column. To create the index:
CREATE INDEX video_tags_idx ON videos (tag)
Use 'explains' to see if the index is being used automatically (it should be)
explain select * from videos where tag in [list of tags]
If it doesn't show your index as being used you can look up the syntax to force the use of it.
The problem was not optimization but it was utilization of the Microsoft SQL or my ApplicationDbContext.
I found this when I realize that http://www.albahari.com/nutshell/predicatebuilder.aspx
Because the problem with Keyword search, there can be multiple keywords, and the code I made above doesn't utilize the SQL which made the long execution time.
Using the predicate builder, it will be possible to create dynamic conditions in LINQ
my code only works if I search correctly for the Name from the table. I must search for the FULL name and spell it corretly with uppercase etc.
E.g. I Cannot search for 'The Martian' by searching 'the martian' or 'martian' etc.
using(MovieEntities db = new MovieEntities()){
var searchMovie = new List<Movie>(db.Moviess.ToList());
var searchFilter = new List<Movie>();
foreach (var search in searchMovie)
{
if (search.Name.Contains(txtSearch.Text))
{
searchFilter.Add(search);
//so far, only adds if I search it's full/correctly name
}
}/* print out */}
How can I search if it contains ANY parts of the txtSearch.Text and also ignoring under-/uppercase etc ?
PS: Im trying to learn about LINQ, I would appreciate if you also could give an alternative Linq solution.
Thanks
This will keep the searches on the database side, which will speed them up, and most people have their databases configured to be case insensitive, so you get that as a freebie.
using(MovieEntities db = new MovieEntities()){
var searchFilter=db.Moviess.AsQueryable();
foreach(var word in txtSearch.Text.Split(' '))
{
searchFilter=searchFilter.Where(f=>f.Name.Contains(word));
}
/* Print */
Although you asked for "How can I search if it contains ANY parts of the txtSearch.Text", which isn't what this does. It makes sure it contains all words of txtSearch.Text, in any order which is more typical usage. It can be rewritten to any part as well, but then "the anything" would return an awful lot of results.
Your query should already search for any strings which include your txtSearch.Text, since you used Contains instead of Equals. The only thing that's missing is to make it case-insensitive. You can do so by essentially making both your strings lowercase (or uppercase) using either the String.ToLower or String.ToUpper methods on both your strings.
So in your case, it should be:
if (search.Name.ToLower().Contains(txtSearch.Text.ToLower()))
{
// ...
}
I have the following code, using Lucene.NET V4, to check if a file exists in my index.
bool exists = false;
IndexReader reader = IndexReader.Open(Lucene.Net.Store.FSDirectory.Open(lucenePath), false);
Term term = new Term("filepath", "\\myFile.PDF");
TermDocs docs = reader.TermDocs(term);
if (docs.Next())
{
exists = true;
}
The file myFile.PDF definitely exists, but it always comes back as false. When I look at docs in debug, its Doc and Freq properties state that they "threw an exception of type 'System.NullReferenceException'.
First of all, it's a good practice to use the same instance of the IndexReader if you're not going to consider deleted documents - it's going to perform better and it's thread-safe so you can make a static read-only field out of it (although, I can see that you're specifying false for readOnly parameter so in case this is intended, just ignore this paragraph).
As for your case, are you tokenizing filepath field values? Because if you are (e.g. by using StandardAnalyzer when indexing/searching), you will probably have problems finding values such as \myFile.PDF (with default tokenizer, the value is going to be split into myFile and PDF, not sure about the leading backslash).
Hope this helps.
You may have analyzed the field "filepath" during indexing with an analyzer which tokenizes/changes the content. e.g. the StandardAnalyzer tokenizes, lowercases, removes stopwords if specified etc.
If you only need to query with the exact filepath like in your example use the KeywordAnalyzer during indexing for this field.
If you can't re-index at the moment you need to find out which analyzer is used during indexing and use it to create your query. You have two options:
Use a query parser with the right analyzer and parse the query filepath:\\myFile.PDF. If the resultung query is a TermQuery you can use its term as you did in your example. Otherwise perform a search with the query.
Use the Analyzer directly to create the terms from the TokenStream object. Again, if only one term, do it as you did, if multipe terms, create a phrase query.
I have done Fuzzy search in lucene.Net. In this if i searched Feature, the Feature,Featured,featuring only should come.But the data came like based on text matching like venture,culture and etc. ture is matched in that fuzzy search.My code is
Query query = new FuzzyQuery(new Term("ContentText", searchString));
finalQuery.Add(query, BooleanClause.Occur.SHOULD);
You should take a look on the process called "Lemmatisation" (http://en.wikipedia.org/wiki/Lemmatisation). You would like to build your index based on the base form of the word (called lemma) - and the same you should do with your query.
Lucene supports English language out of the box so there should not be any problem with that.
You can pass additional filters that check the minimumscore property as well as the minimumsimilarity property that can enhance the quality of the results. Other things I have done in specific scenarios is use multiple different query types and combine the results (filter out low scores) and return a combined list. This works really well for things like an engine that can dynamically "assume did you mean..." results initially rather than asking you "did you mean".
You probably need to set Parser.FuzzyMinSim
I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.
Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.
Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.