Lucene.NET - checking if document exists in index

Lucene.NET - checking if document exists in index - c#

I have the following code, using Lucene.NET V4, to check if a file exists in my index.
bool exists = false;
IndexReader reader = IndexReader.Open(Lucene.Net.Store.FSDirectory.Open(lucenePath), false);
Term term = new Term("filepath", "\\myFile.PDF");
TermDocs docs = reader.TermDocs(term);
if (docs.Next())
{
exists = true;
}
The file myFile.PDF definitely exists, but it always comes back as false. When I look at docs in debug, its Doc and Freq properties state that they "threw an exception of type 'System.NullReferenceException'.

First of all, it's a good practice to use the same instance of the IndexReader if you're not going to consider deleted documents - it's going to perform better and it's thread-safe so you can make a static read-only field out of it (although, I can see that you're specifying false for readOnly parameter so in case this is intended, just ignore this paragraph).
As for your case, are you tokenizing filepath field values? Because if you are (e.g. by using StandardAnalyzer when indexing/searching), you will probably have problems finding values such as \myFile.PDF (with default tokenizer, the value is going to be split into myFile and PDF, not sure about the leading backslash).
Hope this helps.

You may have analyzed the field "filepath" during indexing with an analyzer which tokenizes/changes the content. e.g. the StandardAnalyzer tokenizes, lowercases, removes stopwords if specified etc.
If you only need to query with the exact filepath like in your example use the KeywordAnalyzer during indexing for this field.
If you can't re-index at the moment you need to find out which analyzer is used during indexing and use it to create your query. You have two options:
Use a query parser with the right analyzer and parse the query filepath:\\myFile.PDF. If the resultung query is a TermQuery you can use its term as you did in your example. Otherwise perform a search with the query.
Use the Analyzer directly to create the terms from the TokenStream object. Again, if only one term, do it as you did, if multipe terms, create a phrase query.

Related

Get last index of character with LINQ to Entities

I'm getting the error:
LINQ to Entities does not recognize the method 'Int32 LastIndexOf(System.String)'
method, and this method cannot be translated into a store expression.
When using this code to tell if a person's last name starts with certain characters:
persons = persons.Where(c => c.FullName.IndexOf(" ") > 0 &&
c.FullName.Substring(c.FullName.LastIndexOf(" ")+1).StartsWith(lastNameSearch));
Any clue how to achieve this without using LastIndexOf()? Maybe I have to check for this after I grab results from the database using ToList()?

You are limited by the set of canonical functions that can be translated into an SQL query, so any solution must be achieved with no more than the canonical functions offer.
Luckily, one of the supported functions is the bool Contains(string) instance method. You can rewrite your check as
persons = persons.Where(c => c.FullName.Contains(" " + lastNameSearch));
This is not exactly like your current version (because it will allow people with more than one name to match their second name, while the former won't), but it's pretty close and IMHO can be acceptable.
Of course it would be much better than any of this to keep the last names as a separate column in the database, if that is at all possible.

Passing query as string to elasticsearch using Mpdreamz/NEST

I started using NEST and got it working. I see that the query as string is depricated. Is there another way of doing this? lets say I want to search for "test" in the whole index.

Passing as string is indeed deprecated but will not be removed.
To search for a term over all indices use:
this.ConnectedClient.Search<MyDTO>(s=>s
.AllIndices()
.Query(q=>q.Term(f=>f.Name, ""))
);
Make sure to look at the test project and the documentation which have alot of example code.

You can just use the querystring query type if all you are looking for is to search by a single word across all fields for a document type.
Client.Search<T>(q=>q.Index("MyIndex").Query(q=>q.QueryString("test")))

What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.

Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.

Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

getting a specific field values from lucene

i just started learning how lucene works and am trying to implement it in a site i allready wrote with mysql.
i have a field named city in my documents, and i want to get all the values for city from the documents.
i have found this question ( which is exactly what i need ) Get all lucene values that have a certain fieldName
but all they show there is a line of code, and as i said, i am not experienced enough to understand how to implement that.
can someone please help me with some code to implement IndexReader.Open(directory,true).Terms(new Term("city", String.Empty));
what comes before / after that declaration ?
i have tried this:
System.IO.DirectoryInfo directoryPath = new System.IO.DirectoryInfo(Server.MapPath("LuceneIndex"));
Directory directory = FSDirectory.Open(directoryPath);
Lucene.Net.Index.TermEnum iReader = IndexReader.Open(directory,true).Terms(new Term("city", String.Empty));
but how do i iterate over the results?

This loop should iterate over all the terms:
Term curTerm = iReader.Term();
bool hasNext = true;
while (curTerm != null && hasNext)
{
//do whatever you need with the current term....
hasNext = iReader.Next();
curTerm = iReader.Term();
}

I'm not familiar with C# API, but it looks very similar to the Java one.
What this code does is to get an instance of IndexReader with read-only access that is used to read data from Lucene index segments stored in directory. Then it gets an enumeration of all terms, starting at the given one. Dictionary (index part that stores the terms) in Lucene is organized .tis files, ordered lexicographically first by field name and then by term text.
So this statement gives you an enumeration of all term texts, starting at the beginning of the field city (besides: in Java you would rather write new Term("city")). You now need to find out the C# API of this enumeration, and then walk through it till you will get a Term that have something different as the field().
A final note: generally, you should avoid doing things like this: it may for example limit your ability to distribute the index. If it turns out that this is a thing that you are doing at the very beginning of using Lucene, then it's probable that you are using it more like a document database than a search library.

how to load a hashtable from a simple xml file using xmltextreader

using xmltextreader, how would I load a hashtable.
XML:
<base><user name="john">2342343</user><user name="mark">239099393</user></base>
This was asked before but it was using some funky linq that I am not fully comfortable with just yet.

Well, the LINQ to XML solution is really easy, so I suggest we try to make you comfortable with that instead of creating a more complex solution. Here's the code, with plenty of explanation...
// Load the whole document into memory, as an element
XElement root = XElement.Load(xmlReader);
// Get a sequence of users
IEnumerable<XElement> users = root.Elements("user");
// Convert this sequence to a dictionary...
Dictionary<string, string> userMap = users.ToDictionary(
element => element.Attribute("name").Value, // Key selector
element => element.Value); // Value selector
Of course you could do this all in one go - and I'd probably combine the second and third statements. But that's about as conceptually simple as it's likely to get. It would become more complicated if you wanted to put error handling around the possibility that a user element might not have a name, admittedly. (This code will throw a NullReferenceException in that case.)
Note that this assumes you want the name as the key and id as value. If you want the hashtable the other way round, just switch the order of the lambda expressions.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Lucene.NET - checking if document exists in index - c#

Related

Get last index of character with LINQ to Entities

Passing query as string to elasticsearch using Mpdreamz/NEST

What is the fastest way to search a List<T> across multiple properties?

getting a specific field values from lucene

how to load a hashtable from a simple xml file using xmltextreader

Categories

Resources