Lucene index field not searchable - c#

So I have a field in my Lucene index documents named "Field1" (for all intents and purposes).
When I open Luke, and browse the documents, I see most of the documents have this field. However when I switch to the search tab, and I input Field1:parameterValue I get zero search results.
When doing the indexing, for the document, I have
doc.Add(new Field("Field1", field1, Field.Store.YES, Field.Index.ANALYZED));
Why is my field not able to be searched? As an aside, I can't find any documentation on Luke that explains what the "IdfpTSVopNLB#" column is in the document record either. I'm thinking this information could possibly be useful, so for one of the records that has this field, the column value is IdfpTS---N--- and the "Norm" column is 4.0

The "IdfpTSVopNLB#" field is a collection of flags. You should see a key to it in Luke:
I would guess the reason your searches are failing is because you aren't taking your analysis into account. For instance, for your sample query: Field1:parameterValue, if the field is analyzed by StandardAnalyzer (and the query is not analyzer or is keyword analyzed), you'll get no results. This is because "parameterValue" would have been lowercased by the analyzer, so the actual searchable term would be "parametervalue", instead.
In the search tab, you should see a place to select an analyzer for Luke to use for query parsing. If you use the same analyzer you used to index the data, you may see better results.

As it turns out, this is the correct way to do this. I just needed to delete the entire index and rebuild it from scratch to get the new values in. It didn't automatically update the existing indexes.

Related

Extract unique listo of fields from maching document

I am new to Lucene, so maybe i have missunderstood something about how it works.
I have indexed few hundred thousand documents with many string field. For example suppose we have 5 string field (named A,B,C,D,E) and the first 3 are indexed (A,B,C) leaving the last two unindexed, only included into the document (i mean D,E). Values in each field may be duplicate, for example assume that the field A is used to store names, and the name 'Richard' appear many times.
When i apply a query i looking for each term in each field, now for example, suppose i get 3K documents that match my query.
Is it possible to get a list of unique values (distinct) of each fields without scan and group the result? I am particularly interested into this because i apply a limit to the documents i actually read, but i would like to get a complete list of unique values in each fields (even the documents i dont' read) of the matching documents.
If this is possibile, can i apply this logic even for unindexed fields (D,E) ?
When doing the search, it will return to you all the documents that have the query conditions. On that result you can do a highlight (which will slow the process), but you can do something like pagination to return the result in pages if you want.
In the highligher you have many methods you can use (depending on what version of Lucene you are using; I am talking here about the last version 4.8.0) like GetBestTextFragments() which takes a parameter called maxNumberFragments. If you set that parameter to 1 then it will return only one value from that particular field even if there might be multiple values that match the query.
I am not sure if that answers your question, but I hope it helps. Regarding the unindexed fields, I dont think you can do that (although I have never tried it).

C# implement raven db full text search by the part of word

I have a grid and I need to support full text search. I need to support search not only by start with and end with, but I need to support search by the part of word. For example if I have "MyWord", I need that search will found by the part of "wor". If I try to use string.contains() I get the following error:
Contains is not supported, doing a substring match over a text field is a very slow operation, and is not allowed using the Linq API.
The recommended method is to use full text search (mark the field as Analyzed and use the Search() method to query it.
If I build raven db index and mark field as Analyzed, contains is not working. It works with StartWith() and EndWith(), but not with contains. Using .Search() I'm getting the same results. Another option is to use lucene syntax:
.Where("Name:*partOfWord*")
and it works fine, but I don't want to combine linq with lucene syntax and I want to solve it using raven db indexes.
Have you any ideas how to implement full text search for raven db using indexes?
You want to be using an NGram analyzer, as described here. It's an analyzer you can add to your RavenDB server by dropping its DLL in the Analyzers folder.
You really don't want to do any *substr Lucene queries ("ending with" clauses, that is), because the performance is terrible. The inconsistency in coding style is a lesser problem.
I use this query to search for persons full names by just typing a part of the name. It is recommended to set a minimum length of search string.
.Search(x => x.Name, "word to search" + "*", escapeQueryOptions: EscapeQueryOptions.AllowPostfixWildcard)

Indexed databases in Random Access Memory

I'm currently writing a small test web application for a jobs search system.
I have a table Vacancies (the main table to talk about).
I need to make a rapid AJAX update of vacancies (in a suggest list below input control) matched to a user query. Different DBMS provide powerful programming extensions like Free Text Search in Microsoft SQL Server .. but I think that scanning a physical file takes plenty of time. And my idea is based on transfering the whole table Vacancies into RAM, so it, in my view, makes sense since in this case data retrieving demands less time.
So if a client types in a textbox something like "pro" - the suggest list shows up with suggestions:
-product manager
-professional designer
-programmer
-programmer C#
-programmer Java
-property administrator
-provision expert
when a user types another letter "g", the value of a textbox widens to "prog"
and the list is refreshed:
-programmer
-programmer C#
-programmer Java
To make that possible I plan to create a tree index with saved values in nodes, where a vacancy prefix will play a role of the index key and node values are the vacancy names. The index is built and populated only once with data from a data table. See what I mean below:
"pro" -> {
"product manager",
"professional designer",
"programmer",
"programmer C#",
"programmer Java"
"property administrator",
"provision expert"
}
So an index builder must analyze a string list and find the least prefixes of vacancy names.
Then when a builder finds a string with a letter after prior found prefix, it creates a child tree node ("prog") (the count of data values in that node decreases as it is constantly filtered) and adds itself up to the parent node ("pro")
"prog" -> {
"programmer",
"programmer C#",
"programmer Java"}
}
Can you advise me on the types of tree indexes that naturally fit to solve this problem?
What's the best of them by the seek time?
Thanks
This problem was solved years ago, you are recreating Lucene:
For what it's worth the type of tree you want is a Patricia Tree or a Radix Tree. In terms of storing all data in RAM, this is a bad idea because there are other applications that use RAM not just your index. Currently I am ripping out someone's custom database that they thought was a good idea to implement this way and replacing it with a real database solution.

How can I retrieve a list of terms from an Examine index?

I have implemented Examine in an Umbraco project and have created an index of my site's content. What I now need is a list of terms stored in that index for any given field.
This list of terms will be the basis for an autocomplete search field of a UI form.
How can I retrieve this list of terms based upon a specific field, e.g. nodeName?
Please note, I do not want to search against the nodeName field. I wish to read/retrieve the terms in the index associated with the field.
You may try this:
reader.terms(new Term("nodeName", ""));
It seems this is not possible but since the Examine library is based upon the Lucene library, it is a matter of "rolling your own" and just opening and interrogating an IndexReader instance using the reader.Terms() method.

Lucene.NET - Can't delete docs using IndexWriter

I'm taking over a project so I'm still learning this. The project uses Lucence.NET. I also have no idea if this piece of functionality is correct or not. Anyway, I am instantiating:
var writer = new IndexWriter(directory, analyzer, false);
For specific documents, I'm calling:
writer.DeleteDocuments(new Term(...));
In the end, I'm calling the usual writer.Optimize(), writer.Commit(), and writer.Close().
The field in the Term object is a Guid, converted to a string (.ToString("D")), and is stored in the document, using Field.Store.YES, and Field.Index.NO.
However, with these settings, I cannot seem to delete these documents. The goal is to delete, then add the updated versions, so I'm getting duplicates of the same document. I can provide more code/explanation if needed. Any ideas? Thanks.
The field must be indexed. If a field is not indexed, its terms will not show up in enumeration.
I don't think there is anything wrong with how you are handling the writer.
It sounds as if the term you are passing to DeleteDocuments is not returning any documents. Have you tried to do a query using the same term to see if it returns any results?
Also, if your goal is to simple recreate the document, you can call UpdateDocument:
// Updates a document by first deleting the document(s) containing term and
// then adding the new document. The delete and then add are atomic as seen
// by a reader on the same index (flush may happen only after the add). NOTE:
// if this method hits an OutOfMemoryError you should immediately close the
// writer. See above for details.
You may also want to check out SimpleLucene (http://simplelucene.codeplex.com) - it makes it a bit easier to do basic Lucene tasks.
[Update]
Not sure how I missed it but #Shashikant Kore is correct, you need to make sure the field is indexed otherwise your term query will not return anything.

Categories

Resources