Lucene.NET TextField not being indexed - c#

Using .NET 6.0 and Lucene.NET-4.8.0-beta00016 from NuGet
I am having an issue implementing the quickstart example from the website. When using TextField in a document, the field is not indexed. The search later in the BuildIndex method retrieves no results. If TextField is changed to StringField, the example works and the search returns a valid result.
Why does StringField work and TextField doesn't? I read that StringField is not analyzed but TextField is, so perhaps it's something to do with the StandardAnalyzer?
public class LuceneFullTextSearchService {
private readonly IndexWriter _writer;
private readonly Analyzer _standardAnalyzer;
public LuceneFullTextSearchService(string indexName)
{
// Compatibility version
const LuceneVersion luceneVersion = LuceneVersion.LUCENE_48;
string indexPath = Path.Combine(Environment.CurrentDirectory, indexName);
Directory indexDir = FSDirectory.Open(indexPath);
// Create an analyzer to process the text
_standardAnalyzer = new StandardAnalyzer(luceneVersion);
// Create an index writer
IndexWriterConfig indexConfig = new IndexWriterConfig(luceneVersion, _standardAnalyzer)
{
OpenMode = OpenMode.CREATE_OR_APPEND,
};
_writer = new IndexWriter(indexDir, indexConfig);
}
public void BuildIndex(string searchPath)
{
Document doc = new Document();
TextField docText = new TextField("title", "Apache", Field.Store.YES);
doc.Add(docText);
_writer.AddDocument(doc);
//Flush and commit the index data to the directory
_writer.Commit();
// Parse the user's query text
Query query = new TermQuery(new Term("title", "Apache"));
// Search
using DirectoryReader reader = _writer.GetReader(applyAllDeletes: true);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.Search(query, n: 2);
// Show results
Document resultDoc = searcher.Doc(topDocs.ScoreDocs[0].Doc);
string title = resultDoc.Get("title");
}
}

StandardAnalyzer includes a LowerCaseFilter, so your text is stored in the index as lower-case.
However, when you build your query, the text you use is "Apache" rather than "apache", so it doesn't produce any hits.
// Parse the user's query text
Query query = new TermQuery(new Term("title", "Apache"));
Option 1
Lowercase your search term.
// Parse the user's query text
Query query = new TermQuery(new Term("title", "Apache".ToLowerInvariant()));
Option 2
Use a QueryParser with the same analyzer you use to build the index.
QueryParser parser = new QueryParser(luceneVersion, "title", _standardAnalyzer);
Query query = parser.Parse("Apache");
The Lucene.Net.QueryParser package contains several implementations (the above example uses the Lucene.Net.QueryParsers.Classic.QueryParser).

Related

How to do regular expression search using lucene.Net

I m using lucene.Net version 3.0.3. I want to do regular expression search. I tried the following code:
// code
String SearchExpression = "[DM]ouglas";
const int hitsLimit = 1000000;
//state the file location of the index
string indexFileLocation = IndexLocation;
Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.Open(indexFileLocation);
//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir);
var analyzer = new WhitespaceAnalyzer();
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, new[] {
Field_Content, }, analyzer);
Term t = new Term(Field_Content, SearchExpression);
RegexQuery scriptQuery = new RegexQuery(t);
string s = string.Format("{0}", SearchExpression);
var query = parser.Parse(s);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.Add(query, Occur.MUST);
var hits = searcher.Search(booleanQuery, null, hitsLimit, Sort.RELEVANCE).ScoreDocs;
foreach (var hit in hits)
{
var hitDocument = searcher.Doc(hit.Doc);
string contentValue = hitDocument.Get(Field_Content);
}
// end of code
When I try to search with patten "Do*uglas", I get the results.
But if I search with the pattern "[DM]ouglas]" it is giving me the following error:
"Cannot parse '[DM]ouglas': Encountered " "]" "] "" at line 1, column 3. Was expecting one of: "TO" ... <RANGEIN_QUOTED> ... <RANGEIN_GOOP> ...".
I also tried doing simple search pattern like ".ouglas" which should give me results, as I have "Douglas" in my text content.
Does anyone know how to do regular expression search using lucene.Net version 3.0.3?
The StandardQueryParser does not support regular expressions at all. It is, instead, attempting to interpret that portion of the query as a range query.
I you wish to use regexes to search, you will need to construct a RegexQuery manually. Note, that RegexQuery performance tends to be poor. You might be able to improve it by switching from JavaUtilRegexCapabilities to JakartaRegexpCapabilities.

Why is the Lucene.NET IndexSearcher returning zero results?

I recently started working with Lucene.NET and I have some problems: I have used an IndexWriter to index my documents in C:\\TestIndex which I guess it worked since it generated several .fnm, .frq, .cfx, .tii, .tis files.
The problem is when trying to make a simple search through them, I never get any results back. Below is the code I use,
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
//Provide the directory where index is stored
Directory directory = FSDirectory.Open(newSystem.IO.DirectoryInfo(#"C:\\TestIndex"));
IndexReader indexReader = IndexReader.Open(directory, true);
Searcher indexSearch = new IndexSearcher(indexReader);
Analyzer std = new StandardAnalyzer(Version.LUCENE_29);
QueryParser parser = new QueryParser(Version.LUCENE_29, "text", std);
Query qry = parser.Parse("morning");
// true opens the index in read only mode
Searcher srchr = new IndexSearcher(IndexReader.Open(directory, true));
TopScoreDocCollector cllctr = TopScoreDocCollector.Create(100, true);
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
srchr.Search(qry, cllctr);
for (int i = 0; i < hits.Length; i++)
{
int docId = hits[i].Doc;
float score = hits[i].Score;
Document doc = srchr.Doc(docId);
Console.WriteLine("Searched from Text: " + doc.Get("text"));
}
I tried several approaches but I never get any result. Do you have any idea?
Below is indexing code,
IndexWriter indexWriter =
new IndexWriter(
luceneDir,
new StandardAnalyzer(Version.LUCENE_29),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
string[] listOfFiles = Directory.GetFiles(#"C:\Projects\lucene.net-trunk\build\vs2010\demo\MyTestProject\TestDocs");
foreach (string s in listOfFiles)
{
String content = File.ReadAllText(s);
Document doc = new Document();
String title = s;
// adding title field
doc.Add(new Field("title", title, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
indexWriter.AddDocument(doc);
}
indexWriter.Optimize();
indexWriter.Dispose();
Use luke to inspect the index to ensure it has data also you can perform searches to validate your search criteria
http://www.getopt.org/luke/
EDIT - (Luke will work with lucene and lucene.net indexes you will need to install java to use)
EDIT
Update the line
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", std);
With
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "content", std);
You have set the default search field to text which doesn't exist
Also you are trying to fetch the wrong field in your console.write line
Make sure you use the same analyzer when indexing and searching (in your case it's StandardAnalyzer I guess):
using Lucene.Net.Analysis;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Store;
...
Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(#"C:\\TestIndex"));
var writer = new IndexWriter(
directory,
new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29),
true,
new MaxFieldLength(int.MaxValue));
UPDATE
I'm using a slightly different approach for searching but, anyway, maybe you need to swap these two lines:
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
srchr.Search(qry, cllctr);
So it becomes:
srchr.Search(qry, cllctr);
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
meaning that the collector first collects the results when the search is executed and then you get your scored documents via the collector instance.
Could you try explicitely specifying the field you're searching? for example:
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", std);
Lucene.Net.Search.Query qry = parser.Parse("content: morning");
I think that Lucene requires you to tell it on which field(s) (title, content...) you want to run your query.

Lucene.net - How do I create a negative query, ie. search for objects NOT containing something

I'm working on an EPiServer website using a Lucene.net based search engine.
I have a query for finding only pages with a certain pageTypeId. Now I want to do the opposite, I want to only find pages that is NOT a certain pageTypeId. Is that possible?
This is the code for creating a query to search only for pages with pageTypeId 1, 2 or 3:
public BooleanClause GetClause()
{
var booleanQuery = new BooleanQuery();
var typeIds = new List<string>();
typeIds.Add("1");
typeIds.Add("2");
typeIds.Add("3");
foreach (var id in this.typeIds)
{
var termQuery = new TermQuery(
new Term(IndexFieldNames.PageTypeId, id));
var clause = new BooleanClause(termQuery,
BooleanClause.Occur.SHOULD);
booleanQuery.Add(clause);
}
return new BooleanClause(booleanQuery,
BooleanClause.Occur.MUST);
}
I want instead to create a query where I search for pages that have a pageTypeId that is NOT "4".
I tried simply replacing "SHOULD" and "MUST" with "MUST_NOT", but that didn't work.
Thanks to #goalie7960 for replying so quickly. Here is my revised code for searching for anything except some selected page types. This search includes all documents except those with pageTypeId "1", "2" or "3":
public BooleanClause GetClause()
{
var booleanQuery = new BooleanQuery();
booleanQuery.Add(new MatchAllDocsQuery(),
BooleanClause.Occur.MUST);
var typeIds = new List<string>();
typeIds.Add("1");
typeIds.Add("2");
typeIds.Add("3");
foreach (var typeId in this.typeIds)
{
booleanQuery.Add(new TermQuery(
new Term(IndexFieldNames.PageTypeId, typeId)),
BooleanClause.Occur.MUST_NOT);
}
return new BooleanClause(booleanQuery,
BooleanClause.Occur.MUST);
}
Assuming all your docs have a pageTypeId you can try using a MatchAllDocsQuery and then a MUST_NOT to remove all the docs you want to skip. Something like this would work I think:
BooleanQuery subQuery = new BooleanQuery();
subQuery.Add(new MatchAllDocsQuery(), BooleanClause.Occur.MUST);
subQuery.Add(new TermQuery(new Term(IndexFieldNames.PageTypeId, "4")), BooleanClause.Occur.MUST_NOT);
return subQuery;

Lucene Searcher return only one match result

My search text goes as "ma" and i have two lucene document which have ma as the text in it. But in return i only get one document.
Below is the code :
//adding deocument
document.Add(new Field("Text",text,Field.Store.YES, Field.Index.TOKENIZED));
//search logic :
IndexReader reader = IndexReader.Open(GetFileInfo(indexName));
//create an index searcher that will perform the search
IndexSearcher searcher = new IndexSearcher(reader);
//List of ID
List<string> searchResultID = new List<string>();
//build a query object
QueryParser parser = new QueryParser("Text", analyzer);
parser.SetAllowLeadingWildcard(true);
Query query = parser.Parse(searchText);
//execute the query
Hits hits = searcher.Search(query);
Maybe you could use luke. It's a useful diagnostic tool that can display the contents of an existing Lucene index and do other interesting stuff. I haven't used it myself, so I'm not sure, but I think it might help you in debugging this issue. Good luck!
I was able to solve my issue :
Index Writer must be created only once.You can check whether the index exits or not if not you create an new IndexWriter . for eg :
//The last parameter bool of an IndexWriter Contructor which says that you want to create an newIndexWriter or not
IndexWriter writer = new IndexWriter(GetFileInfo(indexName), analyzer, true);
On adding the new Document you must perform an check whether index exists or not , if it exists , then just pass bool param as false to the IndexWriter constructor:
IndexWriter writer = new IndexWriter(GetFileInfo(indexName), analyzer, false);
writer.AddDocument(CreateDocument(Id, text, dateTime));
writer.Optimize();
writer.Close();

Why does this Lucene.Net query fail?

I am trying to convert my search functionality to allow for fuzzy searches involving multiple words. My existing search code looks like:
// Split the search into seperate queries per word, and combine them into one major query
var finalQuery = new BooleanQuery();
string[] terms = searchString.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string term in terms)
{
// Setup the fields to search
string[] searchfields = new string[]
{
// Various strings denoting the document fields available
};
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
finalQuery.Add(parser.Parse(term), BooleanClause.Occur.MUST);
}
// Perform the search
var directory = FSDirectory.Open(new DirectoryInfo(LuceneIndexBaseDirectory));
var searcher = new IndexSearcher(directory, true);
var hits = searcher.Search(finalQuery, MAX_RESULTS);
This works correctly, and if I have an entity with the name field of "My name is Andrew", and I perform a search for "Andrew Name", Lucene correctly finds the correct document. Now I want to enable fuzzy searching, so that "Anderw Name" is found correctly. I changed my method to use the following code:
const int MAX_RESULTS = 10000;
const float MIN_SIMILARITY = 0.5f;
const int PREFIX_LENGTH = 3;
if (string.IsNullOrWhiteSpace(searchString))
throw new ArgumentException("Provided search string is empty");
// Split the search into seperate queries per word, and combine them into one major query
var finalQuery = new BooleanQuery();
string[] terms = searchString.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string term in terms)
{
// Setup the fields to search
string[] searchfields = new string[]
{
// Strings denoting document field names here
};
// Create a subquery where the term must match at least one of the fields
var subquery = new BooleanQuery();
foreach (string field in searchfields)
{
var queryTerm = new Term(field, term);
var fuzzyQuery = new FuzzyQuery(queryTerm, MIN_SIMILARITY, PREFIX_LENGTH);
subquery.Add(fuzzyQuery, BooleanClause.Occur.SHOULD);
}
// Add the subquery to the final query, but make at least one subquery match must be found
finalQuery.Add(subquery, BooleanClause.Occur.MUST);
}
// Perform the search
var directory = FSDirectory.Open(new DirectoryInfo(LuceneIndexBaseDirectory));
var searcher = new IndexSearcher(directory, true);
var hits = searcher.Search(finalQuery, MAX_RESULTS);
Unfortunately, with this code if I submit the search query "Andrew Name" (same as before) I get zero results back.
The core idea is that all terms must be found in at least one document field, but each term can reside in different fields. Does anyone have any idea why my rewritten query fails?
Final Edit: Ok it turns out I was over complicating this by a LOT, and there was no need to change from my first approach. After reverting back to the first code snippet, I enabled fuzzy searching by changing
finalQuery.Add(parser.Parse(term), BooleanClause.Occur.MUST);
to
finalQuery.Add(parser.Parse(term.Replace("~", "") + "~"), BooleanClause.Occur.MUST);
Your code works for me if I rewrite the searchString to lower-case. I'm assuming that you're using the StandardAnalyzer when indexing, and it will generate lower-case terms.
You need to 1) pass your tokens through the same analyzer (to enable identical processing), 2) apply the same logic as the analyzer or 3) use an analyzer which matches the processing you do (WhitespaceAnalyzer).
You want this line:
var queryTerm = new Term(term);
to look like this:
var queryTerm = new Term(field, term);
Right now you're searching field term (which probably doesn't exist) for the empty string (which will never be found).

Categories

Resources