Lucene.net does not find matches correctly

Lucene.net does not find matches correctly - c#

I am new in lucene.net , in some searches that i had , i found that i can use lucene in my project ,
now i can not fix the bugs in my code .
Let me I explain in Code
First of all i create indexes like these
var strIndexDir = path;
Directory indexDir = FSDirectory.Open(new DirectoryInfo(strIndexDir));
Analyzer std = new StandardAnalyzer(global::Lucene.Net.Util.Version.LUCENE_30)
foreach (var res in resturant)
{
var doc = new Document();
restaurantName = new Field("Name",
res.Name, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES);
var restaurantId = new Field("Id",
res.RestaurantId.ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);
var restaurantSlug = new Field("Slug",
res.Slug, Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);
var restaurantAddress = new Field("Address",
res.Address ?? "empty", Field.Store.YES,
Field.Index.NOT_ANALYZED, Field.TermVector.YES);
var resturantType = new Field("Type",
"restaurant", Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);
doc.Add(restaurantName);
doc.Add(restaurantId);
doc.Add(restaurantSlug);
doc.Add(restaurantAddress);
doc.Add(resturantType);
idxw.AddDocument(doc);
}
idxw.Optimize();
idxw.Close();
I Think with my self the indexing is ok , becuase i want just find the restaurant name and addresses
also for search query i use this way
string strIndexDir = path;
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var indexReader = IndexReader.Open(FSDirectory.Open(path), readOnly: true);
var parserName =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Name", std);
var parserAddress =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Address", std);
var parserSlug =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Slug", std);
var parserTitle =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Title", std);
var searcher = new IndexSearcher(FSDirectory.Open(path));
using (var srchr = new IndexSearcher(IndexReader.Open(directory,true)))
{
var qryName = parserName.Parse(q);
var qryAddress = parserAddress.Parse(q);
var qrySlug = parserSlug.Parse(q);
var qrytitle = parserTitle.Parse(q);
var cllctr = TopScoreDocCollector.Create(10, true);
searcher.Search(qryName, cllctr);
searcher.Search(qryAddress, cllctr);
searcher.Search(qrySlug, cllctr);
searcher.Search(qrytitle, cllctr);
var hits = cllctr.TopDocs().ScoreDocs;
Now let me say where is the problem .
for example i search this key word (q="box") want to find the restaurant name that name is boxshaharkgharb and want use "box"
the problem is that hot is always 0 but when i type boxshaharkgharb for example (q="boxshaharkgharb") the the result is ok .
how can handel that

By using wildcard * you can force Lucene to search by fragment.
If you need to do this for all queries - you need to review your choice - as Lucene best performs using whole term searches. Reason for that is that by default wildcards turn into constant score queries, while term search uses relevancy to rank results.

Related

lucene.net match if search term has no space

I'm using lucene.net to perform searchs in posts in my c# asp.net application, This is a sample document in my indexes:
var doc = new Document();
var title = new Field("Title", "the album hardwired to self-destruct released", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
title.Boost = 5;
doc.Add(title);
var ns_title = new Field("NoSpace_Title", "thealbumhardwiredtoselfdesctructreleased", Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
ns_title.Boost = 5;
doc.Add(ns_title);
doc.Add(new Field("Body", "the body text of the post", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.Add(new Field("Id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.AddDocument(doc);
Problem:
if I search for self or destruct or self destruct I get hit.
if I search for selfdestruct I dont get a hit.
The search method:
var searchWords = s.Split(' ').ToList();
var directory = GetDirectory();
var reader = IndexReader.Open(directory, true);
var searcher = new IndexSearcher(reader);
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, "Title,NoSpace_Title,Body".Split(','), analyzer);
var booleanQuery = new BooleanQuery();
// Title:selfdestruct*NoSpace_Title:selfdestruct*Body:selfdestruct*
s = string.Join(" ", searchWords.Select(x => x.Contains("*") ? x : x + "*"));
Query query = parser.Parse(QueryParser.Escape(s));
query.Boost = 5;
booleanQuery.Add(query, Occur.SHOULD);
// Title:*selfdestruct*,NoSpace_Title:*selfdestruct*,Body:*selfdestruct*
// (I suppose this should work and get hit but it doesn't)
s = "*" + string.Join("", searchWords) + "*";
Query query2 = parser.Parse(QueryParser.Escape(s));
query2.Boost = 3;
booleanQuery.Add(query2, Occur.SHOULD);
// Title:selfdestruct~0.85 (fuzzy search)
s = string.Join(" ", searchWords.Select(x => x.Contains("~") ? x : x + "~0.85"));
Query query3 = parser.Parse(QueryParser.Escape(s));
booleanQuery.Add(query3, Occur.SHOULD);
var collector = TopScoreDocCollector.Create(1000, true);
searcher.Search(booleanQuery, collector);
var hits = collector.TopDocs().ScoreDocs;
var docs = hits.Select(x => searcher.Doc(x.Doc)).ToList();

You can support this by adding a ShingleFilter into your analyzer.
ShingleFilter will combine adjacent tokens into single tokens to facilitate searching for them without a space. By default, it will output Unigrams as well (that is, it will also maintain the single tokens). So, when you index "self-destruct", it will index the tokens "self", "destruct", and "selfdestruct".
An easy way to do this without creating your own custom analyzer, is to use ShingleAnalyzerWrapper:
var analyzer = new ShingleAnalyzerWrapper(
new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
2);

How to Read Lucene.net returned query results

I can't figure out how to read the results returned from a Lucene.net query.
I have this code:
Initialization
var test = new Document();
test.Add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
test.Add(new Field("title", "the title", Field.Store.YES, Field.Index.ANALYZED));
test.Add(new Field("body", "the body of the question", Field.Store.YES, Field.Index.ANALYZED));
string path = HttpRuntime.AppDomainAppPath + "\\LuceneIndex";
Lucene.Net.Store.Directory directory = FSDirectory.Open(new DirectoryInfo(path));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
writer.AddDocument(test);
writer.Optimize();
writer.Flush(true, true, true);
writer.Dispose();
directory.Dispose();
analyzer.Dispose();
Reading the data
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "ti", analyzer);
string path = HttpRuntime.AppDomainAppPath + "\\LuceneIndex";
Lucene.Net.Store.Directory directory = FSDirectory.Open(new DirectoryInfo(path));
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new TermQuery(new Term("body", "body"));
TopDocs docs = searcher.Search(query,5);
analyzer.Dispose();
searcher.Dispose();
I inspected the data in docs, but it doesn't contain the Id of the matched search results.

You can geht the results as follows:
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new TermQuery(new Term("body", "body"));
TopDocs docs = searcher.Search(query,5);
// Get id and score for the top docs
ScoreDocs[] results = docs.ScoreDocs
foreach (ScoreDoc item in results)
{
// Get lucene docID
int luceneID = item.Doc
// Get actual document for the docID from index
Document doc = searcher.Doc(luceneID);
}
Lucene(.Net) has its own unique docIDs for the indexed documents, which is independent from the ID stored in your ID field. You can access the actual Document and its stored fields by calling searcher.Doc(int docID) or for an IndexReader you can call reader.Doc(int docID)

How to get percentage matchingscore values in Lucene.Net 3.0

I am upgrading our searchengine to Lucene.Net 3.0.3.0.
Also i am completely revising the searchengine for our websites, because there were some issues with scoring.
So I am building it from the ground up (again). The first thing that strikes me as weird is that the scoring values are incomprehensible. In the previous version of Lucene I was using it returned scores between 0 and 1, which are easily translated to a percentage.
After upgrading I get scoring values which I am not able to translate to a percentage.
The first version only contains Document with only a Name field and an ID field, which I am adding with the following code:
Document doc = new Document();
doc.Add(new Field("ID", studie.ID.ToString(), Field.Store.YES, Field.Index.NO));
doc.Add(new Field("indexNaam", studie.Naam.Replace("-", " ").ToLower(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
For searching I am using the following code:
string strIndexDir = #"C:\deploys\deploy3\live\index_studies2";
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
//TermQuery qry = new TermQuery(new Lucene.Net.Index.Term("indexNaam", trefwoord));
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "indexNaam", std);
Lucene.Net.Search.Query qry = parser.Parse(trefwoord);
BooleanQuery bln = new BooleanQuery();
Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)); //Provide the directory where index is stored
Lucene.Net.Search.Searcher srchr = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Index.IndexReader.Open(directory, true));
TopScoreDocCollector cllctr = TopScoreDocCollector.Create(100, true);
bln.Add(qry,Occur.MUST);
srchr.Search(bln, cllctr);
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
for (int i = 0; i < hits.Length; i++)
{
int docId = hits[i].Doc;
float score = hits[i].Score;
Lucene.Net.Documents.Document doc = srchr.Doc(docId);
Studie studie =
new Studie
{
ID = doc.Get("ID"),
Naam = doc.Get("Naam"),
ActualScore = score.ToString(),
Score = System.Math.Round(score).ToString()
};
studies.Add(studie);
}
I have also collected the scoring explanation and notice that the Inverse Document Frequency (idf) now determines the value of the Score property.
Is there a good way to retrieve the percentage scoring values and why was this changed?
Thanks in advance.

Lucene.net Field contains mutiple values and who to search

Anyone know what the best way is to search on a Field that hold multiple values?
string tagString = "";
foreach(var tag in tags)
{
tagString = tagString += ":" + tag;
}
doc.Field(new Field("Tags", tagString, Field.Store.YES, Field.Index.Analyzed);
Let's say I want to search for all documents that has the tag "csharp", who could I best implement this?

I think what you are looking for is adding multiple fields with the same name to a single Document.
What you do is create a single Document and add multiple tags Field to it.
RAMDirectory ramDir = new RAMDirectory();
IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
Document doc = new Document();
Field tags = null;
string [] articleTags = new string[] {"C#", "WPF", "Lucene" };
foreach (string tag in articleTags)
{
// adds a field with same name multiple times to the same document
tags = new Field("tags", tag, Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(tags);
}
writer.AddDocument(doc);
writer.Commit();
// search
IndexReader reader = writer.GetReader();
IndexSearcher searcher = new IndexSearcher(reader);
// use an analyzer that treats the tags field as a Keyword (Not Analyzed)
PerFieldAnalyzerWrapper aw = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
aw.AddAnalyzer("tags", new KeywordAnalyzer());
QueryParser qp = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "tags", aw);
Query q = qp.Parse("+WPF +Lucene");
TopDocs docs = searcher.Search(q, null, 100);
Console.WriteLine(docs.totalHits); // 1 hit
q = qp.Parse("+WCF +Lucene");
docs = searcher.Search(q, null, 100);
Console.WriteLine(docs.totalHits); // 0 hit

Lucene .NET search results

I'm using this code to index:
public void IndexEmployees(IEnumerable<Employee> employees)
{
var indexPath = GetIndexPath();
var directory = FSDirectory.Open(indexPath);
var indexWriter = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);
foreach (var employee in employees)
{
var document = new Document();
document.Add(new Field("EmployeeId", employee.EmployeeId.ToString(), Field.Store.YES, Field.Index.NO, Field.TermVector.NO));
document.Add(new Field("Name", employee.FirstName + " " + employee.LastName, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
document.Add(new Field("OfficeName", employee.OfficeName, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
document.Add(new Field("CompetenceRatings", string.Join(" ", employee.CompetenceRatings.Select(cr => cr.Name)), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
indexWriter.AddDocument(document);
}
indexWriter.Optimize();
indexWriter.Close();
var indexReader = IndexReader.Open(directory, true);
var spell = new SpellChecker.Net.Search.Spell.SpellChecker(directory);
spell.ClearIndex();
spell.IndexDictionary(new LuceneDictionary(indexReader, "Name"));
spell.IndexDictionary(new LuceneDictionary(indexReader, "OfficeName"));
spell.IndexDictionary(new LuceneDictionary(indexReader, "CompetenceRatings"));
}
public DirectoryInfo GetIndexPath()
{
return new DirectoryInfo(HttpContext.Current.Server.MapPath("/App_Data/EmployeeIndex/"));
}
And this code to find results (as well as suggestions):
public SearchResult Search(DirectoryInfo indexPath, string[] searchFields, string searchQuery)
{
var directory = FSDirectory.Open(indexPath);
var standardAnalyzer = new StandardAnalyzer(Version.LUCENE_29);
var indexReader = IndexReader.Open(directory, true);
var indexSearcher = new IndexSearcher(indexReader);
var parser = new MultiFieldQueryParser(Version.LUCENE_29, searchFields, standardAnalyzer);
//parser.SetDefaultOperator(QueryParser.Operator.OR);
var query = parser.Parse(searchQuery);
var hits = indexSearcher.Search(query, null, 5000);
return new SearchResult
{
Suggestions = FindSuggestions(indexPath, searchQuery),
LuceneDocuments = hits
.scoreDocs
.Select(scoreDoc => indexSearcher.Doc(scoreDoc.doc))
.ToArray()
};
}
public string[] FindSuggestions(DirectoryInfo indexPath, string searchQuery)
{
var directory = FSDirectory.Open(indexPath);
var spell = new SpellChecker.Net.Search.Spell.SpellChecker(directory);
var similarWords = spell.SuggestSimilar(searchQuery, 20);
return similarWords;
}
var searchResult = Search(GetIndexPath(), new[] { "Name", "OfficeName", "CompetenceRatings" }, "admin*");
Simple queries like: admin or admin* doesnt give me any results. I know that there is an employee with that name. I want to be able to find James Jameson if I search for James.
Thanks!

First thing. You have to commit the changes to the index.
indexWriter.Optimize();
indexWriter.Commit(); //Add This
indexWriter.Close();
Edit#2
Also, keep it simple until you get something that works.
Comment this stuff out.
//var indexReader = IndexReader.Open(directory, true);
//var spell = new SpellChecker.Net.Search.Spell.SpellChecker(directory);
//spell.ClearIndex();
//spell.IndexDictionary(new LuceneDictionary(indexReader, "Name"));
//spell.IndexDictionary(new LuceneDictionary(indexReader, "OfficeName"));
//spell.IndexDictionary(new LuceneDictionary(indexReader, "CompetenceRatings"));
Edit#3
The fields you are searching are probably not going to change often. I would include them in your search function.
string[] fields = new string[] { "Name", "OfficeName", "CompetenceRatings" };
The biggest reason I suggest this is that Fields are case-sensitive and sometimes you wont get any results and it's because you search the "name" field (which doesn't exist) instead of the "Name" field. Easier to spot the mistake this way.

In my (limited) experience working with Lucene, I've found that you have to build up your own query in order to get "google" like behavior. Here is what I do, YMMV, but it generates expected results in my application. The basic idea is you combine a term query (exact match), a prefix query (anything that begins with the term), and a fuzzy query for each term in the search string. The code below won't compile, but gives you the idea
Query GetQuery(string querystring)
{
Search.Search.BooleanQuery query = new Search.Search.BooleanQuery();
Search.Analysis.TokenStream tk = StandardAnalyzerInstance.TokenStream(null, new StringReader(querystring));
Search.Analysis.Tokenattributes.TermAttribute ta = tk.GetAttribute(typeof(Search.Analysis.Tokenattributes.TermAttribute)) as Search.Analysis.Tokenattributes.TermAttribute;
while (tk.IncrementToken())
{
string term = ta.Term();
Search.Search.BooleanQuery bq = new Search.Search.BooleanQuery();
bq.Add(new Search.Search.TermQuery(new Search.Index.Term("fieldToQuery", term)), Search.Search.BooleanClause.Occur.SHOULD);
bq.Add(new Search.Search.PrefixQuery(new Search.Index.Term("fieldToQuery", term)), Search.Search.BooleanClause.Occur.SHOULD);
bq.Add(new Search.Search.FuzzyQuery(new Search.Index.Term("fieldToQuery", term)), Search.Search.BooleanClause.Occur.SHOULD);
query.Add(bq, Search.Search.BooleanClause.Occur.MUST);
}
return query;
}

That Parse() method is inherited. Have you tried utilizing the static methods that returns a Query object?
Parse(Version matchVersion, String[] queries, String[] fields, Analyzer analyzer)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Lucene.net does not find matches correctly - c#

Related

lucene.net match if search term has no space

How to Read Lucene.net returned query results

How to get percentage matchingscore values in Lucene.Net 3.0

Lucene.net Field contains mutiple values and who to search

Lucene .NET search results

Categories

Resources