lucene.net match if search term has no space - c#

I'm using lucene.net to perform searchs in posts in my c# asp.net application, This is a sample document in my indexes:
var doc = new Document();
var title = new Field("Title", "the album hardwired to self-destruct released", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
title.Boost = 5;
doc.Add(title);
var ns_title = new Field("NoSpace_Title", "thealbumhardwiredtoselfdesctructreleased", Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
ns_title.Boost = 5;
doc.Add(ns_title);
doc.Add(new Field("Body", "the body text of the post", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.Add(new Field("Id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.AddDocument(doc);
Problem:
if I search for self or destruct or self destruct I get hit.
if I search for selfdestruct I dont get a hit.
The search method:
var searchWords = s.Split(' ').ToList();
var directory = GetDirectory();
var reader = IndexReader.Open(directory, true);
var searcher = new IndexSearcher(reader);
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, "Title,NoSpace_Title,Body".Split(','), analyzer);
var booleanQuery = new BooleanQuery();
// Title:selfdestruct*NoSpace_Title:selfdestruct*Body:selfdestruct*
s = string.Join(" ", searchWords.Select(x => x.Contains("*") ? x : x + "*"));
Query query = parser.Parse(QueryParser.Escape(s));
query.Boost = 5;
booleanQuery.Add(query, Occur.SHOULD);
// Title:*selfdestruct*,NoSpace_Title:*selfdestruct*,Body:*selfdestruct*
// (I suppose this should work and get hit but it doesn't)
s = "*" + string.Join("", searchWords) + "*";
Query query2 = parser.Parse(QueryParser.Escape(s));
query2.Boost = 3;
booleanQuery.Add(query2, Occur.SHOULD);
// Title:selfdestruct~0.85 (fuzzy search)
s = string.Join(" ", searchWords.Select(x => x.Contains("~") ? x : x + "~0.85"));
Query query3 = parser.Parse(QueryParser.Escape(s));
booleanQuery.Add(query3, Occur.SHOULD);
var collector = TopScoreDocCollector.Create(1000, true);
searcher.Search(booleanQuery, collector);
var hits = collector.TopDocs().ScoreDocs;
var docs = hits.Select(x => searcher.Doc(x.Doc)).ToList();

You can support this by adding a ShingleFilter into your analyzer.
ShingleFilter will combine adjacent tokens into single tokens to facilitate searching for them without a space. By default, it will output Unigrams as well (that is, it will also maintain the single tokens). So, when you index "self-destruct", it will index the tokens "self", "destruct", and "selfdestruct".
An easy way to do this without creating your own custom analyzer, is to use ShingleAnalyzerWrapper:
var analyzer = new ShingleAnalyzerWrapper(
new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
2);

Related

Lucene field not included in otherwise working search

In a C# boolean query Lucene search with multiple fields three of the fields are not included in the search (Sku, VariantSkus and Mpc), while the other fields are work just fine.
Using Luke, I can see that the values are stored in the index. When searching in Luke I get the correct results, using the query contained in the searcher
(taken from the debugger in Visual Studio).
Example:
Using the following query: (taken directly from the query value while debugging in Visual Studio)
(+Mpc:B118^5) (+Sku:B118^5) (+Brand:B118) (+VariantSkus:B118^4) (+DisplayName:B118^3) (+DisplayName:B118*) (+DisplayName:B118~0.5) (+MisspelledNames:B118) (+Description:B118^0.4)
Doesn't work while running the code (totalHits is 0 on the searcher), but gives the expected result of matching the Mpc to the correct product in Luke.
I'm honestly quite confused as to why the same query does not work in the C# code.
Any help or suggestions would be appreciated.
Creation of the index:
public static String CreateLuceneIndex(string basePath, HttpContext context)
{
var stopwatch = new Stopwatch();
/* get the absolute path to the directory where the indexes will be created (and if it doesn't exist, create it) */
var dirPath = context.Server.MapPath(basePath);
if (!Directory.Exists(dirPath)) Directory.CreateDirectory(dirPath);
var di = new DirectoryInfo(dirPath);
var directory = FSDirectory.Open(di);
stopwatch.Start();
/* Select the standard Lucene analyser */
var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
var count = 0;
var catalog = ProductCatalog.All().First();
/* Open the index writer using the selected analyser */
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
using(var mediaRepository = new ProductMediaRepository())
{
var urlService = ObjectFactory.Instance.Resolve<IUrlService>();
// Get all the visible products from uCommerce we wish to index
foreach (var product in Product.Find(p => p.DisplayOnSite && p.ParentProduct == null))
{
var url = urlService.GetUrl(catalog, product);
var doc = new Document();
doc.Add(new Field("id", product.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES));
doc.Add(new Field("Url", url ?? String.Empty, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES));
doc.Add(new Field("Src", ImageService.GetProductMainImage(mediaRepository, product).Src ?? String.Empty
, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES));
doc.Add(new Field("Sku", product.Sku ?? String.Empty, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
var varianSkus = String.Join(" ", product.Variants.Select(variant => variant.VariantSku));
doc.Add(new Field("VariantSkus", varianSkus, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.Add(new Field("DisplayName", product.DisplayName() ?? product.Name ?? String.Empty, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
var brands = String.Join(" ", product.Variants.Select(variant => variant.GetPropertyValue<String>("Brand")).Where(w => !String.IsNullOrWhiteSpace(w)));
doc.Add(new Field("Brand", brands ?? String.Empty, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES));
doc.Add(new Field("MisspelledNames", product.GetPropertyValue<String>("MisspelledNames") ?? String.Empty,
Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES));
doc.Add(new Field("Description", product.ShortDescription()?.StripHtml() ?? String.Empty, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.Add(new Field("Mpc", product.GetPropertyValue<String>("MPC") ?? String.Empty, Field.Store.NO, Field.Index.NOT_ANALYZED, Field.TermVector.YES));
writer.AddDocument(doc);
count++;
}
writer.Optimize();
writer.Close();
}
stopwatch.Stop();
return $"Indexed {count} products in {stopwatch.Elapsed}.\n\n";
Searching:
public static ListItemsDtoModel ProductSearch(String searchTerm, String indexDirPath, Int32 maxResults = Int32.MaxValue)
{
searchTerm = searchTerm.Trim().ToLowerInvariant();
var searchWords = ParseSearchWords(searchTerm);
indexDirPath = HttpContext.Current.Server.MapPath(indexDirPath);
var di = new DirectoryInfo(indexDirPath);
using (var directory = FSDirectory.Open(di))
using (var searcher = new IndexSearcher(IndexReader.Open(directory, true)))
{
var query = new BooleanQuery();
query.Add(new BooleanClause(AddTermClauseGroup("Mpc", searchWords, 5), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddTermClauseGroup("Sku", searchWords, 5), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddTermClauseGroup("Brand", searchWords), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddTermClauseGroup("VariantSkus", searchWords, 4), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddTermClauseGroup("DisplayName", searchWords, 3), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddWildcardClauseGroup("DisplayName", searchWords), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddFuzzyTermClauseGroup("DisplayName", searchWords), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddTermClauseGroup("MisspelledNames", searchWords), BooleanClause.Occur.SHOULD));
query.Add(new BooleanClause(AddTermClauseGroup("Description", searchWords, 0.4f), BooleanClause.Occur.SHOULD));
var searchResults = searcher.Search(query, maxResults);
return AsListItemsDtoModel(searchResults.ScoreDocs.Select(sd =>
{
var document = searcher.Doc(sd.doc);
return new ImageLinkDtoModel
{
Url = document.Get("Url"),
Text = document.Get("DisplayName"),
Alt = document.Get("DisplayName"),
Src = document.Get("Src"),
};
}).ToList());
}
}
private static String[] ParseSearchWords(string searchTerm)
{
return searchTerm.Split(' ', '-')
.Where(w => !String.IsNullOrWhiteSpace(w))
.Select(QueryParser.Escape)
.ToArray();
}
private static BooleanQuery AddTermClauseGroup(String field, IEnumerable<String> searchTerms, float boost = 1f)
{
var boostStr = Math.Abs(boost-1f) > 0.001 ? "^" + boost.ToString(CultureInfo.InvariantCulture) : String.Empty;
return AddClauseGroup(searchTerms, word => new TermQuery(new Term(field, word + boostStr)));
}
private static BooleanQuery AddFuzzyTermClauseGroup(String field, IEnumerable<String> searchTerms)
{
return AddClauseGroup(searchTerms, word => new FuzzyQuery(new Term(field, word), 0.5f));
}
private static BooleanQuery AddWildcardClauseGroup(String field, IEnumerable<String> searchTerms)
{
return AddClauseGroup(searchTerms, word => new WildcardQuery(new Term(field, word + "*")));
}
private static BooleanQuery AddClauseGroup(IEnumerable<String> searchTerms, Func<String, Query> createSubClause)
{
var query = new BooleanQuery();
foreach (var searchTerm in searchTerms)
{
query.Add(new BooleanClause(createSubClause(searchTerm), BooleanClause.Occur.MUST));
}
return query;
}
The problem is in the way you are applying boosts:
return AddClauseGroup(searchTerms, word => new TermQuery(new Term(field, word + boostStr)));
You can't incorporate the boost into the term itself in this way. There is no QueryParser in play here, so QueryParser syntax like "term^4" isn't going to work. It will just search for the string "term^4" with the default boost of 1.0. A TermQuery with a boost would look like:
Query query = new TermQuery(new Term(field, word));
query.Boost = boost;

Lucene.net does not find matches correctly

I am new in lucene.net , in some searches that i had , i found that i can use lucene in my project ,
now i can not fix the bugs in my code .
Let me I explain in Code
First of all i create indexes like these
var strIndexDir = path;
Directory indexDir = FSDirectory.Open(new DirectoryInfo(strIndexDir));
Analyzer std = new StandardAnalyzer(global::Lucene.Net.Util.Version.LUCENE_30)
foreach (var res in resturant)
{
var doc = new Document();
restaurantName = new Field("Name",
res.Name, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES);
var restaurantId = new Field("Id",
res.RestaurantId.ToString(), Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);
var restaurantSlug = new Field("Slug",
res.Slug, Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);
var restaurantAddress = new Field("Address",
res.Address ?? "empty", Field.Store.YES,
Field.Index.NOT_ANALYZED, Field.TermVector.YES);
var resturantType = new Field("Type",
"restaurant", Field.Store.YES,
Field.Index.NO, Field.TermVector.NO);
doc.Add(restaurantName);
doc.Add(restaurantId);
doc.Add(restaurantSlug);
doc.Add(restaurantAddress);
doc.Add(resturantType);
idxw.AddDocument(doc);
}
idxw.Optimize();
idxw.Close();
I Think with my self the indexing is ok , becuase i want just find the restaurant name and addresses
also for search query i use this way
string strIndexDir = path;
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var indexReader = IndexReader.Open(FSDirectory.Open(path), readOnly: true);
var parserName =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Name", std);
var parserAddress =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Address", std);
var parserSlug =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Slug", std);
var parserTitle =
new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_30, "Title", std);
var searcher = new IndexSearcher(FSDirectory.Open(path));
using (var srchr = new IndexSearcher(IndexReader.Open(directory,true)))
{
var qryName = parserName.Parse(q);
var qryAddress = parserAddress.Parse(q);
var qrySlug = parserSlug.Parse(q);
var qrytitle = parserTitle.Parse(q);
var cllctr = TopScoreDocCollector.Create(10, true);
searcher.Search(qryName, cllctr);
searcher.Search(qryAddress, cllctr);
searcher.Search(qrySlug, cllctr);
searcher.Search(qrytitle, cllctr);
var hits = cllctr.TopDocs().ScoreDocs;
Now let me say where is the problem .
for example i search this key word (q="box") want to find the restaurant name that name is boxshaharkgharb and want use "box"
the problem is that hot is always 0 but when i type boxshaharkgharb for example (q="boxshaharkgharb") the the result is ok .
how can handel that
By using wildcard * you can force Lucene to search by fragment.
If you need to do this for all queries - you need to review your choice - as Lucene best performs using whole term searches. Reason for that is that by default wildcards turn into constant score queries, while term search uses relevancy to rank results.

How to Read Lucene.net returned query results

I can't figure out how to read the results returned from a Lucene.net query.
I have this code:
Initialization
var test = new Document();
test.Add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
test.Add(new Field("title", "the title", Field.Store.YES, Field.Index.ANALYZED));
test.Add(new Field("body", "the body of the question", Field.Store.YES, Field.Index.ANALYZED));
string path = HttpRuntime.AppDomainAppPath + "\\LuceneIndex";
Lucene.Net.Store.Directory directory = FSDirectory.Open(new DirectoryInfo(path));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
writer.AddDocument(test);
writer.Optimize();
writer.Flush(true, true, true);
writer.Dispose();
directory.Dispose();
analyzer.Dispose();
Reading the data
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "ti", analyzer);
string path = HttpRuntime.AppDomainAppPath + "\\LuceneIndex";
Lucene.Net.Store.Directory directory = FSDirectory.Open(new DirectoryInfo(path));
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new TermQuery(new Term("body", "body"));
TopDocs docs = searcher.Search(query,5);
analyzer.Dispose();
searcher.Dispose();
I inspected the data in docs, but it doesn't contain the Id of the matched search results.
You can geht the results as follows:
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new TermQuery(new Term("body", "body"));
TopDocs docs = searcher.Search(query,5);
// Get id and score for the top docs
ScoreDocs[] results = docs.ScoreDocs
foreach (ScoreDoc item in results)
{
// Get lucene docID
int luceneID = item.Doc
// Get actual document for the docID from index
Document doc = searcher.Doc(luceneID);
}
Lucene(.Net) has its own unique docIDs for the indexed documents, which is independent from the ID stored in your ID field. You can access the actual Document and its stored fields by calling searcher.Doc(int docID) or for an IndexReader you can call reader.Doc(int docID)

Lucene.net Field contains mutiple values and who to search

Anyone know what the best way is to search on a Field that hold multiple values?
string tagString = "";
foreach(var tag in tags)
{
tagString = tagString += ":" + tag;
}
doc.Field(new Field("Tags", tagString, Field.Store.YES, Field.Index.Analyzed);
Let's say I want to search for all documents that has the tag "csharp", who could I best implement this?
I think what you are looking for is adding multiple fields with the same name to a single Document.
What you do is create a single Document and add multiple tags Field to it.
RAMDirectory ramDir = new RAMDirectory();
IndexWriter writer = new IndexWriter(ramDir, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
Document doc = new Document();
Field tags = null;
string [] articleTags = new string[] {"C#", "WPF", "Lucene" };
foreach (string tag in articleTags)
{
// adds a field with same name multiple times to the same document
tags = new Field("tags", tag, Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(tags);
}
writer.AddDocument(doc);
writer.Commit();
// search
IndexReader reader = writer.GetReader();
IndexSearcher searcher = new IndexSearcher(reader);
// use an analyzer that treats the tags field as a Keyword (Not Analyzed)
PerFieldAnalyzerWrapper aw = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
aw.AddAnalyzer("tags", new KeywordAnalyzer());
QueryParser qp = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "tags", aw);
Query q = qp.Parse("+WPF +Lucene");
TopDocs docs = searcher.Search(q, null, 100);
Console.WriteLine(docs.totalHits); // 1 hit
q = qp.Parse("+WCF +Lucene");
docs = searcher.Search(q, null, 100);
Console.WriteLine(docs.totalHits); // 0 hit

Lucene .NET search results

I'm using this code to index:
public void IndexEmployees(IEnumerable<Employee> employees)
{
var indexPath = GetIndexPath();
var directory = FSDirectory.Open(indexPath);
var indexWriter = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);
foreach (var employee in employees)
{
var document = new Document();
document.Add(new Field("EmployeeId", employee.EmployeeId.ToString(), Field.Store.YES, Field.Index.NO, Field.TermVector.NO));
document.Add(new Field("Name", employee.FirstName + " " + employee.LastName, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
document.Add(new Field("OfficeName", employee.OfficeName, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
document.Add(new Field("CompetenceRatings", string.Join(" ", employee.CompetenceRatings.Select(cr => cr.Name)), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
indexWriter.AddDocument(document);
}
indexWriter.Optimize();
indexWriter.Close();
var indexReader = IndexReader.Open(directory, true);
var spell = new SpellChecker.Net.Search.Spell.SpellChecker(directory);
spell.ClearIndex();
spell.IndexDictionary(new LuceneDictionary(indexReader, "Name"));
spell.IndexDictionary(new LuceneDictionary(indexReader, "OfficeName"));
spell.IndexDictionary(new LuceneDictionary(indexReader, "CompetenceRatings"));
}
public DirectoryInfo GetIndexPath()
{
return new DirectoryInfo(HttpContext.Current.Server.MapPath("/App_Data/EmployeeIndex/"));
}
And this code to find results (as well as suggestions):
public SearchResult Search(DirectoryInfo indexPath, string[] searchFields, string searchQuery)
{
var directory = FSDirectory.Open(indexPath);
var standardAnalyzer = new StandardAnalyzer(Version.LUCENE_29);
var indexReader = IndexReader.Open(directory, true);
var indexSearcher = new IndexSearcher(indexReader);
var parser = new MultiFieldQueryParser(Version.LUCENE_29, searchFields, standardAnalyzer);
//parser.SetDefaultOperator(QueryParser.Operator.OR);
var query = parser.Parse(searchQuery);
var hits = indexSearcher.Search(query, null, 5000);
return new SearchResult
{
Suggestions = FindSuggestions(indexPath, searchQuery),
LuceneDocuments = hits
.scoreDocs
.Select(scoreDoc => indexSearcher.Doc(scoreDoc.doc))
.ToArray()
};
}
public string[] FindSuggestions(DirectoryInfo indexPath, string searchQuery)
{
var directory = FSDirectory.Open(indexPath);
var spell = new SpellChecker.Net.Search.Spell.SpellChecker(directory);
var similarWords = spell.SuggestSimilar(searchQuery, 20);
return similarWords;
}
var searchResult = Search(GetIndexPath(), new[] { "Name", "OfficeName", "CompetenceRatings" }, "admin*");
Simple queries like: admin or admin* doesnt give me any results. I know that there is an employee with that name. I want to be able to find James Jameson if I search for James.
Thanks!
First thing. You have to commit the changes to the index.
indexWriter.Optimize();
indexWriter.Commit(); //Add This
indexWriter.Close();
Edit#2
Also, keep it simple until you get something that works.
Comment this stuff out.
//var indexReader = IndexReader.Open(directory, true);
//var spell = new SpellChecker.Net.Search.Spell.SpellChecker(directory);
//spell.ClearIndex();
//spell.IndexDictionary(new LuceneDictionary(indexReader, "Name"));
//spell.IndexDictionary(new LuceneDictionary(indexReader, "OfficeName"));
//spell.IndexDictionary(new LuceneDictionary(indexReader, "CompetenceRatings"));
Edit#3
The fields you are searching are probably not going to change often. I would include them in your search function.
string[] fields = new string[] { "Name", "OfficeName", "CompetenceRatings" };
The biggest reason I suggest this is that Fields are case-sensitive and sometimes you wont get any results and it's because you search the "name" field (which doesn't exist) instead of the "Name" field. Easier to spot the mistake this way.
In my (limited) experience working with Lucene, I've found that you have to build up your own query in order to get "google" like behavior. Here is what I do, YMMV, but it generates expected results in my application. The basic idea is you combine a term query (exact match), a prefix query (anything that begins with the term), and a fuzzy query for each term in the search string. The code below won't compile, but gives you the idea
Query GetQuery(string querystring)
{
Search.Search.BooleanQuery query = new Search.Search.BooleanQuery();
Search.Analysis.TokenStream tk = StandardAnalyzerInstance.TokenStream(null, new StringReader(querystring));
Search.Analysis.Tokenattributes.TermAttribute ta = tk.GetAttribute(typeof(Search.Analysis.Tokenattributes.TermAttribute)) as Search.Analysis.Tokenattributes.TermAttribute;
while (tk.IncrementToken())
{
string term = ta.Term();
Search.Search.BooleanQuery bq = new Search.Search.BooleanQuery();
bq.Add(new Search.Search.TermQuery(new Search.Index.Term("fieldToQuery", term)), Search.Search.BooleanClause.Occur.SHOULD);
bq.Add(new Search.Search.PrefixQuery(new Search.Index.Term("fieldToQuery", term)), Search.Search.BooleanClause.Occur.SHOULD);
bq.Add(new Search.Search.FuzzyQuery(new Search.Index.Term("fieldToQuery", term)), Search.Search.BooleanClause.Occur.SHOULD);
query.Add(bq, Search.Search.BooleanClause.Occur.MUST);
}
return query;
}
That Parse() method is inherited. Have you tried utilizing the static methods that returns a Query object?
Parse(Version matchVersion, String[] queries, String[] fields, Analyzer analyzer)

Categories

Resources