Lucene.net and partial "starts with" phrase search - c#

I'm looking to build an auto-complete textbox over a large quantity of city names. Search functionality is as follows: I want a "Starts with" search over a multi-word phrase. For example, if user has typed in "chicago he", only locations such as "Chicago Heights" need to be returned.
I'm trying to use Lucene for this. I'm having issues understanding how this needs to be implemented.
I've tried what I think is the approach that should work:
I've indexed locations with KeywordAnalyzer (I've tried both TOKENIZED and UN_TOKENIZED):
doc.Add(new Field("Name", data.ToLower(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO));
And search for them via the following (I've also tried a variety of other queries/analyzers/etc):
var luceneQuery = new BooleanQuery();
var wildcardQuery = new WildcardQuery(new Term("Name", "chicago hei*"));
luceneQuery.Add(wildcardQuery, BooleanClause.Occur.MUST);
I'm not getting any results. Would appreciate any advice.

To do that you need to index your field with the Field.Index.NOT_ANALYZED setting, which is the same as the UN_TOKENIZED you use, so it should work. Heres a working sample I quickly made up to test. Im using the latest version available on Nuget
IndexWriter iw = new IndexWriter(#"C:\temp\sotests", new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true);
Document doc = new Document();
Field loc = new Field("location", "", Field.Store.YES, Field.Index.NOT_ANALYZED);
doc.Add(loc);
loc.SetValue("chicago heights");
iw.AddDocument(doc);
loc.SetValue("new-york");
iw.AddDocument(doc);
loc.SetValue("chicago low");
iw.AddDocument(doc);
loc.SetValue("montreal");
iw.AddDocument(doc);
loc.SetValue("paris");
iw.AddDocument(doc);
iw.Commit();
IndexSearcher ins = new IndexSearcher(iw.GetReader());
WildcardQuery query = new WildcardQuery(new Term("location", "chicago he*"));
var hits = ins.Search(query);
for (int i = 0; i < hits.Length(); i++)
Console.WriteLine(hits.Doc(i).GetField("location").StringValue());
Console.WriteLine("---");
query = new WildcardQuery(new Term("location", "chic*"));
hits = ins.Search(query);
for (int i = 0; i < hits.Length(); i++)
Console.WriteLine(hits.Doc(i).GetField("location").StringValue());
iw.Close();
Console.ReadLine();

The only way to guarantee a "starts with" search is to put a delimiter at the beginning of the indexed string, so "diamond ring" is indexed like "lucenedelimiter diamond ring lucenedelimiter". This prevents a search turning up "the famous Diamond Ridge Resort" from turning up in a search for "diamond ri*".

Related

Unable to get the searched document using Lucene.net

I'm new to Lucene.net. I've a situation where I need to search the all the documents in a folder for a keyword that has been entered by the user.
I've indexed all the files in the folder and prepared a query for the keywords entered by the user and performed searching.
The problem is I could get the hits and when I tried to iterate the hits, I couldn't get the fields from the documents of the hits.
Here is my code.
public void Searching()
{
Analyzer analyzer = new StandardAnalyzer(luceneVersion.Version.LUCENE_29);
QueryParser parser = new QueryParser(luceneVersion.Version.LUCENE_29, "content", analyzer);
Query query = parser.Parse(txtSearchText.Text);
Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(txtIndexPath.Text.Trim()));
Searcher searcher = new IndexSearcher(IndexReader.Open(directory, true));
TopScoreDocCollector collector = TopScoreDocCollector.Create(100, true):
searcher.Search(query, collector);
ScoreDoc [] hits = collector.TopDocs(). ScoreDocs;
foreach (ScoreDoc hit in hits)
{
int id = hit.Doc;
float score = hit.Score;
Document doc = searcher.Doc(id);
string content = doc.Get("content"); // null
}
}
When tried to debug, the content I'm getting is null, empty.
Am I missing anything in my code, this is literally bogging me since half day all the way. Please help me out.
Thanks in advance.
I've been trying this everything whatever I could do. The problem is I've been indexing without storing the id field of the document in the index file.
Here was the code I've used while indexing.
doc.Add(new Field("id", id, Field.Store.NO, Field.Index.ANALYZED);
While it should be like the following, so that it will be available in the index file.
doc.Add(new Field("id", id, Field.Store.YES, Field.Index.ANALYZED);

Lucene .NET searching

Hi i am trying to make autocomplete system using Lucene library to search over 170K records.
But there is a litle problem.
For example when i search for Candice Gra(...), it brings records like
Candice Jackson
Candice Hamilton
Candice Hayes
Bu not Candice Graham to make Lucene find Candice Graham i need to type Candice Graham exactly.
Here is the code that i'm building query.
Directory directory = FSDirectory.Open(new DirectoryInfo(context.Server.MapPath("
ISet<string> stopWordSet = new HashSet<string>(stopWords);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWordSet);
IndexReader indexReader = IndexReader.Open(directory, true);
Searcher indexSearch = new IndexSearcher(indexReader);
//Singe Field Search
var queryParser = new QueryParser(Version.LUCENE_30,
"Title",
analyzer);
string strQuery = string.Format("{0}", q);
var query = queryParser.Parse(strQuery);
If i build strQuery like this (* appended to the query)
string strQuery = string.Format("{0}*", q);
But using this way brings irrelevant records too.
For example if i search Candice Gra(...) again it returns records like
Grass
Gravity
Gray (etc.)
By the way i used KeywordAnalyzer and SimpleAnalyzer but these are not worked either.
Any ideas?
You should escape your spaces if you want them included in the search;
var query = queryParser.Parse(QueryParser.Escape(strQuery));
I think you need to put a AND keyword between these two words.
"Candice" AND "Gra"
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#AND

How do I get a list of found words using Lucene.Net?

I have indexed documents. They have content:
Document 1:
Green table stood in the room. The room was small.
Document 2:
Green tables stood in the room. The room was large.
I'm looking for "green table". I will find Document1 and Document2. I want to show which phrases were found. I found in first document - "green table". I found in second document - "greens table". How will I get list of founds words ("green table" and "greens table")? I'm using Lucene.Net version 3.0.3.
You can use the Highlighter to mark the "found words".
If you want to find them for another reason you can still use the Highlighter and then using a regex (or a simple substring loop) to extract the words.
For example:
Query objQuery = new TermQuery(new Term("content", strQuery));
QueryScorer scorer = new QueryScorer(objQuery , "content");
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>","</b>");
highlighter = new Highlighter(formatter, scorer);
highlighter.TextFragmenter = new SimpleFragmenter(9999);
for (int i = 0; i < topRealtedDocs.ScoreDocs.Length; i++)
{
TokenStream stream = TokenSources.GetAnyTokenStream(searcher.IndexReader, topRealtedDocs.ScoreDocs[i].Doc, "content", analyzer);
string strSnippet = highlighter.GetBestFragment(stream, doc.GetValue("content"));
// here you can do what you want with the snippet. add it to your result or for example extract the words (not with a regex - this is just an example from here! use what ever you need):
List<string> foundPhrases = new List<string>();
while (strSnippet.IndexOf("<b>") > -1)
{
int indexStart = strSnippet.IndexOf("<b>");
int indexEnd = strSnippet.IndexOf("</b>");
foundPhrases.Add(strSnippet.Substring(indexStart, indexEnd - indexStart));
strSnippet = strSnippet.Substring(indexEnd);
}
}
Omri

Why is the Lucene.NET IndexSearcher returning zero results?

I recently started working with Lucene.NET and I have some problems: I have used an IndexWriter to index my documents in C:\\TestIndex which I guess it worked since it generated several .fnm, .frq, .cfx, .tii, .tis files.
The problem is when trying to make a simple search through them, I never get any results back. Below is the code I use,
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
//Provide the directory where index is stored
Directory directory = FSDirectory.Open(newSystem.IO.DirectoryInfo(#"C:\\TestIndex"));
IndexReader indexReader = IndexReader.Open(directory, true);
Searcher indexSearch = new IndexSearcher(indexReader);
Analyzer std = new StandardAnalyzer(Version.LUCENE_29);
QueryParser parser = new QueryParser(Version.LUCENE_29, "text", std);
Query qry = parser.Parse("morning");
// true opens the index in read only mode
Searcher srchr = new IndexSearcher(IndexReader.Open(directory, true));
TopScoreDocCollector cllctr = TopScoreDocCollector.Create(100, true);
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
srchr.Search(qry, cllctr);
for (int i = 0; i < hits.Length; i++)
{
int docId = hits[i].Doc;
float score = hits[i].Score;
Document doc = srchr.Doc(docId);
Console.WriteLine("Searched from Text: " + doc.Get("text"));
}
I tried several approaches but I never get any result. Do you have any idea?
Below is indexing code,
IndexWriter indexWriter =
new IndexWriter(
luceneDir,
new StandardAnalyzer(Version.LUCENE_29),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
string[] listOfFiles = Directory.GetFiles(#"C:\Projects\lucene.net-trunk\build\vs2010\demo\MyTestProject\TestDocs");
foreach (string s in listOfFiles)
{
String content = File.ReadAllText(s);
Document doc = new Document();
String title = s;
// adding title field
doc.Add(new Field("title", title, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
indexWriter.AddDocument(doc);
}
indexWriter.Optimize();
indexWriter.Dispose();
Use luke to inspect the index to ensure it has data also you can perform searches to validate your search criteria
http://www.getopt.org/luke/
EDIT - (Luke will work with lucene and lucene.net indexes you will need to install java to use)
EDIT
Update the line
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", std);
With
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "content", std);
You have set the default search field to text which doesn't exist
Also you are trying to fetch the wrong field in your console.write line
Make sure you use the same analyzer when indexing and searching (in your case it's StandardAnalyzer I guess):
using Lucene.Net.Analysis;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Store;
...
Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(#"C:\\TestIndex"));
var writer = new IndexWriter(
directory,
new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29),
true,
new MaxFieldLength(int.MaxValue));
UPDATE
I'm using a slightly different approach for searching but, anyway, maybe you need to swap these two lines:
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
srchr.Search(qry, cllctr);
So it becomes:
srchr.Search(qry, cllctr);
ScoreDoc[] hits = cllctr.TopDocs().ScoreDocs;
meaning that the collector first collects the results when the search is executed and then you get your scored documents via the collector instance.
Could you try explicitely specifying the field you're searching? for example:
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", std);
Lucene.Net.Search.Query qry = parser.Parse("content: morning");
I think that Lucene requires you to tell it on which field(s) (title, content...) you want to run your query.

Lucene Searcher return only one match result

My search text goes as "ma" and i have two lucene document which have ma as the text in it. But in return i only get one document.
Below is the code :
//adding deocument
document.Add(new Field("Text",text,Field.Store.YES, Field.Index.TOKENIZED));
//search logic :
IndexReader reader = IndexReader.Open(GetFileInfo(indexName));
//create an index searcher that will perform the search
IndexSearcher searcher = new IndexSearcher(reader);
//List of ID
List<string> searchResultID = new List<string>();
//build a query object
QueryParser parser = new QueryParser("Text", analyzer);
parser.SetAllowLeadingWildcard(true);
Query query = parser.Parse(searchText);
//execute the query
Hits hits = searcher.Search(query);
Maybe you could use luke. It's a useful diagnostic tool that can display the contents of an existing Lucene index and do other interesting stuff. I haven't used it myself, so I'm not sure, but I think it might help you in debugging this issue. Good luck!
I was able to solve my issue :
Index Writer must be created only once.You can check whether the index exits or not if not you create an new IndexWriter . for eg :
//The last parameter bool of an IndexWriter Contructor which says that you want to create an newIndexWriter or not
IndexWriter writer = new IndexWriter(GetFileInfo(indexName), analyzer, true);
On adding the new Document you must perform an check whether index exists or not , if it exists , then just pass bool param as false to the IndexWriter constructor:
IndexWriter writer = new IndexWriter(GetFileInfo(indexName), analyzer, false);
writer.AddDocument(CreateDocument(Id, text, dateTime));
writer.Optimize();
writer.Close();

Categories

Resources