Lucene.Net fails to search keyword "The"

Lucene.Net fails to search keyword "The" - c#

I am using Lucene.Net (Version Lucene 3.0.3). While searching for keyword "The",
it gives less than 5 results when there is plenty of records matching that keyword.
It works fine for all other keywords.
Does lucene have any problems with 'The' ??? :-)

As stated in the comments, your problem is the Analyzer you are using.
StandardAnalyzer does various things to do with grammar, but also removes a default set of english stop-words (a, and, or, then etc) "the" is one of these.
You can create it like
var a new StandardAnalyzer(version, new HashSet<string>());
Giving it an empty hashset says "there are no stop words"

Related

Fuzzy search using Lucene with Azure Search .NET SDK

I am trying to use Fuzzy search in combination with partial search and match boosting, using the Azure Search .NET API.
This is what I currently have, it doesn't work yet:
// Create SearchIndexClient
searchIndexClient= new SearchIndexClient("searchServiceName", "indexName", [credentials]);
// Set search params
var searchParameters = new SearchParameters(
includeTotalResultCount: true,
queryType: QueryType.Full);
// Set search string
string searchText = "elise*~^10";
// perform search.
var result = searchIndexClient.Documents.SearchAsync(searchText, searchParameters);
There is an entry in that index with a property Name with value 'Elyse'. This entry is not found using the above code. If i change the searchText to "elyse~", the entry does get returned.
I also could not get this to work in the Azure web portal search explorer (does that thing have a name?).
What am I missing here?
I think it may be an issue with escaping, but I am not sure how to fix it.
I looked at a bunch of documentation and Stack Overflow questions on the topic, but none showed a complete answer on how to make a fuzzy search call using the .NET SDK. So please respond in the form of complete code if possible.
Many thanks in advance.

I haven't compiled your application code but it looks correct. The issue here is that wildcard queries don't work with fuzzy operator as you are expecting it to work here.
There is a note in the documentation that says:
You cannot use a * or ? symbol as the first character of a search. No
text analysis is performed on wildcard search queries. At query time,
wildcard query terms are compared against analyzed terms in the search
index and expanded.
This means that specifying a fuzzy operator after a wildcard doesn't have any affect and the result is the same as not applying it. In your example, elise*~^10 is effectively elise*^10 and therefore doesn't match "elyse".
One way to express this as in a query is to use OR operator. elise~^10 OR elise*^10. This will return the doc containing "elyse" because of the 1st clause.

How to customize Lucene.NET to search for words with symbols without case-sensitivity (e.g. "C#" or ".net")?

The standard analyzer does not work. From what I can understand, it changes this to a search for c and net
The WhitespaceAnalyzer would work but it's case sensitive.
The general rule is search should work like Google so hoping it's a configuration thing considering .net, c# have been out there for a while or there's a workaround for this.
Per the suggestions below, I tried the custom WhitespaceAnalyzer but then if the keywords are separated by a comma and no-space are not handled correctly e.g.
java,.net,c#,oracle
will not be returned while searching which would be incorrect.
I came across PatternAnalyzer which is used to split the tokens but can't figure out how to use it in this scenario.
I'm using Lucene.Net 3.0.3 and .NET 4.0

Write your own custom analyzer class similar to SynonymAnalyzer in Lucene.Net – Custom Synonym Analyzer. Your override of TokenStream could solve this by pipelining the stream using WhitespaceTokenizer and LowerCaseFilter.
Remember that your indexer and searcher need to use the same analyzer.
Update: Handling multiple comma-delimited keywords
If you only need to handle unspaced comma-delimited keywords for searching, not indexing then you could convert the search expression expr as below.
expr = expr.Replace(',', ' ');
Then pass expr to the QueryParser. If you want to support other delimiters like ';' you could do it like this:
var terms = expr.Split(new char[] { ',', ';'} );
expr = String.Join(" ", terms);
But you also need to check for a phrase expression like "sybase,c#,.net,oracle" (expression includes the quote " chars) which should not be converted (the user is looking for an exact match):
expr = expr.Trim();
if (!(expr.StartsWith("\"") && expr.EndsWith("\"")))
{
expr = expr.Replace(',', ' ');
}
The expression might include both a phrase and some keywords, like this:
"sybase,c#,.net,oracle" server,c#,.net,sybase
Then you need to parse and translate the search expression to this:
"sybase,c#,.net,oracle" server c# .net sybase
If you also need to handle unspaced comma-delimited keywords for indexing then you need to parse the text for unspaced comma-delimited keywords and store them in a distinct field eg. Keywords (which must be associated with your custom analyzer). Then your search handler needs to convert a search expression like this:
server,c#,.net,sybase
to this:
Keywords:server Keywords:c# Keywords:.net, Keywords:sybase
or more simply:
Keywords:(server, c#, .net, sybase)

Use the WhitespacerAnalyzer and chain it with a LowerCaseFilter.
Use the same chain at search and index time. by converting everything to lower case, you actually make it case insensitive.
According to your problem description, that should work and be simple to implement.

for others who might be looking for an answer as well
the final answer turned out be to create a custom TokenFilter and a custom Analyzer using
that token filter along with Whitespacetokenizer, lowercasefilter etc., all in all about 30 lines of code, i will create a blog post and post the link here when i do, have to create a blog first !

Lucene doesn't search text having '_' [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Lucene search and underscores
I am using Lucene full text search for searching in my application.
But for example, if I search for 'Turbo_Boost' it returns 0 results.
For other text it works fine.
Any Idea?

Assuming you are using the StandardTokenizer, it will split on the underscore character.
You can get around this by providing your own Tokenizer which will keep the underscore in the Token that's returned (either through a combination of Filter instances or TokenFilter instances).

A general rule of thumb with Lucene is to tokenize your search queries using the same Tokenizer/Analyzer you used to index the data.
see http://wiki.apache.org/lucene-java/LuceneFAQ#Why_is_it_important_to_use_the_same_analyzer_type_during_indexing_and_search.3F

I can only think of a few reasons why your query would fail:
First, and probably the least likely, considering other text searches fine, you didn't set the document's field to be analyzed. It won't be tokenized, so you can only search against the exact value of the whole field. Again, this one is probably not your issue.
The second (related to the third), and fairly likely, would depend on how you're executing the search. If you are not using the QueryParser (which analyzes your text the same way you index it if constructed properly) and instead say you are using a TermQuery like:
var tq = new TermQuery("Field", "Turbo_Boost");
That could cause your search to possibly fail. This has to do with the Analyzer you used to index the document splitting or changing the case of "Turbo_Boost" when it was indexed, causing the string comparison at search-time to f
The third, and even more likely, has to do with the Analyzer class you're using to index your items, versus the one you're using to search with. Using the same analyzer is important, because each analyzer uses a different Tokenizer that splits the text into searchable terms.
Let me give you some examples using your own Turbo_Boost query on how each analyzer will split the text into terms:
KeywordAnalyzer, WhitespaceAnalyzer -> Field:Turbo_Boost
SimpleAnalyzer, StopAnalyzer -> Field:turbo Field:boost
StandardAnalyzer -> Field:turbo Field:boost
You'll notice some of the Analyzers are splitting the term on the underscore character, while KeywordAnalyzer keeps it. It is extremely important that you use the same analyzer when you search, because you may not get the same results. It can also cause issues where sometimes the query will find results and other times it won't, all this depending on the query used.
As a side note, if you are using the StandardAnalyzer, it's also important that you pass it the same Version to the IndexWriter and QueryParser, because there are differences in how the parsing is done depending on which version of Lucene you expect it to emulate.
My guess your issue is one of those above reasons.

Howto perform a 'contains' search rather than 'starts with' using Lucene.Net

We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.
Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.
In the future we would like to have a search that performs something like a Contains rather than a StartsWith.
We use
Lucene.Net 2.9.2.2
StandardAnalyzer
default QueryParser
Samples:
(Title:Orch*) matches: Orchestra
but:
(Title:rch*) does not match: Orchestra
We want the first and the second one to both match Orchestra.
Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?
How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.

First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".
You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?
You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.
I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.

#Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.
Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.

Lucene - Wildcards in phrases

I am currently attempting to use Lucene to search data populated in an index.
I can match on exact phrases by enclosing it in brackets (i.e. "Processing Documents"), but cannot get Lucene to find that phrase by doing any sort of "Processing Document*".
The obvious difference being the wildcard at the end.
I am currently attempting to use Luke to view and search the index. (it drops the asterisk at the end of the phrase when parsing)
Adding the quotes around the data seems to be the main culprit as searching for document* will work, but "document*" does not
Any assistance would be greatly appreciated

Lucene 2.9 has ComplexPhraseQueryParser which can handle wildcards in phrases.

What you're looking for is FuzzyQuery which allows one to search for results with similar words based on Levenshtein distance. Alternatively you may also want to consider using slop of PhraseQuery (also available in MultiPhraseQuery) if the order of words isn't significant.

Not only does the QueryParser not support wildcards in phrases, PhraseQuery itself only supports Terms. MultiPhraseQuery comes closer, but as its summary says, you still need to enumerate the IndexReader.terms yourself to match the wildcard.

It seems that the default QueryParser cannot handle this. You can probably create a custom QueryParser for wildcards in phrases. If your example is representative, stemming may solve your problem. Please read the documentation for PorterStemFilter to see whether it fits.

Another alternative is to use NGrams and specifically the EdgeNGram. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
This will create indexes for ngrams or parts of words.
Documents, with a min ngram size of 5 and max ngram size of 8, would index:
Docum
Docume
Document
Documents
There is a bit of a tradeoff for index size and time.
One of the Solr books quotes as a rough guide:
Indexing takes 10 times longer
Uses 5 times more disk space
Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries.
As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).

I was also looking for the same thing and what i found is PrefixQuery gives u a combination of some thing like this "Processing Document*".But the thing is your field which you are searching for should be untokenized and store it in lowercase (reason for so since it is untokenized indexer wont save your field values in lowercase) for this to work.Here is code for PrefixQuery which worked for me :-
List<SearchResult> results = new List<SearchResult>();
Lucene.Net.Store.Directory searchDir = FSDirectory.GetDirectory(this._indexLocation, false);
IndexSearcher searcher = new IndexSearcher( searchDir );
Hits hits;
BooleanQuery query = new BooleanQuery();
query.Add(new PrefixQuery(new Term(FILE_NAME_KEY, keyWords.ToLower())), BooleanClause.Occur.MUST);
hits = searcher.Search(query);
this.FillResults(hits, results);

Use a SpanNearQuery with a slop of 0.
Unfortunately there's no SpanWildcardQuery in Lucene.Net. Either you'll need to use SpanMultiTermQueryWrapper, or with little effort you can convert the java version to C#.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.