Lucene doesn't search text having '_' [duplicate]

Lucene doesn't search text having '_' [duplicate] - c#

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Lucene search and underscores
I am using Lucene full text search for searching in my application.
But for example, if I search for 'Turbo_Boost' it returns 0 results.
For other text it works fine.
Any Idea?

Assuming you are using the StandardTokenizer, it will split on the underscore character.
You can get around this by providing your own Tokenizer which will keep the underscore in the Token that's returned (either through a combination of Filter instances or TokenFilter instances).

A general rule of thumb with Lucene is to tokenize your search queries using the same Tokenizer/Analyzer you used to index the data.
see http://wiki.apache.org/lucene-java/LuceneFAQ#Why_is_it_important_to_use_the_same_analyzer_type_during_indexing_and_search.3F

I can only think of a few reasons why your query would fail:
First, and probably the least likely, considering other text searches fine, you didn't set the document's field to be analyzed. It won't be tokenized, so you can only search against the exact value of the whole field. Again, this one is probably not your issue.
The second (related to the third), and fairly likely, would depend on how you're executing the search. If you are not using the QueryParser (which analyzes your text the same way you index it if constructed properly) and instead say you are using a TermQuery like:
var tq = new TermQuery("Field", "Turbo_Boost");
That could cause your search to possibly fail. This has to do with the Analyzer you used to index the document splitting or changing the case of "Turbo_Boost" when it was indexed, causing the string comparison at search-time to f
The third, and even more likely, has to do with the Analyzer class you're using to index your items, versus the one you're using to search with. Using the same analyzer is important, because each analyzer uses a different Tokenizer that splits the text into searchable terms.
Let me give you some examples using your own Turbo_Boost query on how each analyzer will split the text into terms:
KeywordAnalyzer, WhitespaceAnalyzer -> Field:Turbo_Boost
SimpleAnalyzer, StopAnalyzer -> Field:turbo Field:boost
StandardAnalyzer -> Field:turbo Field:boost
You'll notice some of the Analyzers are splitting the term on the underscore character, while KeywordAnalyzer keeps it. It is extremely important that you use the same analyzer when you search, because you may not get the same results. It can also cause issues where sometimes the query will find results and other times it won't, all this depending on the query used.
As a side note, if you are using the StandardAnalyzer, it's also important that you pass it the same Version to the IndexWriter and QueryParser, because there are differences in how the parsing is done depending on which version of Lucene you expect it to emulate.
My guess your issue is one of those above reasons.

Related

Fuzzy search using Lucene with Azure Search .NET SDK

I am trying to use Fuzzy search in combination with partial search and match boosting, using the Azure Search .NET API.
This is what I currently have, it doesn't work yet:
// Create SearchIndexClient
searchIndexClient= new SearchIndexClient("searchServiceName", "indexName", [credentials]);
// Set search params
var searchParameters = new SearchParameters(
includeTotalResultCount: true,
queryType: QueryType.Full);
// Set search string
string searchText = "elise*~^10";
// perform search.
var result = searchIndexClient.Documents.SearchAsync(searchText, searchParameters);
There is an entry in that index with a property Name with value 'Elyse'. This entry is not found using the above code. If i change the searchText to "elyse~", the entry does get returned.
I also could not get this to work in the Azure web portal search explorer (does that thing have a name?).
What am I missing here?
I think it may be an issue with escaping, but I am not sure how to fix it.
I looked at a bunch of documentation and Stack Overflow questions on the topic, but none showed a complete answer on how to make a fuzzy search call using the .NET SDK. So please respond in the form of complete code if possible.
Many thanks in advance.

I haven't compiled your application code but it looks correct. The issue here is that wildcard queries don't work with fuzzy operator as you are expecting it to work here.
There is a note in the documentation that says:
You cannot use a * or ? symbol as the first character of a search. No
text analysis is performed on wildcard search queries. At query time,
wildcard query terms are compared against analyzed terms in the search
index and expanded.
This means that specifying a fuzzy operator after a wildcard doesn't have any affect and the result is the same as not applying it. In your example, elise*~^10 is effectively elise*^10 and therefore doesn't match "elyse".
One way to express this as in a query is to use OR operator. elise~^10 OR elise*^10. This will return the doc containing "elyse" because of the 1st clause.

How to convert words to links?

I have a xml with two properties: word and link.
How can I replace the words on a text to a link using the xml information.
Ex.:
XML
<word>dog</word>
<link>http://www.dog.com</link>
Text: The dog is nice.
Result: The dog is nice.
Results OK.
The problems:
1- If the text has the word dogs the result is incorret, because of "s".
2- I've tested doing a split by space on text to fix it, but if the word is composed like new year the result is incorret again.
Does anyone have any suggestions to do it and fix these problems (plural and compound words)?
Thanks for the help.

You can use Lucene.Net's contrib package Snowball for stemming (words->word , came->come , having->have etc.). But you will still have troubles with compound words

If you roll your own solution, I have had good success with the .NET pluralization capabilities:
http://msdn.microsoft.com/en-us/library/system.data.entity.design.pluralizationservices.pluralizationservice.aspx
Essentially, you can pass a word in its plural form and receive a singular version and vice versa.
This could be fairly intensive depending on how often the content changed, i.e. this wouldn't be a good choice to search thousands of words in real time.
Assuming that you can pre-process/cache the results or that the source file is small, you could:
Run Once
Identify all candidate words from the source file.
Parse/split phrases and pass them through the pluralization libraries to determine their plural counterparts.
Generate (and precompile) simple regular expressions to locate the words that you do want to match. For example, if you want to match "dog" but not "dogs" you could create a regex like dog[^s] which could then be executed against the text.
Run Whenever a Search/Replace is Needed
Run your list of source expressions against the text in question. I would suggest ordering the expressions from shortest to longest (otherwise a short expression may replace a word that was just parsed by a longer expression).
Again, this would be processor intensive to run in real-time (most solutions will be). As always, if you are parsing HTML, you should use an HTML parser, not a regular expression. In this case, you might use a proper parser to locate all text nodes and then perform the search/replace on them.
An alternative solution would be to put the text and keyword list into a database and use SQL Server Full Text Indexing which tends to be pretty smart about these things and supports intelligent match predicates. You could even combine this with a CLR stored procedure to handle things that .NET excels at (like string parsing).
Regardless of the approach, this will not be an exact science.

You're likely going to need a dictionary. Create a text file/XML file that contains both the singular and plural forms of the words you want. At runtime, load them into a Dictionary<String, String>. Then look up the value of <word/> in the dictionary and extract its singular value.

Howto perform a 'contains' search rather than 'starts with' using Lucene.Net

We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.
Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.
In the future we would like to have a search that performs something like a Contains rather than a StartsWith.
We use
Lucene.Net 2.9.2.2
StandardAnalyzer
default QueryParser
Samples:
(Title:Orch*) matches: Orchestra
but:
(Title:rch*) does not match: Orchestra
We want the first and the second one to both match Orchestra.
Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?
How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.

First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".
You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?
You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.
I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.

#Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.
Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.

Lucene - Wildcards in phrases

I am currently attempting to use Lucene to search data populated in an index.
I can match on exact phrases by enclosing it in brackets (i.e. "Processing Documents"), but cannot get Lucene to find that phrase by doing any sort of "Processing Document*".
The obvious difference being the wildcard at the end.
I am currently attempting to use Luke to view and search the index. (it drops the asterisk at the end of the phrase when parsing)
Adding the quotes around the data seems to be the main culprit as searching for document* will work, but "document*" does not
Any assistance would be greatly appreciated

Lucene 2.9 has ComplexPhraseQueryParser which can handle wildcards in phrases.

What you're looking for is FuzzyQuery which allows one to search for results with similar words based on Levenshtein distance. Alternatively you may also want to consider using slop of PhraseQuery (also available in MultiPhraseQuery) if the order of words isn't significant.

Not only does the QueryParser not support wildcards in phrases, PhraseQuery itself only supports Terms. MultiPhraseQuery comes closer, but as its summary says, you still need to enumerate the IndexReader.terms yourself to match the wildcard.

It seems that the default QueryParser cannot handle this. You can probably create a custom QueryParser for wildcards in phrases. If your example is representative, stemming may solve your problem. Please read the documentation for PorterStemFilter to see whether it fits.

Another alternative is to use NGrams and specifically the EdgeNGram. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
This will create indexes for ngrams or parts of words.
Documents, with a min ngram size of 5 and max ngram size of 8, would index:
Docum
Docume
Document
Documents
There is a bit of a tradeoff for index size and time.
One of the Solr books quotes as a rough guide:
Indexing takes 10 times longer
Uses 5 times more disk space
Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries.
As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).

I was also looking for the same thing and what i found is PrefixQuery gives u a combination of some thing like this "Processing Document*".But the thing is your field which you are searching for should be untokenized and store it in lowercase (reason for so since it is untokenized indexer wont save your field values in lowercase) for this to work.Here is code for PrefixQuery which worked for me :-
List<SearchResult> results = new List<SearchResult>();
Lucene.Net.Store.Directory searchDir = FSDirectory.GetDirectory(this._indexLocation, false);
IndexSearcher searcher = new IndexSearcher( searchDir );
Hits hits;
BooleanQuery query = new BooleanQuery();
query.Add(new PrefixQuery(new Term(FILE_NAME_KEY, keyWords.ToLower())), BooleanClause.Occur.MUST);
hits = searcher.Search(query);
this.FillResults(hits, results);

Use a SpanNearQuery with a slop of 0.
Unfortunately there's no SpanWildcardQuery in Lucene.Net. Either you'll need to use SpanMultiTermQueryWrapper, or with little effort you can convert the java version to C#.

Efficient string matching algorithm

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.
Here are my requirements:
Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
Entries may be absolute matches, i.e. www.example.com.
Entries may include wildcards, i.e. *.example.com.
Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.
Language/environment: C# (.Net Framework 3.5)
I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.
I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.
My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.
Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.
The only differences between this and a regular spell-checking tree are:
The matching needs to be done in reverse
You have to take into account the wildcards
To address point #2, you would simply check for the "*" character at the end of a test.
A quick example:
Entries:
*.fark.com
www.cnn.com
Tree:
m -> o -> c -> . -> k -> r -> a -> f -> . -> *
\
-> n -> n -> c -> . -> w -> w -> w
Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.
Checking www.cern.com would fail on the second "n" of n,n,c,...
Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

you don't need regexp .. just reverse all the strings,
get rid of '*', and put a flag to indicate partial match
till this point passes.
Somehow, a trie or suffix trie looks most appropriate.
If the list of domains is known at compile time, you may look at
tokenizing at '.' and using multiple gperf generated machines.
Links:
google for trie
http://marknelson.us/1996/08/01/suffix-trees/

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.
Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.
To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.
This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.
Additionally, I would have to bet that this approach would be faster than RegEx.

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.
Here's an article on recursive descent parsers.

Assuming the rules are as you said: literal or start with a *.
Java:
public static boolean matches(String candidate, List<String> rules) {
for(String rule : rules) {
if (rule.startsWith("*")) {
rule = rule.substring(2);
}
if (candidate.endsWith(rule)) {
return true;
}
}
return false;
}
This scales to the number of rules you have.
EDIT:
Just to be clear here.
When I say "sort the rules", I really mean create a tree out of the rule characters.
Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).
For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.
If you have a LOT (LOT LOT) of rules I would sort the rules.
For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)
If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

Have a look at RegExLib

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:
Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.
What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.

I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?
I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex

If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.
That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.
If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...

Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.
One source of information for these is:
KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html
BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.