I was looking for good code for searching index using lucene.net. i got one look promising but i got some confusion. if possible anyone who is familiar with lucene.net then please have look at the code and tell me why the person construct that code in that way.
from where i got the code...url as follows
http://www.codeproject.com/Articles/320219/Lucene-Net-ultra-fast-search-for-MVC-or-WebForms
here is code
private static IEnumerable<SampleData> _search
(string searchQuery, string searchField = "") {
// validation
if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", ""))) return new List<SampleData>();
// set up lucene searcher
using (var searcher = new IndexSearcher(_directory, false)) {
var hits_limit = 1000;
var analyzer = new StandardAnalyzer(Version.LUCENE_29);
// search by single field
if (!string.IsNullOrEmpty(searchField)) {
var parser = new QueryParser(Version.LUCENE_29, searchField, analyzer);
var query = parseQuery(searchQuery, parser);
var hits = searcher.Search(query, hits_limit).ScoreDocs;
var results = _mapLuceneSearchResultsToDataList(hits, searcher);
analyzer.Close();
searcher.Close();
searcher.Dispose();
return results;
}
// search by multiple fields (ordered by RELEVANCE)
else {
var parser = new MultiFieldQueryParser
(Version.LUCENE_29, new[] { "Id", "Name", "Description" }, analyzer);
var query = parseQuery(searchQuery, parser);
var hits = searcher.Search
(query, null, hits_limit, Sort.RELEVANCE).ScoreDocs;
var results = _mapLuceneSearchResultsToDataList(hits, searcher);
analyzer.Close();
searcher.Close();
searcher.Dispose();
return results;
}
}
}
i have couple of question here for the above routine
1) why the developer of this code replace all * & ? to empty string in search term
2) why search once with QueryParser and again by MultiFieldQueryParser
3) how developer detect that search term has one word or many words separated by space.
4) how wild card search can be done using this code....where to change in code for handling wild card.
5) how to handle search for similar word like if anyone search with helo then hello related result should come.
var hits = searcher.Search(query, 1000).ScoreDocs;
6) when my search result will return 5000 record and then if i limit like 1000 then how could i show next 4000 in pagination fashion.what is the object for giving the limit...i think for fastness but if i specify limit the how can i show other results....what would be the logic
i will be glad if someone discuss about all my points. thanks
1) why the developer of this code replace all * & ? to empty string in
search term
Because those are special characters for wildcard search. What the author does - he checks if a search query has something else along with wildcards. You don't usually want to search for "*", for example.
2) why search once with QueryParser and again by
MultiFieldQueryParser
He doesn't search with QueryParsers per se, but he's parsing a search query (string) and making a bunch of objects out of it. Those objects are then consumed by a Searcher object, which performs actual search.
3) how developer detect that search term has one
word or many words separated by space.
That's something a Parser object should care about, not the developer.
4) how wild card search can be
done using this code....where to change in code for handling wild
card.
The wildcards are specified in a searchQuery parameter. Specifying "test*" will count as a wildcard, for example. Details are here.
5) how to handle search for similar word like if anyone search with
helo then hello related result should come.
I think you want a fuzzy search.
6) when my search result will return 5000 record and then if i limit
like 1000 then how could i show next 4000 in pagination
fashion.what is the object for giving the limit...i think for
fastness but if i specify limit the how can i show other
results....what would be the logic
Here's an article about that.
UPD: About multiple fields. Logic is following:
If searchField is specified, than use simple parser, that will produce query like searchField: value1 seachField: value2... etc.
If, however that parameter isn't there, then it assumes, that passed searchQuery will specify fields and values like "field1: value1 field2: value2". Example is on the same syntax page, as I previously mentioned.
UPD2: Don't hesitate to look for Java documentation and examples for Lucene, as this is initially a Java project (hence, there's a lot of Java examples and tutorials). Lucene.NET is a ported project and both projects share a lot of functionality and classes.
UPD3: About fuzzy search, you might also want to implement your own analyzer for synonyms search (we used that technique in one of commercial projects, which I worked on, to handle common typos along with synonyms).
Related
I am trying to use Fuzzy search in combination with partial search and match boosting, using the Azure Search .NET API.
This is what I currently have, it doesn't work yet:
// Create SearchIndexClient
searchIndexClient= new SearchIndexClient("searchServiceName", "indexName", [credentials]);
// Set search params
var searchParameters = new SearchParameters(
includeTotalResultCount: true,
queryType: QueryType.Full);
// Set search string
string searchText = "elise*~^10";
// perform search.
var result = searchIndexClient.Documents.SearchAsync(searchText, searchParameters);
There is an entry in that index with a property Name with value 'Elyse'. This entry is not found using the above code. If i change the searchText to "elyse~", the entry does get returned.
I also could not get this to work in the Azure web portal search explorer (does that thing have a name?).
What am I missing here?
I think it may be an issue with escaping, but I am not sure how to fix it.
I looked at a bunch of documentation and Stack Overflow questions on the topic, but none showed a complete answer on how to make a fuzzy search call using the .NET SDK. So please respond in the form of complete code if possible.
Many thanks in advance.
I haven't compiled your application code but it looks correct. The issue here is that wildcard queries don't work with fuzzy operator as you are expecting it to work here.
There is a note in the documentation that says:
You cannot use a * or ? symbol as the first character of a search. No
text analysis is performed on wildcard search queries. At query time,
wildcard query terms are compared against analyzed terms in the search
index and expanded.
This means that specifying a fuzzy operator after a wildcard doesn't have any affect and the result is the same as not applying it. In your example, elise*~^10 is effectively elise*^10 and therefore doesn't match "elyse".
One way to express this as in a query is to use OR operator. elise~^10 OR elise*^10. This will return the doc containing "elyse" because of the 1st clause.
I am trying to add a search feature to my application which will allow someone to enter several words and search for those in my data.
Doing single words and phrases is simple:
if (x.Title.ToUpper().Contains(tbSearch.Text.ToUpper()) || x.Description.ToUpper().Contains(tbSearch.Text.ToUpper()))
BUT how do I work out if someone entered a search for "red car" and the title was "the car that is red"? I know I could split on SPACE and then search for each term but this seems over complicated and I would also need to strip out non word characters.
I've been looking at using RegExes but am not sure if it would search for items in order or any order.
I guess I'm trying to basically create a simple google search in my application.
Have you considered using a proper search engine such as Lucene? The StandardAnalyzer in Lucene uses the StandardTokenizer, which takes care of (some) special characters, when tokenizing. It would for example split "red-car" into the tokens "red car", thereby "removing" special characters.
In order to search in multiple fields in a Lucene index, you could use the MultiFieldQueryParser.
I think you are looking for something like this:
public static bool HasWordsContaining(this string searchCriteria, string toFilter)
{
var regex = new Regex(string.Format("^{0}| {0}", Regex.Escape(toFilter)), RegexOptions.IgnoreCase);
return regex.IsMatch(searchCriteria);
}
Usage:
someList.Where(x=>x.Name.HasWordsContaining(searchedText)).ToList();
You might use CONTAINSTABLE for this. You can use a SPROC and pass in the search string.
USE AdventureWorks2012
GO
SELECT
KEY_TBL.RANK,
FT_TBL.Description
FROM
Production.ProductDescription AS FT_TBL
INNER JOIN
FREETEXTTABLE
(
Production.ProductDescription,
Description,
'perfect all-around bike'
) AS KEY_TBL
ON FT_TBL.ProductDescriptionID = KEY_TBL.[KEY]
ORDER BY KEY_TBL.RANK DESC
GO
https://msdn.microsoft.com/en-us/library/ms142583.aspx
we have creating lucene.net index and search based on this URL http://sonyblogpost.blogspot.in/. but we want the output like follow.
example: if i search "featured"
i want to show related terms like "featured","featuring","feature".
Anyone can help me.
thanks.
To perform a Fuzzy search you'll create a MultiFieldQueryParser Below is an example on how to do this:
var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, new[] { "field1", "field2" }, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
Your version of Lucene.Net may vary.
Next you will get a Fuzzy query from the parser like this:
var query = parser.GetFuzzyQuery("fieldName", "featured", 0.7f);
The float value of 0.7f is the minimum similarity. You can tweak this number until you get the desired results. The number cannot be more than 1.0f. Executing this query using an Lucene Searcher will give you the results you expect.
You're probably looking for stemming: Stemming English words with Lucene - The link is Java, but you should be able to identify the corresponding parts of the lucene .Net API.
I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).
I need to know which phrases match specifically.
A phrase could be "blah blah blah" or just "blah" - meaning I need to take word-boundaries into account, as I don't wish to include infix matches.
My first attempt was to just create a large pre-compiled list of regular expressions that look like #"\b{0}\b" (as 10k the phrases are constant - I can cache & re-use this same list against multiple documents);
On my brand-new & very fast PC - this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.
Any advice on how I may be able to achieve this would be greatly appreciated!
Cheers,
Dave
You could Lucene.NET and the Shingle Filter as long as you don't mind having a cap on the number of possible words as phrase can have.
public class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
}
}
You can run the analyzer using this utility method.
public static IEnumerable<string> GetTerms(Analyzer analyzer, string keywords)
{
var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
var termAttribute = tokenStream.AddAttribute<ITermAttribute>();
var terms = new HashSet<string>();
while (tokenStream.IncrementToken())
{
var term = termAttribute.Term;
if (!terms.Contains(term))
{
terms.Add(term);
}
}
return terms;
}
Once you've retrieved all the terms do an intersect with you words list.
var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");
var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);
I think you will find this method is much faster than Regex matching and respects word boundries.
You can use Lucene.Net
This will create a inedx of your text, so that you can make really quick queries against it. This is a "full text index".
This article explains what it's all about:
Lucene.net
This library is originally written in java, (Lucene) but there is a port to .NET (lucene.net).
You must take special care while choosing the stemmer. An stemmer takes the "root" of a word, so that several similar words can match (i.e. book and books will match). If you need exact matches, then you should take (or implement) an stemmer which returns the original words without change.
The same stemmer must be used for creating the index and for searching the results.
You must also have a look at the syntax, because it's too powerful and allows for partial matches, exact matches, and so on.
You can also have a look at this blog.
I am currently attempting to use Lucene to search data populated in an index.
I can match on exact phrases by enclosing it in brackets (i.e. "Processing Documents"), but cannot get Lucene to find that phrase by doing any sort of "Processing Document*".
The obvious difference being the wildcard at the end.
I am currently attempting to use Luke to view and search the index. (it drops the asterisk at the end of the phrase when parsing)
Adding the quotes around the data seems to be the main culprit as searching for document* will work, but "document*" does not
Any assistance would be greatly appreciated
Lucene 2.9 has ComplexPhraseQueryParser which can handle wildcards in phrases.
What you're looking for is FuzzyQuery which allows one to search for results with similar words based on Levenshtein distance. Alternatively you may also want to consider using slop of PhraseQuery (also available in MultiPhraseQuery) if the order of words isn't significant.
Not only does the QueryParser not support wildcards in phrases, PhraseQuery itself only supports Terms. MultiPhraseQuery comes closer, but as its summary says, you still need to enumerate the IndexReader.terms yourself to match the wildcard.
It seems that the default QueryParser cannot handle this. You can probably create a custom QueryParser for wildcards in phrases. If your example is representative, stemming may solve your problem. Please read the documentation for PorterStemFilter to see whether it fits.
Another alternative is to use NGrams and specifically the EdgeNGram. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
This will create indexes for ngrams or parts of words.
Documents, with a min ngram size of 5 and max ngram size of 8, would index:
Docum
Docume
Document
Documents
There is a bit of a tradeoff for index size and time.
One of the Solr books quotes as a rough guide:
Indexing takes 10 times longer
Uses 5 times more disk space
Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries.
As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
I was also looking for the same thing and what i found is PrefixQuery gives u a combination of some thing like this "Processing Document*".But the thing is your field which you are searching for should be untokenized and store it in lowercase (reason for so since it is untokenized indexer wont save your field values in lowercase) for this to work.Here is code for PrefixQuery which worked for me :-
List<SearchResult> results = new List<SearchResult>();
Lucene.Net.Store.Directory searchDir = FSDirectory.GetDirectory(this._indexLocation, false);
IndexSearcher searcher = new IndexSearcher( searchDir );
Hits hits;
BooleanQuery query = new BooleanQuery();
query.Add(new PrefixQuery(new Term(FILE_NAME_KEY, keyWords.ToLower())), BooleanClause.Occur.MUST);
hits = searcher.Search(query);
this.FillResults(hits, results);
Use a SpanNearQuery with a slop of 0.
Unfortunately there's no SpanWildcardQuery in Lucene.Net. Either you'll need to use SpanMultiTermQueryWrapper, or with little effort you can convert the java version to C#.