I am trying to add a search feature to my application which will allow someone to enter several words and search for those in my data.
Doing single words and phrases is simple:
if (x.Title.ToUpper().Contains(tbSearch.Text.ToUpper()) || x.Description.ToUpper().Contains(tbSearch.Text.ToUpper()))
BUT how do I work out if someone entered a search for "red car" and the title was "the car that is red"? I know I could split on SPACE and then search for each term but this seems over complicated and I would also need to strip out non word characters.
I've been looking at using RegExes but am not sure if it would search for items in order or any order.
I guess I'm trying to basically create a simple google search in my application.
Have you considered using a proper search engine such as Lucene? The StandardAnalyzer in Lucene uses the StandardTokenizer, which takes care of (some) special characters, when tokenizing. It would for example split "red-car" into the tokens "red car", thereby "removing" special characters.
In order to search in multiple fields in a Lucene index, you could use the MultiFieldQueryParser.
I think you are looking for something like this:
public static bool HasWordsContaining(this string searchCriteria, string toFilter)
{
var regex = new Regex(string.Format("^{0}| {0}", Regex.Escape(toFilter)), RegexOptions.IgnoreCase);
return regex.IsMatch(searchCriteria);
}
Usage:
someList.Where(x=>x.Name.HasWordsContaining(searchedText)).ToList();
You might use CONTAINSTABLE for this. You can use a SPROC and pass in the search string.
USE AdventureWorks2012
GO
SELECT
KEY_TBL.RANK,
FT_TBL.Description
FROM
Production.ProductDescription AS FT_TBL
INNER JOIN
FREETEXTTABLE
(
Production.ProductDescription,
Description,
'perfect all-around bike'
) AS KEY_TBL
ON FT_TBL.ProductDescriptionID = KEY_TBL.[KEY]
ORDER BY KEY_TBL.RANK DESC
GO
https://msdn.microsoft.com/en-us/library/ms142583.aspx
Related
I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).
I need to know which phrases match specifically.
A phrase could be "blah blah blah" or just "blah" - meaning I need to take word-boundaries into account, as I don't wish to include infix matches.
My first attempt was to just create a large pre-compiled list of regular expressions that look like #"\b{0}\b" (as 10k the phrases are constant - I can cache & re-use this same list against multiple documents);
On my brand-new & very fast PC - this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.
Any advice on how I may be able to achieve this would be greatly appreciated!
Cheers,
Dave
You could Lucene.NET and the Shingle Filter as long as you don't mind having a cap on the number of possible words as phrase can have.
public class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
}
}
You can run the analyzer using this utility method.
public static IEnumerable<string> GetTerms(Analyzer analyzer, string keywords)
{
var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
var termAttribute = tokenStream.AddAttribute<ITermAttribute>();
var terms = new HashSet<string>();
while (tokenStream.IncrementToken())
{
var term = termAttribute.Term;
if (!terms.Contains(term))
{
terms.Add(term);
}
}
return terms;
}
Once you've retrieved all the terms do an intersect with you words list.
var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");
var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);
I think you will find this method is much faster than Regex matching and respects word boundries.
You can use Lucene.Net
This will create a inedx of your text, so that you can make really quick queries against it. This is a "full text index".
This article explains what it's all about:
Lucene.net
This library is originally written in java, (Lucene) but there is a port to .NET (lucene.net).
You must take special care while choosing the stemmer. An stemmer takes the "root" of a word, so that several similar words can match (i.e. book and books will match). If you need exact matches, then you should take (or implement) an stemmer which returns the original words without change.
The same stemmer must be used for creating the index and for searching the results.
You must also have a look at the syntax, because it's too powerful and allows for partial matches, exact matches, and so on.
You can also have a look at this blog.
My c# code stores a text.
I want to fetch some words without a known pattern which appear among words with known patterns. I don't want to fetch the words with the patterns.
i.e.
My company! 02-45895438 more details: myDomain.mysite.com
can I fetch like this?
<vendorName?>\\s*\\d{2}-d{6}\\s*more details: <site?>
vendorName = "My company!" or "My company! "
site = "myDomain.mysite.com"
Is there any way to do so with regex?
from your description, it seems like you want to find "myDomain.mysite.com" from the string "My company! 02-45895438 more details: myDomain.mysite.com", if that's the case you can use a regex simmilar to this one to get the string you want
(?<=My company! 02-45895438 more details: ).*?
that should give you the substring based on the preceeding match, but will ommit that from the capture.
You can do this by using parentheses. For example, this will give you the contents of a bold tag:
<b>([^>]+)</b>
You can then use Regex.Match to get a Match object, then get the groups via Match.Groups. Each group is a set of parentheses, so in this case there's one group that contains the tag's content.
THis is the syntax I was looking for:
(?<TheServer>\w*)
like in:
string matchPattern = #"\\\\(?<TheServer>\w*)\\(?<TheService>\w*)\\";
see
http://en.csharp-online.net/CSharp_Regular_Expression_Recipes%E2%80%94Extracting_Groups_from_a_MatchCollection
There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.
Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.
You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.
You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.
Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.
A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);
Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).
In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.
You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.
Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.
Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant
I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).
I had great results using this algorithm on codeproject.com better than brute force text replacments.
I have built a T-SQL query like this:
DECLARE #search nvarchar(1000) = 'FORMSOF(INFLECTIONAL,hills) AND FORMSOF(INFLECTIONAL,print) AND FORMSOF(INFLECTIONAL,emergency)'
SELECT * FROM Tickets
WHERE ID IN (
-- unioned subqueries using CONTAINSTABLE
...
)
The GUI for this search will be an aspx page with a single textbox where the user can search.
I plan to somehow construct the search term to be like the example above (#search).
I have some concerns, though:
Is the example search term above the best or only way to include the inflections of all words in the search?
Should I separate the words and construct the search term in C# or T-SQL. I tend to lean toward C# for decisions/looping/construction, but I want your opinion.
I hate building SQL dynamically because of the risk of injection. How can I guard against this?
Should I use FREETEXTTABLE instead? Is there a way to make FREETEXT look for ALL words instead of ANY?
In general, how else would you do this?
I recently used Full-Text Search, so I'll try to answer some of your questions.
• "I hate building sql dynamically because of the risk of injection. How can I guard against this?"
I used a sanitize method like this:
static string SanitizeInput(string searchPhrase)
{
if (searchPhrase.Length > 200)
searchPhrase = searchPhrase.Substring(0, 200);
searchPhrase = searchPhrase.Replace(";", " ");
searchPhrase = searchPhrase.Replace("'", " ");
searchPhrase = searchPhrase.Replace("--", " ");
searchPhrase = searchPhrase.Replace("/*", " ");
searchPhrase = searchPhrase.Replace("*/", " ");
searchPhrase = searchPhrase.Replace("xp_", " ");
return searchPhrase;
}
• Should I use FREETEXTTABLE instead? Is there a way to make FREETEXT look for ALL words instead of ANY?
I did use FREETEXTTABLE, but I needed any of the words. As much as I've read about it (and I've read quite a bit), you have to use CONTAINSTABLE to search for ALL words, or different combinations. FREETEXTTABLE seems to be the lighter solution, but not the one to pick when you want deeper customizations.
Dan, I like your SanitizeInput method. I refactored it to make it more compact and enhance performance a little.
static string SanitizeInput(string searchPhrase, int maxLength)
{
Regex r = new Regex(#";|'|--|xp_|/\*|\*/", RegexOptions.Compiled);
return r.Replace(searchPhrase.Substring(0, searchPhrase.Length > maxLength ? maxLength : searchPhrase.Length), " ");
}
static string SanitizeInput(string searchPhrase)
{
const int MAX_SEARCH_PHRASE_LENGTH = 200;
return SanitizeInput(searchPhrase, MAX_SEARCH_PHRASE_LENGTH);
}
I agree that FreeTextTable is too lightweight of a solution.
In your example, you have the #search variable already defined. As a rule of thumb, you shouldn't include dynamically concatenated text into raw SQL, due to the risk of injection. However, you can of course set the value of #search in the calling command object from your application. This completely negates the risk of injection attacks.
I would recommend construction of the search term in C#; passing the final search term in as a parameter like already mentioned.
As far as I recall, FREETEXTTABLE uses word breakers to completely decompose the search terms into their individual components. However, the FREETEXTTABLE operator automatically decomposes words into inflectional equivalents also, so you won't have to construct a complex CONTAINSTABLE operator if you decide to use it.
You could INNER JOIN the results of multiple FREETEXTTABLE queries to produce an equivalent AND result.
All of our searches are on columns in the database that have predefined valid characters.
Our search algorithm incorporates this with a regex that only allows these predefined characters. Because of this escaping in the search string is not needed. Our regex weeds out any injection attempts in the web code (asp & aspx). For standard comments from the users, we use escaping that changes all characters that may be used for harm in SQL, ASP, ASPX, & Javascript.
The TransStar site http://latranstar.tann.com/ is using an extended form of Soundex to search for street names, addresses and cities anywhere in Southern California. The Soundex by itself eliminates any need for anti-injection code since it operates only on alpha characters.
I am currently attempting to use Lucene to search data populated in an index.
I can match on exact phrases by enclosing it in brackets (i.e. "Processing Documents"), but cannot get Lucene to find that phrase by doing any sort of "Processing Document*".
The obvious difference being the wildcard at the end.
I am currently attempting to use Luke to view and search the index. (it drops the asterisk at the end of the phrase when parsing)
Adding the quotes around the data seems to be the main culprit as searching for document* will work, but "document*" does not
Any assistance would be greatly appreciated
Lucene 2.9 has ComplexPhraseQueryParser which can handle wildcards in phrases.
What you're looking for is FuzzyQuery which allows one to search for results with similar words based on Levenshtein distance. Alternatively you may also want to consider using slop of PhraseQuery (also available in MultiPhraseQuery) if the order of words isn't significant.
Not only does the QueryParser not support wildcards in phrases, PhraseQuery itself only supports Terms. MultiPhraseQuery comes closer, but as its summary says, you still need to enumerate the IndexReader.terms yourself to match the wildcard.
It seems that the default QueryParser cannot handle this. You can probably create a custom QueryParser for wildcards in phrases. If your example is representative, stemming may solve your problem. Please read the documentation for PorterStemFilter to see whether it fits.
Another alternative is to use NGrams and specifically the EdgeNGram. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
This will create indexes for ngrams or parts of words.
Documents, with a min ngram size of 5 and max ngram size of 8, would index:
Docum
Docume
Document
Documents
There is a bit of a tradeoff for index size and time.
One of the Solr books quotes as a rough guide:
Indexing takes 10 times longer
Uses 5 times more disk space
Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries.
As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
I was also looking for the same thing and what i found is PrefixQuery gives u a combination of some thing like this "Processing Document*".But the thing is your field which you are searching for should be untokenized and store it in lowercase (reason for so since it is untokenized indexer wont save your field values in lowercase) for this to work.Here is code for PrefixQuery which worked for me :-
List<SearchResult> results = new List<SearchResult>();
Lucene.Net.Store.Directory searchDir = FSDirectory.GetDirectory(this._indexLocation, false);
IndexSearcher searcher = new IndexSearcher( searchDir );
Hits hits;
BooleanQuery query = new BooleanQuery();
query.Add(new PrefixQuery(new Term(FILE_NAME_KEY, keyWords.ToLower())), BooleanClause.Occur.MUST);
hits = searcher.Search(query);
this.FillResults(hits, results);
Use a SpanNearQuery with a slop of 0.
Unfortunately there's no SpanWildcardQuery in Lucene.Net. Either you'll need to use SpanMultiTermQueryWrapper, or with little effort you can convert the java version to C#.