Trouble searching for acronyms in Lucene.NET

Trouble searching for acronyms in Lucene.NET - c#

I'm currently working on a Lucene.NET full-text search implementation. For the most part it's going quite well but I'm having a few issues revolving around acronyms in the data...
As an example of what's going on if I had "N.A.S.A." in the field I indexed I'm able to match it with n.a.s.a. or nasa, but n.a.s.a doesn't match it, not even if I put a fuzzy-search (n.a.s.a~).
The first thought that comes to mind for me is to rip out all the .'s before indexing/searching, but it seems a bit more like a workaround than a solution and I was hoping to get a cleaner solution.
Can anyone suggest any changes or a different analyzer (using StandardAnalyzer currently) that may be more suited to matching this kind of data?

The StandardAnalyzer uses the StandardTokenizer which tokenizes 'N.A.S.A.' as 'nasa', but won't do this to 'N.A.S.A'. That's why your original query matches both the input 'N.A.S.A' which are processed into 'nasa', and the input 'nasa' which matches the already tokenized value. This also explains why 'N.A.S.A' wont match anything since the index only contains the token 'nasa'.
This can be seen when outputting the value from the token stream directly.
public static void Main(string[] args) {
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
var stream = analyzer.TokenStream("f", new StringReader("N.A.S.A. N.A.S.A"));
var termAttr = stream.GetAttribute<ITermAttribute>();
while (stream.IncrementToken()) {
Console.WriteLine(termAttr.Term);
}
Console.ReadLine();
}
Outputs:
nasa
n.a.s.a
You would probably need to write a custom analyzer to handle this scenario. One solution would be to keep the original token so n.a* would work, but you would also need to build a better detection of acronyms.

Related

Is it possible to check for a verb without creating my own database?

I am playing around with a sentence string entry for a project I'm working on in C# and wanted to see if there was an alternative way to search for a verb using a built in function.
Currently, I am using a database table with a list of regular verbs and cycling through those to check if there is a match but wanted to see if there would be a better way to do this?
Consider the following input:
"Develop string matching software for verb"
Program will read the string and check each word,
if (word == isVerb)
{
m_verbs.Add(word);
}

Short answer :
There is a better way.
Long answer :
It's not that simple. The problem is that there is no inbuilt language functionality into the string class in C#. This is an implementation detail that rests on the developer's shoulders.
You have some grammatical (or perhaps lexical is a better word) issues to consider as Owen79 pointed out in his comment. Then there is the question of environment / resource restrictions.
You have a few options available to you :
Web based dictionary services. You can query those with the words of your sentence and get back the 'status' of each word. Then you will take only the statuses you want, like verbs for instance. Here is a link to DictService which also includes a C# code sample.
A text / xml / other file based solution. Similar approach, you simply look up the words in the file and act according to the presence or absence of the word in the file. You can cache (load into memory) the contents of the file to save on IO operations. Here are the links to lists of regular and irregular verbs.
Database solution is identical to the previous one with the exception of loading contents into memory. That part may be unnecessary but that depends on your implementation requirements.
Bottom line each solution will require some work but whatever option you go for the key aspects to consider are the platform and the resources available to you. If computational speed is a concern you will most likely need to do some tricks to cut down on lookup times etc.
Hope this helps

you could load the common verbs from disk in a text file. If you have lots of verbs and worry about memory you could bucket them into common and uncommon or alphabetically then load in the dictionaries if needed

If you don't want to use the databse option (although highly recommanded), then you need to put them in a data structure (e.g. array or list). You can then use powerful System.Linq extension methods.
For example:
string[] allVerbs = new[] { "eat", "drink" }; // etc
string s = "Develop string matching software for verb";
var words = s.Split(' ');
foreach (var word in words)
if (allVerbs.Contains(word.ToLower()))
m_verbs.Add(word);

Matching a large number of strings/phrases

I need to implement a process, wherein a text file of roughly 50/150kb is uploaded, and matched against a large number of phrases (~10k).
I need to know which phrases match specifically.
A phrase could be "blah blah blah" or just "blah" - meaning I need to take word-boundaries into account, as I don't wish to include infix matches.
My first attempt was to just create a large pre-compiled list of regular expressions that look like #"\b{0}\b" (as 10k the phrases are constant - I can cache & re-use this same list against multiple documents);
On my brand-new & very fast PC - this matching is taking 10 seconds+, which I would like to be able to reduce a great deal.
Any advice on how I may be able to achieve this would be greatly appreciated!
Cheers,
Dave

You could Lucene.NET and the Shingle Filter as long as you don't mind having a cap on the number of possible words as phrase can have.
public class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new ShingleFilter(new LowerCaseFilter(new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader)), 6);
}
}
You can run the analyzer using this utility method.
public static IEnumerable<string> GetTerms(Analyzer analyzer, string keywords)
{
var tokenStream = analyzer.TokenStream("content", new StringReader(keywords));
var termAttribute = tokenStream.AddAttribute<ITermAttribute>();
var terms = new HashSet<string>();
while (tokenStream.IncrementToken())
{
var term = termAttribute.Term;
if (!terms.Contains(term))
{
terms.Add(term);
}
}
return terms;
}
Once you've retrieved all the terms do an intersect with you words list.
var matchingShingles = GetTerms(new MyAnalyzer(), "Here's my stuff I want to match");
var matchingPhrases = phrasesToMatch.Intersect(matchingShingles, StringComparer.OrdinalIgnoreCase);
I think you will find this method is much faster than Regex matching and respects word boundries.

You can use Lucene.Net
This will create a inedx of your text, so that you can make really quick queries against it. This is a "full text index".
This article explains what it's all about:
Lucene.net
This library is originally written in java, (Lucene) but there is a port to .NET (lucene.net).
You must take special care while choosing the stemmer. An stemmer takes the "root" of a word, so that several similar words can match (i.e. book and books will match). If you need exact matches, then you should take (or implement) an stemmer which returns the original words without change.
The same stemmer must be used for creating the index and for searching the results.
You must also have a look at the syntax, because it's too powerful and allows for partial matches, exact matches, and so on.
You can also have a look at this blog.

Maybe I need a Regex?

I am making a simple console application for a home project. Basically, it monitors a folder for any files being added.
FileSystemWatcher fsw = new FileSystemWatcher(#"c:\temp");
fsw.Created += new FileSystemEventHandler(fsw_Created);
bool monitor = true;
while (monitor)
{
fsw.WaitForChanged(WatcherChangeTypes.Created, 1000);
if(Console.KeyAvailable)
{
monitor = false;
}
}
Show("User has quit the process...", ConsoleColor.Yellow);
When a new files arrives, 'WaitForChanges' gets called, and I can then start the work.
What I need to do is check the filename for patterns. In real life, I am putting video files into this folder. Based on the filename, I will have rules, which move the files into specific directories. So for now, I'll have a list of KeyValue pairs... holding a RegEx (I think?), and a folder. So, if the filename matches a regex, it moves it into the related folder.
An example of a filename is:
CSI- NY.S07E01.The 34th Floor.avi
So, my Regex needs to look at it, and see if the words CSI "AND" (NY "OR" NewYork "OR" New York) exist. If they do, I will then move them to a \Series\CSI\NY\ folder.
I need the AND, because another file example for a different series is:
CSI- Crime Scene Investigation.S11E16.Turn On, Tune In, Drop Dead
So, for this one, I would need to have some NOTs. So, I need to check if the filename has CSI, but NOT ("New York" or "NY" or "NewYork")
Could someone assist me with these RegExs? Or maybe, there's a better method?

You can try to store conditions in Func<string,bool>
Dictionary<Func<string,bool>,string> dic = new Dictionary<Func<string, bool>, string>();
Func<string, bool> f1 = x => x.Contains("CSI") && ((x.Contains("NY") || x.Contains("New York"));
dic.Add(f1,"C://CSI/");
foreach (var pair in dic)
{
if(pair.Key.Invoke("CSI- NY.S07E01.The 34th Floor.avi"))
{
// copy
return;
}
}

I think you have the right idea. The nice thing about this approach is that you can add/remove/edit regular expressions to a config file or some other approach which means you don't have to recompile the project every time you want to keep track of a new show.
A regular expression for CSI AND NY would look something like this.
First if you want to check if CSI exists in the filename the regex is simply "CSI". Keep in mind it's case sensitive by default.
If you want to check if NY, New York or NewYork exist in the file name the regex is "((NY)|(New York)|(NewYork))" The bars indicate OR and the parenthesis are used to designate groups.
In order to combine the two you could run both regexes and in some cases (where perhaps order is unimportant) this might be easier. However if you always expect the show type to come after the syntax would be "(CSI).*((NY)|(New York)|(NewYork))" The period means "any character" and the asterisk means zero or more.

This does not look as one regex, even if you succeed with tossing the whole thing into one. Regexes which match "anything without a given word" are a pain. I'd better stick with two regexes for each rule: one which should match, and the other which should NOT match for this rule to be triggered. If you need your "CSI" and "NY" but don't like fixing any particular order within the filename, you as well may switch from a pair of regexes to a pair of lists of regexes. In general it's better to put this logic into code and configuration and keep regexes as simple as possible. And yes, you're quite likely to get away with simple substring search, no explicit need for regexes as long as you keep your code smart enough.

Well, people already gave you some advises about doing this using:
Regular expressions
Func and storing exactly the C# code that will be executed against the file
so I'm just give you a different one.
I disagree with using Regular Expressions for this purpose. I agree with #Anton S. Kraievoy: I don't like regexes to match anything without a given word. It is easier to check: !text.Contains(word)
The second option looks perfect if you are looking for a fast solution, but...
If that is a more complex application, and you want to design it correctly, I think you should:
Define how you will store those patterns (in a class with members, or in a string, etc). An string example could be:
"CSI" & ("NY" || "Las Vegas")
Then write a module that will match a filename with that pattern.
You're creating kind of a DSL.
Why is it better than just paste directly the C# code?
Well, because:
You can easily change the semantic of your patterns
You can generate the validation code in any language you want, because you're storing patterns in a generic way.
The thing is how to match a pattern against a filename.
You have some options:
Write the grammar of your pattern and write a parser by yourself
Generate (I'm not 100% sure if it is possible, that depends on the grammar) the write a regex that will convert your grammar into C# code.
Like: "A" & "B" to string.Contains("A") && string.Contains("B") or something like that.
Use a tool to do that, like ANTLR.

What's the best way to parse a string for "bad" words in C#?

I'm thinking of something like:
foreach (var word in paragraph.split(' ')) {
if (badWordArray.Contains(word) {
// do something about it
}
}
but I'm sure there's a better way.
Thanks in advance!
UPDATE
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.

While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.
Edit to add sample code:
public string FilterWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.Replace(inputWords, "<3");
}
That should work for you, more or less.
Edit to answer OP clarification:
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.
Much as the replacement portion above, you can see if something matches like so:
public bool HasBadWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.IsMatch(inputWords);
}
It will return true if the string you passed to it contains any words in the list.

At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).
One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we
uppercase everything in the input
remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)
And then get some friends to try to break it. It's fun.

You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)

Best way to replace tokens in a large text template

I have a large text template which needs tokenized sections replaced by other text. The tokens look something like this: ##USERNAME##. My first instinct is just to use String.Replace(), but is there a better, more efficient way or is Replace() already optimized for this?

System.Text.RegularExpressions.Regex.Replace() is what you seek - IF your tokens are odd enough that you need a regex to find them.
Some kind soul did some performance testing, and between Regex.Replace(), String.Replace(), and StringBuilder.Replace(), String.Replace() actually came out on top.

The only situation in which I've had to do this is sending a templated e-mail. In .NET this is provided out of the box by the MailDefinition class. So this is how you create a templated message:
MailDefinition md = new MailDefinition();
md.BodyFileName = pathToTemplate;
md.From = "test#somedomain.com";
ListDictionary replacements = new ListDictionary();
replacements.Add("<%To%>", someValue);
// continue adding replacements
MailMessage msg = md.CreateMailMessage("test#someotherdomain.com", replacements, this);
After this, msg.Body would be created by substituting the values in the template. I guess you can take a look at MailDefinition.CreateMailMessage() with Reflector :). Sorry for being a little off-topic, but if this is your scenario I think it's the easiest way.

Well, depending on how many variables you have in your template, how many templates you have, etc. this might be a work for a full template processor. The only one I've ever used for .NET is NVelocity, but I'm sure there must be scores of others out there, most of them linked to some web framework or another.

string.Replace is fine. I'd prefer using a Regex, but I'm *** for regular expressions.
The thing to keep in mind is how big these templates are. If its real big, and memory is an issue, you might want to create a custom tokenizer that acts on a stream. That way you only hold a small part of the file in memory while you manipulate it.
But, for the naiive implementation, string.Replace should be fine.

If you are doing multiple replaces on large strings then it might be better to use StringBuilder.Replace(), as the usual performance issues with strings will appear.

Regular expressions would be the quickest solution to code up but if you have many different tokens then it will get slower. If performance is not an issue then use this option.
A better approach would be to define token, like your "##" that you can scan for in the text. Then select what to replace from a hash table with the text that follows the token as the key.
If this is part of a build script then nAnt has a great feature for doing this called Filter Chains. The code for that is open source so you could look at how its done for a fast implementation.

Had to do something similar recently. What I did was:
make a method that takes a dictionary (key = token name, value = the text you need to insert)
Get all matches to your token format (##.+?## in your case I guess, not that good at regular expressions :P) using Regex.Matches(input, regular expression)
foreach over the results, using the dictionary to find the insert value for your token.
return result.
Done ;-)
If you want to test your regexes I can suggest the regulator.

FastReplacer implements token replacement in O(n*log(n) + m) time and uses 3x the memory of the original string.
FastReplacer is good for executing many Replace operations on a large string when performance is important.
The main idea is to avoid modifying existing text or allocating new memory every time a string is replaced.
We have designed FastReplacer to help us on a project where we had to generate a large text with a large number of append and replace operations. The first version of the application took 20 seconds to generate the text using StringBuilder. The second improved version that used the String class took 10 seconds. Then we implemented FastReplacer and the duration dropped to 0.1 seconds.

If your template is large and you have lots of tokens, you probably don't want walk it and replace the token in the template one by one as that would result in an O(N * M) operation where N is the size of the template and M is the number of tokens to replace.
The following method accepts a template and a dictionary of the keys value pairs you wish to replace. By initializing the StringBuilder to slightly larger than the size of the template, it should result in an O(N) operation (i.e. it shouldn't have to grow itself log N times).
Finally, you can move the building of the tokens into a Singleton as it only needs to be generated once.
static string SimpleTemplate(string template, Dictionary<string, string> replacements)
{
// parse the message into an array of tokens
Regex regex = new Regex("(##[^#]+##)");
string[] tokens = regex.Split(template);
// the new message from the tokens
var sb = new StringBuilder((int)((double)template.Length * 1.1));
foreach (string token in tokens)
sb.Append(replacements.ContainsKey(token) ? replacements[token] : token);
return sb.ToString();
}

This is an ideal use of Regular Expressions. Check out this helpful website, the .Net Regular Expressions class, and this very helpful book Mastering Regular Expressions.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.