How to cut specified words from string - c#

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.
Trivial example:
foreach(string word in wordsList)
{
foreach(string mail in mailList)
{
mail.Replace(word,String.Empty);
}
}
How I can improve this algorithm?
Thanks for advices. I voted few answers up but I didn't mark any as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.

Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.
What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?
See How do you implement a good profanity filter? for extensive discussion.

You could use RegEx to make things a little cleaner:
var bannedWords = #"\b(this|is|the|list|of|banned|words)\b";
foreach(mail in mailList)
var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);
Even that, though, is far from perfect since people will always figure out a way around any type of filter.

You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.
You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.
Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.
In that way you only need to walk each message a single time.

Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).
In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.

A general algorithm would be to:
Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens
A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.
HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
"bad",
};
string Input = "this is some bad text.";
string Output = Regex.Replace(Input, #"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);

Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.
Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).

In some circumstance is possible to improve it:
Just for fun:
u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:
first calculate ur running time algorithm:
Words: n item. (each item has an O(1) length).
mailing list: K item.
each item in mailing list average length of Z.
each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.
ur algorithm takes O(n*K*Z). // the best way with knut algorithm
1.now if u sort the words list in O(n log n).
2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z).
2.2- sort the items in mailing list: O(m * log m)
total sorting takes O(K * Z) in worth case with respect to (m logm << Z).
3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)
total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).
So if performance is very very very important, u can do this.

You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:
"\bBADWORD\b"
Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.

Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.

Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.
That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word
/censorship is bullshit rant

I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).

I had great results using this algorithm on codeproject.com better than brute force text replacments.

Related

Searching for a substring in a key of a key-value pair more efficient when stored in a data structure or in a long string?

I have a string searching problem and two ideas came to mind on how to implement it. I was wondering if people can indicate which method would give me more efficient performance, or perhaps even suggest a better way of doing it?
The problem is I have a text file of around 450kb containing data in the following format:
description1, code1\n
description2, code2\n
description3, code3\n
...
It is two columns of data delimited by a comma and each record consists of a description and a code.
The code is a short three character text that is not immediately meaningful to the user, which is why there is description data paired with the code.
The description data is a short sentence that describes to the user what the code means.
I'm trying to create a GUI where the user can enter a search keyword in an editable text field which is then used to search against the description data. The system would then return back all the filtered records, i.e., all the description data that has the keyword as a substring and the code that it is paired with for the user to select. This occurs for each character the user types.
The first idea that came to mind on how to implement this feature is to create a key-value pair collection using the description data as key, such as a NameValueCollection, and then use a foreach loop to go through each record and search the key for the matching substring.
The second idea is to read the whole text file into one long string, and use the String.IndexOf() method to search for the keyword and wherever there is a hit in the search, I extract that portion of the record to return to the user.
The second idea came to mind because I was concerned by the performance impact that the first idea may have. I've read that the IndexOf method in use with StringComparison.Ordinal performs better than a Boyer–Moore string search algorithm so I think implementing it this way would have better performance?
So when searching for a substring in the key, does it provide faster retrieval to store the whole file as a string or in a NameValueCollection, or are there better ways of doing this?
If you have a wide collection of strings that you are planning on searching for the exact same substring, you have many options available.
One option would be to use the Aho-Corasick string matching algorithm to search for the search query in every single one of the lines of the file. The total runtime of doing this will be O(m + n + z), where m is the length of the query, z is the number of total matches, and n is the total number of characters in all of the strings in the file.
A better but more complex option would be to build a generalized suffix tree out of all the lines of the file. You could then find all matching lines in time O(n + z), where n is the length of the pattern to search for and z is the total number of lines in the file. This requires O(m) preprocessing time, where m is the total number of characters in the file. This is much, much faster than the first option, but you would probably have to find a good suffix tree library, as suffix tree construction algorithms are fairly complex.
Hope this helps!

Parsing A String - Is There A More Efficient Method than Checking Each Line?

I am working on a project to parse out a text file. The file is output from networking equipment. The incoming string is anywhere from a few thousand to tens of thousands of lines long. There will be a variable number of entries with keywords like these:
fcN/N is up
Hardware is Fibre Channel, SFP is short wave laser w/o OFC (SN)
Port WWN is 20:52:00:0d:ec:ef:b0:40
Admin port mode is F, trunk mode is on
snmp link state traps are enabled
Port vsan is 10
fcipN is up
.....
port-channel-N is trunking
......
The N is a number. There will always be the 'fcN/N' entries, there may or may not be the other two. The 'fcip' and 'port-channel' entries will have similar status information after each one as the fcN/N entries. All entries of the same type will be grouped - there won't be an fc followed by an fcip followed by another fc. Also as a general rule, all the fc entries are listed, then all the port-channel then all the fcip but I don't want to assume that. At the moment I have about 7 different RegEx patterns I am looking for. I do this by examining each line in turn, however managing all those is cumbersome. I thought about splitting the string on newline and then some kind of LINQ select to get all of each of the 3 types of entries, but that assumes they are always grouped in the same order. I also thought about 3 monster regexes to match everything from one entry to the next, but my experience is those are tough to get working and almost unreadable. Another thing I thought of was first match the three keywords - fc or port-channel or fcip, then have an if statement that matches the patterns unique to those. That is still matching each line for all 3 patterns though.
To be clear, I have the Regex patterns working. I am looking for a more efficient way to do this than test each line for 6 0r 8 matches.
Any other ideas?
I have two thought:
(1) Your last approach of using if statements to first find the right regex to apply is like to be quite efficient. I'd recommend it.
(2) You can compose regex's like this:
var pattern1 = #"abc";
var pattern2 = #"def";
var unionPattern = "((" + pattern1 + ")|(" + pattern2 + "))";
This makes it much more readable.
If you never want to find a match that spans lines you should split the file into lines first. That will improve efficiency because the regexes have smaller inputs and will backtrack less.
If your matches span multiple lines but they always start after a new-line, you can you can split the string into chunks first like this:
var chunks = Regex.Split(str, "((fc\d)|(fcip\d)|(port-channel-\d)));
You might get clearer and more concise code by using a parser combinator library, such as Sprache.
Not being a C# programmer, I'm not intimately familiar with this library (and there may well be others for C# as well), but I've used Scala parser combinators to good effect, and they build on and use regular expression parsing.
Whether it make your code more efficient likely depends on how inefficient your code now is.
Are you looking for raw speed, or efficiency? If the former, you can split the file into parts and have a thread parsing each part simultaneously. The trick will be finding a boundary to split on (so that each part contains only whole entries) quickly. You will also only want to go multithreaded if the total number of lines is large, or the overhead will outweigh the parallelization gains.

How to find the number of occurrences of a string within a huge string like a big book

I was recently asked this question during a C# interview session:
How would you efficiently find the number of occurrences of a word within a huge text like a big book (the Bible, a dictionary, etc).
I am wondering what would be the most efficient data structure to store the contents of the book in. The dirtiest soultion I could think of was to store it in a StringBuilder and find the count of the substrings, but I am sure there has to be a much better way to do this.
And for a reasonably sized string there are multiple ways of doing this using substring, regular expressions, etc but for a humongous string what is the most efficient way.
Update: What I am looking for is this:
Assuming there is a text file, lets again say the Bible, of size 20 MB, and I want to find the number of times the word "Jesus" occurs in the text, other than loading the entire 20 MB into a string or StringBuilder and using a substring or regex to find the match count, is there any other data structure that could be used to store the entire text contents. The actual search can be accomplished in multiple ways, what I am looking for is the most efficient "data structure" for the temporary storage.
Assuming you dont care about substrings, but just full words, I would use a hashtable. Can be built in linear time and the size is proportional to the number of distinct words. Dictionary<string,int> specifically. On my machine, it took about 450ms to load the entire bible into a hashtable and find all entries of the word "God".
Assuming you do a full word match (can be made to work for prefix matches too).
Construct a trie from the bible with the count information.
If you need to query a word, walk the trie, get the count.
If you need to do a substring match, you can try using a suffix tree (which is basically a trie, but you also include the suffixes).
This assumes the words to query change, the bible stays fixed...
Wikipedia has an interesting article on string searching:
http://en.wikipedia.org/wiki/String_searching_algorithm
and according to that article this algorithm is a sort of benchmark:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Something the size of the bible is not so huge as to prevent the entire string to be cached in memory, so with the assumption you can... I have used this method before but it obviously would not be lightning fast. Strictly speaking in terms of efficient from a computational standpoint this is not the fastest, but from a speed of coding and reasonable speed I think this works until nanoseconds count.
string text = "a set of text to search in. fast to implement.";
string key = "to";
MessageBox.Show(text.Split(" ',.".ToCharArray()).Where(a => a == key).Count().ToString());
Edit: doesn't solve the final version of the question and may have misinterpretted the original question. Ignore.

Is there a way to match a set of groups in any order in a regex?

I looked through the related questions, there were quite a few but I don't think any answered this question. I am very new to Regex but I'm trying to get better so bear with me please. I am trying to match several groups in a string, but in any order. Is this something I should be using Regex for? If so, how? If it matters, I plan to use these in IronPython.
EDIT: Someone asked me to be more specific, so here:
I want to use re.match with a regex like:
\[image\s*(?(#alt:(?<alt>.*?);).*(#title:(?<title>.*?);))*.*\](?<arg>.*?)\[\/image\]
But it will only match the named groups when they are in the right order, and separated with a space. I would like to be able to match the named groups in any order, as long as they appear where they do now in the regex.
A typical string that will be applied to this might look like:
[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien.png[/image]
So the 'attributes' (things that come between '#' and ';' in the first 'tag') should be matched in any order, as long as they both appear.
The answer to the question in your title is "no" -- to match N groups "in any order", the regex should have an "or" (the | feature in the regex pattern) among the N! (N factorial) possible permutations of the groups, the product of all integers from 1 to N. That's a number which grows extremely fast -- for N just equal 6, it's already 720, for 7, it's almost 5000, and so on at a dizzying pace -- so this approach is totally impractical for any N which isn't really tiny.
The solutions may be many, depending on what you want the groups to be separated with. Let's say, for example, that you don't care (if you DO care, edit your question with better specs).
In this case, if overlapping matches are impossible or are OK with you, make N separate regular expressions, one per group -- say these N compiled RE objects are in a list named grps, then
mos = [g.search(thestring) for g in grps]
is the list of match objects for the groups (None for a group which doesn't match). With the mos list you can do all sorts of checks and/or further manipulations, for example all(mos) is True if and only if all the groups matched, in which case [m.group() for m in mos] is the list of substrings that have been matched, and so on, and so forth.
If you need non-overlapping matches, it's a bit more complicated -- you may extract the boundaries of all possible matches for each group, then seeing if there's a way to extract from these N lists a set of N intervals, one per lists, so that no two of them are pairwise intersecting. This is a somewhat subtle algorithm (if you want reasonable speed for a large N, of course), so I think it's worth a separate question, and in any case it's not worth discussing right here when the very issue of whether it's needed or not depends on so incredibly many factors that you have not specified.
So, please edit your question with more precise specifications, first, and then things can perhaps be clarified to provide you with the code and/or algorithms you need.
Edit: I see the OP has now clarified the issue at least of the extent of providing an example -- although, confusingly, he offers a RE pattern example and a string example that should not match, regardless of ordering (the RE specifies the presence of a substring #title which the example string does not have -- puzzling!).
Anyway, if the number of groups in the example (two which appear to be interchangeable, one which appears to have to occur in a specific spot) is representative of the OP's actual problems, then the total number of permutations of interest is just two, so joining the "just two" permutations with a vertical bar | would of course be quite feasible. Is that the case in the OP's real problems, though...?
Edit: if the number of permutations of interest is tiny, here's an example of one way to avoid the problem of repeated group names in the pattern (syntax requires Python 2.7 or better, but that's just for the final "dict comprehension" -- the same functionality is available in many previous version of Python, just with the less elegant dict(('a', ... syntax;-)...:
>>> r = re.compile(r'(?P<a1>a.*?a).*?(?P<b1>b.*?b)|(?P<b2>b.*?b).*?(?P<a2>a.*?a)')
>>> m = r.search('zzzakkkavvvbxxxbnnn')
>>> g = m.groupdict()
>>> d = {'a':(g.get('a1') or g.get('a2')), 'b':(g.get('b1') or g.get('b2'))}
>>> d
{'a': 'akkka', 'b': 'bxxxb'}
This is very similar to one of the key problems with using regular expressions to parse HTML - there is no requirement that attributes always be specified in the same order, and many tags have surprising attributes (like <br clear="all">. So it seems you are working with a very similar markup syntax.
Pyparsing addresses this problem in an indirect way - instead of trying to parse all different permutations, parse the general "#attrname:attribute value;" syntax, and keep track of the attributes keys and values in an attribute mapping data structure. The mapping makes it easy to get the "title" attribute, regardless of whether it came first or last in the image tag. This behavior is built into the pyparsing API methods, makeHTMLTags and makeXMLTags.
Of course, this markup is not XML, but a similar approach gives some pretty easy to work with results:
text = """[image #alt:alien; #title:reddit alien;]http://www.reddit.com/alien1.png[/image]
But I should have no problem matching:
[image #title:reddit alien; #alt:alien;]http://www.reddit.com/alien2.png[/image]
"""
from pyparsing import Suppress, Group, Word, alphas, SkipTo, Dict, ZeroOrMore
LBRACK,RBRACK,COLON,SEMI,AT = map(Suppress,"[]:;#")
tagAttribute = Group(AT + Word(alphas) + COLON + SkipTo(SEMI) + SEMI)
imageTag = LBRACK + "image" + Dict(ZeroOrMore(tagAttribute)) + RBRACK
imageLink = imageTag + SkipTo("[/image]")("text")
for taginfo in imageLink.searchString(text):
print taginfo.alt
print taginfo.title
print taginfo.text
print
Prints:
alien
reddit alien
http://www.reddit.com/alien1.png
alien
reddit alien
http://www.reddit.com/alien2.png

C# Code/Algorithm to Search Text for Terms

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.
What's the most efficient way to do this in .NET (ideally C#)?
Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.
The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).
We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.
A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.
Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.
I'd suggest your problem lies elsewhere...Or you just need some decent hardware.
Why reinvent the wheel? Why not just leverage something like Lucene.NET?
have you considered the following:
do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?
Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)
Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example
"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time
"written C# today" <- removed noise, now there are lots less to look throught.
A great article on NLP can be found here. Sentance comparing
HTH
Bones
A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).
Here's another idea: Make a class something like this:
class Word
{
string Word;
List<int> Positions;
}
For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.
Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.
Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.
Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.
However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.
The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.
It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.
There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf
Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.
An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.
Ok, current rework shows this as fastest (psuedocode):
foreach (var term in allTerms)
{
string pattern = term.ToWord(); // Use /b word boundary regex
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(bigTextToSearchForTerms))
{
result.Add(term);
}
}
What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count
How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));
This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:
static bool Match(string checkItem)
{
return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}
static void Main(string[] args)
{
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(Match);
}
Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:
List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));
I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

Categories

Resources