remove stop words from text C# - c#

i want to remove an array of stop words from input string, and I have the following procedure
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
string input = "Did you try this yourself before asking";
foreach (string word in arrToCheck )
{
input = input.Replace(word, "");
}
Is it the best way to conduct this task, specially when I have (450) stop words and the input string is long? I prefer using replace method, because I want to remove the stop words when they appear in different morphologies. For example, if the stop word is "do" then delete "do" from (doing, does and so on ). are there any suggestions for better and fastest processing? thanks in advance.

May I suggest a StringBuilder?
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
StringBuilder input = new StringBuilder("Did you try this yourself before asking");
foreach (string word in arrToCheck )
{
input.Replace(word, "");
}
Because it does all its processing inside it's own data structure, and doesnt allocate hundreds of new strings, I believe you will find it to be far more memory efficient.

There are a few aspects to this
Premature optimization
The method given works and is easy to understand/maintain. Is it causing a performance problem?
If not, then don't worry about it. If it ever causes a problem, then look at it.
Expected Results
In the example, what you do want the output to be?
"Did you this asking"
or
"Did you this asking"
You haved added spaces to the end of "try" and "before" but not "yourself". Why? Typo?
string.Replace() is case-sensitive. If you care about casing, you need to modify the code.
Working with partials is messy.
Words change in different tenses. The example of 'do' being removed from 'doing' words, but how about 'take' and 'taking'?
The order of the stop words matters because you are changing the input. It is possible (I've no idea how likely but possible) that a word which was not in the input before a change 'appears' in the input after the change. Do you want to go back and recheck each time?
Do you really need to remove the partials?
Optimizations
The current method is going to work its way through the input string n times, where n is the number of words to be redacted, creating a new string each time a replacement occurs. This is slow.
Using StringBuilder (akatakritos above) will speed that up an amount, so I would try this first. Retest to see if this makes it fast enough.
Linq can be used
EDIT
Just splitting by ' ' to demonstrate. You would need to allow for punctuation marks as well and decide what should happen with them.
END EDIT
[TestMethod]
public void RedactTextLinqNoPartials() {
var arrToCheck = new string[] { "try", "yourself", "before" };
var input = "Did you try this yourself before asking";
var output = string.Join(" ",input.Split(' ').Where(wrd => !arrToCheck.Contains(wrd)));
Assert.AreEqual("Did you this asking", output);
}
Will remove all the whole words (and the spaces. It will not be possible to see from where the words were removed) but without some benchmarking I would not say that it is faster.
Handling partials with linq becomes messy but can work if we only want one pass (no checking for 'discovered' words')
[TestMethod]
public void RedactTextLinqPartials() {
var arrToCheck = new string[] { "try", "yourself", "before", "ask" };
var input = "Did you try this yourself before asking";
var output = string.Join(" ", input.Split(' ').Select(wrd => {
var found = arrToCheck.FirstOrDefault(chk => wrd.IndexOf(chk) != -1);
return found != null
? wrd.Replace(found,"")
: wrd;
}).Where(wrd => wrd != ""));
Assert.AreEqual("Did you this ing", output);
}
Just from looking at this I would say that it is slower than the string.Replace() but without some numbers there is no way to tell. It is definitely more complicated.
Bottom Line
The String.Replace() approach (modified to use string builder and to be case insensitive) looks like a good first cut solution. Before trying anything more complicated I would benchmark it under likely performance conditions.
hth,
Alan.

Here you go:
var words_to_remove = new HashSet<string> { "try", "yourself", "before" };
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input
.Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
.Where(word => !words_to_remove.Contains(word))
);
Console.WriteLine(output);
This prints:
Did you this asking
The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).
However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.
To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?
BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.
--- EDIT ---
Here is another alternative:
var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
);
I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)

For a simple way to remove a list of strings from your sentence, and aggregate the results back together, you can do the following:
var input = "Did you try this yourself before asking";
var arrToCheck = new [] { "try ", "yourself", "before " };
var result = input.Split(arrToCheck,
arrToCheck.Count(),
StringSplitOptions.None)
.Aggregate((first, second) => first + second);
This will break your original string apart by your word delimiters, and create one final string using the result set from the split array.
The result will be, "Did you this before asking"

shorten your code, and use LINQ
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
var test = new StringBuilder("Did you try this yourself before asking");
arrToCheck.ForEach(x=> test = test.Replace(x, ""));
Console.Writeln(test.ToString());

String.Join(" ",input.
Split(' ').Where(w=>stop.Where(sW=>sW==w).
FirstOrDefault()==null).ToArray());

Related

Remove personal data from text with efficient replace on strings from a word list with 6 million words

I'm having trouble with performance on this replace function that I am trying to build. To remove personal information from text posts.
I have thousands of strings looking like this:
Hi, the user: f3a11-010101-01a1-1111 is a great co-worker. She lives in Manchester. If you want to contact her, the telephone number is 1111111. /Marcus
The strings are often much more text.
In the end I want to replace all the info in the text that is in the word-list. The final string will look like this.
Hi, the user: [userID] is a great co-worker. She lives in [city]. If you want to contact her, the telephone number is [telephone]. /[Name]
the wordlist is in lowercase but the function have to be case insensitive.
I have tried several approaches which I have skipped because they are to slow.
What I have now is the fastest. I work with C# 6.
Here is some pseudo code to explain it.
The wordlist is in a dictionary.
Dictionary<string, string> word = PopulateWordList();
I loop through each string with text that can have values that should be replaced.
foreach (var post in objColl)
{
string[] substrings = Regex.Split(post.Text, #"( |,|!|\.)");
replaced = false;
foreach (string str in substrings)
{
if (str.Length > 4)
{
stringToCompare = str.ToLower();
if (word.Keys.Contains(stringToCompare))
{
replaced = true;
str = words[stringToCompare];
}
}
}
if (replaced)
post.Text = String.Join("" substrings);
}
This code works but it is slow. The important thing is that words should be matched if they have trailing characters like .?! or signs before them like / and so on. All these signs are not in the above code.
I have also tried splitting the string and populate a hashset and se if they intersect with the dictionary keys. But for that to work you have to lowercase the string first before you split it. And then when you do the replace you have to use the above code to preserve the casing. But no real performence improvements though almost every post have something to replace.
In my real code I also use parallel for each loops but leaved that out in my example.
Things I've tried before this code is with regex replace that can handle ignore case, but that is super slow and to prevent splitting half words you have to add spaces and trailing charcters to the words.
Small example of trashed code:
foreach (var word in wordlist)
{
stringWithText = Regex.Replace(stringWithText, ' ' + word.Oldvalue + ' ', ' ' + word.Newvalue + ' ', RegexOptions.IgnoreCase);
stringWithText = Regex.Replace(stringWithText, ' ' + word.Oldvalue, '.' + word.Newvalue + '.', RegexOptions.IgnoreCase);
stringWithText = Regex.Replace(stringWithText, ' ' + word.Oldvalue, '!' + word.Newvalue + '!', RegexOptions.IgnoreCase);
//And several more replaces to handle every case. You also have to handle when the word is first or last in the string.
}
I have tried many more ways but nothing that is faster then my first code. This topic I widely discussed and there are many threads on this over the years. I have looked at many but not found any better way.
I'm using a StopWatch to measure the time it takes to go through the same smaller set of posts and wordlist so I know what time it takes with each code change.
Any ideas on how to improve this or if there is a completely different way to solve this? Just don't suggest sending the data to cloud API with an AI language model to solve this as it contains personal data.
There are also problems with conjugations that may not be in the word list or if you have words like "Manchester City". That will be replace with "[city] City"
So my code don't handle words with spacing.
I also know I won't be able to solve this perfectly but it can probably be done better and faster.
This part is especially highly suboptimal.
if (word.Keys.Contains(stringToCompare))
{
replaced = true;
str = words[stringToCompare];
}
Change PopulateWordList() so that when it does a new Dictionary<string, string>() to use the constructor that allows a comparer, as so: new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase).
Then change:
stringToCompare = str.ToLower();
if (word.Keys.Contains(stringToCompare))
{
replaced = true;
str = words[stringToCompare];
}
to:
if (word.TryGetValue(str, out temp))
{
replaced = true;
str = temp;
}
And right before the outer foreach, make a string temp = null;

How to take a string, split it into an array and then join it back together

I have some code that will take in a string from console in [0,0,0,0] format. I then want to split it into an array but leave the [] and only take the numbers. How do i do this? This is what i have, i thought i could split it all and remove the brackets after but it doesnt seem to take the brackets and rather just leaves a null space. is there a way to split from index 1 to -1?
input = Console.ReadLine();
Char[] splitChars = {'[',',',']'};
List<string> splitString = new List<string>(input.Split(splitChars));
Console.WriteLine("[" + String.Join(",", splitString) + "]");
Console.ReadKey();
I love using LinqPad for such tasks, because you can use Console.WriteLine() to get the details of a result like so:
It becomes obvious that you have empty entries, i.e. "" at the beginning and the end. You want to remove those with the overloaded funtion that takes StringSplitOptions.RemoveEmptyEntries [MSDN]:
List<string> splitString = new List<string>(input.Split(splitChars, StringSplitOptions.RemoveEmptyEntries));
Result:

How to split lines but keep those with tabs?

I know how to split a long string by new lines
string[] lines = d.Split(new string[] { Environment.NewLine },
StringSplitOptions.RemoveEmptyEntries);
But I would like to
Split in new lines like this
and in new lines + tabs
like this, in order to have
an array which contains the first line and this second block.
Basically I would like to split not by one rule, but by two.
The resulting array should contain following strings:
[0] Split in new lines like this
[1] and in new lines + tabs
like this, in order to have
an array which contains the first line and this second block.
This trick should work. I have replaced "\n\t" with "\r" temporarily. After splitting the string, restored back the "\n\t" string. So your array lines, will have desired count of strings.
This way, you can get your desired output:
d = d.Replace("\n\t", "\r");
string[] lines = d.Split(new string[] {"\n"}, StringSplitOptions.RemoveEmptyEntries);
lines = lines.Select((line) => line = line.Replace("\r", "\n\t")).ToArray();
What is a tab?
If it is 4 spaces, use the following pattern:
string pattern = Environment.NewLine + #"(?!\s{4})";
If it is tabulation, use the following pattern:
string pattern = Environment.NewLine + #"(?!\t)";
Next, we use a regular expression:
string[] lines = Regex.Split(text, pattern);
One should use the right tool for the right situation. To avoid using a tool available to one (a tool which is in every programming language btw) is foolish.
Regex is best when a discernible pattern has to be expressed which can't be done, or easily done, by the string functions. Below I use each tool in the situation it was best designed for...
The following is a three stage operation using regex, string op, and Linq.
Identifying which lines have to be kept together due to the indented rule. This is done to not lose them in the main split operation, the operation will replace \r\n\t with a pipe character (|) to identify their specialty. This is done with regex because we are able to effectively group and process the operations with minimal overhead.
We split all the remaining lines by the newline character which gives us a glimpse at the final array of lines wanted.
We project (change) each line via linq's Select to change the | to a \r\n.
Code
Regex.Replace(text, "\r\n\t", "|\t")
.Split(new string[] { Environment.NewLine }, StringSplitOptions.None )
.Select (rLine => rLine.Replace("|", Environment.NewLine));
Try it here (.Net Fiddle)
Full code with before and after results as run in LinqPad. Note that .Dump() is only available in Linqpad to show results and is not a Linq extension.
Result first:
Full code
string text = string.Format("{1}{0}{2}{0}\t\t{3}{0}\t\t{4}",
Environment.NewLine,
"Split in new lines like this",
"and in new lines + tabs",
"like this, in order to have",
"an array which contains the first line and this second block.");
text.Dump("Before");
var result = Regex.Replace(text, "\r\n\t", "|\t")
.Split(new string[] { Environment.NewLine }, StringSplitOptions.None )
.Select (rLine => rLine.Replace("|", Environment.NewLine));
result.Dump("after");
After you split all lines, you can just "join" tabbed lines like so:
List<string> lines = d.Split(new string[] { Environment.NewLine })
.ToList();
// loop through all lines, but skip the first (lets assume it isn't tabbed)
for (int i = 1; i < lines.Count; i++)
{
if (lines[i][0] == '\t') // current line starts with tab
{
lines[i - 1] += "\r\n" + lines[i]; // append it to prev line
lines.RemoveAt(i); // remove current line from list
i--; // and dec i so you don't skip an item
}
}
You could add more complex logic to consider different number of tabs if you wanted, but this should get you started.
If you expect many tabbed lines to be grouped together, you might want to use StringBuilder instead of string for increased performance in appending the lines back together.

Using string.ToUpper on substring

Have an assignment to allow a user to input a word in C# and then display that word with the first and third characters changed to uppercase. Code follows:
namespace Capitalizer
{
class Program
{
static void Main(string[] args)
{
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
string Upper = text.ToUpper();
Console.WriteLine(Upper);
Console.ReadKey();
}
}
}
This of course generates the entire word in uppercase, which is not what I want. I can't seem to make text.ToUpper(0,2) work, and even then that'd capitalize the first three letters. Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
The simplest way I can think of to address your exact question as described — to convert to upper case the first and third characters of the input — would be something like the following:
StringBuilder sb = new StringBuilder(text);
sb[0] = char.ToUpper(sb[0]);
sb[2] = char.ToUpper(sb[2]);
text = sb.ToString();
The StringBuilder class is essentially a mutable string object, so when doing these kinds of operations is the most fluid way to approach the problem, as it provides the most straightforward conversions to and from, as well as the full range of string operations. Changing individual characters is easy in many data structures, but insertions, deletions, appending, formatting, etc. all also come with StringBuilder, so it's a good habit to use that versus other approaches.
But frankly, it's hard to see how that's a useful operation. I can't help but wonder if you have stated the requirements incorrectly and there's something more to this question than is seen here.
You could use LINQ:
var upperCaseIndices = new[] { 0, 2 };
var message = "hello";
var newMessage = new string(message.Select((c, i) =>
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c).ToArray());
Here is how it works. message.Select (inline LINQ query) selects characters from message one by one and passes into selector function:
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c
written as C# ?: shorthand syntax for if. It reads as "If index is present in the array, then select upper case character. Otherwise select character as is."
(c, i) => condition
is a lambda expression. See also:
Understand Lambda Expressions in 3 minutes
The rest is very simple - represent result as array of characters (.ToArray()), and create a new string based off that (new string(...)).
Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
That seems a lot more complicated than necessary. Once you have a character array, you can simply change the elements of that character array. In a separate function, it would look something like
string MakeFirstAndThirdCharacterUppercase(string word) {
var chars = word.ToCharArray();
chars[0] = chars[0].ToUpper();
chars[2] = chars[2].ToUpper();
return new string(chars);
}
My simple solution:
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
foreach (string s in words)
{
char[] chars = s.ToCharArray();
chars[0] = char.ToUpper(chars[0]);
if (chars.Length > 2)
{
chars[2] = char.ToUpper(chars[2]);
}
Console.Write(new string(chars));
Console.Write(' ');
}
Console.ReadKey();

Regular Expression split string and get whats in brackets [ ] put into array

I am trying to use regex to split the string into 2 arrays to turn out like this.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
How do I split str1 to break off into 2 arrays that look like this:
ary1 = ['First Second','Third Forth','Fifth'];
ary2 = ['insideFirst','insideSecond'];
here is my solution
string str = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
MatchCollection matches = Regex.Matches(str,#"\[.*?\]");
string[] arr = matches.Cast<Match>()
.Select(m => m.Groups[0].Value.Trim(new char[]{'[',']'}))
.ToArray();
foreach (string s in arr)
{
Console.WriteLine(s);
}
string[] arr1 = Regex.Split(str,#"\[.*?\]")
.Select(x => x.Trim())
.ToArray();
foreach (string s in arr1)
{
Console.WriteLine(s);
}
Output
insideFirst
insideSecond
First Second
Third Forth
Fifth
Plz Try below code. Its working fine for me.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var output = String.Join(";", Regex.Matches(str1, #"\[(.+?)\]")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
string[] strInsideBreacket = output.Split(';');
for (int i = 0; i < strInsideBreacket.Count(); i++)
{
str1 = str1.Replace("[", ";");
str1 = str1.Replace("]", "");
str1 = str1.Replace(strInsideBreacket[i], "");
}
string[] strRemaining = str1.Split(';');
Plz look at below screen shot of output while debugging code:
Here,
strInsideBreacket is array of breacket value like insideFirst andinsideSecond
and strRemaining is array of First Second,Third Forth and Fifth
Thanks
Try this solution,
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var allWords = str1.Split(new char[] { '[', ']' }, StringSplitOptions.RemoveEmptyEntries);
var result = allWords.GroupBy(x => x.Contains("inside")).ToArray();
The idea is that, first get all words and then the group it.
It seems to me that "user2828970" asked a question with an example, not with literal text he wanted to parse. In my mind, he could very well have asked this question:
I am trying to use regex to split a string like so.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"\d+");
The result of regexSplit is: I had, birds but, of them flew away.
However, I also want to know the value which resulted in the second string splitting away from its preceding text, and the value which resulted in the third string splitting away from its preceding text. i.e.: I want to know about 185 and 20.
The string could be anything, and the pattern to split by could be anything. The answer should not have hard-coded values.
Well, this simple function will perform that task. The code can be optimized to compile the regex, or re-organized to return multiple collections or different objects. But this is (nearly) the way I use it in production code.
public static List<Tuple<string, string>> RegexSplitDetail(this string text, string pattern)
{
var splitAreas = new List<Tuple<string, string>>();
var regexResult = Regex.Matches(text, pattern);
var regexSplit = Regex.Split(text, pattern);
for (var i = 0; i < regexSplit.Length; i++)
splitAreas.Add(new Tuple<string, string>(i == 0 ? null : regexResult[i - 1].Value, regexSplit[i]));
return splitAreas;
}
...
var result = exampleSentence.RegexSplitDetail(#"\d+");
This would return a single collection which looks like this:
{ null, "I had "}, // First value, had no value splitting it from a predecessor
{"185", " birds but "}, // Second value, split from the preceding string by "185"
{ "20", " of them flew away"} // Third value, split from the preceding string by "20"
Being that this is a .NET Question and, apart from my more favoured approach in my other answer, you can also capture the Split Value another VERY Simple way. You just then need to create a function to utilize the results as you see fit.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"(\d+)");
The result of regexSplit is: I had, 185, birds but, 20, of them flew away. As you can see, the split values exist within the split results.
Note the subtle difference compared to my other answer. In this regex split, I used a Capture Group around the entire pattern (\d+) You can't do that!!!?.. can you?
Using a Capture Group in a Split will force all capture groups of the Split Value between the Split Result Capture Groups. This can get messy, so I don't suggest doing it. It also forces somebody using your function(s) to know that they have to wrap their regexes in a capture group.

Categories

Resources