Remove specific characters from string C#

Remove specific characters from string C# - c#

I have got a string ["foo","bar","buzz"] from view, i want to remove [,"&,]
I have used string x = tags.Trim(new Char[] { '[', '"', ']' }); but the output i got is foo","bar","buzz instead of foo,bar,buzz
I have tried Trimming & this
but still having problem.

As alternative, you can use a "simple" Replace
string x = tags.Replace("[","")
.Replace("\"","")
.Replace("]","");
It isn't fast, but it's simple.
If you need more performance you should use an alternative.
please note: that each Replace call returns a new string, and with every call the whole string is reevaluated. Although I often use this myself (due to the readability) it's not recommended for complex patterns or very long string values

By the power of Regex!
var x = Regex.Replace(tags, #"(\[|""|\])", "");

Personally, I'd switch your order of operations. For instance
String[] unformattedTags = tags.Split(',');
String[] formattedTags = unformattedTags.Select(itm => itm.Trim( '[','"',']')).ToArray();
This removes the restricted characters from each tag individually.

Related

Remove personal data from text with efficient replace on strings from a word list with 6 million words

I'm having trouble with performance on this replace function that I am trying to build. To remove personal information from text posts.
I have thousands of strings looking like this:
Hi, the user: f3a11-010101-01a1-1111 is a great co-worker. She lives in Manchester. If you want to contact her, the telephone number is 1111111. /Marcus
The strings are often much more text.
In the end I want to replace all the info in the text that is in the word-list. The final string will look like this.
Hi, the user: [userID] is a great co-worker. She lives in [city]. If you want to contact her, the telephone number is [telephone]. /[Name]
the wordlist is in lowercase but the function have to be case insensitive.
I have tried several approaches which I have skipped because they are to slow.
What I have now is the fastest. I work with C# 6.
Here is some pseudo code to explain it.
The wordlist is in a dictionary.
Dictionary<string, string> word = PopulateWordList();
I loop through each string with text that can have values that should be replaced.
foreach (var post in objColl)
{
string[] substrings = Regex.Split(post.Text, #"( |,|!|\.)");
replaced = false;
foreach (string str in substrings)
{
if (str.Length > 4)
{
stringToCompare = str.ToLower();
if (word.Keys.Contains(stringToCompare))
{
replaced = true;
str = words[stringToCompare];
}
}
}
if (replaced)
post.Text = String.Join("" substrings);
}
This code works but it is slow. The important thing is that words should be matched if they have trailing characters like .?! or signs before them like / and so on. All these signs are not in the above code.
I have also tried splitting the string and populate a hashset and se if they intersect with the dictionary keys. But for that to work you have to lowercase the string first before you split it. And then when you do the replace you have to use the above code to preserve the casing. But no real performence improvements though almost every post have something to replace.
In my real code I also use parallel for each loops but leaved that out in my example.
Things I've tried before this code is with regex replace that can handle ignore case, but that is super slow and to prevent splitting half words you have to add spaces and trailing charcters to the words.
Small example of trashed code:
foreach (var word in wordlist)
{
stringWithText = Regex.Replace(stringWithText, ' ' + word.Oldvalue + ' ', ' ' + word.Newvalue + ' ', RegexOptions.IgnoreCase);
stringWithText = Regex.Replace(stringWithText, ' ' + word.Oldvalue, '.' + word.Newvalue + '.', RegexOptions.IgnoreCase);
stringWithText = Regex.Replace(stringWithText, ' ' + word.Oldvalue, '!' + word.Newvalue + '!', RegexOptions.IgnoreCase);
//And several more replaces to handle every case. You also have to handle when the word is first or last in the string.
}
I have tried many more ways but nothing that is faster then my first code. This topic I widely discussed and there are many threads on this over the years. I have looked at many but not found any better way.
I'm using a StopWatch to measure the time it takes to go through the same smaller set of posts and wordlist so I know what time it takes with each code change.
Any ideas on how to improve this or if there is a completely different way to solve this? Just don't suggest sending the data to cloud API with an AI language model to solve this as it contains personal data.
There are also problems with conjugations that may not be in the word list or if you have words like "Manchester City". That will be replace with "[city] City"
So my code don't handle words with spacing.
I also know I won't be able to solve this perfectly but it can probably be done better and faster.

This part is especially highly suboptimal.
if (word.Keys.Contains(stringToCompare))
{
replaced = true;
str = words[stringToCompare];
}
Change PopulateWordList() so that when it does a new Dictionary<string, string>() to use the constructor that allows a comparer, as so: new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase).
Then change:
stringToCompare = str.ToLower();
if (word.Keys.Contains(stringToCompare))
{
replaced = true;
str = words[stringToCompare];
}
to:
if (word.TryGetValue(str, out temp))
{
replaced = true;
str = temp;
}
And right before the outer foreach, make a string temp = null;

how to split a string on every parenthese and keep all text into an array in c#

I'm trying to split a string on every parenthese into an array and keep all text in C#, get everything in the parenthese.
Example: "hmmmmmmmm (asdfhqwe)asasd"
Should become: "hmmmmmmmm", "(asdfhqwe)" and "asasd".
My current setup is only able to take everything inside the parentheses and discards the rest.
var output = input.Split('(', ')').Where((item, index) => index % 2 != 0).ToList();
How would i go forward to do such thing (disregarding my current code) ?

Use regrx split with positive look-ahead and look-behind and an optional space; then filter out empty strings.
var tokens = Regex
.Split(str, #"(?<=[)])\s*|\s*(?=[(])")
.Where(s => s != string.Empty)
.ToList();
Demo.

Oky so I do not know what the real string will look like in your application, but based on the provided string this will be my hack of a solution:
string sample = "hmmmmmmmm (asdfhqwe)asasd";
var result = sample.Replace("(", ",(").Replace(")", "),").Split(',');
So i replaced where the split should be with a comma, but you can use any other char that might never occur in your string, Say like the '~' could also work.
But not knowing all the required functionality, this would work for above scenario.

Try this:
string[] subString = myString.Split(new char[] { '(', ')' });

remove stop words from text C#

i want to remove an array of stop words from input string, and I have the following procedure
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
string input = "Did you try this yourself before asking";
foreach (string word in arrToCheck )
{
input = input.Replace(word, "");
}
Is it the best way to conduct this task, specially when I have (450) stop words and the input string is long? I prefer using replace method, because I want to remove the stop words when they appear in different morphologies. For example, if the stop word is "do" then delete "do" from (doing, does and so on ). are there any suggestions for better and fastest processing? thanks in advance.

May I suggest a StringBuilder?
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
StringBuilder input = new StringBuilder("Did you try this yourself before asking");
foreach (string word in arrToCheck )
{
input.Replace(word, "");
}
Because it does all its processing inside it's own data structure, and doesnt allocate hundreds of new strings, I believe you will find it to be far more memory efficient.

There are a few aspects to this
Premature optimization
The method given works and is easy to understand/maintain. Is it causing a performance problem?
If not, then don't worry about it. If it ever causes a problem, then look at it.
Expected Results
In the example, what you do want the output to be?
"Did you this asking"
or
"Did you this asking"
You haved added spaces to the end of "try" and "before" but not "yourself". Why? Typo?
string.Replace() is case-sensitive. If you care about casing, you need to modify the code.
Working with partials is messy.
Words change in different tenses. The example of 'do' being removed from 'doing' words, but how about 'take' and 'taking'?
The order of the stop words matters because you are changing the input. It is possible (I've no idea how likely but possible) that a word which was not in the input before a change 'appears' in the input after the change. Do you want to go back and recheck each time?
Do you really need to remove the partials?
Optimizations
The current method is going to work its way through the input string n times, where n is the number of words to be redacted, creating a new string each time a replacement occurs. This is slow.
Using StringBuilder (akatakritos above) will speed that up an amount, so I would try this first. Retest to see if this makes it fast enough.
Linq can be used
EDIT
Just splitting by ' ' to demonstrate. You would need to allow for punctuation marks as well and decide what should happen with them.
END EDIT
[TestMethod]
public void RedactTextLinqNoPartials() {
var arrToCheck = new string[] { "try", "yourself", "before" };
var input = "Did you try this yourself before asking";
var output = string.Join(" ",input.Split(' ').Where(wrd => !arrToCheck.Contains(wrd)));
Assert.AreEqual("Did you this asking", output);
}
Will remove all the whole words (and the spaces. It will not be possible to see from where the words were removed) but without some benchmarking I would not say that it is faster.
Handling partials with linq becomes messy but can work if we only want one pass (no checking for 'discovered' words')
[TestMethod]
public void RedactTextLinqPartials() {
var arrToCheck = new string[] { "try", "yourself", "before", "ask" };
var input = "Did you try this yourself before asking";
var output = string.Join(" ", input.Split(' ').Select(wrd => {
var found = arrToCheck.FirstOrDefault(chk => wrd.IndexOf(chk) != -1);
return found != null
? wrd.Replace(found,"")
: wrd;
}).Where(wrd => wrd != ""));
Assert.AreEqual("Did you this ing", output);
}
Just from looking at this I would say that it is slower than the string.Replace() but without some numbers there is no way to tell. It is definitely more complicated.
Bottom Line
The String.Replace() approach (modified to use string builder and to be case insensitive) looks like a good first cut solution. Before trying anything more complicated I would benchmark it under likely performance conditions.
hth,
Alan.

Here you go:
var words_to_remove = new HashSet<string> { "try", "yourself", "before" };
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input
.Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
.Where(word => !words_to_remove.Contains(word))
);
Console.WriteLine(output);
This prints:
Did you this asking
The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).
However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.
To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?
BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.
--- EDIT ---
Here is another alternative:
var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
);
I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)

For a simple way to remove a list of strings from your sentence, and aggregate the results back together, you can do the following:
var input = "Did you try this yourself before asking";
var arrToCheck = new [] { "try ", "yourself", "before " };
var result = input.Split(arrToCheck,
arrToCheck.Count(),
StringSplitOptions.None)
.Aggregate((first, second) => first + second);
This will break your original string apart by your word delimiters, and create one final string using the result set from the split array.
The result will be, "Did you this before asking"

shorten your code, and use LINQ
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
var test = new StringBuilder("Did you try this yourself before asking");
arrToCheck.ForEach(x=> test = test.Replace(x, ""));
Console.Writeln(test.ToString());

String.Join(" ",input.
Split(' ').Where(w=>stop.Where(sW=>sW==w).
FirstOrDefault()==null).ToArray());

C# search into a string for a specific pattern, and put in an Array

I'm having the following string as an example:
<tr class="row_odd"><td>08:00</td><td>08:10</td><td>TEST1</td></tr><tr class="row_even"><td>08:10</td><td>08:15</td><td>TEST2</td></tr><tr class="row_odd"><td>08:15</td><td>08:20</td><td>TEST3</td></tr><tr class="row_even"><td>08:20</td><td>08:25</td><td>TEST4</td></tr><tr class="row_odd"><td>08:25</td><td>08:30</td><td>TEST5</td></tr>
I need to have to have the output as a onedimensional Array.
Like 11111=myArray(0) , 22222=myArray(1) , 33333=myArray(2) ,......
I have already tried the myString.replace, but it seems I can only replace a single Char that way. So I need to use expressions and a for loop for filling the array, but since this is my first c# project, that is a bridge too far for me.
Thanks,

It seems like you want to use a Regex search pattern. Then return the matches (using a named group) into an array.
var regex = new Regex("act=\?(<?Id>\d+)");
regex.Matches(input).Cast<Match>()
.Select(m => m.Groups["Id"])
.Where(g => g.Success)
.Select(g => Int32.Parse(g.Value))
.ToArray();
(PS. I'm not positive about the regex pattern - you should check into it yourself).

Several ways you could do this. A couple are:
a) Use a regular expression to look for what you want in the string. Used a named group so you can access the matches directly
http://www.regular-expressions.info/dotnet.html
b) Split the expression at the location where your substrings are (e.g. split on "act="). You'll have to do a bit more parsing to get what you want, but that won't be to difficult since it will be at the beginning of the split string (and your have other srings that dont have your substring in them)

Use a combination of IndexOf and Substring... something like this would work (not sure how much your string varies). This will probably be quicker than any Regex you come up with. Although, looking at the length of your string, it might not really be an issue.
public static List<string> GetList(string data)
{
data = data.Replace("\"", ""); // get rid of annoying "'s
string[] S = data.Split(new string[] { "act=" }, StringSplitOptions.None);
var results = new List<string>();
foreach (string s in S)
{
if (!s.Contains("<tr"))
{
string output = s.Substring(0, s.IndexOf(">"));
results.Add(output);
}
}
return results;
}

Split your string using HTML tags like "<tr>","</tr>","<td>","</td>", "<a>","</a>" with strinng-variable.split() function. This gives you list of array.
Split html row into string array

Does C# have a String Tokenizer like Java's?

I'm doing simple string input parsing and I am in need of a string tokenizer. I am new to C# but have programmed Java, and it seems natural that C# should have a string tokenizer. Does it? Where is it? How do I use it?

You could use String.Split method.
class ExampleClass
{
public ExampleClass()
{
string exampleString = "there is a cat";
// Split string on spaces. This will separate all the words in a string
string[] words = exampleString.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
// there
// is
// a
// cat
}
}
}
For more information see Sam Allen's article about splitting strings in c# (Performance, Regex)

I just want to highlight the power of C#'s Split method and give a more detailed comparison, particularly from someone who comes from a Java background.
Whereas StringTokenizer in Java only allows a single delimiter, we can actually split on multiple delimiters making regular expressions less necessary (although if one needs regex, use regex by all means!) Take for example this:
str.Split(new char[] { ' ', '.', '?' })
This splits on three different delimiters returning an array of tokens. We can also remove empty arrays with what would be a second parameter for the above example:
str.Split(new char[] { ' ', '.', '?' }, StringSplitOptions.RemoveEmptyEntries)
One thing Java's String tokenizer does have that I believe C# is lacking (at least Java 7 has this feature) is the ability to keep the delimiter(s) as tokens. C#'s Split will discard the tokens. This could be important in say some NLP applications, but for more general purpose applications this might not be a problem.

The split method of a string is what you need. In fact the tokenizer class in Java is deprecated in favor of Java's string split method.

I think the nearest in the .NET Framework is
string.Split()

For complex splitting you could use a regex creating a match collection.

_words = new List<string>(YourText.ToLower().Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetter).ToArray())));
Or
_words = new List<string>(YourText.Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetterOrDigit).ToArray())));

The similar to Java's method is:
Regex.Split(string, pattern);
where
string - the text you need to split
pattern - string type pattern, what is splitting the text

use Regex.Split(string,"#|#");

read this, split function has an overload takes an array consist of seperators
http://msdn.microsoft.com/en-us/library/system.stringsplitoptions.aspx

If you're trying to do something like splitting command line arguments in a .NET Console app, you're going to have issues because .NET is either broken or is trying to be clever (which means it's as good as broken). I needed to be able to split arguments by the space character, preserving any literals that were quoted so they didn't get split in the middle. This is the code I wrote to do the job:
private static List<String> Tokenise(string value, char seperator)
{
List<string> result = new List<string>();
value = value.Replace(" ", " ").Replace(" ", " ").Trim();
StringBuilder sb = new StringBuilder();
bool insideQuote = false;
foreach(char c in value.ToCharArray())
{
if(c == '"')
{
insideQuote = !insideQuote;
}
if((c == seperator) && !insideQuote)
{
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
sb.Clear();
}
}
else
{
sb.Append(c);
}
}
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
}
return result;
}

If you are using C# 3.5 you could write an extension method to System.String that does the splitting you need. You then can then use syntax:
string.SplitByMyTokens();
More info and a useful example from MS here http://msdn.microsoft.com/en-us/library/bb383977.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove specific characters from string C# - c#

I have got a string ["foo","bar","buzz"] from view, i want to remove [,"&,] I have used string x = tags.Trim(new Char[] { '[', '"', ']' }); but the output i got is foo","bar","buzz instead of foo,bar,buzz I have tried Trimming & this but still having problem.

By the power of Regex! var x = Regex.Replace(tags, #"(\[|""|\])", "");

Personally, I'd switch your order of operations. For instance String[] unformattedTags = tags.Split(','); String[] formattedTags = unformattedTags.Select(itm => itm.Trim( '[','"',']')).ToArray(); This removes the restricted characters from each tag individually.

Related

Remove personal data from text with efficient replace on strings from a word list with 6 million words

how to split a string on every parenthese and keep all text into an array in c#

remove stop words from text C#

C# search into a string for a specific pattern, and put in an Array

Does C# have a String Tokenizer like Java's?

Categories

Resources