C#: Removing common invalid characters from a string: improve this algorithm

C#: Removing common invalid characters from a string: improve this algorithm - c#

Consider the requirement to strip invalid characters from a string. The characters just need to be removed and replace with blank or string.Empty.
char[] BAD_CHARS = new char[] { '!', '#', '#', '$', '%', '_' }; //simple example
foreach (char bad in BAD_CHARS)
{
if (someString.Contains(bad))
someString = someString.Replace(bad.ToString(), string.Empty);
}
I'd have really liked to do this:
if (BAD_CHARS.Any(bc => someString.Contains(bc)))
someString.Replace(bc,string.Empty); // bc is out of scope
Question:
Do you have any suggestions on refactoring this algoritm, or any simpler, easier to read, performant, maintainable algorithms?

I don't know about the readability of it, but a regular expression could do what you need it to:
someString = Regex.Replace(someString, #"[!##$%_]", "");

char[] BAD_CHARS = new char[] { '!', '#', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOptions.RemoveEmptyEntries));
should do the trick (sorry for any smaller syntax errors I'm on my phone)

The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.
The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)
Try the following extension method. The key is the use of StringBuilder, which means only one block of memory is assigned for the result during execution.
private static readonly HashSet<char> badChars =
new HashSet<char> { '!', '#', '#', '$', '%', '_' };
public static string CleanString(this string str)
{
var result = new StringBuilder(str.Length);
for (int i = 0; i < str.Length; i++)
{
if (!badChars.Contains(str[i]))
result.Append(str[i]);
}
return result.ToString();
}
This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one (m being the number of bad chars); it also is lot a better with memory usage, as explained above.

This one is faster than HashSet<T>. Also, if you have to perform this action often, please consider the foundations for this question I asked here.
private static readonly bool[] BadCharValues;
static StaticConstructor()
{
BadCharValues = new bool[char.MaxValue+1];
char[] badChars = { '!', '#', '#', '$', '%', '_' };
foreach (char c in badChars)
BadCharValues[c] = true;
}
public static string CleanString(string str)
{
var result = new StringBuilder(str.Length);
for (int i = 0; i < str.Length; i++)
{
if (!BadCharValues[str[i]])
result.Append(str[i]);
}
return result.ToString();
}

Extra tip: If you don't want to remember the array of char that are invalid for Files, you could use Path.GetInvalidFileNameChars(). If you wanted it for Paths, it's Path.GetInvalidPathChars
private static string RemoveInvalidChars(string str)
{
return string.Concat(str.Split(Path.GetInvalidFileNameChars(), StringSplitOptions.RemoveEmptyEntries));
}

if you still want to do it in a LINQy way:
public static string CleanUp(this string orig)
{
var badchars = new HashSet<char>() { '!', '#', '#', '$', '%', '_' };
return new string(orig.Where(c => !badchars.Contains(c)).ToArray());
}

Something to consider -- if this is for passwords (say), you want to scan for and keep good characters, and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.
For Each Character
If Character is Good -> Keep it (copy to out buffer, whatever.)
jeff

Why would you have REALLY LIKED to do that? The code is absolutely no simpler, you're just forcing a query extension method into your code.
As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present.
Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this.

This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:
string clean = new string(#"Sour!ce Str&*(#ing".Where(c =>
#"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

Related

Using string.ToUpper on substring

Have an assignment to allow a user to input a word in C# and then display that word with the first and third characters changed to uppercase. Code follows:
namespace Capitalizer
{
class Program
{
static void Main(string[] args)
{
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
string Upper = text.ToUpper();
Console.WriteLine(Upper);
Console.ReadKey();
}
}
}
This of course generates the entire word in uppercase, which is not what I want. I can't seem to make text.ToUpper(0,2) work, and even then that'd capitalize the first three letters. Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.

The simplest way I can think of to address your exact question as described — to convert to upper case the first and third characters of the input — would be something like the following:
StringBuilder sb = new StringBuilder(text);
sb[0] = char.ToUpper(sb[0]);
sb[2] = char.ToUpper(sb[2]);
text = sb.ToString();
The StringBuilder class is essentially a mutable string object, so when doing these kinds of operations is the most fluid way to approach the problem, as it provides the most straightforward conversions to and from, as well as the full range of string operations. Changing individual characters is easy in many data structures, but insertions, deletions, appending, formatting, etc. all also come with StringBuilder, so it's a good habit to use that versus other approaches.
But frankly, it's hard to see how that's a useful operation. I can't help but wonder if you have stated the requirements incorrectly and there's something more to this question than is seen here.

You could use LINQ:
var upperCaseIndices = new[] { 0, 2 };
var message = "hello";
var newMessage = new string(message.Select((c, i) =>
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c).ToArray());
Here is how it works. message.Select (inline LINQ query) selects characters from message one by one and passes into selector function:
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c
written as C# ?: shorthand syntax for if. It reads as "If index is present in the array, then select upper case character. Otherwise select character as is."
(c, i) => condition
is a lambda expression. See also:
Understand Lambda Expressions in 3 minutes
The rest is very simple - represent result as array of characters (.ToArray()), and create a new string based off that (new string(...)).

Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
That seems a lot more complicated than necessary. Once you have a character array, you can simply change the elements of that character array. In a separate function, it would look something like
string MakeFirstAndThirdCharacterUppercase(string word) {
var chars = word.ToCharArray();
chars[0] = chars[0].ToUpper();
chars[2] = chars[2].ToUpper();
return new string(chars);
}

My simple solution:
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
foreach (string s in words)
{
char[] chars = s.ToCharArray();
chars[0] = char.ToUpper(chars[0]);
if (chars.Length > 2)
{
chars[2] = char.ToUpper(chars[2]);
}
Console.Write(new string(chars));
Console.Write(' ');
}
Console.ReadKey();

String is not splitting correctly

I am trying to split a string into a string[] made of the words the string originally held using the fallowing code.
private string[] ConvertWordsFromFile(String NewFileText)
{
char[] delimiterChars = { ' ', ',', '.', ':', '/', '|', '<' , '>','/','#','#','$','%','^','&','*','"','(',')',';'};
string[] words = NewFileText.Split(delimiterChars);
return words;
}
I am then using this to add the words to a dictionary that keeps up with word keys and their frequency value. All other duplicated words are not added as keys and only the value is affected. However the last word is counted as a different word and is therefore made into a new key. How can i fix this?
This is the code I have for adding words to the dictionary :
public void AddWord(String newWord)
{
newWord = newWord.ToLower();
try
{
MyWords.Add(newWord, 1);
}
catch (ArgumentException)
{
MyWords[newWord]++;
}
}
To clarify the problem i am having is that even if the word at the end of a string is a duplicate it is still treated like a new word and therefore a new string.

Random guess - space at the end makes empty word that you don't expect. If yes - use correct option for Split:
var words = newFileText.Split(delimiterChars,
StringSplitOptions.RemoveEmptyEntries);

Split is not the best choice to do what you want to do because you end having this kind of problems and you also have to specify all the delimiters, etc.
A much better option is using a regular expressions instead of your ConvertWordsFromFile method as follow:
Regex.Split(theTextToBeSplitted, #"\W+")
This line will return an array containing all the 'words'. Once you have that, the next step should be create your dictionary so, if you can use linq in your code, the easiest and cleaner way to do what you want is this one:
var theTextToBeSplitted = "#Hi, this is a 'little' test: <I hope it is useful>";
var myDictionary = Regex.Split(theTextToBeSplitted, #"\W+")
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
That´s all that you need.
Good luck!

Split a string by word using one of any or all delimiters?

I may have just hit the point where i;m overthinking it, but I'm wondering: is there a way to designate a list of special characters that should all be considered delimiters, then splitting a string using that list? Example:
"battlestar.galactica-season 1"
should be returned as
battlestar galactica season 1
i'm thinking regex but i'm kinda flustered at the moment, been staring at it for too long.
EDIT:
Thanks guys for confirming my suspicion that i was overthinking it lol: here is what i ended up with:
//remove the delimiter
string[] tempString = fileTitle.Split(#"\/.-<>".ToCharArray());
fileTitle = "";
foreach (string part in tempString)
{
fileTitle += part + " ";
}
return fileTitle;
I suppose i could just replace delimiters with " " spaces as well... i will select an answer as soon as the timer is up!

The built-in String.Split method can take a collection of characters as delimiters.
string s = "battlestar.galactica-season 1";
string[] words = s.split('.', '-');

The standard split method does that for you. It takes an array of characters:
public string[] Split(
params char[] separator
)

You can just call an overload of split:
myString.Split(new char[] { '.', '-', ' ' }, StringSplitOptions.RemoveEmptyEntries);
The char array is a list of delimiters to split on.

"battlestar.galactica-season 1".Split(new string[] { ".", "-" }, StringSplitOptions.RemoveEmptyEntries);

This may not be complete but something like this.
string value = "battlestar.galactica-season 1"
char[] delimiters = new char[] { '\r', '\n', '.', '-' };
string[] parts = value.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < parts.Length; i++)
{
Console.WriteLine(parts[i]);
}

Are you trying to split the string (make multiple strings) or do you just want to replace the special characters with a space as your example might also suggest (make 1 altered string).
For the first option just see the other answers :)
If you want to replace you could use
string title = "battlestar.galactica-season 1".Replace('.', ' ').Replace('-', ' ');

For more information split with easy examples you may see following Url:
This also include split on words (multiple chars).
C# Split Function explained

Filter a String

I want to make sure a string has only characters in this range
[a-z] && [A-Z] && [0-9] && [-]
so all letters and numbers plus the hyphen.
I tried this...
C# App:
char[] filteredChars = { ',', '!', '#', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '=', '{', '}', '[', ']', ':', ';', '"', '\'', '?', '/', '.', '<', '>', '\\', '|' };
string s = str.TrimStart(filteredChars);
This TrimStart() only seems to work with letters no otehr characters like $ % etc
Did I implement it wrong?
Is there a better way to do it?
I just want to avoid looping through each string's index checking because there will be a lot of strings to do...
Thoughts?
Thanks!

This seems like a perfectly valid reason to use a regular expression.
bool stringIsValid = Regex.IsMatch(inputString, #"^[a-zA-Z0-9\-]*?$");
In response to miguel's comment, you could do this to remove all unwanted characters:
string cleanString = Regex.Replace(inputString, #"[^a-zA-Z0-9\-]", "");
Note that the caret (^) is now placed inside the character class, thus negating it (matching any non-allowed character).

Here's a fun way to do it with LINQ - no ugly loops, no complicated RegEx:
private string GetGoodString(string input)
{
var allowedChars =
Enumerable.Range('0', 10).Concat(
Enumerable.Range('A', 26)).Concat(
Enumerable.Range('a', 26)).Concat(
Enumerable.Range('-', 1));
var goodChars = input.Where(c => allowedChars.Contains(c));
return new string(goodChars.ToArray());
}
Feed it "Hello, world? 123!" and it will return "Helloworld123".

Why not just use replace instead? Trimstart will only remove the leading characters in your list...

Try the following
public bool isStringValid(string input) {
if ( null == input ) {
throw new ArgumentNullException("input");
}
return System.Text.RegularExpressions.Regex.IsMatch(input, "^[A-Za-z0-9\-]*$");
}

I'm sure that with a bit more time you can come up wiht something better, but this will give you a good idea:
public string NumberOrLetterOnly(string s)
{
string rtn = s;
for (int i = 0; i < s.Length; i++)
{
if (!char.IsLetterOrDigit(rtn[i]) && rtn[i] != '-')
{
rtn = rtn.Replace(rtn[i].ToString(), " ");
}
}
return rtn.Replace(" ", "");
}

I have tested these two solutions in Linqpad 5. The benefit of these is that they can be used not only for integers, but also decimals / floats with a number decimal separator, which is culture dependent. For example, in Norway we use the comma as the decimal separator, whereas in the US, the dot is used. The comma is used there as a thousands separator. Anyways, first the Linq version and then the Regex version. The most terse bit is accessing the Thread's static property for number separator, but you can compress this a bit using static at the top of the code, or better - put such functionality into C# extension methods, preferably having overloads with arbitrary Regex patterns.
string crappyNumber = #"40430dfkZZZdfldslkggh430FDFLDEFllll340-DIALNOWFORCHRISTSAKE.,CAKE-FORFIRSTDIAL920932903209032093294faøj##R#KKL##K";
string.Join("", crappyNumber.Where(c => char.IsDigit(c)|| c.ToString() == Thread.CurrentThread.CurrentCulture.NumberFormat.NumberDecimalSeparator)).Dump();
new String(crappyNumber.Where(c => new Regex($"[\\d]+{Thread.CurrentThread.CurrentUICulture.NumberFormat.NumberDecimalSeparator}\\d+").IsMatch(c.ToString())).ToArray()).Dump();
Note to the code above, the Dump() method dumps the results to Linqpad. Your code will of course skip this very last part. Also note that we got it down to a one liner, but it is a bit verbose still and can be put into C# extension methods as suggested.
Also, instead of string.join, newing a new String object is more compact syntax and less error prone.
We got a crappy number as input, but we managed to get our number in the end! And it is Culture aware in C#!

Does C# have a String Tokenizer like Java's?

I'm doing simple string input parsing and I am in need of a string tokenizer. I am new to C# but have programmed Java, and it seems natural that C# should have a string tokenizer. Does it? Where is it? How do I use it?

You could use String.Split method.
class ExampleClass
{
public ExampleClass()
{
string exampleString = "there is a cat";
// Split string on spaces. This will separate all the words in a string
string[] words = exampleString.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
// there
// is
// a
// cat
}
}
}
For more information see Sam Allen's article about splitting strings in c# (Performance, Regex)

I just want to highlight the power of C#'s Split method and give a more detailed comparison, particularly from someone who comes from a Java background.
Whereas StringTokenizer in Java only allows a single delimiter, we can actually split on multiple delimiters making regular expressions less necessary (although if one needs regex, use regex by all means!) Take for example this:
str.Split(new char[] { ' ', '.', '?' })
This splits on three different delimiters returning an array of tokens. We can also remove empty arrays with what would be a second parameter for the above example:
str.Split(new char[] { ' ', '.', '?' }, StringSplitOptions.RemoveEmptyEntries)
One thing Java's String tokenizer does have that I believe C# is lacking (at least Java 7 has this feature) is the ability to keep the delimiter(s) as tokens. C#'s Split will discard the tokens. This could be important in say some NLP applications, but for more general purpose applications this might not be a problem.

The split method of a string is what you need. In fact the tokenizer class in Java is deprecated in favor of Java's string split method.

I think the nearest in the .NET Framework is
string.Split()

For complex splitting you could use a regex creating a match collection.

_words = new List<string>(YourText.ToLower().Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetter).ToArray())));
Or
_words = new List<string>(YourText.Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetterOrDigit).ToArray())));

The similar to Java's method is:
Regex.Split(string, pattern);
where
string - the text you need to split
pattern - string type pattern, what is splitting the text

use Regex.Split(string,"#|#");

read this, split function has an overload takes an array consist of seperators
http://msdn.microsoft.com/en-us/library/system.stringsplitoptions.aspx

If you're trying to do something like splitting command line arguments in a .NET Console app, you're going to have issues because .NET is either broken or is trying to be clever (which means it's as good as broken). I needed to be able to split arguments by the space character, preserving any literals that were quoted so they didn't get split in the middle. This is the code I wrote to do the job:
private static List<String> Tokenise(string value, char seperator)
{
List<string> result = new List<string>();
value = value.Replace(" ", " ").Replace(" ", " ").Trim();
StringBuilder sb = new StringBuilder();
bool insideQuote = false;
foreach(char c in value.ToCharArray())
{
if(c == '"')
{
insideQuote = !insideQuote;
}
if((c == seperator) && !insideQuote)
{
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
sb.Clear();
}
}
else
{
sb.Append(c);
}
}
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
}
return result;
}

If you are using C# 3.5 you could write an extension method to System.String that does the splitting you need. You then can then use syntax:
string.SplitByMyTokens();
More info and a useful example from MS here http://msdn.microsoft.com/en-us/library/bb383977.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C#: Removing common invalid characters from a string: improve this algorithm - c#

I don't know about the readability of it, but a regular expression could do what you need it to: someString = Regex.Replace(someString, #"[!##$%_]", "");

char[] BAD_CHARS = new char[] { '!', '#', '#', '$', '%', '_' }; //simple example someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOptions.RemoveEmptyEntries)); should do the trick (sorry for any smaller syntax errors I'm on my phone)

if you still want to do it in a LINQy way: public static string CleanUp(this string orig) { var badchars = new HashSet<char>() { '!', '#', '#', '$', '%', '_' }; return new string(orig.Where(c => !badchars.Contains(c)).ToArray()); }

This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably: string clean = new string(#"Sour!ce Str&*(#ing".Where(c => #"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

Related

Using string.ToUpper on substring

String is not splitting correctly

Split a string by word using one of any or all delimiters?

Filter a String

Does C# have a String Tokenizer like Java's?

Categories

Resources