Filter a String - c#

I want to make sure a string has only characters in this range
[a-z] && [A-Z] && [0-9] && [-]
so all letters and numbers plus the hyphen.
I tried this...
C# App:
char[] filteredChars = { ',', '!', '#', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '=', '{', '}', '[', ']', ':', ';', '"', '\'', '?', '/', '.', '<', '>', '\\', '|' };
string s = str.TrimStart(filteredChars);
This TrimStart() only seems to work with letters no otehr characters like $ % etc
Did I implement it wrong?
Is there a better way to do it?
I just want to avoid looping through each string's index checking because there will be a lot of strings to do...
Thoughts?
Thanks!

This seems like a perfectly valid reason to use a regular expression.
bool stringIsValid = Regex.IsMatch(inputString, #"^[a-zA-Z0-9\-]*?$");
In response to miguel's comment, you could do this to remove all unwanted characters:
string cleanString = Regex.Replace(inputString, #"[^a-zA-Z0-9\-]", "");
Note that the caret (^) is now placed inside the character class, thus negating it (matching any non-allowed character).

Here's a fun way to do it with LINQ - no ugly loops, no complicated RegEx:
private string GetGoodString(string input)
{
var allowedChars =
Enumerable.Range('0', 10).Concat(
Enumerable.Range('A', 26)).Concat(
Enumerable.Range('a', 26)).Concat(
Enumerable.Range('-', 1));
var goodChars = input.Where(c => allowedChars.Contains(c));
return new string(goodChars.ToArray());
}
Feed it "Hello, world? 123!" and it will return "Helloworld123".

Why not just use replace instead? Trimstart will only remove the leading characters in your list...

Try the following
public bool isStringValid(string input) {
if ( null == input ) {
throw new ArgumentNullException("input");
}
return System.Text.RegularExpressions.Regex.IsMatch(input, "^[A-Za-z0-9\-]*$");
}

I'm sure that with a bit more time you can come up wiht something better, but this will give you a good idea:
public string NumberOrLetterOnly(string s)
{
string rtn = s;
for (int i = 0; i < s.Length; i++)
{
if (!char.IsLetterOrDigit(rtn[i]) && rtn[i] != '-')
{
rtn = rtn.Replace(rtn[i].ToString(), " ");
}
}
return rtn.Replace(" ", "");
}

I have tested these two solutions in Linqpad 5. The benefit of these is that they can be used not only for integers, but also decimals / floats with a number decimal separator, which is culture dependent. For example, in Norway we use the comma as the decimal separator, whereas in the US, the dot is used. The comma is used there as a thousands separator. Anyways, first the Linq version and then the Regex version. The most terse bit is accessing the Thread's static property for number separator, but you can compress this a bit using static at the top of the code, or better - put such functionality into C# extension methods, preferably having overloads with arbitrary Regex patterns.
string crappyNumber = #"40430dfkZZZdfldslkggh430FDFLDEFllll340-DIALNOWFORCHRISTSAKE.,CAKE-FORFIRSTDIAL920932903209032093294faøj##R#KKL##K";
string.Join("", crappyNumber.Where(c => char.IsDigit(c)|| c.ToString() == Thread.CurrentThread.CurrentCulture.NumberFormat.NumberDecimalSeparator)).Dump();
new String(crappyNumber.Where(c => new Regex($"[\\d]+{Thread.CurrentThread.CurrentUICulture.NumberFormat.NumberDecimalSeparator}\\d+").IsMatch(c.ToString())).ToArray()).Dump();
Note to the code above, the Dump() method dumps the results to Linqpad. Your code will of course skip this very last part. Also note that we got it down to a one liner, but it is a bit verbose still and can be put into C# extension methods as suggested.
Also, instead of string.join, newing a new String object is more compact syntax and less error prone.
We got a crappy number as input, but we managed to get our number in the end! And it is Culture aware in C#!

Related

CRM How to Split with Multiple Types of Characters and New Line

I'm trying split a string in CRM using different characters (like whitespace, comma, period, colon, semicolon, slash, pipe). But I also need to split on a new line as well.
The below function is working to split using different characters:
string[] values = propertylist.Split(new Char[] { ' ', ',', '.', ':','\t', '/', ';', '|', '\\', '\r', '\n'});
I read that for new line the symbol must be '\r\n'.. but for some reason if I change the function a little bit from Split(new Char[] to Split(new String[], even after changing to use double quotation mark, I keep getting error "Cannot convert from string[] to char[]..." even though I am already using double quotation mark.
Any suggestions for this is appreciated very much. Thanks!
-elisabeth
Most likely you changed the code in your question to this:
string[] values =
propertylist.Split(new string[] { " ", ... "\r", "\n"});
The problem is that the method overload that accepts a string[] requires additional parameters. Without supplying those parameters, you get a syntax error.
This is the closest match for your original code:
string[] values =
propertylist.Split(new string[] { " ", ... "\r\n"}, StringSplitOptions.None);

Create a static method that will parse any string to detect special characters and include \ before them

In my MVC-App I want to create a method that will be used everywhere to avoid having any special characters like #, ", ' or anything else provoking a major problem.
So I'm trying to build this method using a regex that parses a string to detect if there's any special characters in the string and put a \ in front of them to make them harmless.
public static string ParseStringForSpecialChars(string stringToParse)
{
const string regexItem = "^[a-zA-Z0-9 ]*$";
string stringToReturn = Regex.Replace(stringToParse, regexItem, "\\");
return stringToReturn;
}
There are many problems in my code:
1) I am not familiar with regex and I have troubles figuring out what I wanted to do. Here, I think I was trying to detect if there were any characters other than thos in the regexItem; 2) When the code hits the string stringToReturn = line, my app crashed as it says that the value cannot be null.
Can anyone help me out? Thanks!
EDIT
I have been asked to show an example of special characters, here they are:
'/', '.', '*', '+', '?', '|', '(', ')', '[', ']', '{', '}', '\\'
You get the idea, I just want to avoid sending a string to the database containing a ', because that will be interpreted as then end of a string and will provoke an error.
If you're worried about writing to sql, check out:
SqlParameterCollection.AddWithValue.
As for your code, I think this is it:
public static string ParseStringForSpecialChars(string stringToParse)
{
const string regexItem = "[^a-zA-Z0-9 ]";
string stringToReturn = Regex.Replace(stringToParse, regexItem, #"\$&");
return stringToReturn;
}

c#: how to split string using default whitespaces + set of addtional delimiters?

I need to split a string in C# using a set of delimiter characters. This set should include the default whitespaces (i.e. what you effectively get when you String.Split(null, StringSplitOptions.RemoveEmptyEntries)) plus some additional characters that I specify like '.', ',', ';', etc. So if I have a char array of those additional characters, how to I add all the default whitespaces to it, in order to then feed that expanded array to String.Split? Or is there a better way of splitting using my custom delimiter set + whitespaces? Thx
Just use the appropriate overload of string.Split if you're at least on .NET 2.0:
char[] separator = new[] { ' ', '.', ',', ';' };
string[] parts = text.Split(separator, StringSplitOptions.RemoveEmptyEntries);
I guess i was downvoted because of the incomplete answer. OP has asked for a way to split by all white-spaces(which are 25 on my pc) but also by other delimiters:
public static class StringExtensions
{
static StringExtensions()
{
var whiteSpaceList = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
char c = Convert.ToChar(i);
if (char.IsWhiteSpace(c))
{
whiteSpaceList.Add(c);
}
}
WhiteSpaces = whiteSpaceList.ToArray();
}
public static readonly char[] WhiteSpaces;
public static string[] SplitWhiteSpacesAndMore(this string str, IEnumerable<char> otherDeleimiters, StringSplitOptions options = StringSplitOptions.None)
{
var separatorList = new List<char>(WhiteSpaces);
separatorList.AddRange(otherDeleimiters);
return str.Split(separatorList.ToArray(), options);
}
}
Now you can use this extension method in this way:
string str = "word1 word2\tword3.word4,word5;word6";
char[] separator = { '.', ',', ';' };
string[] split = str.SplitWhiteSpacesAndMore(separator, StringSplitOptions.RemoveEmptyEntries);
The answers above do not use all whitespace characters as delimiters, as you state in your request, only the ones specified by the program. In the solution examples above, this is only SPACE, but not TAB, CR, LF, and all the other Unicode-defined whitespace chars.
I have not found a way to retrieve the default whitespace chars from String. However, they are defined in Regex, and you can use that instead of String. In your case, adding period and comma to the Regex whitespace set:
Regex regex = new Regex(#"[\s\.,]+"); // The "+" will remove blank entries
input = #"1.2 3, 4";
string[] tokens = regex.Split(input);
will produce
tokens[0] "1"
tokens[1] "2"
tokens[2] "3"
tokens[3] "4"
str.Split(" .,;".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
I use something like the following to ensure I'm always splitting on Split's default whitespace characters:
public static string[] SplitOnWhitespaceAnd(this string value,
char[] separator, StringSplitOptions options = StringSplitOptions.RemoveEmptyEntries)
=> value.Split().SelectMany(s => s.Split(separator, options)).ToArray();
Note that to be consistent with Microsoft's naming conventions, you'd want to use WhiteSpace rather than Whitespace.
Refer to Microsoft's Char.IsWhiteSpace documentation to see the whitespace characters split on by default.
string[] splitSentence(string sentence)
{
return sentence
.Replace(",", " , ")
.Replace(".", " . ")
.Split(' ', StringSplitOptions.RemoveEmptyEntries)
}
or
string[] result = test.Split(new string[] {"\n", "\r\n"},
StringSplitOptions.RemoveEmptyEntries);

Getting punctuation from the end of a string only

I'm looking for a C# snippet to remove and store any punctuation from the end of a string only.
Example:
Test! would return !
Test;; would return ;;
Test?:? would return ?:?
!!Test!?! would return !?!
I have a rather clunky solution at the moment but wondered if anybody could suggest a more succinct way to do this.
My puncutation list is
new char[] { '.', ':', '-', '!', '?', ',', ';' })
You could use the following regular expression:
\p{P}*$
This breaks down to:
\p{P} - Unicode punctuation
* - Any number of times
$ - End of line anchor
If you know that there will always be some punctuation at the end of the string, use + for efficiency.
And use it like this in order to get the punctuation:
string punctuation = Regex.Match(myString, #"\p{P}*$").Value;
To actually remove it:
string noPunctuation = Regex.Replace(myString, #"\p{P}*$", string.Empty);
Use a regex:
resultString = Regex.Replace(subjectString, #"[.:!?,;-]+$", "");
Explanation:
[.:!?,;-] # Match a character that's one of the enclosed characters
+ # Do this once or more (as many times as possible)
$ # Assert position at the end of the string
As Oded suggested, use \p{P} instead of [.:!?,;-] if you want to remove all punctuation characters, not just the ones from your list.
To also "store" the punctuation, you could split the string:
splitArray = Regex.Split(subjectString, #"(?=\p{P}+$)");
Then splitArray[0] contains the part before the punctuation, and splitArray[1] the punctuation characters. If there are any.
Using Linq:
var punctuationMap = new HashSet<char>(new char[] { '.', ':', '-', '!', '?', ',', ';' });
var endPunctuationChars = aString.Reverse().
TakeWhile(ch => punctuationMap.Contains(ch));
var result = new string(endPunctuationChars.Reverse().ToArray());
The HashSet is not mandatory, you can use Linq's Contains on the array directly.

C#: Removing common invalid characters from a string: improve this algorithm

Consider the requirement to strip invalid characters from a string. The characters just need to be removed and replace with blank or string.Empty.
char[] BAD_CHARS = new char[] { '!', '#', '#', '$', '%', '_' }; //simple example
foreach (char bad in BAD_CHARS)
{
if (someString.Contains(bad))
someString = someString.Replace(bad.ToString(), string.Empty);
}
I'd have really liked to do this:
if (BAD_CHARS.Any(bc => someString.Contains(bc)))
someString.Replace(bc,string.Empty); // bc is out of scope
Question:
Do you have any suggestions on refactoring this algoritm, or any simpler, easier to read, performant, maintainable algorithms?
I don't know about the readability of it, but a regular expression could do what you need it to:
someString = Regex.Replace(someString, #"[!##$%_]", "");
char[] BAD_CHARS = new char[] { '!', '#', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOptions.RemoveEmptyEntries));
should do the trick (sorry for any smaller syntax errors I'm on my phone)
The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.
The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)
Try the following extension method. The key is the use of StringBuilder, which means only one block of memory is assigned for the result during execution.
private static readonly HashSet<char> badChars =
new HashSet<char> { '!', '#', '#', '$', '%', '_' };
public static string CleanString(this string str)
{
var result = new StringBuilder(str.Length);
for (int i = 0; i < str.Length; i++)
{
if (!badChars.Contains(str[i]))
result.Append(str[i]);
}
return result.ToString();
}
This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one (m being the number of bad chars); it also is lot a better with memory usage, as explained above.
This one is faster than HashSet<T>. Also, if you have to perform this action often, please consider the foundations for this question I asked here.
private static readonly bool[] BadCharValues;
static StaticConstructor()
{
BadCharValues = new bool[char.MaxValue+1];
char[] badChars = { '!', '#', '#', '$', '%', '_' };
foreach (char c in badChars)
BadCharValues[c] = true;
}
public static string CleanString(string str)
{
var result = new StringBuilder(str.Length);
for (int i = 0; i < str.Length; i++)
{
if (!BadCharValues[str[i]])
result.Append(str[i]);
}
return result.ToString();
}
Extra tip: If you don't want to remember the array of char that are invalid for Files, you could use Path.GetInvalidFileNameChars(). If you wanted it for Paths, it's Path.GetInvalidPathChars
private static string RemoveInvalidChars(string str)
{
return string.Concat(str.Split(Path.GetInvalidFileNameChars(), StringSplitOptions.RemoveEmptyEntries));
}
if you still want to do it in a LINQy way:
public static string CleanUp(this string orig)
{
var badchars = new HashSet<char>() { '!', '#', '#', '$', '%', '_' };
return new string(orig.Where(c => !badchars.Contains(c)).ToArray());
}
Something to consider -- if this is for passwords (say), you want to scan for and keep good characters, and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.
For Each Character
If Character is Good -> Keep it (copy to out buffer, whatever.)
jeff
Why would you have REALLY LIKED to do that? The code is absolutely no simpler, you're just forcing a query extension method into your code.
As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present.
Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this.
This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:
string clean = new string(#"Sour!ce Str&*(#ing".Where(c =>
#"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

Categories

Resources