How do I strip non-alphanumeric characters (including spaces) from a string?

How do I strip non-alphanumeric characters (including spaces) from a string? - c#

How do I strip non alphanumeric characters from a string and loose spaces in C# with Replace?
I want to keep a-z, A-Z, 0-9 and nothing more (not even " " spaces).
"Hello there(hello#)".Replace(regex-i-want, "");
should give
"Hellotherehello"
I have tried "Hello there(hello#)".Replace(#"[^A-Za-z0-9 ]", ""); but the spaces remain.

In your regex, you have excluded the spaces from being matched (and you haven't used Regex.Replace() which I had overlooked completely...):
result = Regex.Replace("Hello there(hello#)", #"[^A-Za-z0-9]+", "");
should work. The + makes the regex a bit more efficient by matching more than one consecutive non-alphanumeric character at once instead of one by one.
If you want to keep non-ASCII letters/digits, too, use the following regex:
#"[^\p{L}\p{N}]+"
which leaves
BonjourmesélèvesGutenMorgenliebeSchüler
instead of
BonjourmeslvesGutenMorgenliebeSchler

You can use Linq to filter out required characters:
String source = "Hello there(hello#)";
// "Hellotherehello"
String result = new String(source
.Where(ch => Char.IsLetterOrDigit(ch))
.ToArray());
Or
String result = String.Concat(source
.Where(ch => Char.IsLetterOrDigit(ch)));
And so you have no need in regular expressions.

Or you can do this too:
public static string RemoveNonAlphanumeric(string text)
{
StringBuilder sb = new StringBuilder(text.Length);
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= '0' && c <= '9')
sb.Append(text[i]);
}
return sb.ToString();
}
Usage:
string text = SomeClass.RemoveNonAlphanumeric("text LaLa (lol) á ñ $ 123 ٠١٢٣٤");
//text: textLaLalol123

The mistake made above was using Replace incorrectly (it doesn't take regex, thanks CodeInChaos).
The following code should do what was specified:
Regex reg = new Regex(#"[^\p{L}\p{N}]+");//Thanks to Tim Pietzcker for regex
string regexed = reg.Replace("Hello there(hello#)", "");
This gives:
regexed = "Hellotherehello"

And as a replace operation as an extension method:
public static class StringExtensions
{
public static string ReplaceNonAlphanumeric(this string text, char replaceChar)
{
StringBuilder result = new StringBuilder(text.Length);
foreach(char c in text)
{
if(c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= '0' && c <= '9')
result.Append(c);
else
result.Append(replaceChar);
}
return result.ToString();
}
}
And test:
[TestFixture]
public sealed class StringExtensionsTests
{
[Test]
public void Test()
{
Assert.AreEqual("text_LaLa__lol________123______", "text LaLa (lol) á ñ $ 123 ٠١٢٣٤".ReplaceNonAlphanumeric('_'));
}
}

var text = "Hello there(hello#)";
var rgx = new Regex("[^a-zA-Z0-9]");
text = rgx.Replace(text, string.Empty);

Use following regex to strip those all characters from the string using Regex.Replace
([^A-Za-z0-9\s])

In .Net 4.0 you can use the IsNullOrWhitespace method of the String class to remove the so called white space characters. Please take a look here http://msdn.microsoft.com/en-us/library/system.string.isnullorwhitespace.aspx
However as #CodeInChaos pointed there are plenty of characters which could be considered as letters and numbers. You can use a regular expression if you only want to find A-Za-z0-9.

Related

Removing non-ASCII characters from string

I am trying to strip non-ASCII character from strings I am reading from a text file and can't get it to do so. I checked some of the suggestions from posts in SO and other sites, all to no avail.
This is what I have and what I have tried:
String in text file:
2021-03-26 10:00:16:648|2021-03-26 10:00:14:682|MPE->IDC|[10.20.30.40:41148]|203, ? ?'F?~?^?W?|?8wL?i??{?=kb ? Y R?
String read from the file:
"2021-03-26 10:00:16:648|2021-03-26 10:00:14:682|[10.20.30.40:41148]|203,\u0016\u0003\u0001\0?\u0001\0\0?\u0003\u0001'F?\u001e~\u0018?^?W\u0013?|?8wL\v?i??{?=kb\t?\tY\u0005\0\0R?"
Methods to get rid of non-ASCII characters:
Regex reAsciiPattern = new Regex(#"[^\u0000-\u007F]+"); // Non-ASCII characters
sLine = reAsciiPattern.Replace(sLine, ""); // remove non-ASCII chars
Regex reAsciiPattern2 = new Regex(#"[^\x00-\x7F]+"); // Non-ASCII characters
sLine = reAsciiPattern2.Replace(sLine, ""); // remove non-ASCII chars
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(sLine)
)
);
What am I missing?
Thanks.

This can be done without a Regex using a loop and a StringBuilder:
var sb = new StringBuilder();
foreach(var ch in line) {
//printable Ascii range
if (ch >= 32 && ch < 127) {
sb.Append(ch);
}
}
line = sb.ToString();
Or you can use some LINQ:
line = string.Concat(
line.Where(ch => ch >= 32 && ch < 127)
);
If you must do this with Regex then the following should suffice (again this keeps printable ASCII only)
line = Regex.Replace(line, #"[^\u0020-\u007e]", "");
Try It Online
If you want all ASCII (including non-printable) characters, then modify the tests to
ch <= 127 // for the loops
#"[^\u0000-\u007f]" // for the regex

You can use the following regular expression to get rid of all non-printable characters.
Regex.Replace(sLine, #"[^\u0020-\u007E]+", string.Empty);

This is what worked for me based on a post here
using System.Text.RegularExpressions;
...
Regex reAsciiNonPrintable = new Regex(#"\p{C}+"); // Non-printable characters
string sLine;
using (StreamReader sr = File.OpenText(Path.Combine(Folder, FileName)))
{
while (!sr.EndOfStream)
{
sLine = sr.ReadLine().Trim();
if (!string.IsNullOrEmpty(sLine))
{
Match match = reAsciiNonPrintable.Match(sLine);
if (match.Success)
continue; // skip the line
...
}
...
}
....
}

Since a string is an IEnumerable<char> where each char represents one UTF-16 code unit (possibly a surrogate), you can also do:
var ascii = new string(sLine.Where(x => x <= sbyte.MaxValue).ToArray());
Or if you want only printable ASCII:
var asciiPrintable = new string(sLine.Where(x => ' ' <= x && x <= '~').ToArray());
I realize now that this is mostly a duplicate of pinkfloydx33's answer, so go and upvote that.
If the string contains accented letters, the result can depend on the normalization, so compare:
var sLine1 = "olé";
var sLine2 = sLine1.Normalize(NormalizationForm.FormD);

Clean string to have only numbers c#

I want to do have only the numbers from a string. I have tried this:
string phoneNumber = txtPhoneNumber.Text;
string cleanPhoneNumber = string.Empty;
foreach (char c in phoneNumber)
{
if (c.Equals('0') || c.Equals('1') || c.Equals('2') ||
c.Equals('3') || c.Equals('4') || c.Equals('5') ||
c.Equals('6') || c.Equals('7') || c.Equals('8') ||
c.Equals('9'))
cleanPhoneNumber += Convert.ToString(c);
}
The solution above worked, but i want to know if there is a more efficient way.

string b = string.Empty;
for (int i=0; i< a.Length; i++)
{
if (Char.IsDigit(a[i]))
b += a[i];
}
Or use Regex
resultString = Regex.Match(subjectString, #"\d+").Value;

Since you, probable, want digits in 0..9 range only, not all unicode ones (which include Persian, Indian digits etc.), char.IsDigit and \d regular expression are not exact solutions.
Linq:
string cleanPhoneNumber = string.Concat(phoneNumber.Where(c => c >= '0' && c <= '9'));
Regex:
either Sami's, integer's codes or
resultString = Regex.Match(subjectString, #"\d+", RegexOptions.ECMAScript ).Value;
which is Krystian Borysewicz's solution with ECMAScript option to be on the safe side.

string phoneNumber = txtPhoneNumber.Text;
// Get numbers only
Regex numbersRegex = new Regex("[^0-9]");
var cleanPhoneNumber = numbersRegex.Replace(phoneNumber, ""));

If you're looking to be efficient in terms on time then you should avoid using regex as the Regex class will need to parse your expression before it applies it to the phone number.
The code below avoid regex and keeps memory allocations to a minimum. It only allocates twice, once for a buffer to store the numbers and the once again at the end to create the string containing the valid numbers.
string Clean(string text)
{
var validCharacters = new char[text.Length];
var next = 0;
for(int i = 0; i < text.Length; i++)
{
char c = text[i];
if(char.IsDigit(c))
{
validCharacters[next++] = c;
}
}
return new string(validCharacters, 0, next);
}

using Linq:
string cleanPhoneNumber = new String(phoneNumber.Where(Char.IsDigit).ToArray());

Regular expression help - ignoring parenthesis, ands, ors and whitespace again

Consider the following english phrase
FRIEND AND COLLEAGUE AND (FRIEND OR COLLEAGUE AND (COLLEAGUE AND FRIEND AND FRIEND))
I want to be able to programmatically change arbitrary phrases, such as above, to something like:
SELECT * FROM RelationTable R1 JOIN RelationTable R2 ON R2.RelationName etc etc WHERE
R2.RelationName = FRIEND AND R2.RelationName = Colleague AND (R3.RelationName = FRIENd,
etc. etc.
My question is. How do I take the initial string, strip it of the following words and symbols : AND, OR, (, ),
Then change each word, and create a new string.
I can do most of it, but my main problem is that if I do a string.split and only get the words I care for, I can't really replace them in the original string because I lack their original index. Let me explain in a smaller example:
string input = "A AND (B AND C)"
Split the string for space, parenthesies, etc, gives: A,B,C
input.Replace("A", "MyRandomPhrase")
But there is an A in AND.
So I moved into trying to create a regular expression that matches exact words, post split, and replaces. It started to look like this:
"(\(|\s|\))*" + itemOfInterest + "(\(|\s|\))+"
Am I on the right track or am I overcomplicating things..Thanks !

You can try using Regex.Replace, with \b word boundary regex
string input = "A AND B AND (A OR B AND (B AND A AND A))";
string pattern = "\\bA\\b";
string replacement = "MyRandomPhrase";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);

class Program
{
static void Main(string[] args)
{
string text = "A AND (B AND C)";
List<object> result = ParseBlock(text);
Console.ReadLine();
}
private static List<object> ParseBlock(string text)
{
List<object> result = new List<object>();
int bracketsCount = 0;
int lastIndex = 0;
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (c == '(')
bracketsCount++;
else if (c == ')')
bracketsCount--;
if (bracketsCount == 0)
if (c == ' ' || i == text.Length - 1)
{
string substring = text.Substring(lastIndex, i + 1 - lastIndex).Trim();
object itm = substring;
if (substring[0] == '(')
itm = ParseBlock(substring.Substring(1, substring.Length - 2));
result.Add(itm);
lastIndex = i;
}
}
return result;
}
}

Change in string some part, but without one part - where are numbers

For example I have such string:
ex250-r-ninja-08-10r_
how could I change it to such string?
ex250 r ninja 08-10r_
as you can see I change all - to space, but didn't change it where I have XX-XX part... how could I do such string replacement in c# ? (also string could be different length)
I do so for -
string correctString = errString.Replace("-", " ");
but how to left - where number pattern XX-XX ?

You can use regular expressions to only perform substitutions in certain cases. In this case, you want to perform a substitution if either side of the dash is a non-digit. That's not quite as simple as it might be, but you can use:
string ReplaceSomeHyphens(string input)
{
string result = Regex.Replace(input, #"(\D)-", "${1} ");
result = Regex.Replace(result, #"-(\D)", " ${1}");
return result;
}
It's possible that there's a more cunning way to do this in a single regular expression, but I suspect that it would be more complicated too :)

A very uncool approach using a StringBuilder. It'll replace all - with space if the two characters before and the two characters behind are not digits.
StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
bool replace = false;
char c = text[i];
if (c == '-')
{
if (i < 2 || i >= text.Length - 2) replace = true;
else
{
bool leftDigit = text.Substring(i - 2, 2).All(Char.IsDigit);
bool rightDigit = text.Substring(i + 1, 2).All(Char.IsDigit);
replace = !leftDigit || !rightDigit;
}
}
if (replace)
sb.Append(' ');
else
sb.Append(c);
}

Since you say you won't have hyphens at the start of your string then you need to capture every occurrence of - that is preceded by a group of characters which contains at least one letter and zero or many numbers. To achieve this, use positive lookbehind in your regex.
string strRegex = #"(?<=[a-z]+[0-9]*)-";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
string strTargetString = #"ex250-r-ninja-08-10r_";
string strReplace = #" ";
return myRegex.Replace(strTargetString, strReplace);
Here are the results:

regex/linq to replace consecutive characters with count

I have the following method (written in C#/.NET). Input text consist only of letters (no digits). Returned value is another text in which groups of more than two consecutive characters are replaced with one the character preceded with a count of repetitions.
Ex.: aAAbbbcccc -> aAA3b4c
public static string Pack(string text)
{
if (string.IsNullOrEmpty(text)) return text;
StringBuilder sb = new StringBuilder(text.Length);
char prevChar = text[0];
int prevCharCount = 1;
for (int i = 1; i < text.Length; i++)
{
char c = text[i];
if (c == prevChar) prevCharCount++;
else
{
if (prevCharCount > 2) sb.Append(prevCharCount);
else if (prevCharCount == 2) sb.Append(prevChar);
sb.Append(prevChar);
prevChar = c;
prevCharCount = 1;
}
}
if (prevCharCount > 2) sb.Append(prevCharCount);
else if (prevCharCount == 2) sb.Append(prevChar);
sb.Append(prevChar);
return sb.ToString();
}
The method is not too long. But does any one has an idea how to do that in a more concise way using regex? Or LINQ?

How about:
static readonly Regex re = new Regex(#"(\w)(\1){2,}", RegexOptions.Compiled);
static void Main() {
string result = re.Replace("aAAbbbcccc",
match => match.Length.ToString() + match.Value[0]);
}
The regex is a word char, followed by the same (back-ref) at least twice; the lamba takes the length of the match (match.Length) and appends the first character (match.Value[0])

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How do I strip non-alphanumeric characters (including spaces) from a string? - c#

var text = "Hello there(hello#)"; var rgx = new Regex("[^a-zA-Z0-9]"); text = rgx.Replace(text, string.Empty);

Use following regex to strip those all characters from the string using Regex.Replace ([^A-Za-z0-9\s])

Related

Removing non-ASCII characters from string

Clean string to have only numbers c#

Regular expression help - ignoring parenthesis, ands, ors and whitespace again

Change in string some part, but without one part - where are numbers

regex/linq to replace consecutive characters with count

Categories

Resources