Removing non-ASCII characters from string

Removing non-ASCII characters from string - c#

I am trying to strip non-ASCII character from strings I am reading from a text file and can't get it to do so. I checked some of the suggestions from posts in SO and other sites, all to no avail.
This is what I have and what I have tried:
String in text file:
2021-03-26 10:00:16:648|2021-03-26 10:00:14:682|MPE->IDC|[10.20.30.40:41148]|203, ? ?'F?~?^?W?|?8wL?i??{?=kb ? Y R?
String read from the file:
"2021-03-26 10:00:16:648|2021-03-26 10:00:14:682|[10.20.30.40:41148]|203,\u0016\u0003\u0001\0?\u0001\0\0?\u0003\u0001'F?\u001e~\u0018?^?W\u0013?|?8wL\v?i??{?=kb\t?\tY\u0005\0\0R?"
Methods to get rid of non-ASCII characters:
Regex reAsciiPattern = new Regex(#"[^\u0000-\u007F]+"); // Non-ASCII characters
sLine = reAsciiPattern.Replace(sLine, ""); // remove non-ASCII chars
Regex reAsciiPattern2 = new Regex(#"[^\x00-\x7F]+"); // Non-ASCII characters
sLine = reAsciiPattern2.Replace(sLine, ""); // remove non-ASCII chars
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(sLine)
)
);
What am I missing?
Thanks.

This can be done without a Regex using a loop and a StringBuilder:
var sb = new StringBuilder();
foreach(var ch in line) {
//printable Ascii range
if (ch >= 32 && ch < 127) {
sb.Append(ch);
}
}
line = sb.ToString();
Or you can use some LINQ:
line = string.Concat(
line.Where(ch => ch >= 32 && ch < 127)
);
If you must do this with Regex then the following should suffice (again this keeps printable ASCII only)
line = Regex.Replace(line, #"[^\u0020-\u007e]", "");
Try It Online
If you want all ASCII (including non-printable) characters, then modify the tests to
ch <= 127 // for the loops
#"[^\u0000-\u007f]" // for the regex

You can use the following regular expression to get rid of all non-printable characters.
Regex.Replace(sLine, #"[^\u0020-\u007E]+", string.Empty);

This is what worked for me based on a post here
using System.Text.RegularExpressions;
...
Regex reAsciiNonPrintable = new Regex(#"\p{C}+"); // Non-printable characters
string sLine;
using (StreamReader sr = File.OpenText(Path.Combine(Folder, FileName)))
{
while (!sr.EndOfStream)
{
sLine = sr.ReadLine().Trim();
if (!string.IsNullOrEmpty(sLine))
{
Match match = reAsciiNonPrintable.Match(sLine);
if (match.Success)
continue; // skip the line
...
}
...
}
....
}

Since a string is an IEnumerable<char> where each char represents one UTF-16 code unit (possibly a surrogate), you can also do:
var ascii = new string(sLine.Where(x => x <= sbyte.MaxValue).ToArray());
Or if you want only printable ASCII:
var asciiPrintable = new string(sLine.Where(x => ' ' <= x && x <= '~').ToArray());
I realize now that this is mostly a duplicate of pinkfloydx33's answer, so go and upvote that.
If the string contains accented letters, the result can depend on the normalization, so compare:
var sLine1 = "olé";
var sLine2 = sLine1.Normalize(NormalizationForm.FormD);

Related

Remove specific characters except last

I have a text string and I want to replace the dots with underscores except for the last character found in the string.
Example:
input = "video.coffee.example.mp4"
result = "video_coffe_example.mp4"
I have a code but this replaces everything including the last character
first option failed
static string replaceForUnderScore(string file)
{
return file = file.Replace(".", "_");
}
I implemented a second option that works for me but I find that it is very extensive and not very optimized
static string replaceForUnderScore(string file)
{
string result = "";
var splits = file.Split(".");
var extension = splits.LastOrDefault();
splits = splits.Take(splits.Count() - 1).ToArray();
foreach (var strItem in splits)
{
result = result + "_" + strItem;
}
result = result.Substring(1, result.Length-1);
string finalResult = result + "."+extension;
return finalResult;
}
Is there a better way to do it?

Since you work with files, I suggest using Path class: all
we want is to change file name only while keeping extension intact:
static string replaceForUnderScore(string file) =>
Path.GetFileNameWithoutExtension(file).Replace('.', '_') + Path.GetExtension(file);

You can replace all the dots with an underscore except for the last dot by asserting that there is still a dot present to the right when matching one.
string result = Regex.Replace(input, #"\.(?=[^.]*\.)", "_");
The result will be
video_coffee_example.mp4

Regex will help you to do this.
Add the namespace using System.Text.RegularExpressions;
And use this code:
var regex = new Regex(Regex.Escape("."));
var newText = regex.Replace("video.coffee.example.mp4", "_", 2);
Here we specified the maximum number of times to replace the .
The output would be the following:
video_coffee_example.mp4
Additionally, you can update the code to replace any number of dots excluding the last one.
var replaceChar = '.';
var regex = new Regex(Regex.Escape(replaceChar.ToString()));
var replaceWith = "_";
// The text to process
var text = "video.coffee.example.mp4";
// Count how many chars to replace excluding extension
var replaceCount = text.Count(s => s == replaceChar) - 1;
var newText = regex.Replace(text, replaceWith, replaceCount);

Off the top of my head but this might work.
return $"{file.Replace(".mp4","").Replace(".","_")}.mp4";

The simplest (and probably fastest) way is just to iterate over the string:
static string replaceForUnderScore(string file)
{
StringBuilder sb = new StringBuilder( file.Length ) ;
int lastDot = -1 ;
for ( int i = 0 ; i < file.Length ; ++i )
{
char c = file[i] ;
// if we found a '.', replace it with '_' and save its position
if ( c == '.' )
{
c = '_' ;
lastDot = i ;
}
sb.Append( c ) ;
}
// if we changed any '.' to '_', convert the last such replacement back to '.'
if ( lastDot >= 0 )
{
sb.Replace ( '.' , '_' , lastDot, 1 );
}
return sb.ToString();
}
Another approach would be to use System.IO.Path. It's certainly the most succinct:
static string replaceForUnderScore( string file )
{
string ext = Path.GetExtension( file ) ;
string name = Path
.GetFileNameWithoutExtension( file )
.Replace( '.' , '_' )
;
return Path.ChangeExtension( name , ext ) ;
}

Change in string some part, but without one part - where are numbers

For example I have such string:
ex250-r-ninja-08-10r_
how could I change it to such string?
ex250 r ninja 08-10r_
as you can see I change all - to space, but didn't change it where I have XX-XX part... how could I do such string replacement in c# ? (also string could be different length)
I do so for -
string correctString = errString.Replace("-", " ");
but how to left - where number pattern XX-XX ?

You can use regular expressions to only perform substitutions in certain cases. In this case, you want to perform a substitution if either side of the dash is a non-digit. That's not quite as simple as it might be, but you can use:
string ReplaceSomeHyphens(string input)
{
string result = Regex.Replace(input, #"(\D)-", "${1} ");
result = Regex.Replace(result, #"-(\D)", " ${1}");
return result;
}
It's possible that there's a more cunning way to do this in a single regular expression, but I suspect that it would be more complicated too :)

A very uncool approach using a StringBuilder. It'll replace all - with space if the two characters before and the two characters behind are not digits.
StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
bool replace = false;
char c = text[i];
if (c == '-')
{
if (i < 2 || i >= text.Length - 2) replace = true;
else
{
bool leftDigit = text.Substring(i - 2, 2).All(Char.IsDigit);
bool rightDigit = text.Substring(i + 1, 2).All(Char.IsDigit);
replace = !leftDigit || !rightDigit;
}
}
if (replace)
sb.Append(' ');
else
sb.Append(c);
}

Since you say you won't have hyphens at the start of your string then you need to capture every occurrence of - that is preceded by a group of characters which contains at least one letter and zero or many numbers. To achieve this, use positive lookbehind in your regex.
string strRegex = #"(?<=[a-z]+[0-9]*)-";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
string strTargetString = #"ex250-r-ninja-08-10r_";
string strReplace = #" ";
return myRegex.Replace(strTargetString, strReplace);
Here are the results:

Regular expression to extract characters in between other characters

I have a string which is //{characters}\n.
And I need a regular expression to extract the character in between // and \n.

Regular expressions are nice and all, but why not use Substring?
string input = "//{characters}\n";
string result = input.Split('\n')[0].Substring(2);
or
string result = input.Substring(2, input.Length - 3);

Using RegEx:
Regex g;
Match m;
g = new Regex("//(.*)\n"); // if you have just alphabet characters replace .* with \w*
m = g.Match(input);
if (m.Success == true)
output = m.Groups[1].Value;

This should work:
string s1 = "//{characters}\n";
string final = (s1.Replace("//", "").Replace("\n", ""));

How do I strip non-alphanumeric characters (including spaces) from a string?

How do I strip non alphanumeric characters from a string and loose spaces in C# with Replace?
I want to keep a-z, A-Z, 0-9 and nothing more (not even " " spaces).
"Hello there(hello#)".Replace(regex-i-want, "");
should give
"Hellotherehello"
I have tried "Hello there(hello#)".Replace(#"[^A-Za-z0-9 ]", ""); but the spaces remain.

In your regex, you have excluded the spaces from being matched (and you haven't used Regex.Replace() which I had overlooked completely...):
result = Regex.Replace("Hello there(hello#)", #"[^A-Za-z0-9]+", "");
should work. The + makes the regex a bit more efficient by matching more than one consecutive non-alphanumeric character at once instead of one by one.
If you want to keep non-ASCII letters/digits, too, use the following regex:
#"[^\p{L}\p{N}]+"
which leaves
BonjourmesélèvesGutenMorgenliebeSchüler
instead of
BonjourmeslvesGutenMorgenliebeSchler

You can use Linq to filter out required characters:
String source = "Hello there(hello#)";
// "Hellotherehello"
String result = new String(source
.Where(ch => Char.IsLetterOrDigit(ch))
.ToArray());
Or
String result = String.Concat(source
.Where(ch => Char.IsLetterOrDigit(ch)));
And so you have no need in regular expressions.

Or you can do this too:
public static string RemoveNonAlphanumeric(string text)
{
StringBuilder sb = new StringBuilder(text.Length);
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= '0' && c <= '9')
sb.Append(text[i]);
}
return sb.ToString();
}
Usage:
string text = SomeClass.RemoveNonAlphanumeric("text LaLa (lol) á ñ $ 123 ٠١٢٣٤");
//text: textLaLalol123

The mistake made above was using Replace incorrectly (it doesn't take regex, thanks CodeInChaos).
The following code should do what was specified:
Regex reg = new Regex(#"[^\p{L}\p{N}]+");//Thanks to Tim Pietzcker for regex
string regexed = reg.Replace("Hello there(hello#)", "");
This gives:
regexed = "Hellotherehello"

And as a replace operation as an extension method:
public static class StringExtensions
{
public static string ReplaceNonAlphanumeric(this string text, char replaceChar)
{
StringBuilder result = new StringBuilder(text.Length);
foreach(char c in text)
{
if(c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= '0' && c <= '9')
result.Append(c);
else
result.Append(replaceChar);
}
return result.ToString();
}
}
And test:
[TestFixture]
public sealed class StringExtensionsTests
{
[Test]
public void Test()
{
Assert.AreEqual("text_LaLa__lol________123______", "text LaLa (lol) á ñ $ 123 ٠١٢٣٤".ReplaceNonAlphanumeric('_'));
}
}

var text = "Hello there(hello#)";
var rgx = new Regex("[^a-zA-Z0-9]");
text = rgx.Replace(text, string.Empty);

Use following regex to strip those all characters from the string using Regex.Replace
([^A-Za-z0-9\s])

In .Net 4.0 you can use the IsNullOrWhitespace method of the String class to remove the so called white space characters. Please take a look here http://msdn.microsoft.com/en-us/library/system.string.isnullorwhitespace.aspx
However as #CodeInChaos pointed there are plenty of characters which could be considered as letters and numbers. You can use a regular expression if you only want to find A-Za-z0-9.

Replace placeholders in order

I have a part of a URL like this:
/home/{value1}/something/{anotherValue}
Now i want to replace all between the brackets with values from a string-array.
I tried this RegEx pattern: \{[a-zA-Z_]\} but it doesn't work.
Later (in C#) I want to replace the first match with the first value of the array, second with the second.
Update: The /'s cant be used to separate. Only the placeholders {...} should be replaced.
Example: /home/before{value1}/and/{anotherValue}
String array: {"Tag", "1"}
Result: /home/beforeTag/and/1
I hoped it could works like this:
string input = #"/home/before{value1}/and/{anotherValue}";
string pattern = #"\{[a-zA-Z_]\}";
string[] values = {"Tag", "1"};
MatchCollection mc = Regex.Match(input, pattern);
for(int i, ...)
{
mc.Replace(values[i];
}
string result = mc.GetResult;
Edit:
Thank you Devendra D. Chavan and ipr101,
both solutions are greate!

You can try this code fragment,
// Begin with '{' followed by any number of word like characters and then end with '}'
var pattern = #"{\w*}";
var regex = new Regex(pattern);
var replacementArray = new [] {"abc", "cde", "def"};
var sourceString = #"/home/{value1}/something/{anotherValue}";
var matchCollection = regex.Matches(sourceString);
for (int i = 0; i < matchCollection.Count && i < replacementArray.Length; i++)
{
sourceString = sourceString.Replace(matchCollection[i].Value, replacementArray[i]);
}

[a-zA-Z_] describes a character class. For words, you'll have to add * at the end (any number of characters within a-zA-Z_.
Then, to have 'value1' captured, you'll need to add number support : [a-zA-Z0-9_]*, which can be summarized with: \w*
So try this one : {\w*}
But for replacing in C#, string.Split('/') might be easier as Fredrik proposed. Have a look at this too

You could use a delegate, something like this -
string[] strings = {"dog", "cat"};
int counter = -1;
string input = #"/home/{value1}/something/{anotherValue}";
Regex reg = new Regex(#"\{([a-zA-Z0-9]*)\}");
string result = reg.Replace(input, delegate(Match m) {
counter++;
return "{" + strings[counter] + "}";
});

My two cents:
// input string
string txt = "/home/{value1}/something/{anotherValue}";
// template replacements
string[] str_array = { "one", "two" };
// regex to match a template
Regex regex = new Regex("{[^}]*}");
// replace the first template occurrence for each element in array
foreach (string s in str_array)
{
txt = regex.Replace(txt, s, 1);
}
Console.Write(txt);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Removing non-ASCII characters from string - c#

You can use the following regular expression to get rid of all non-printable characters. Regex.Replace(sLine, #"[^\u0020-\u007E]+", string.Empty);

Related

Remove specific characters except last

Change in string some part, but without one part - where are numbers

Regular expression to extract characters in between other characters

How do I strip non-alphanumeric characters (including spaces) from a string?

Replace placeholders in order

Categories

Resources