How to remove non-ASCII word from a string in C#

How to remove non-ASCII word from a string in C# - c#

I want to filter some string which has some wrong letters (non-ASCII). It looks different in Notepad, Visual Studio 2010 and MySQL.
How can I check if a string has non-ASCII letters and how I can remove them?

You could use a regular expression to filter non ASCII characters:
string input = "AB £ CD";
string result = Regex.Replace(input, "[^\x0d\x0a\x20-\x7e\t]", "");

You could use Regular Expressions.
Regex.Replace(input, "[^a-zA-Z0-9]+", "")
You could also use \W+ as the pattern to remove any non-character.

This has been a God-send:
Regex.Replace(input, #"[^\u0000-\u007F]", "");
I think I got it elsewhere originally, but here is a link to the same answer here:
How can you strip non-ASCII characters from a string? (in C#)

string testString = Regex.Replace(OldString, #"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");

First, you need to determine what you mean by a "word". If non-ascii, this probably implies non-english?
Personally, I'd ask why you need to do this and what fundamental assumption has your application got that conflicts with your data? Depending on the situation, I suggest you either re-encode the text from the source encoding, although this will be a lossy conversion, or alternatively, address that fundamental assumption so that your application handles data correctly.

I think something as simple as this would probably work, wouldn't it?
public static string AsciiOnly(this string input, bool includeExtendedAscii)
{
int upperLimit = includeExtendedAscii ? 255 : 127;
char[] asciiChars = input.Where(c => (int)c <= upperLimit).ToArray();
return new string(asciiChars);
}
Example usage:
string input = "AB£ȼCD";
string asciiOnly = input.AsciiOnly(false); // returns "ABCD"
string extendedAsciiOnly = input.AsciiOnly(true); // returns "AB£CD"

Related

Get the first word from the string

I would like to get only the first word of the string regardless of any character or punctuation in front of it.
Sometimes, there could be , or . or !. I don't want these characters.
var s = "Hello, World";
var firstWord = s.Substring(0, s.IndexOf(" "));
This gives me Hello,. I would like to get Hello only.
How do I achieve this?

Simply use the following regex:
var s = "Hello, World";
var result = Regex.Match(s, #"^([\w\-]+)");
Console.WriteLine(result.Value); // Result is "Hello"
This will get the first word regardless of whether or not it ends with punctuation or simply precedes a space.

This will work for you. I assumed that words will be separated with whitespace.
var input = "Hello, World";
var output = Regex.Replace(input.Split()[0], #"[^0-9a-zA-Z\ ]+", "");

IndexOfAny (https://msdn.microsoft.com/fr-ca/library/11w09h50(v=vs.110).aspx) is an alternative if you know the list of characters you want to use. It really depends on the definition you want to use and which characters you want to handle. How do you want to handle characters like œ,é,µ,½,¶,ç,+,-,3...?
Also, do you want to handle locale as some characters might have a classification that is dependant on the language.
Char has many function that allows you to classify characters. See https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx.
And there is also the regex solutions proposed by others.
So the best solution really depends on your need. Do you need to properly handle any Unicode characters or only some specific ASCII characters?

LATE ENTRY:
If you don't want to use Regular Expressions:
private string GetFirstWord(string text)
{
var candidate = text.Trim();
if (!candidate.Any(Char.IsWhiteSpace))
return text;
return candidate.Split(' ').FirstOrDefault();
}

Regex to match only numbers , no apostrophes

I want to match only numbers in the following string
String : "40’000"
Match : "40000"
basically tring to ignore apostrophe.
I am using C#, in case it matters.
Cant use any C# methods, need to only use Regex.

Replace like this it replace all char excpet numbers
string input = "40’000";
string result = Regex.Replace(input, #"[^\d]", "");

Since you said; I just want to pick up numbers only, how about without regex?
var s = "40’000";
var result = new string(s.Where(char.IsDigit).ToArray());
Console.WriteLine(result); // 40000

I suggest use regex to find the special characters not the digits, and then replace by ''.
So a simple (?=\S)\D should be enough, the (?=\S) is to ignore the whitespace at the end of number.
DEMO

Replace like this it replace all char excpet numbers and points
string input = "40’000";
string result = Regex.Replace(input, #"[^\d^.]", "");

Don't complicate your life, use Regex.Replace
string s = "40'000";
string replaced = Regex.Replace(s, #"\D", "");

how to handle %20 while string comparison in c#

I am trying to compare two strings but one of the string contains a white space at the end. I used Trim() and compared but didn't work because that white space is getting converted to %20 and I thing Trim does not remove that. it is something like "abc" and "abc%20" , what can I do in such situation to compare strings whih ignoring the case too?

%20 is the url-encoded version of space.
You can't directly strip it off using Trim(), but you can use HttpUtility.UrlDecode() to decode the %20 back to a space, then trim/do the comparison exactly as you would otherwise;
using System.Web;
//...
var test1 = "HELLO%20";
var test2 = "hello";
Console.WriteLine(HttpUtility.UrlDecode(test1).Trim().
Equals(HttpUtility.UrlDecode(test2).Trim(),
StringComparison.InvariantCultureIgnoreCase));
> true

Use HttpUtility.UrlDecode to decode the strings:
string s1 = "abc ";
string s2 = "abc%20";
if (System.Web.HttpUtility.UrlDecode(s1).Equals(System.Web.HttpUtility.UrlDecode(s2)))
{
//equals...
}
In case of WinForms or Console (or any non ASP.NET) project you will have to add reference to the System.Web assembly in your project.

Something like:
if (System.Uri.UnescapeDataString("abc%20").ToLower() == myString.ToLower()) {}

The "%20" is the url encoded version of the ' ' (space) character. Are you comparing an encoded URL parameter? If so, you can use:
string str = "abc%20";
string decoded = HttpUtility.UrlDecode(str); // decoded becomes "abc "
If you need to trim any white spaces, you should do this for the decoded string. The Trim method does not understand or recognize the encoded whitespace characters.
decoded = decoded.Trim();
Now you can compare with the decoded variable using:
decoded.Equals(otherValue, StringComparison.OrdinalIgnoreCase);
The StringComparison.OrdinalIgnoreCase is probably the fastest way for case-insensitive comparison between strings.

Did you try this?
string before = "abc%20";
string after = before.Replace("%20", "").ToLower();

You can use String.Replace and since you mentioned case insensitivity String.ToLower like this:
var str1 = "abc";
var str2 = "Abc%20";
str1.Replace("%20", "").ToLower() == str2.Replace("%20", "").ToLower();
// will be true

It seems the root problem is when you are with Encoding the Url. If you will use the character encoding, then you will never get %20. The default encoding used by HttpUtility.UrlEncode utf-8. here is the usage
System.Web.HttpUtility.UrlEncode("ãName Marcos", System.Text.Encoding.GetEncoding("iso-8859-1"))
And Here, on Microsoft website You can read more about Character Encoding.
And if you will do proper encoding you can avoid rest of the work
And here is what you asked -
The Second Case - If you have to compare two string as per your need, you need to Decode HttpUtility.UrlDecode(test)
bool result = HttpUtility.UrlDecode(stringOne).Equals(HttpUtility.UrlDecode(stringOne));
And result bool knows if they are equal or unequal
Console.WriteLine("Result is", result ? "equal." : "not equal.");
Hope it helps

c# allowing slashes (and other similar characters) in regular expressions

I am trying to validate a string based on the inputed characters. I want to be able to set which characters are allowed besides characters and numbers. Below is my extension method:
public static bool isAlphaNumeric(this string inputString, string allowedChars)
{
StringBuilder str = new StringBuilder(allowedChars);
str.Replace(" ", "\\s");
str.Replace("\n","\\\\");
str.Replace("/n", "////");
allowedChars = str.ToString();
Regex rg = new Regex(#"^[a-zA-Z0-9" + allowedChars + "]*$");
return rg.IsMatch(inputString);
}
The way I use this is:
s string = " te\m#as 1963' yili.ışçöÖÇÜ/nda olbnrdu" // just a test string with no meaning
if (s.isAlphaNumeric("ışŞö\Ö#üÜçÇ ğ'Ğ/.")) {...}
Of course it gives an error:
parsing "^[a-zA-Z0-9ışŞö\Ö#üÜçÇ\sğ'Ğ/.]*$" - Unrecognized escape sequence
the stringbuilder replace function is wrong which I am aware of. I want to be able to accept all characters given in the allowedChars parameter. This can also include slashes (any other characters similar to slashes I am not aware of?) Given this, how can I get my replace function work? and also is the way I am doing is correct? I am very very new to regular expressions and have no clue on how to work with them...

You need to use Regex.Escape on your string.
allowedChars = Regex.Escape(str.ToString());
ought to do it.

You're looking for Regex.Escape.

How do i strip special characters from the end of a string?

I need to strip unknown characters from the end of a string returned from an SQL database. I also need to log when a special character occurs in the string.
What's the best way to do this?

You can use the Trim() method to trim blanks or specific characters from the end of a string. If you need to trim a certain number of characters you can use the Substring() method. You can use Regexs (System.Text.RegularExpressions namespace) to match patterns in a string and detect when they occur. See MSDN for more info.
If you need more help you'll need to provide a bit more info on what exactly you're trying to do.

First define what are unknown characters (chars other than 0-9, a to z and A to Z ?) and put them in an array
Loop trough the characters of a string and check if the char occurs, if so remove.
you can also to a String.Replace with as param the unknown char, and replaceparam ''.

Since you've specified that the legal characters are only alphanumeric, you could do something like this:
Match m = Regex.Match(original, "^([0-9A-Za-z]*)(.*)$");
string good = m.Groups[1].Value;
string bad = m.Groups[2].Value;
if (bad.Length > 0)
{
// log bad characters
}
Console.WriteLine(good);

Your definition of the problem is not precise yet this is a fast trick to do so:
string input;
...
var trimed = input.TrimEnd(new[] {'#','$',...} /* array of unwanted characters */);
if(trimed != input)
myLogger.Log(input.Replace(trimed, ""));

check out the Regex.Replace methods...there are lots of overloads. You can use the Match methods for the logging to identify all matches.
String badString = "HELLO WORLD!!!!";
Regex regex = new Regex("!{1,}$" );
String newString = regex.Replace(badString, String.Empty);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove non-ASCII word from a string in C# - c#

I want to filter some string which has some wrong letters (non-ASCII). It looks different in Notepad, Visual Studio 2010 and MySQL. How can I check if a string has non-ASCII letters and how I can remove them?

You could use a regular expression to filter non ASCII characters: string input = "AB £ CD"; string result = Regex.Replace(input, "[^\x0d\x0a\x20-\x7e\t]", "");

You could use Regular Expressions. Regex.Replace(input, "[^a-zA-Z0-9]+", "") You could also use \W+ as the pattern to remove any non-character.

This has been a God-send: Regex.Replace(input, #"[^\u0000-\u007F]", ""); I think I got it elsewhere originally, but here is a link to the same answer here: How can you strip non-ASCII characters from a string? (in C#)

string testString = Regex.Replace(OldString, #"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");

Related

Get the first word from the string

Regex to match only numbers , no apostrophes

how to handle %20 while string comparison in c#

c# allowing slashes (and other similar characters) in regular expressions

How do i strip special characters from the end of a string?

Categories

Resources