How to remove non-INTERNATIONAL-alphanumeric characters from a string?

How to remove non-INTERNATIONAL-alphanumeric characters from a string? - c#

How can I remove (or recognize) non-alphanumeric characters such as '-', '*', '‡', '€', '⁋', '™' from a string without removing non-Latin alphanumeric characters such as 'Ж', 'ק', 'ओ', 'を'?
The removing part is easy, my issue is differentiating the non-Latin alphabets from the non-Latin symbols.
* All existing Q&A I found filtered out non-Latin alphabets.

A simple solution (that works only for basic BMP characters) is:
string str = "-*‡€⁋™Жקओを";
var sb = new StringBuilder();
foreach (char ch in str)
{
bool isLetter = char.IsLetterOrDigit(ch);
if (isLetter)
{
sb.Append(ch);
}
}
string str2 = sb.ToString();
The Char.IsLetterOrDigit is described as:
Indicates whether the specified Unicode character is categorized as a letter or a decimal digit.
if you want to support surrogate pairs it becomes more complex:
// \U0001F000 is MAHJONG TILE EAST WIND
// \U0001D49E is MATHEMATICAL SCRIPT CAPITAL C
string str = "-*‡€⁋™Жקओを\U0001F000\U0001D49E";
var sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
bool isLetter = char.IsLetterOrDigit(str, i);
bool isHighSurrogate = char.IsHighSurrogate(str[i]);
if (isLetter)
{
sb.Append(str, i, isHighSurrogate ? 2 : 1);
}
if (isHighSurrogate && i + 1 < str.Length && char.IsLowSurrogate(str[i + 1]))
{
i++;
}
}
string str2 = sb.ToString();

Related

How to exchange numbers to alphabet and alphabet to numbers in a string?

How do I convert numbers to its equivalent alphabet character and convert alphabet character to its numeric values from a string (except 0, 0 should stay 0 for obvious reasons)
So basically if there is a string
string content="D93AK0F5I";
How can I convert it to ?
string new_content="4IC11106E9";

I'm assuming you're aware this is not reversible, and that you're only using upper case and digits. Here you go...
private string Transpose(string input)
{
StringBuilder result = new StringBuilder();
foreach (var character in input)
{
if (character == '0')
{
result.Append(character);
}
else if (character >= '1' && character <= '9')
{
int offset = character - '1';
char replacement = (char)('A' + offset);
result.Append(replacement);
}
else if (character >= 'A' && character <= 'Z') // I'm assuming upper case only; feel free to duplicate for lower case
{
int offset = character - 'A' + 1;
result.Append(offset);
}
else
{
throw new ApplicationException($"Unexpected character: {character}");
}
}
return result.ToString();
}

Well, if you are only going to need a one way translation, here is quite a simple way to do it, using linq:
string convert(string input)
{
var chars = "0abcdefghijklmnopqrstuvwxyz";
return string.Join("",
input.Select(
c => char.IsDigit(c) ?
chars[int.Parse(c.ToString())].ToString() :
(chars.IndexOf(char.ToLowerInvariant(c))).ToString())
);
}
You can see a live demo on rextester.

You can use ArrayList of Albhabets. For example
ArrayList albhabets = new ArrayList();
albhabets.Add("A");
albhabets.Add("B");
and so on.
And now parse your string character by character.
string s = "1BC34D";
char[] characters = s.ToCharArray();
for (int i = 0; i < characters.Length; i++)
{
if (Char.IsNumber(characters[0]))
{
var index = characters[0];
var stringAlbhabet = albhabets[index];
}
else
{
var digitCharacter = albhabets.IndexOf(characters[0]);
}
}
This way you can get "Alphabet" representation of number & numeric representation of "Alphabet".

LowerCase and UpperCase Alternating c#

I need convert a phrase words in uppercase and lowercase continuous (alternating).
Example.
Input:
the girl is pretty.
Output:
tHe GiRl Is PrEtTy
I have tried the code below but it only convert the first letter:
char[] array = texto.ToCharArray();
if (array.Length >= 1)
{
if (char.IsLower(array[0]))
{
array[0] = char.ToUpper(array[0]);
}
}
for (int i = 1; i < array.Length; i++)
{
if (array[i - 1] == ' ')
{
if (char.IsLower(array[i]))
{
array[i] = char.ToUpper(array[i]);
}
}
}
return new string(array);
Thanks

Fancy solution using LINQ:
string someString = "the girl is pretty";
string newString = string.Concat(
someString.ToLower().AsEnumerable().Select((c, i) => i % 2 == 0 ? c : char.ToUpper(c)));
This basically does the following:
Convert the string to lowercase.
Iterate over each character.
Convert every second character to upper case.
Join the characters into a single string.
A more “classical” solution could look like this:
string someString = "the girl is pretty";
StringBuilder sb = new StringBuilder();
bool uppercase = false;
foreach (char c in someString)
{
if (uppercase)
sb.Append(char.ToUpper(c));
else
sb.Append(char.ToLower(c));
uppercase = !uppercase;
}
string newString = sb.ToString();

Answer by poke was right, however it includes the spaces on alternating the case. I do some tweak to the previous answer, it ignores the spaces of the string.
string someString = "the girl is pretty";
string space = " ";
char[] str = someString.ToCharArray();
char[] str2 = space.ToCharArray();
bool uppercase = false;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if (c != str2[0])
{
if (uppercase)
sb.Append(char.ToUpper(c));
else
{
sb.Append(char.ToLower(c));
}
uppercase = !uppercase;
}
else
{
sb.Append(c);
}
}
string newString = sb.ToString();

Convert Dash-Separated String to camelCase via C#

I have a large XML file that contain tag names that implement the dash-separated naming convention. How can I use C# to convert the tag names to the camel case naming convention?
The rules are:
1. Convert all characters to lower case
2. Capitalize the first character after each dash
3. Remove all dashes
Example
Before Conversion
<foo-bar>
<a-b-c></a-b-c>
</foo-bar>
After Conversion
<fooBar>
<aBC></aBC>
</fooBar>
Here's a code example that works, but it's slow to process - I'm thinking that there is a better way to accomplish my goal.
string ConvertDashToCamelCase(string input)
{
input = input.ToLower();
char[] ca = input.ToCharArray();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < ca.Length; i++)
{
if(ca[i] == '-')
{
string t = ca[i + 1].ToString().toUpper();
sb.Append(t);
i++;
}
else
{
sb.Append(ca[i].ToString());
}
}
return sb.ToString();
}

The reason your original code was slow is because you're calling ToString all over the place unnecessarily. There's no need for that. There's also no need for the intermediate array of char. The following should be much faster, and faster than the version that uses String.Split, too.
string ConvertDashToCamelCase(string input)
{
StringBuilder sb = new StringBuilder();
bool caseFlag = false;
for (int i = 0; i < input.Length; ++i)
{
char c = input[i];
if (c == '-')
{
caseFlag = true;
}
else if (caseFlag)
{
sb.Append(char.ToUpper(c));
caseFlag = false;
}
else
{
sb.Append(char.ToLower(c));
}
}
return sb.ToString();
}
I'm not going to claim that the above is the fastest possible. In fact, there are several obvious optimizations that could save some time. But the above is clean and clear: easy to understand.
The key is the caseFlag, which you use to indicate that the next character copied should be set to upper case. Also note that I don't automatically convert the entire string to lower case. There's no reason to, since you'll be looking at every character anyway and can do the appropriate conversion at that time.
The idea here is that the code doesn't do any more work than it absolutely has to.

For completeness, here's also a regular expression one-liner (inspred by this JavaScript answer):
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-.", m => m.Value.ToUpper().Substring(1));
It replaces all occurrences of -x with x converted to upper case.
Special cases:
If you want lower-case all other characters, replace input with input.ToLower() inside the expression:
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input.ToLower(), "-.", m => m.Value.ToUpper().Substring(1));
If you want to support multiple dashes between words (dash--case) and have all of the dashes removed (dashCase), replace - with -+ in the regular expression (to greedily match all sequences of dashes) and keep only the final character:
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-+.", m => m.Value.ToUpper().Substring(m.Value.Length - 1));
If you want to support multiple dashes between words (dash--case) and remove only the final one (dash-Case), change the regular expression to match only a dash followed by a non-dash (rather than a dash followed by any character):
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-[^-]", m => m.Value.ToUpper().Substring(1));

string ConvertDashToCamelCase(string input)
{
string[] words = input.Split('-');
words = words.Select(element => wordToCamelCase(element));
return string.Join("", words);
}
string wordToCamelCase(string input)
{
return input.First().ToString().ToUpper() + input.Substring(1).ToLower();
}

Here is an updated version of #Jim Mischel's answer that will ignore the content - i.e. it will only camelCase tag names.
string ConvertDashToCamelCase(string input)
{
StringBuilder sb = new StringBuilder();
bool caseFlag = false;
bool tagFlag = false;
for(int i = 0; i < input.Length; i++)
{
char c = input[i];
if(tagFlag)
{
if (c == '-')
{
caseFlag = true;
}
else if (caseFlag)
{
sb.Append(char.ToUpper(c));
caseFlag = false;
}
else
{
sb.Append(char.ToLower(c));
}
}
else
{
sb.Append(c);
}
// Reset tag flag if necessary
if(c == '>' || c == '<')
{
tagFlag = (c == '<');
}
}
return sb.ToString();
}

using System;
using System.Text;
public class MyString
{
public static string ToCamelCase(string str)
{
char[] s = str.ToCharArray();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < s.Length; i++)
{
if (s[i] == '-' || s[i] == '_')
sb.Append(Char.ToUpper(s[++i]));
else
sb.Append(s[i]);
}
return sb.ToString();
}
}

How to separate 1 string into multiple strings [duplicate]

This question already has answers here:
.NET - How can you split a "caps" delimited string into an array?
(19 answers)
Split a PascalCase string into separate words
(10 answers)
Regular expression, split string by capital letter but ignore TLA
(7 answers)
Closed 9 years ago.
How do I convert "ThisIsMyTestString" into "This Is My Test String" using C#?
Is there a fast way to do it?
I've been thinking of a pseudo code but it's complicated and ugly:
String s = "ThisIsMyTestString";
List<String> strList = new List<String>();
for(int i=0; i < str->Length ; i++)
{
String tmp = "";
if (Char.IsUpper(str[i]))
{
tmp += str[i];
i++;
}
while (Char::IsLower(str[i]))
{
tmp += str[i];
i++;
}
strList .Add(tmp);
}
String tmp2 = "";
for (uint i=0 ; i<strList.Count(); i++)
{
tmp2 += strList[i] + " ";
}

You can use Regex as outlined here:
Regular expression, split string by capital letter but ignore TLA
Your regex: "((?<=[a-z])[A-Z]|A-Z)"
Find and replace with " $1"
string splitString = Replace("ThisIsMyTestString", "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1")
Here (?<=...) is a "positive lookbehind, a regex that should precede the match. In this case the lookbehind is "characters 'a' through 'z'"
(?=...) is a similar construct with lookahead, where the match has to be followed by regex-described string. In this case the lookahead is "characters 'a' through 'z'"
In both cases the final match contains one character "A" through "Z" followed by 'a'-'z' OR one character 'a' through 'z' followed by a capital letter. Replacing these matches puts a space between the capital and lowercase letters

Not best code, but it works
String.Join("", s.Select(c => Char.IsUpper(c) ? " " + c : c.ToString())).Trim()

lazyberezovsky beat me with a much simpler solution... but this creates less garbage so I won't delete it.
static void Main(string[] args)
{
Console.WriteLine(SplitByCase("ThisIsMyString"));
Console.ReadLine();
}
static string SplitByCase(string str, bool upper = true)
{
return String.Join(" ", SplitIntoWords(str, c => Char.IsUpper(c)));
}
static IEnumerable<String> SplitIntoWords(string str, Func<char, bool> splitter)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
sb.Append(str[i]);
if (i + 1 == str.Length || splitter(str[i + 1]))
{
yield return sb.ToString();
sb.Clear();
}
}
}

This will do it for that string:
String s = "ThisIsMyTestString";
StringBuilder result = new StringBuilder();
result.Append(s[0]);
for (int i = 1; i < s.Length; i++)
{
if (char.IsUpper(s[i]) )
result.Append(' ');
result.Append(s[i]);
}
s = result.ToString();

regex/linq to replace consecutive characters with count

I have the following method (written in C#/.NET). Input text consist only of letters (no digits). Returned value is another text in which groups of more than two consecutive characters are replaced with one the character preceded with a count of repetitions.
Ex.: aAAbbbcccc -> aAA3b4c
public static string Pack(string text)
{
if (string.IsNullOrEmpty(text)) return text;
StringBuilder sb = new StringBuilder(text.Length);
char prevChar = text[0];
int prevCharCount = 1;
for (int i = 1; i < text.Length; i++)
{
char c = text[i];
if (c == prevChar) prevCharCount++;
else
{
if (prevCharCount > 2) sb.Append(prevCharCount);
else if (prevCharCount == 2) sb.Append(prevChar);
sb.Append(prevChar);
prevChar = c;
prevCharCount = 1;
}
}
if (prevCharCount > 2) sb.Append(prevCharCount);
else if (prevCharCount == 2) sb.Append(prevChar);
sb.Append(prevChar);
return sb.ToString();
}
The method is not too long. But does any one has an idea how to do that in a more concise way using regex? Or LINQ?

How about:
static readonly Regex re = new Regex(#"(\w)(\1){2,}", RegexOptions.Compiled);
static void Main() {
string result = re.Replace("aAAbbbcccc",
match => match.Length.ToString() + match.Value[0]);
}
The regex is a word char, followed by the same (back-ref) at least twice; the lamba takes the length of the match (match.Length) and appends the first character (match.Value[0])

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove non-INTERNATIONAL-alphanumeric characters from a string? - c#

Related

How to exchange numbers to alphabet and alphabet to numbers in a string?

LowerCase and UpperCase Alternating c#

Convert Dash-Separated String to camelCase via C#

How to separate 1 string into multiple strings [duplicate]

regex/linq to replace consecutive characters with count

Categories

Resources