Split Chinese Character problem using tocharArray() - c#

I am writing a C# program to split Chinese Character input like this
textbox input ="大大大"
expected output =
大
大
大
And the code is like this
string aa="大大大";
foreach (char c in aa.ToCharArray()){
Console.WriteLine(c);
}
It works fine for most of the characters.
However, for some characters such as "𧜏", I got the result like this
textbox input = 𧜏大
output =
口
口
大
It looks like that the program fail to handle this character
Is there any solution to solve this?

TL;DR:
Don't use ToCharArray() when working with non-"ASCII" (or at least, non-Latin-style) text.
Instead, use TextElementEnumerator.
Here's why.
Explanation
In Unicode, the 𧜏 character has a code-point of U+2770F which is outside the range supported by a single 16-bit UTF-16 value (i.e. 2 bytes, a single .NET Char value), so UTF-16 uses a pair of separate 16-bit values known as a surrogate pair to represent it:
using Shouldly;
String input = "𧜏";
Char[] chars = input.ToCharArray();
chars.Length.ShouldBe( 2 ); // 2*Char == 2*16-bits == 32 bits
Char.GetUnicodeCategory( chars[0] ).ShouldBe( UnicodeCategory.Surrogate );
Char.GetUnicodeCategory( chars[1] ).ShouldBe( UnicodeCategory.Surrogate );
Therefore, to meaingfully "split" a string like this, your program needs to be aware of surrogate-pairs and not split a pair up.
The code below is a simple program that extracts each Unicode code-point from a string and adds it to a list.
String input = "大𧜏大";
// Don't forget to Normalize!
input = input.Normalize();
List<UInt32> codepoints = new List<UInt32>( capacity: 3 );
for( Int32 i = 0; i < input.Length; i++ )
{
Char c = input[i];
if( Char.GetUnicodeCategory( c ) == UnicodeCategory.Surrogate )
{
Char former = c;
Char latter = input[i+1];
// The former sibling has the upper 11 bits of the code-point (after 0x00D800).
// The latter sibling has the lower 10 bits of the code-point.
UInt32 hi = former;
UInt32 lo = latter;
UInt32 codepoint = ((hi - 0xD800) * 0x400) + (lo - 0xDC00) + 0x10000;
codepoints.Add( codepoint );
i += 1; // Skip the next char
}
else
{
codepoints.Add( c );
}
}
codepoint.Dump();
// [0] = 22823 == '大'
// [1] = 161551 == '𧜏'
// [2] = 22823 == '大'
Note that when it comes to non-Latin-style alphabets, the concept of splitting a string up into discrete characters, glyphs, or graphemes is... complicated. But in general, you don't want to split a string up into discrete Char values (Q.E.D.), but also you shouldn't split a string up into code-points either, instead you'll want to split a string up into grapheme clusters (a visual-grouping of related glyphs, each represented by their own codepoints, which in-turn may be a single .NET 16-bit Char value, or a Surrogate Pair of Char values).
Fortunately .NET has this functionality built-in into System.Globalization.TextElementEnumerator.
using System.Globalization;
String input = "大𧜏大".Normalize();
TextElementEnumerator iter = StringInfo.GetTextElementEnumerator( input );
while( iter.MoveNext() )
{
String graphemeCluster = iter.GetTextElement();
Console.WriteLine( graphemeCluster );
}
Gives me the expected output:
大
𧜏
大

Related

Convert single-character string to char

I need to convert single string character to ASC code, similar to Visual Basic ASC("a")
I need to do it in C#, something similar to ToCharArray()
("a").ToCharArray()
returns
{char[1]}
[0]: 97 'a'
I need to have 97 alone.
A string is an array of char, so you can get the first character using array indexing syntax, and a char, if used as an int (which is an implicit conversion), will return the ASCII value.
Try:
int num = "a"[0]; // num will be 97
// Which is the same as using a char directly to get the int value:
int num = 'a'; // num will be 97
What you're seeing that seems to be causing some confusion is how the char type is represented in the debugger: both the character and the int value are shown.
Here's an example of an int and a char in the debugger as well as in the console window (which is their ToString() representation):
int num = "a"[0];
char chr = "a"[0];
Console.WriteLine($"num = {num}");
Console.WriteLine($"chr = {chr}");
If you want to convert a single character string to char, do this
char.Parse("a");
If you want to get char code do this
char.ConvertToUtf32("a", 0); // return 97
char chrReadLetter;
chrReadLetter = (char)char.ConvertToUtf32(txtTextBox1.Text.Substring(0, 1), 0);
Reads the first letter of the textbox into a character variable.

Programming with C# console application

Convert.ToString((input.Split(' ').Length + 1), 2).PadLeft(8, '0')
could anybody explain this line for me
It takes an input string (input), splits it on the space character (input.Split(' ')) (presumably to get the number of "words"), adds 1 to the .Length of the resulting array (not sure why), converts that number to a binary string (Convert.ToString(int, 2) will convert the int to a base-2 number and return it as a string), then pads the left side of the string with the 0 character until it's 8 characters long (.PadLeft(8, '0')).
My guess is that this might be used in some kind of encoding/decoding algorithm(?).
Here it is in action:
var inputStrings = new List<string>
{
"one",
"two words",
"this is three",
"this one is four",
"and this one has five"
};
foreach(var input in inputStrings)
{
var result = Convert.ToString((input.Split(' ').Length + 1), 2).PadLeft(8, '0');
Console.WriteLine($"{input.PadRight(22, ' ')} = {result}");
}
Console.Write("\nDone!\nPress any key to exit...");
Console.ReadKey();
Output
Split it up:
var stringItems = input.Split(' ');
Split the input string on spaces
int itemCount = stringItems.Length + 1;
However many items are in the collection, add one to that
var str = Convert.ToString(itemCount, 2);
Call some overload of Convert.ToString that takes two ints as parameters (I have no idea what this is). Consult Documentation as we have no idea what this overload does
It turns out that it:
Converts the value of a 32-bit signed integer to its equivalent string representation in a specified base
So we have a string from a 32 bit integer in base 2.
str.PadLeft(8, '0')
Make sure the string has 8 characters total, all 0's except the one's that existed already.
Looks like we created a pretty binary number. Though I have no idea what the meaning of it is without context.

Six digit unicode escaped value comparison

I have a six digit unicode character, for example U+100000 which I wish to make a comparison with a another char in my C# code.
My reading of the MSDN documentation is that this character cannot be represented by a char, and must instead be represented by a string.
a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal
I feel that I'm missing something obvious, but how can you get the follow comparison to work correctly:
public bool IsCharLessThan(char myChar, string upperBound)
{
return myChar < upperBound; // will not compile as a char is not comparable to a string
}
Assert.IsTrue(AnExample('\u0066', "\u100000"));
Assert.IsFalse(AnExample("\u100000", "\u100000")); // again won't compile as this is a string and not a char
edit
k, I think I need two methods, one to accept chars and another to accept 'big chars' i.e. strings. So:
public bool IsCharLessThan(char myChar, string upperBound)
{
return true; // every char is less than a BigChar
}
public bool IsCharLessThan(string myBigChar, string upperBound)
{
return string.Compare(myBigChar, upperBound) < 0;
}
Assert.IsTrue(AnExample('\u0066', "\u100000));
Assert.IsFalse(AnExample("\u100022", "\u100000"));
To construct a string with the Unicode code point U+10FFFF using a string literal, you need to work out the surrogate pair involved.
In this case, you need:
string bigCharacter = "\uDBFF\uDFFF";
Or you can use char.ConvertFromUtf32:
string bigCharacter = char.ConvertFromUtf32(0x10FFFF);
It's not clear what you want your method to achieve, but if you need it to work with characters not in the BMP, you'll need to make it accept int instead of char, or a string.
As per the documentation for string, if you want to iterate over characters in a string as full Unicode values, use TextElementEnumerator or StringInfo.
Note that you do need to do this explicitly. If you just use ordinal values, it will check UTF-16 code units, not the UTF-32 code points. For example:
string text = "\uF000";
string upperBound = "\uDBFF\uDFFF";
Console.WriteLine(string.Compare(text, upperBound, StringComparison.Ordinal));
This prints out a value greater than zero, suggesting that text is greater than upperBound here. Instead, you should use char.ConvertToUtf32:
string text = "\uF000";
string upperBound = "\uDBFF\uDFFF";
int textUtf32 = char.ConvertToUtf32(text, 0);
int upperBoundUtf32 = char.ConvertToUtf32(upperBound, 0);
Console.WriteLine(textUtf32 < upperBoundUtf32); // True
So that's probably what you need to do in your method. You might want to use StringInfo.LengthInTextElements to check that the strings really are single UTF-32 code points first.
From https://msdn.microsoft.com/library/aa664669.aspx, you have to use \U with full 8 hex digits. So for example:
string str1 = "\U0001F300";
string str2 = "\uD83C\uDF00";
bool eq = str1 == str2;
using the :cyclone: emoji.

Convert stringbinary to ascii

i'm just started dev-ing app for wp7 and I'm trying to convert a string of binary back to ascii using c#.
But i have no idea how can it be done.
Hope someone could help me out here.
example :
Input string: 0110100001100101011011000110110001101111
Output string: hello
The simple solution,
using SubString and built in Convert.ToByte could look like this:
string input = "0110100001100101011011000110110001101111";
int charCount = input.Length / 8;
var bytes = from idx in Enumerable.Range(0, charCount)
let str = input.Substring(idx*8,8)
select Convert.ToByte(str,2);
string result = Encoding.ASCII.GetString(bytes.ToArray());
Console.WriteLine(result);
Another solution, doing the calculations yourself:
I added this in case you wanted to know how the calculations should be performed, rather than which method in the framework does it for you:
string input = "0110100001100101011011000110110001101111";
var chars = input.Select((ch,idx) => new { ch, idx});
var parts = from x in chars
group x by x.idx / 8 into g
select g.Select(x => x.ch).ToArray();
var bytes = parts.Select(BitCharsToByte).ToArray();
Console.WriteLine(Encoding.ASCII.GetString(bytes));
Where BitCharsToByte does the conversion from a char[] to the corresponding byte:
byte BitCharsToByte(char[] bits)
{
int result = 0;
int m = 1;
for(int i = bits.Length - 1 ; i >= 0 ; i--)
{
result += m * (bits[i] - '0');
m*=2;
}
return (byte)result;
}
Both the above solutions does basically the same thing: First group the characters in groups of 8; then take that sub string, get the bits represented and calculate the byte value. Then use the ASCII Encoding to convert those bytes to a string.
You can use the BitArray Class and use its CopyTo function to copy ur bit string to byte array
Then you can convert your byte array to string using Text.Encoding.UTF8.GetString(Byte[])
See this link on BitArray on MSDN

How to find wide characters from the given input string?

How to find wide characters from the given input string(English letters)?
I have a business requirement to get last name(English letters) with max length of 12 by considering wide character( length 2) and normal character ( length 1). Based on that, input box should accept number of characters.
UPDATED:
If you are talking about asian characters (like Japanese 全角) then here is one way.
public static bool isZenkaku(string str)
{
int num = sjisEnc.GetByteCount(str);
return num == str.Length * 2;
}
You would use it like this:
string test = "testTEST!+亜+123!123";
var widechars = test.Where(c => isZenkaku(c.ToString())).ToList();
foreach (var c in widechars)
{
Console.WriteLine(c); //result is TEST!+亜123
}
I was looking into this a while ago and the String class's Length property tells you the number of characters not the number of bytes. You can do something where when the Length of the string is greater then 12 return the left 12 characters. There could be anything up to 24 bytes in the string.

Categories

Resources