How to find wide characters from the given input string? - c#

How to find wide characters from the given input string(English letters)?
I have a business requirement to get last name(English letters) with max length of 12 by considering wide character( length 2) and normal character ( length 1). Based on that, input box should accept number of characters.

UPDATED:
If you are talking about asian characters (like Japanese 全角) then here is one way.
public static bool isZenkaku(string str)
{
int num = sjisEnc.GetByteCount(str);
return num == str.Length * 2;
}
You would use it like this:
string test = "testTEST!+亜+123!123";
var widechars = test.Where(c => isZenkaku(c.ToString())).ToList();
foreach (var c in widechars)
{
Console.WriteLine(c); //result is TEST!+亜123
}

I was looking into this a while ago and the String class's Length property tells you the number of characters not the number of bytes. You can do something where when the Length of the string is greater then 12 return the left 12 characters. There could be anything up to 24 bytes in the string.

Related

Split Chinese Character problem using tocharArray()

I am writing a C# program to split Chinese Character input like this
textbox input ="大大大"
expected output =
大
大
大
And the code is like this
string aa="大大大";
foreach (char c in aa.ToCharArray()){
Console.WriteLine(c);
}
It works fine for most of the characters.
However, for some characters such as "𧜏", I got the result like this
textbox input = 𧜏大
output =
口
口
大
It looks like that the program fail to handle this character
Is there any solution to solve this?
TL;DR:
Don't use ToCharArray() when working with non-"ASCII" (or at least, non-Latin-style) text.
Instead, use TextElementEnumerator.
Here's why.
Explanation
In Unicode, the 𧜏 character has a code-point of U+2770F which is outside the range supported by a single 16-bit UTF-16 value (i.e. 2 bytes, a single .NET Char value), so UTF-16 uses a pair of separate 16-bit values known as a surrogate pair to represent it:
using Shouldly;
String input = "𧜏";
Char[] chars = input.ToCharArray();
chars.Length.ShouldBe( 2 ); // 2*Char == 2*16-bits == 32 bits
Char.GetUnicodeCategory( chars[0] ).ShouldBe( UnicodeCategory.Surrogate );
Char.GetUnicodeCategory( chars[1] ).ShouldBe( UnicodeCategory.Surrogate );
Therefore, to meaingfully "split" a string like this, your program needs to be aware of surrogate-pairs and not split a pair up.
The code below is a simple program that extracts each Unicode code-point from a string and adds it to a list.
String input = "大𧜏大";
// Don't forget to Normalize!
input = input.Normalize();
List<UInt32> codepoints = new List<UInt32>( capacity: 3 );
for( Int32 i = 0; i < input.Length; i++ )
{
Char c = input[i];
if( Char.GetUnicodeCategory( c ) == UnicodeCategory.Surrogate )
{
Char former = c;
Char latter = input[i+1];
// The former sibling has the upper 11 bits of the code-point (after 0x00D800).
// The latter sibling has the lower 10 bits of the code-point.
UInt32 hi = former;
UInt32 lo = latter;
UInt32 codepoint = ((hi - 0xD800) * 0x400) + (lo - 0xDC00) + 0x10000;
codepoints.Add( codepoint );
i += 1; // Skip the next char
}
else
{
codepoints.Add( c );
}
}
codepoint.Dump();
// [0] = 22823 == '大'
// [1] = 161551 == '𧜏'
// [2] = 22823 == '大'
Note that when it comes to non-Latin-style alphabets, the concept of splitting a string up into discrete characters, glyphs, or graphemes is... complicated. But in general, you don't want to split a string up into discrete Char values (Q.E.D.), but also you shouldn't split a string up into code-points either, instead you'll want to split a string up into grapheme clusters (a visual-grouping of related glyphs, each represented by their own codepoints, which in-turn may be a single .NET 16-bit Char value, or a Surrogate Pair of Char values).
Fortunately .NET has this functionality built-in into System.Globalization.TextElementEnumerator.
using System.Globalization;
String input = "大𧜏大".Normalize();
TextElementEnumerator iter = StringInfo.GetTextElementEnumerator( input );
while( iter.MoveNext() )
{
String graphemeCluster = iter.GetTextElement();
Console.WriteLine( graphemeCluster );
}
Gives me the expected output:
大
𧜏
大

How to convert a double value to string without rounded [duplicate]

This question already has answers here:
Truncate Two decimal places without rounding
(24 answers)
Closed 7 years ago.
I have this variable:
Double dou = 99.99;
I want to convert it to a string variable, and the string should be 99.9.
I can do it like this:
string str = String.Format("{0:0.#}", dou);
But the value I got is: 100 which is not 99.9.
So how could I implement that?
PS: This question is marked as duplicated. Yes, they may have the same the solution (although I think that's a workaround), but from different viewpoints.
For example, if there is another variable:
Double dou2 = 99.9999999;
I want to convert it to string: 99.999999, so how should I do? Like this:
Math.Truncate(1000000 * value) / 1000000;
But what if there are more digits after dot?
You have to truncate the second decimal position.
Double dou = 99.99;
double douOneDecimal = System.Math.Truncate (dou * 10) / 10;
string str = String.Format("{0:0.0}", douOneDecimal);
You can use the Floor method to round down:
string str = (Math.Floor(dou * 10.0) / 10.0).ToString("0.0");
The format 0.0 means that it will show the decimal even if it is zero, e.g. 99.09 is formatted as 99.0 rather than 99.
Update:
If you want to do this dynamically depending on the number of digits in the input, then you first have to decide how to determine how many digits there actually are in the input.
Double precision floating point numbers are not stored in decimal form, they are stored in binary form. That means that some numbers that you think have just a few digits actually have a lot. A number that you see as 1.1 might actually have the value 1.099999999999999945634.
If you choose to use the number of digits that is shown when you format it into a string, then you would simply format it into a string and remove the last digit:
// format number into a string, make sure it uses period as decimal separator
string str = dou.ToString(CultureInfo.InvariantCulture);
// find the decimal separator
int index = str.IndexOf('.');
// check if there is a fractional part
if (index != -1) {
// check if there is at least two fractional digits
if (index < str.Length - 2) {
// remove last digit
str = str.Substring(0, str.Length - 1);
} else {
// remove decimal separator and the fractional digit
str = str.Substring(0, index);
}
}

Splitting a string at certain differing character counts

I have a semi-complicated file full of lines. These lines can fall under one of two formats:
Person
Company
The specification for a company line is like so:
10 Characters = company identification, record specification and status of company
22 Characters = Blank spaces (filler)
8 Characters = number of employees and length of company name
Max of 161 characters = Company name + "<" delimiter
And the specification for a person:
12 Characters = Parent company number, appointment date and type
12 Characters = unique reference number
1 Character = corporate indicator
7 Characters = Blank spaces (filler)
16 Characters = confirmed appointment date and resignation date
8 Characters = Postcode
8 Characters = Date of Birth
4 Characters = length of variable data
Max of 1125 characters = Variable data delimited by "<"
First, I need to test the 11 character to determine the type of string. Pseudo-code:
if (string.count(11) = " ")
{
ItsACompany();
}
else
{
ItsAPerson();
}
Then I need to do a custom count for every type of specification - so far, all I've found is a method to split strings every nth character, and reads to the end of the string. This is recursive and not what I need.
I need an option that allows n to change per specification, and allows me to select all characters between char n and char y. Does such a thing exist?
To extract a block of text from a string you could write an extension method like this
namespace StringExtension
{
public static class MyExtensions
{
public static string TakeBlock(this string input, int x, int y)
{
if(y > input.Length) y = input.Length;
if(x > y) x = y;
int length = y - (x-1);
return input.Substring(x-1, length);
}
}
}
And then you could call it from your main code with (supposing to be inside the method that extracts data for the company line)
string parentCompany = line.TakeBlock(1, 12);
string uniqueRef = line.TakeBlock(13,24);

Six digit unicode escaped value comparison

I have a six digit unicode character, for example U+100000 which I wish to make a comparison with a another char in my C# code.
My reading of the MSDN documentation is that this character cannot be represented by a char, and must instead be represented by a string.
a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal
I feel that I'm missing something obvious, but how can you get the follow comparison to work correctly:
public bool IsCharLessThan(char myChar, string upperBound)
{
return myChar < upperBound; // will not compile as a char is not comparable to a string
}
Assert.IsTrue(AnExample('\u0066', "\u100000"));
Assert.IsFalse(AnExample("\u100000", "\u100000")); // again won't compile as this is a string and not a char
edit
k, I think I need two methods, one to accept chars and another to accept 'big chars' i.e. strings. So:
public bool IsCharLessThan(char myChar, string upperBound)
{
return true; // every char is less than a BigChar
}
public bool IsCharLessThan(string myBigChar, string upperBound)
{
return string.Compare(myBigChar, upperBound) < 0;
}
Assert.IsTrue(AnExample('\u0066', "\u100000));
Assert.IsFalse(AnExample("\u100022", "\u100000"));
To construct a string with the Unicode code point U+10FFFF using a string literal, you need to work out the surrogate pair involved.
In this case, you need:
string bigCharacter = "\uDBFF\uDFFF";
Or you can use char.ConvertFromUtf32:
string bigCharacter = char.ConvertFromUtf32(0x10FFFF);
It's not clear what you want your method to achieve, but if you need it to work with characters not in the BMP, you'll need to make it accept int instead of char, or a string.
As per the documentation for string, if you want to iterate over characters in a string as full Unicode values, use TextElementEnumerator or StringInfo.
Note that you do need to do this explicitly. If you just use ordinal values, it will check UTF-16 code units, not the UTF-32 code points. For example:
string text = "\uF000";
string upperBound = "\uDBFF\uDFFF";
Console.WriteLine(string.Compare(text, upperBound, StringComparison.Ordinal));
This prints out a value greater than zero, suggesting that text is greater than upperBound here. Instead, you should use char.ConvertToUtf32:
string text = "\uF000";
string upperBound = "\uDBFF\uDFFF";
int textUtf32 = char.ConvertToUtf32(text, 0);
int upperBoundUtf32 = char.ConvertToUtf32(upperBound, 0);
Console.WriteLine(textUtf32 < upperBoundUtf32); // True
So that's probably what you need to do in your method. You might want to use StringInfo.LengthInTextElements to check that the strings really are single UTF-32 code points first.
From https://msdn.microsoft.com/library/aa664669.aspx, you have to use \U with full 8 hex digits. So for example:
string str1 = "\U0001F300";
string str2 = "\uD83C\uDF00";
bool eq = str1 == str2;
using the :cyclone: emoji.

In c# how to use String.Format for a number and pad left with zeros so its always 6 chars

I want to use c# format to do this:
6 = "000006"
999999 = "999999"
100 = "000100"
-72 = error
1000000 = error
I was trying to use String.Format but without success.
Formatting will not produce an error if there are too many digits. You can achieve a left-padded 6 digit string just with
string output = number.ToString("000000");
If you need 7 digit strings to be invalid, you'll just need to code that.
if (number >= 0 and number < 1000000)
{
output = number.ToString("000000")
}
else
{
throw new ArgumentException("number");
}
To use string.Format, you would write
string output = string.Format("{0:000000}", number);

Categories

Resources