remove 4 byte UTF8 characters

remove 4 byte UTF8 characters - c#

I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried
sText = Regex.Replace (sText, "\xF0...", "");
This doesn't work. Using two backslashes did not work neither.
The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..
What's wrong?

Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char values.
To just remove them, you can do (using System.Linq;):
sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));
(uses an overload of Concat introduced in .NET 4.0 (Visual Studio 2010)).
Late addition: It may give better performance to use:
sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());
even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)

You are trying to search for byte values but C# strings are made from char values. The C# language spec at section "2.4.4.4 Character literals" states:
A character literal represents a single character, and usually consists of a character in quotes, as in 'a'.
...
A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following \x.
Hence the search for "\xF0..." is searching for the character U+F0 which would be represented by the bytes C3 B0.
If you want find replace all Unicode characters whose first byte is 0xF0 then I believe you need to search for the character values whose first byte if 0xFO.
The character U+10000 is represented as F0 90 80 80 (the preceding code is U+FFFF which is EF BF BF). The first code with F1 .... .. is U+40000 which is F1 80 80 80 and the value before it is U+3FFFF which is F0 BF BF BF.
Hence you need to remove characters in the range U+10000 to U+3FFFF. This should be possible with a regular expression such as
sText = Regex.Replace (sText, "[\\x10000-\\x3FFFF]", "");
The relevant characters from the source quoted in the question have been extracted into the code below. The code then tries to understand how the characters are held in strings.
static void Main(string[] args)
{
string input = "] 𝄞 (";
Console.Write("Input length {0} : '{1}' : ", input.Length, input);
foreach (char cc in input)
{
Console.Write(" {0,2:X02}", (int)cc);
}
Console.WriteLine();
}
The output from the program is as below. This supports the surrogate pair explanation given by #Jeppe in his answer.
Input length 6 : '] ?? (' : 5D 20 D834 DD1E 20 28

Related

Why do I get a different value after turning an integer into ASCII and then back to an integer?

Why, when I turn INT value to bytes and to ASCII and back, I get another value?
Example:
var asciiStr = new string(Encoding.ASCII.GetChars(BitConverter.GetBytes(2000)));
var intVal = BitConverter.ToInt32(Encoding.ASCII.GetBytes(asciiStr), 0);
Console.WriteLine(intVal);
// Result: 1855

ASCII is only 7-bit - code points above 127 are unsupported. Unsupported characters are converted to ? per the docs on Encoding.ASCII:
The ASCIIEncoding object that is returned by this property might not have the appropriate behavior for your app. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.
So 2000 decimal = D0 07 00 00 hexadecimal (little endian) = [unsupported character] [BEL character] [NUL character] [NUL character] = ? [BEL character] [NUL character] [NUL character] = 3F 07 00 00 hexadecimal (little endian) = 1855 decimal.

TL;DR: Everything's fine. But you're a victim of character replacement.
We start with 2000. Let's acknowledge, first, that this number can be represented in hexadecimal as 0x000007d0.
BitConverter.GetBytes
BitConverter.GetBytes(2000) is an array of 4 bytes, Because 2000 is a 32-bit integer literal. So the 32-bit integer representation, in little endian (least significant byte first), is given by the following byte sequence { 0xd0, 0x07, 0x00, 0x00 }. In decimal, those same bytes are { 208, 7, 0, 0 }
Encoding.ASCII.GetChars
Uh oh! Problem. Here's where things likely took an unexpected turn for you.
You're asking the system to interpret those bytes as ASCII-encoded data. The problem is that ASCII uses codes from 0-127. The byte with value 208 (0xd0) doesn't correspond to any character encodable by ASCII. So what actually happens?
When decoding ASCII, if it encounters a byte that is out of the range 0-127 then it decodes that byte to a replacement character and moves to the next byte. This replacement character is a question mark ?. So the 4 chars you get back from Encoding.ASCII.GetChars are ?, BEL (bell), NUL (null) and NUL (null).
BEL is the ASCII name of the character with code 7, which traditionally elicits a beep when presented on a capable terminal. NUL (code 0) is a null character traditionally used for representing the end of a string.
new string
Now you create a string from that array of chars. In C# a string is perfectly capable of representing a NUL character within the body of a string, so your string will have two NUL chars in it. They can be represented in C# string literals with "\0", in case you want to try that yourself. A C# string literal that represents the string you have would be "?\a\0\0" Did you know that the BEL character can be represented with the escape sequence \a? Many people don't.
Encoding.ASCII.GetBytes
Now you begin the reverse journey. Your string is comprised entirely of characters in the ASCII range. The encoding of a question mark is code 63 (0x3F). And the BEL is 7, and the NUL is 0. so the bytes are { 0x3f, 0x07, 0x00, 0x00 }. Surprised? Well, you're encoding a question mark now where before you provided a 208 (0xd0) byte that was not representable with ASCII encoding.
BitConverter.ToInt32
Converting these four bytes back to a 32-bit integer gives the integer 0x0000073f, which, in decimal, is 1855.

String encoding (ASCII, UTF8, SHIFT_JIS, etc.) is designed to pigeonhole human language into a binary (byte) form. It isn't designed to store arbitrary binary data, such as the binary form of an integer.
While your binary data will be interpreted as a string, some of the information will be lost, meaning that storing binary data in this way will fail in the general case. You can see the point where this fails using the following code:
for (int i = 0; i < 255; ++i)
{
var byteData = new byte[] { (byte)i };
var stringData = System.Text.Encoding.ASCII.GetString(byteData);
var encodedAsBytes = System.Text.Encoding.ASCII.GetBytes(stringData);
Console.WriteLine("{0} vs {1}", i, (int)encodedAsBytes[0]);
}
Try it online
As you can see it starts off well because all of the character codes correspond to ASCII characters, but once we get up in the numbers (i.e. 128 and beyond), we start to require a more than 7 bits to store the binary value. At this point it ceases to be decoded correctly, and we start seeing 63 come back instead of the input value.
Ultimately you will have this problem encoding binary data using any string encoding. You need to choose an encoding method specifically meant for storing binary data as a string.
Two popular methods are:
Hexadecimal
Base64 using ToBase64String and FromBase64String
Hexadecimal example (using the hex methods here):
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to hex
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = ByteArrayToString(bytesValue);
Console.WriteLine("As hex: {0}", stringValue); // outputs D0070000
// Convert form hex to bytes and then to int
byte[] decodedBytesValue = StringToByteArray(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
Base64 example:
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to base64
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = Convert.ToBase64String(bytesValue);
Console.WriteLine("As base64: {0}", stringValue); // outputs 0AcAAA==
// Convert form base64 to bytes and then to int
byte[] decodedBytesValue = Convert.FromBase64String(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
P.S. If you simply wanted to convert your integer to a string (e.g. "2000") then you can simply use .ToString():
int initialValue = 2000;
string stringValue = initialValue.ToString();

Convert from Hexadecimal to Text

With respect to this tool, I need to convert hexadecimal data, irrespective of their combination to equivalent text. For example:
"HelloWorld" = 48656c6c6f576f726c64;
The solution needs to take into account that hexadecimal can be grouped in different lengths:
48656c6c 6f576f72 6c64
or
48 65 6c 6c 6f 57 6f 72 6c 64
All of the hexadecimal values supplied above read as HelloWorld when converted to text.

First, I would like to point out that this question has been asked many times on the web (here is one example). However, I am going to break this down step by step for you to hopefully teach you how to not only utilize your resources available on the web, but also how to solve your problem.
Overview: Converting from hexadecimal data to text that is able to be read by human beings is a straight-forward process in modern development languages; you clean the data (ensuring no illegal characters remain), then you convert down to the byte level so that you can work with the raw data. Finally, you'll convert that raw data into readable text utilizing a method that has already been created by Microsoft.
Important: Remember, for the conversion to work, you have to ensure you're converting in the same format that you started with:
ASCII -> ASCII: Works Great!
ASCII -> UTF7: Not so much...
Removing Illegal Characters: One of the first things you'll need to do is ensure the hexadecimal value that you're supplying doesn't contain any illegal characters. The simplest way to do this is to create an array of acceptable characters and then remove anything but these in a loop:
private string GetCleanHex(string hex) {
string legalCharacters = "0123456789ABCDEF";
string result = hex.ToUpper();
foreach (char c in result) {
if (!legalCharacters.Contains(c))
result = result.Replace(c.ToString(), string.Empty);
}
}
Getting The Byte Array: Once you've cleaned out all illegal characters, you can now convert your hexadecimal string into a byte array. This is required to convert from hexadecimal to ASCII. This step was provided by the linked post above:
private byte[] GetBytesFromHex(string hex) {
byte[] bytes = new byte[result.Length / 2];
for (int i = 0; i < bytes.Length; i++)
bytes[i] = Convert.ToByte(result.Substring(i * 2, 2), 16);
}
Converting To Text: Now that you've cleaned your data, and converted it to a byte[], you can now convert that byte data into ASCII. This can be done using a method available in Encoding.ASCII called GetString:
string text = Encoding.ASCII.GetString(bytes);
The Final Result: Plug all of this into your application and you'll have successfully converted hexadecimal data into clean, readable text:
string hex = GetCleanHex("506c 65 61736520 72 656164 20686f77 2074 6f 2061 73 6b 2e");
byte[] bytes = GetBytesFromHex(hex);
string text = Encoding.ASCII.GetString(bytes);
Console.WriteLine(text);
Console.ReadKey();
The code above will print the following text to the console:
Please read how to ask.

Why are ASCII values of a byte different when cast as Int32?

I'm in the process of creating a program that will scrub extended ASCII characters from text documents. I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.
Consider:
namespace ASCIITest
{
class Program
{
static void Main(string[] args)
{
string value = "Slide™1½”C4®";
byte[] asciiValue = Encoding.ASCII.GetBytes(value); // byte array
char[] array = value.ToCharArray(); // char array
Console.WriteLine("CHAR\tBYTE\tINT32");
for (int i = 0; i < array.Length; i++)
{
char letter = array[i];
byte byteValue = asciiValue[i];
Int32 int32Value = array[i];
//
Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
}
Console.ReadLine();
}
}
}
Output from program
CHAR BYTE INT32
S 83 83
l 108 108
i 105 105
d 100 100
e 101 101
T 63 8482 <- trademark symbol
1 49 49
½ 63 189 <- fraction
" 63 8221 <- smartquotes
C 67 67
4 52 52
r 63 174 <- registered trademark symbol
In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32, but all show up as 63 when cast as the byte value. What's going on here?

ASCII.GetBytes conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63).
So since your string contains characters outside of that range your asciiValue have ? instead of all interesting symbols like ™ - its Char (Unicode) repesentation is 8482 which is indeed outside of 0-127 range.
Converting string to char array does not modify values of characters and you still have original Unicode codes (char is essentially Int16) - casting it to longer integer type Int32 does not change the value.
Below are possible conversion of that character into byte/integers:
var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482
var Int32 = (Int16)value[0]; // 8482
Details available at ASCIIEncoding Class
ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.

Problems parsing through string in C#

I am trying to parse through the first three characters of a string.
public List<string> sortModes(List<string> allModesNonSorted)
{
foreach (string s in allModesNonSorted)
{
char firstNumber = s[0];
char secondNumber = s[1];
char thirdNumber = s[2];
char.IsDigit(firstNumber);
char.IsDigit(secondNumber);
char.IsDigit(thirdNumber);
combinedNumbers = Convert.ToInt16(firstNumber) + Convert.ToInt16(secondNumber) + Convert.ToInt16(thirdNumber);
}
return allModesNonSorted;
}
It recognizes each character correctly, but adds on an extra value 53 or 55. Below when I add the numbers, the 53 and 55 are included. Why is it doing this??

53 is the Unicode value of '5', and 55 is the Unicode value of '7'. It's showing you both the numeric and character versions of the data.
You'll notice with secondNumber you see the binary value 0 and the character value '\0' as well.
If you want to interpret a string as an integer, you can use
int myInteger = int.Parse(myString);
Specifically if you know you always have the format
input = "999 Hz Bla bla"
you can do something like:
int firstSeparator = input.IndexOf(' ');
string frequency = input.Substring(firstSeparator);
int numericFrequency = int.Parse(frequency);
That will work no matter how many digits are in the frequency as long as the digits are followed by a space character.

53 is the ASCII value for the character '5'
57 is the ASCII value for the character '7'
this is just Visual Studio showing you extra details about the actual values.
You can proceed with your code.

Because you're treating them as Characters.
the character '5' is sequentially the 53rd character in ASCII.
the simplest solution is to just subtract the character '0' from all of them, that will give you the numeric value of a single character.

53 and 55 are the ASCII values of the '5' and '7' characters (the way the characters are stored in memory).
If you need to convert them to Integers, take a look at this SO post.

convert Hex UTF-8 bytes to Hex code point

how can i convert
Hex UTF-8 bytes -E0 A4 A4 to hex code point - 0924
ref: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=e0+a4+a4&mode=bytes
I need this because when i read Unicode data in c# it is taking it as single byte sequence and displaying 3 characters instead of 1,but i need 3 byte sequence(read 3 bytes and display single character),I tried many solutions but didn't get the result.
If I can display or store a 3-byte sequence utf-8 character then I don't need conversion.
senario is like this:
string str=getivrresult();
in str I have a word with each character as 3 byte utf-8 sequence.
Edited:
string str="à¤¤";
//i want it as "त" in str.
Character त
Character name DEVANAGARI LETTER TA
Hex code point 0924
Decimal code point 2340
Hex UTF-8 bytes E0 A4 A4
Octal UTF-8 bytes 340 244 244
UTF-8 bytes as Latin-1 characters bytes à ¤ ¤
Thank You.

Use the GetString methdod in the Encoding class:
byte[] data = { 0xE0, 0xA4, 0xA4 };
string str = Encoding.UTF8.GetString(data);
The string now contains one character with the character code 0x924.

//utf-8 Single Byte Sequence input
string str = "à¤¤";
int i = 0;
byte[] data=new byte[3];
foreach (char c in str)
{
string tmpstr = String.Format("{0:x2}", (int)c);
data[i] = Convert.ToByte(int.Parse(tmpstr, System.Globalization.NumberStyles.HexNumber));
i++;
}
//utf-8 3-Byte Sequence Output now stp contains "त".
string stp = Encoding.UTF8.GetString(data);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.