I'm in the process of creating a program that will scrub extended ASCII characters from text documents. I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.
Consider:
namespace ASCIITest
{
class Program
{
static void Main(string[] args)
{
string value = "Slide™1½”C4®";
byte[] asciiValue = Encoding.ASCII.GetBytes(value); // byte array
char[] array = value.ToCharArray(); // char array
Console.WriteLine("CHAR\tBYTE\tINT32");
for (int i = 0; i < array.Length; i++)
{
char letter = array[i];
byte byteValue = asciiValue[i];
Int32 int32Value = array[i];
//
Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
}
Console.ReadLine();
}
}
}
Output from program
CHAR BYTE INT32
S 83 83
l 108 108
i 105 105
d 100 100
e 101 101
T 63 8482 <- trademark symbol
1 49 49
½ 63 189 <- fraction
" 63 8221 <- smartquotes
C 67 67
4 52 52
r 63 174 <- registered trademark symbol
In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32, but all show up as 63 when cast as the byte value. What's going on here?
ASCII.GetBytes conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63).
So since your string contains characters outside of that range your asciiValue have ? instead of all interesting symbols like ™ - its Char (Unicode) repesentation is 8482 which is indeed outside of 0-127 range.
Converting string to char array does not modify values of characters and you still have original Unicode codes (char is essentially Int16) - casting it to longer integer type Int32 does not change the value.
Below are possible conversion of that character into byte/integers:
var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482
var Int32 = (Int16)value[0]; // 8482
Details available at ASCIIEncoding Class
ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.
Related
I have a file I am reading from to acquire a database of music files, like this;
00 6F 74 72 6B 00 00 02 57 74 74 79 70 00 00 00 .otrk...Wttyp...
06 00 6D 00 70 00 33 70 66 69 6C 00 00 00 98 00 ..m.p.3pfil...~.
44 00 69............. D.i.....
Etc., there could be hundreds to thousands of records in this file, all split at "otrk" into a string this is the start of a new track.
The problem actually lies in the above, all the tracks start with otrk, and the has field identifiers, their length, and value, for example above;
ttyp = type, and 06 following it is the length of the value, which is .m.p.3 or 00 6D 00 70 00 33
then the next field identifier is pfil = filename, this lies the issue, specifically the length, which value is 98, however when read into a string becomes unrecognizable and defaults to a diamond with a question mark, and a value of 239, which is wrong. How can I avoid this and get the correct value in order to display the value correctly.
My code to read the file;
db_file = File.ReadAllText(filePath, Encoding.UTF8);
and the code to split and sort through the file
string[] entries = content.Split(new string[] "otrk", StringSplitOptions.None);
public List<Songs> Songs { get; } = new List<Songs>();
foreach(string entry in entries)
{
Songs.Add(Song.Create(entry));
}
Song.Create looks like;
public static Song Create(string dbString)
{
Song toRet = new Song();
for (int farthestReached = 0; farthestReached < dbString.Length;)
{
int startOfString = -1;
int iLength = -1;
byte[] b = Encoding.UTF8.GetBytes("0");
//Gets the start index
foreach(var l in labels)
{
startOfString = dbString.IndexOf(l, farthestReached);
if (startOfString >= 0)
{
// get identifer index plus its length
iLength = startOfString + 3;
var valueIndex = iLength + 5;
// get length of value
string temp = dbString.Substring(iLength + 4, 1);
b = Encoding.UTF8.GetBytes(temp);
int xLen = b[0];
// populate the label
string fieldLabel = dbString.Substring(startOfString, l.Length);
// populate the value
string fieldValue = dbString.Substring(valueIndex, xLen);
// set new
farthestReached = xLen + valueIndex;
switch (fieldLabel[0])
{
case 'p':
case 't':
string stringValue = "";
foreach (char c in fieldValue)
{
if (c == 0)
continue;
stringValue += c;
}
assignStringField(toRet, fieldLabel, stringValue);
break;
}
break;
}
}
//If a field was not found, there are no more fields
if (startOfString == -1)
break;
}
return toRet;
}
The file is not a UTF-8 file. The hex dump shown in the question makes it clear that it is not a UTF-8 file, and neither a proper text file in any other text encoding. It rather looks like some binary (serialized) format, with data fields of different types.
You cannot reliably read a binary file naively like a text file, especially considering that certain UTF-8 characters are represented by two or more bytes. Thus, it is pretty likely that the UTF-8 decoder will get confused by all the binary data and might miss the first character(s) of a text field because preceding (binary) byte not belonging to a text field could coincidentally be equal to a start byte of a multi-byte character sequence in UTF-8, thus accidentally not correctly identifying the first character(s) in a text field because the UTF-8 decoder is trying to decode a multi-byte sequence not aligning with the text field.
Not only that, but certain byte values or byte sequences are not valid UTF-8 encodings for character, and you would "lose" such bytes when trying to read them as UTF-8 text.
Also, since it is possible for byte sequences of multiple bytes to form a single UTF-8 character, you cannot rely on every individual byte being turned into a character with the same ordinal value (even if the byte value might be a valid ASCII value), since such a byte could be decoded "merely" as part of a UTF-8 byte sequence into a single character whose ordinal value is entirely different from the value of the bytes in such a byte sequence.
That said, as far as i can tell, the text field in your litte data snippet above does not look like UTF-8 at all. 00 6D 00 70 00 33 (*m*p*3) and 00 44 00 69 (*D*i) are definitely not UTF-8 -- note the zero bytes.
Thus, first consult the file format specification to figure out the actual text encoding used for the text fields in this file format. Don't guess. Don't assume. Don't believe. Look up, verify and confirm.
Secondly, since the file is not a proper text file (as already mentioned), you cannot read it like a text file with File.ReadAllText. Instead, read the raw byte data, for example with File.ReadAllBytes.
Find the otrk marker in the byte data of the file not as text, but by the 4 byte values this marker is made of.
Then, parse the byte data following the otrk marker according to the file format specification, and only decode the bytes that are actual text data into strings, using the correct text encoding as denoted by the file format specification.
Why, when I turn INT value to bytes and to ASCII and back, I get another value?
Example:
var asciiStr = new string(Encoding.ASCII.GetChars(BitConverter.GetBytes(2000)));
var intVal = BitConverter.ToInt32(Encoding.ASCII.GetBytes(asciiStr), 0);
Console.WriteLine(intVal);
// Result: 1855
ASCII is only 7-bit - code points above 127 are unsupported. Unsupported characters are converted to ? per the docs on Encoding.ASCII:
The ASCIIEncoding object that is returned by this property might not have the appropriate behavior for your app. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.
So 2000 decimal = D0 07 00 00 hexadecimal (little endian) = [unsupported character] [BEL character] [NUL character] [NUL character] = ? [BEL character] [NUL character] [NUL character] = 3F 07 00 00 hexadecimal (little endian) = 1855 decimal.
TL;DR: Everything's fine. But you're a victim of character replacement.
We start with 2000. Let's acknowledge, first, that this number can be represented in hexadecimal as 0x000007d0.
BitConverter.GetBytes
BitConverter.GetBytes(2000) is an array of 4 bytes, Because 2000 is a 32-bit integer literal. So the 32-bit integer representation, in little endian (least significant byte first), is given by the following byte sequence { 0xd0, 0x07, 0x00, 0x00 }. In decimal, those same bytes are { 208, 7, 0, 0 }
Encoding.ASCII.GetChars
Uh oh! Problem. Here's where things likely took an unexpected turn for you.
You're asking the system to interpret those bytes as ASCII-encoded data. The problem is that ASCII uses codes from 0-127. The byte with value 208 (0xd0) doesn't correspond to any character encodable by ASCII. So what actually happens?
When decoding ASCII, if it encounters a byte that is out of the range 0-127 then it decodes that byte to a replacement character and moves to the next byte. This replacement character is a question mark ?. So the 4 chars you get back from Encoding.ASCII.GetChars are ?, BEL (bell), NUL (null) and NUL (null).
BEL is the ASCII name of the character with code 7, which traditionally elicits a beep when presented on a capable terminal. NUL (code 0) is a null character traditionally used for representing the end of a string.
new string
Now you create a string from that array of chars. In C# a string is perfectly capable of representing a NUL character within the body of a string, so your string will have two NUL chars in it. They can be represented in C# string literals with "\0", in case you want to try that yourself. A C# string literal that represents the string you have would be "?\a\0\0" Did you know that the BEL character can be represented with the escape sequence \a? Many people don't.
Encoding.ASCII.GetBytes
Now you begin the reverse journey. Your string is comprised entirely of characters in the ASCII range. The encoding of a question mark is code 63 (0x3F). And the BEL is 7, and the NUL is 0. so the bytes are { 0x3f, 0x07, 0x00, 0x00 }. Surprised? Well, you're encoding a question mark now where before you provided a 208 (0xd0) byte that was not representable with ASCII encoding.
BitConverter.ToInt32
Converting these four bytes back to a 32-bit integer gives the integer 0x0000073f, which, in decimal, is 1855.
String encoding (ASCII, UTF8, SHIFT_JIS, etc.) is designed to pigeonhole human language into a binary (byte) form. It isn't designed to store arbitrary binary data, such as the binary form of an integer.
While your binary data will be interpreted as a string, some of the information will be lost, meaning that storing binary data in this way will fail in the general case. You can see the point where this fails using the following code:
for (int i = 0; i < 255; ++i)
{
var byteData = new byte[] { (byte)i };
var stringData = System.Text.Encoding.ASCII.GetString(byteData);
var encodedAsBytes = System.Text.Encoding.ASCII.GetBytes(stringData);
Console.WriteLine("{0} vs {1}", i, (int)encodedAsBytes[0]);
}
Try it online
As you can see it starts off well because all of the character codes correspond to ASCII characters, but once we get up in the numbers (i.e. 128 and beyond), we start to require a more than 7 bits to store the binary value. At this point it ceases to be decoded correctly, and we start seeing 63 come back instead of the input value.
Ultimately you will have this problem encoding binary data using any string encoding. You need to choose an encoding method specifically meant for storing binary data as a string.
Two popular methods are:
Hexadecimal
Base64 using ToBase64String and FromBase64String
Hexadecimal example (using the hex methods here):
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to hex
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = ByteArrayToString(bytesValue);
Console.WriteLine("As hex: {0}", stringValue); // outputs D0070000
// Convert form hex to bytes and then to int
byte[] decodedBytesValue = StringToByteArray(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
Base64 example:
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to base64
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = Convert.ToBase64String(bytesValue);
Console.WriteLine("As base64: {0}", stringValue); // outputs 0AcAAA==
// Convert form base64 to bytes and then to int
byte[] decodedBytesValue = Convert.FromBase64String(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
P.S. If you simply wanted to convert your integer to a string (e.g. "2000") then you can simply use .ToString():
int initialValue = 2000;
string stringValue = initialValue.ToString();
I convert my Hex to dump to get special character like symbol but when I try to convert my "0x18" i "\u0018" this value. Can anyone give me solution regarding this matter.
Here is my code:
public static string FromHexDump(string sText)
{
Int32 lIdx;
string prValue ="" ;
for (lIdx = 1; lIdx < sText.Length; lIdx += 2)
{
string prString = "0x" + Mid(sText, lIdx, 2);
string prUniCode = Convert.ToChar(Convert.ToInt64(prString,16)).ToString();
prValue = prValue + prUniCode;
}
return prValue;
}
I used VB language. I have a database that already encrypted text to my password and the value is BAA37D40186D like this so I loop it by step 2 and it will like this 0xBA,0xA3,0x7D,0x40,0x18,0x6D and the VB result getting like this º£}#m
You can use this code:
var myHex = '\x0633';
var formattedString += string.Format(#"\x{0:x4}", (int)myHex);
Or you can use this code from MSDN (https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types):
string hexValues = "48 65 6C 6C 6F 20 57 6F 72 6C 64 21";
string[] hexValuesSplit = hexValues.Split(' ');
foreach (string hex in hexValuesSplit)
{
// Convert the number expressed in base-16 to an integer.
int value = Convert.ToInt32(hex, 16);
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Console.WriteLine("hexadecimal value = {0}, int value = {1}, char value = {2} or {3}",
hex, value, stringValue, charValue);
}
The question is unclear - what is the database column's type? Does it contain 6 bytes, or 12 characters with the hex encoding of the bytes? In any case, this has nothing to do with special characters or encodings.
First, 0x18 is the byte value of the Cancel Character in the Latin 1 codepage, not the pound sign. That's 0xA3. It seems that the byte values in the question are just the Latin 1 bytes for the string in hex.
.NET strings are Unicode (UTF16LE specifically). There's no UTF8 string or Latin1 string. Encodings and codepages apply when converting bytes to strings or vice versa. This is done using the Encoding class and eg Encoding.GetBytes
In this case, this code will convert the byte to the expected string form, including the unprintable character :
new byte[] {0xBA,0xA3,0x7D,0x40,0x18,0x6D};
var latinEncoding=Encoding.GetEncoding(1252);
var result=latinEncoding.GetString(dbBytes);
The result is :
º£}#m
With the Cancel character between # and m.
If the database column contains the byte values as strings :
it takes double the required space and
the hex values have to be converted back to bytes before converting to strings
The x format is used to convert numbers or bytes to their hex form and vice versa. For each byte value, ToString("x") returns the hex string.
The hex string can be produced from the original buffer with :
var dbBytes=new byte[] {0xBA,0xA3,0x7D,0x40,0x18,0x6D};
var hexString=String.Join("",dbBytes.Select(c=>c.ToString("x")));
There are many questions that show how to parse a byte string into a byte array. I'll just steal Jared Parson's LINQ answer :
public static byte[] StringToByteArray(string hex) {
return Enumerable.Range(0, hex.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(hex.Substring(x, 2), 16))
.ToArray();
}
With that, we can parse the hex string into a byte array and convert it to the original string :
var bytes=StringToByteArray(hexString);
var latinEncoding=Encoding.GetEncoding(1252);
var result=latinEncoding.GetString(bytes);
First of all u don't need dump but Unicode, I would recomend to read about unicode/encoding etc and why this is a problem with strings.
PS: solution : StackOverflow
With respect to this tool, I need to convert hexadecimal data, irrespective of their combination to equivalent text. For example:
"HelloWorld" = 48656c6c6f576f726c64;
The solution needs to take into account that hexadecimal can be grouped in different lengths:
48656c6c 6f576f72 6c64
or
48 65 6c 6c 6f 57 6f 72 6c 64
All of the hexadecimal values supplied above read as HelloWorld when converted to text.
First, I would like to point out that this question has been asked many times on the web (here is one example). However, I am going to break this down step by step for you to hopefully teach you how to not only utilize your resources available on the web, but also how to solve your problem.
Overview: Converting from hexadecimal data to text that is able to be read by human beings is a straight-forward process in modern development languages; you clean the data (ensuring no illegal characters remain), then you convert down to the byte level so that you can work with the raw data. Finally, you'll convert that raw data into readable text utilizing a method that has already been created by Microsoft.
Important: Remember, for the conversion to work, you have to ensure you're converting in the same format that you started with:
ASCII -> ASCII: Works Great!
ASCII -> UTF7: Not so much...
Removing Illegal Characters: One of the first things you'll need to do is ensure the hexadecimal value that you're supplying doesn't contain any illegal characters. The simplest way to do this is to create an array of acceptable characters and then remove anything but these in a loop:
private string GetCleanHex(string hex) {
string legalCharacters = "0123456789ABCDEF";
string result = hex.ToUpper();
foreach (char c in result) {
if (!legalCharacters.Contains(c))
result = result.Replace(c.ToString(), string.Empty);
}
}
Getting The Byte Array: Once you've cleaned out all illegal characters, you can now convert your hexadecimal string into a byte array. This is required to convert from hexadecimal to ASCII. This step was provided by the linked post above:
private byte[] GetBytesFromHex(string hex) {
byte[] bytes = new byte[result.Length / 2];
for (int i = 0; i < bytes.Length; i++)
bytes[i] = Convert.ToByte(result.Substring(i * 2, 2), 16);
}
Converting To Text: Now that you've cleaned your data, and converted it to a byte[], you can now convert that byte data into ASCII. This can be done using a method available in Encoding.ASCII called GetString:
string text = Encoding.ASCII.GetString(bytes);
The Final Result: Plug all of this into your application and you'll have successfully converted hexadecimal data into clean, readable text:
string hex = GetCleanHex("506c 65 61736520 72 656164 20686f77 2074 6f 2061 73 6b 2e");
byte[] bytes = GetBytesFromHex(hex);
string text = Encoding.ASCII.GetString(bytes);
Console.WriteLine(text);
Console.ReadKey();
The code above will print the following text to the console:
Please read how to ask.
I have a string in C# like this:
string only_number;
I assigned it a value = 40
When I check only_number[0], I get 52
When I check only_number[1], I get 48
why it is adding 48 to a character at current position? Please suggest
String is basically char[]. So what you are seeing is ASCII value of char 4 and 0.
Proof: Diff between 4 and 0 = Diff between 52 and 48.
Since it is a string so you didn't assigned it 40. Instead you assigned it "40".
What you see is the ASCII code of '4' and '0'.
It's not adding 48 to the character. What you see is the character code, and the characters for digits start at 48 in Unicode:
'0' = 48
'1' = 49
'2' = 50
'3' = 51
'4' = 52
'5' = 53
'6' = 54
'7' = 55
'8' = 56
'9' = 57
A string is a range of char values, and each char value is a 16 bit integer basically representing a code point in the Unicode character set.
When you read from only_number[0] you get a char value that is '4', and the character code for that is 52. So, what you have done is reading a character from the string, and then converted that to an integer before you display it.
So:
char c = only_number[0];
Console.WriteLine(c); // displays 4
int n = (int)only_number[0]; // cast to integer
Console.WriteLine(n); // displays 52
int m = only_number[0]; // the cast is not needed, but the value is cast anyway
Console.WriteLine(m); // displays 52
You are accessing this string and it is outputting the ASCII character codes for each of your two characters, '4' and '0' - please see here:
http://www.theasciicode.com.ar/ascii-control-characters/null-character-ascii-code-0.html
string is the array of chars, so, that;s why you recieved these results, it basicallly display the ASCII of '4' and '0'.