I have a file I am reading from to acquire a database of music files, like this;
00 6F 74 72 6B 00 00 02 57 74 74 79 70 00 00 00 .otrk...Wttyp...
06 00 6D 00 70 00 33 70 66 69 6C 00 00 00 98 00 ..m.p.3pfil...~.
44 00 69............. D.i.....
Etc., there could be hundreds to thousands of records in this file, all split at "otrk" into a string this is the start of a new track.
The problem actually lies in the above, all the tracks start with otrk, and the has field identifiers, their length, and value, for example above;
ttyp = type, and 06 following it is the length of the value, which is .m.p.3 or 00 6D 00 70 00 33
then the next field identifier is pfil = filename, this lies the issue, specifically the length, which value is 98, however when read into a string becomes unrecognizable and defaults to a diamond with a question mark, and a value of 239, which is wrong. How can I avoid this and get the correct value in order to display the value correctly.
My code to read the file;
db_file = File.ReadAllText(filePath, Encoding.UTF8);
and the code to split and sort through the file
string[] entries = content.Split(new string[] "otrk", StringSplitOptions.None);
public List<Songs> Songs { get; } = new List<Songs>();
foreach(string entry in entries)
{
Songs.Add(Song.Create(entry));
}
Song.Create looks like;
public static Song Create(string dbString)
{
Song toRet = new Song();
for (int farthestReached = 0; farthestReached < dbString.Length;)
{
int startOfString = -1;
int iLength = -1;
byte[] b = Encoding.UTF8.GetBytes("0");
//Gets the start index
foreach(var l in labels)
{
startOfString = dbString.IndexOf(l, farthestReached);
if (startOfString >= 0)
{
// get identifer index plus its length
iLength = startOfString + 3;
var valueIndex = iLength + 5;
// get length of value
string temp = dbString.Substring(iLength + 4, 1);
b = Encoding.UTF8.GetBytes(temp);
int xLen = b[0];
// populate the label
string fieldLabel = dbString.Substring(startOfString, l.Length);
// populate the value
string fieldValue = dbString.Substring(valueIndex, xLen);
// set new
farthestReached = xLen + valueIndex;
switch (fieldLabel[0])
{
case 'p':
case 't':
string stringValue = "";
foreach (char c in fieldValue)
{
if (c == 0)
continue;
stringValue += c;
}
assignStringField(toRet, fieldLabel, stringValue);
break;
}
break;
}
}
//If a field was not found, there are no more fields
if (startOfString == -1)
break;
}
return toRet;
}
The file is not a UTF-8 file. The hex dump shown in the question makes it clear that it is not a UTF-8 file, and neither a proper text file in any other text encoding. It rather looks like some binary (serialized) format, with data fields of different types.
You cannot reliably read a binary file naively like a text file, especially considering that certain UTF-8 characters are represented by two or more bytes. Thus, it is pretty likely that the UTF-8 decoder will get confused by all the binary data and might miss the first character(s) of a text field because preceding (binary) byte not belonging to a text field could coincidentally be equal to a start byte of a multi-byte character sequence in UTF-8, thus accidentally not correctly identifying the first character(s) in a text field because the UTF-8 decoder is trying to decode a multi-byte sequence not aligning with the text field.
Not only that, but certain byte values or byte sequences are not valid UTF-8 encodings for character, and you would "lose" such bytes when trying to read them as UTF-8 text.
Also, since it is possible for byte sequences of multiple bytes to form a single UTF-8 character, you cannot rely on every individual byte being turned into a character with the same ordinal value (even if the byte value might be a valid ASCII value), since such a byte could be decoded "merely" as part of a UTF-8 byte sequence into a single character whose ordinal value is entirely different from the value of the bytes in such a byte sequence.
That said, as far as i can tell, the text field in your litte data snippet above does not look like UTF-8 at all. 00 6D 00 70 00 33 (*m*p*3) and 00 44 00 69 (*D*i) are definitely not UTF-8 -- note the zero bytes.
Thus, first consult the file format specification to figure out the actual text encoding used for the text fields in this file format. Don't guess. Don't assume. Don't believe. Look up, verify and confirm.
Secondly, since the file is not a proper text file (as already mentioned), you cannot read it like a text file with File.ReadAllText. Instead, read the raw byte data, for example with File.ReadAllBytes.
Find the otrk marker in the byte data of the file not as text, but by the 4 byte values this marker is made of.
Then, parse the byte data following the otrk marker according to the file format specification, and only decode the bytes that are actual text data into strings, using the correct text encoding as denoted by the file format specification.
Related
With respect to this tool, I need to convert hexadecimal data, irrespective of their combination to equivalent text. For example:
"HelloWorld" = 48656c6c6f576f726c64;
The solution needs to take into account that hexadecimal can be grouped in different lengths:
48656c6c 6f576f72 6c64
or
48 65 6c 6c 6f 57 6f 72 6c 64
All of the hexadecimal values supplied above read as HelloWorld when converted to text.
First, I would like to point out that this question has been asked many times on the web (here is one example). However, I am going to break this down step by step for you to hopefully teach you how to not only utilize your resources available on the web, but also how to solve your problem.
Overview: Converting from hexadecimal data to text that is able to be read by human beings is a straight-forward process in modern development languages; you clean the data (ensuring no illegal characters remain), then you convert down to the byte level so that you can work with the raw data. Finally, you'll convert that raw data into readable text utilizing a method that has already been created by Microsoft.
Important: Remember, for the conversion to work, you have to ensure you're converting in the same format that you started with:
ASCII -> ASCII: Works Great!
ASCII -> UTF7: Not so much...
Removing Illegal Characters: One of the first things you'll need to do is ensure the hexadecimal value that you're supplying doesn't contain any illegal characters. The simplest way to do this is to create an array of acceptable characters and then remove anything but these in a loop:
private string GetCleanHex(string hex) {
string legalCharacters = "0123456789ABCDEF";
string result = hex.ToUpper();
foreach (char c in result) {
if (!legalCharacters.Contains(c))
result = result.Replace(c.ToString(), string.Empty);
}
}
Getting The Byte Array: Once you've cleaned out all illegal characters, you can now convert your hexadecimal string into a byte array. This is required to convert from hexadecimal to ASCII. This step was provided by the linked post above:
private byte[] GetBytesFromHex(string hex) {
byte[] bytes = new byte[result.Length / 2];
for (int i = 0; i < bytes.Length; i++)
bytes[i] = Convert.ToByte(result.Substring(i * 2, 2), 16);
}
Converting To Text: Now that you've cleaned your data, and converted it to a byte[], you can now convert that byte data into ASCII. This can be done using a method available in Encoding.ASCII called GetString:
string text = Encoding.ASCII.GetString(bytes);
The Final Result: Plug all of this into your application and you'll have successfully converted hexadecimal data into clean, readable text:
string hex = GetCleanHex("506c 65 61736520 72 656164 20686f77 2074 6f 2061 73 6b 2e");
byte[] bytes = GetBytesFromHex(hex);
string text = Encoding.ASCII.GetString(bytes);
Console.WriteLine(text);
Console.ReadKey();
The code above will print the following text to the console:
Please read how to ask.
I am trying to encrypt a text and the decrypt it using public private key pair. I need the keys to handled as string. So I convert these key from string to RSAParameters. But when i try to encrypt the text, it throws me an error as "The handle is not valid".
static RSAParameters _publicKey = new RSAParameters();
static RSAParameters _privateKey = new RSAParameters();
static string _strPublicKey = "-----BEGIN PUBLIC KEY-----\r\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsG5sBbnH7gXkrExzNOeK\r\nEjoRDHUH5uU+crH52Z1uXCEx8LiFow8RwrvZGqjYXgBwxzqOQwHJt3utoNVY0niP\r\nHjfPXwKTk79PkeET/mtRar1gEcCOr0/hgHxT3YGlQLw2ugVIulMzlUBRY4rceNv3\r\nEiSZ+4cnO04hJ6UiftrCfwTe6q9Hadp6B6SX+N9hgcHhRX4iR/VYUf/6cvN+NAgb\r\ntuF0Dk61C6ulh2Gvdj2TBCLaq1LPF5H+ghrRxjK/Zn6MG9BW2ju9g8zYKuufaaaM\r\ndbTmN+Z4f27nEOxFq5wWqeaWl53yrMia6xnOi7vtU8zcwBL7jSLgwrkyO8LabCdz\r\neQIDAQAB\r\n-----END PUBLIC KEY-----";
static string _strPrivateKey = "-----BEGIN RSA PRIVATE KEY-----\r\nMIIEogIBAAKCAQEAsG5sBbnH7gXkrExzNOeKEjoRDHUH5uU+crH52Z1uXCEx8LiF\r\now8RwrvZGqjYXgBwxzqOQwHJt3utoNVY0niPHjfPXwKTk79PkeET/mtRar1gEcCO\r\nr0/hgHxT3YGlQLw2ugVIulMzlUBRY4rceNv3EiSZ+4cnO04hJ6UiftrCfwTe6q9H\r\nadp6B6SX+N9hgcHhRX4iR/VYUf/6cvN+NAgbtuF0Dk61C6ulh2Gvdj2TBCLaq1LP\r\nF5H+ghrRxjK/Zn6MG9BW2ju9g8zYKuufaaaMdbTmN+Z4f27nEOxFq5wWqeaWl53y\r\nrMia6xnOi7vtU8zcwBL7jSLgwrkyO8LabCdzeQIDAQABAoIBAA2z5cvkC/UenA4N\r\nufzn5r9Xpy9Sf5SdRWZfEEqogYPCSECr9CUf7H81W71IU9WpLxkqIRZvMx1/C5Ms\r\nPsPJ/UOZjg+RAak9+I4Z7xWZfC9QGgAG9o4DJD54aYMQqKcIdy+nbWibQaxb3HZg\r\nuJLicqQEF7mDW7atcMHFf5JepzB6LO7u9mfgR03uHQh6r6ym27BTGwssSmEeOeiA\r\n+tOPEhCsbZMSs5+8aGMoV08OqjscytQCWDY8rwA8ZE/qis+cNxKo0OluRTde68mH\r\nbr42CZpNJNulhg4mZyxtrtC+D13VcRpFeKW7WbMBwEUJ8/liUBDvAPLB3Np46FsG\r\njcZfFmkCgYEAtSk5HGEk+y8dyl+l62u7oir5IEZf8vNsMJ+CrpF1C8e6qShRe9uy\r\nJ5pN/4dBb2thknOLsaw6K2qYGNkH3TYSpHusW7v4Iuy6ONmzXHCxcgconzCWJ0HI\r\nLWnYRZAHv8PdOxdFqukjLqFOz6fEyIJ5Ayp+7qxg3QRmE7bwnEwWYK8CgYEA+VEC\r\nBUVUd8tHLqMyaYuv7HXQlQ3J01fobts98k+xqQtKsMUu4SZQ/5uQF8Wk3P6AAKiU\r\ngeD81Fpu4PINrRH97R0twnlWru1fHNSHNenuLASW+3I4l975PhZJc5AeV7VpRcnY\r\nyZCMAKO/XRpfgTEsv5HNfidUsYuJ/9epQzVy6FcCgYAs1V3f6x621ys9OTybrZbL\r\nBG2REjmOq7V7tw4lW7QmzTAhyuuXhoBpkqN4+KU2CNIl51iMCP6AXin0BEoQ8d/d\r\nOwolzbgUFJfll+LunqkbejAQbXrLjlkW/Bnc5U81oyhuBk1khbwCP0N82p01rix6\r\nnxq4wIpcSElm2aBkXeQv2wKBgCvewUhEJtTdhC0Esn44AkDNimJwBq+VrGS1V3Un\r\n6M8iGYZ5bAJaR65ypSxJrvTkI4n6IAeqm1KShyg174ogvFnY5JBv4Xzub+oWy6QF\r\nAc/lDtw4ARVYOutd6JbZKT2twlRxbCAruzbxmV68oUmOaZ1b/pjQOury7tmCDVqy\r\nMQIJAoGAIqUiIAnKLBrXhfN0nqGt2iOnRl31Ef3p/pTNwhUBdJphi/zlE9JHTI2Q\r\nCiLCGyJTpr3FoPjIJZ2P+fRrB3FmVuGNVZw5s8g4ouuDWNyCL/upVwU84eAWMu7P\r\nDsUL5Ia5W7/Cm7d0/nqchUCkRslIsH+bfGj0NZ7qcE5H1D8Ee0A=\r\n-----END RSA PRIVATE KEY-----";
static void Main(string[] args)
{
var message = "Hello World!";
var data = Encrypt(message);
var res = Decrypt(data);
}
private static RSAParameters GetRSAParameters(string pPublicKey)
{
byte[] lDer;
//Set RSAKeyInfo to the public key values.
int lBeginStart = "-----BEGIN PUBLIC KEY-----".Length;
int lEndLenght = "-----END PUBLIC KEY-----".Length;
string KeyString = pPublicKey.Substring(lBeginStart, (pPublicKey.Length - lBeginStart - lEndLenght));
lDer = Convert.FromBase64String(KeyString);
//Create a new instance of the RSAParameters structure.
RSAParameters lRSAKeyInfo = new RSAParameters();
lRSAKeyInfo.Modulus = GetModulus(lDer);
lRSAKeyInfo.Exponent = GetExponent(lDer);
return lRSAKeyInfo;
}
private static byte[] GetModulus(byte[] pDer)
{
//Size header is 29 bits
//The key size modulus is 128 bits, but in hexa string the size is 2 digits => 256
string lModulus = BitConverter.ToString(pDer).Replace("-", "").Substring(58, 256);
return StringHexToByteArray(lModulus);
}
private static byte[] GetExponent(byte[] pDer)
{
int lExponentLenght = pDer[pDer.Length - 3];
string lExponent = BitConverter.ToString(pDer).Replace("-", "").Substring((pDer.Length * 2) - lExponentLenght * 2, lExponentLenght * 2);
return StringHexToByteArray(lExponent);
}
public static byte[] StringHexToByteArray(string hex)
{
return Enumerable.Range(0, hex.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(hex.Substring(x, 2), 16))
.ToArray();
}
public static string Encrypt(string mess)
{
string response = "";
var input = Encoding.UTF8.GetBytes(mess);
using (var rsa = new RSACryptoServiceProvider(1024))
{
rsa.PersistKeyInCsp = false;
rsa.ImportParameters(GetRSAParameters(_strPublicKey));
var decrypt = rsa.Encrypt(input, true);
response = Convert.ToBase64String(decrypt);
}
return response;
}
public static string Decrypt(string mess)
{
string response = "";
var input = Convert.FromBase64String(mess);
using (var rsa = new RSACryptoServiceProvider(2048))
{
rsa.PersistKeyInCsp = false;
rsa.ImportParameters(GetRSAParameters(_strPublicKey));
var decrypt = rsa.Decrypt(input, false);
response = Encoding.UTF8.GetString(decrypt);
}
return response;
}
Above is the code that I am using to encrypt and decrypt. After the conversion, I convert them to string and pass it to the decrypt function. Please advice
Your attempt to read the DER is definitely lacking. Possibly because you have a 2048-bit key and your "parser" at best only works for a 1024-bit key (though I'm having trouble deciding what key it could legitimately read).
Your public key hex is
30820122300D06092A864886F70D01010105000382010F003082010A02820101
00B06E6C05B9C7EE05E4AC4C7334E78A123A110C7507E6E53E72B1F9D99D6E5C
2131F0B885A30F11C2BBD91AA8D85E0070C73A8E4301C9B77BADA0D558D2788F
1E37CF5F029393BF4F91E113FE6B516ABD6011C08EAF4FE1807C53DD81A540BC
36BA0548BA5333954051638ADC78DBF7122499FB87273B4E2127A5227EDAC27F
04DEEAAF4769DA7A07A497F8DF6181C1E1457E2247F55851FFFA72F37E34081B
B6E1740E4EB50BABA58761AF763D930422DAAB52CF1791FE821AD1C632BF667E
8C1BD056DA3BBD83CCD82AEB9F69A68C75B4E637E6787F6EE710EC45AB9C16A9
E696979DF2ACC89AEB19CE8BBBED53CCDCC012FB8D22E0C2B9323BC2DA6C2773
790203010001
The DER breakdown (because I can't help myself) is
30 82 01 22
30 0D
06 09 2A 86 48 86 F7 0D 01 01 01
05 00
03
82 01 0F
00
30 82 01 0A
02 82 01 01
00B06E6C05B9C7EE05E4AC4C7334E78A123A110C7507E6E53E72B1F9D99D6E5C
2131F0B885A30F11C2BBD91AA8D85E0070C73A8E4301C9B77BADA0D558D2788F
1E37CF5F029393BF4F91E113FE6B516ABD6011C08EAF4FE1807C53DD81A540BC
36BA0548BA5333954051638ADC78DBF7122499FB87273B4E2127A5227EDAC27F
04DEEAAF4769DA7A07A497F8DF6181C1E1457E2247F55851FFFA72F37E34081B
B6E1740E4EB50BABA58761AF763D930422DAAB52CF1791FE821AD1C632BF667E
8C1BD056DA3BBD83CCD82AEB9F69A68C75B4E637E6787F6EE710EC45AB9C16A9
E696979DF2ACC89AEB19CE8BBBED53CCDCC012FB8D22E0C2B9323BC2DA6C2773
79
02 03
010001
Which is the ASN.1 equivalent of
SEQUENCE
SEQUENCE
OBJECT IDENTIFIER id-rsaEncryption
NULL
BIT STRING (0 bits unused) (wrapping)
SEQUENCE
INTEGER (positive) 0xB06E...7379
INTEGER 0x010001
Your modulus extractor skips 29 bytes and reads the next 128. (Why it does byte[] => hex => substring => byte[] instead of doing new byte[size] and Buffer.BlockCopy is beyond me). The problem is, your key is a 2048-bit key (the modulus length is 0x0101, and once we remove the sign-padding byte it's 0x0100 bytes long, with the high bit set, so 256*8 = 2048 bit modulus). So you need to have read 256 bytes. And, worse, because the payload is bigger than expected, it doesn't start at the offset you think it does. So where you should end up with B0 6E 6C ... 27 73 79 (after skipping the padding byte) you ended up with 82 01 01 00 *B0 6E 6C ... A5 22 7E (the asterisk marking the desired start of your payload). So, that's bad, and you are definitely not encrypting what you think you are. (The A5 22 7E sequence occurs on the line starting with 36BA, ending with 3 bytes left to go on that line)
What's worse is the GetExponent method. It seems to try to combine actual DER reading with presupposing what the value is. It should conclude 01 00 01, but it reads the first 01 of the payload as a length, then uses that length to count from the end. So the exponent value ends up being just 01. The RSA algorithm doesn't work with e=1, and what's happening is probably that a post-encryption self-test decided that the results were garbage, and the Win32 crypto library concluded that the problem is that the caller gave it something that wasn't a valid key handle, as opposed to a valid handle to an invalid key. In addition to not understanding the byte[] => hex => substring => byte[] pattern here I don't understand why this code decided that pDer[-3] (to use syntax from another language) was the length value. DER can only be read "left to right", and if it knew "the exponent is always 0x010001" then a) that'd be length-encoded at pDer[-4] b) just reading the last three bytes would be forgivable and c) given how many assumptions this code makes it would've made more sense to just blindly assume 0x010001.
So, you need to a) replace your parser with either a correct parser or at least a specialized one which functions at your desired keysize and b) never use that private key for anything important, since you just posted it to the Internet (and the private key variable wasn't even used in this problem).
I've modified an example to send & receive from serial, and that works fine.
The device I'm connecting to has three commands I need to work with.
My experience is with C.
MAP - returns a list of field_names, (decimal) values & (hex) addresses
I can keep track of which values are returned as decimal or hex.
Each line is terminated with CR
:: Example:
MEMBERS:10 - number of (decimal) member names
NAME_LENGTH:15 - (decimal) length of each name string
NAME_BASE:0A34 - 10 c-strings of (15) characters each starting at address (0x0A34) (may have junk following each null terminator)
etc.
GET hexaddr hexbytecount - returns a list of 2-char hex values starting from (hexaddr).
The returned bytes are a mix of bytes/ints/longs, and null terminated c-strings terminated with CR
:: Example::
get 0a34 10 -- will return
0A34< 54 65 73 74 20 4D 65 20 4F 75 74 00 40 D3 23 0B
This happens to be 'Test Me Out'(00) followed by junk
etc.
PUT hexaddr hexbytevalue {{value...} {value...}} sends multiple hex byte values separated by spaces starting at hex address, terminated by CR/LF
These bytes are a mix of bytes/ints/longs, and null terminated c-strings :: Example:
put 0a34 50 75 73 68 - (ascii Push)
Will replace the first 4-chars at 0x0A34 to become 'Push Me Out'
SAVED OK
See my answer previously about serial handling, which might be useful Serial Port Polling and Data handling
to convert your response to actual text :-
var s = "0A34 < 54 65 73 74 20 4D 65 20 4F 75 74 00 40 D3 23 0B";
var hex = s.Substring(s.IndexOf("<") + 1).Trim().Split(new char[] {' '});
var numbers = hex.Select(h => Convert.ToInt32(h, 16)).ToList();
var toText = String.Join("",numbers.TakeWhile(n => n!=0)
.Select(n => Char.ConvertFromUtf32(n)).ToArray());
Console.WriteLine(toText);
which :-
skips through the string till after the < character, then splits the rest into hex string
then, converts each hex string into ints ( base 16 )
then, takes each number till it finds a 0 and converts each number to text (using UTF32 encoding)
then, we join all the converted strings together to recreate the original text
alternatively, more condensed
var hex = s.Substring(s.IndexOf("<") + 1).Trim().Split(new char[] {' '});
var bytes = hex.Select(h => (byte) Convert.ToInt32(h, 16)).TakeWhile(n => n != 0);
var toText = Encoding.ASCII.GetString(bytes.ToArray());
for converting to hex from a number :-
Console.WriteLine(123.ToString("X"));
Console.WriteLine(123.ToString("X4"));
Console.WriteLine(123.ToString("X8"));
Console.WriteLine(123.ToString("x4"));
also you will find playing with hex data is well documented at https://msdn.microsoft.com/en-us/library/bb311038.aspx
I've noticed this strange issue. Check out this Vietnamese (according to Google Translate) string:
string line = "Mìng-dĕ̤ng-ngṳ̄";
string sub = "Mìng-dĕ̤ng-ngṳ";
line.Length
15
sub.Length
14
line.StartsWith(sub)
false
Which seems to me like a wrong result. So, I've implemented my custom StartWith function, which compares the string char-by-char.
public bool CustomStartWith(string parent, string child)
{
for (int i = 0; i < child.Length; i++)
{
if (parent[i] != child[i])
return false;
}
return true;
}
And as I assumed, the results of running this function
CustomStartWith("Mìng-dĕ̤ng-ngṳ̄", "Mìng-dĕ̤ng-ngṳ")
true
What's going on here?! How's this possible?
The result returned by StartsWith is correct. By default, most string comparison methods perform culture-sensitive comparisons using the current culture, not plain byte sequences. Although your line starts with a byte sequence identical to sub, the substring it represents is not equivalent under most (or all) cultures.
If you really want a comparison that treats strings as plain byte sequences, use the overload:
line.StartsWith(sub, StringComparison.Ordinal); // true
If you want the comparison to be case-insensitive:
line.StartsWith(sub, StringComparison.OrdinalIgnoreCase); // true
Here's a more familiar example:
var line1 = "café"; // 63 61 66 E9 – precomposed character 'é' (U+00E9)
var line2 = "café"; // 63 61 66 65 301 – base letter e (U+0065) and
// combining acute accent (U+0301)
var sub = "cafe"; // 63 61 66 65
Console.WriteLine(line1.StartsWith(sub)); // false
Console.WriteLine(line2.StartsWith(sub)); // false
Console.WriteLine(line1.StartsWith(sub, StringComparison.Ordinal)); // false
Console.WriteLine(line2.StartsWith(sub, StringComparison.Ordinal)); // true
In the above examples, line2 starts with the same byte sequence as sub, followed by a combining acute accent (U+0301) to be applied to the final e. line1 uses the precomposed character for é (U+00E9), so its byte sequence does not match that of sub.
In real-world semantics, one would typically not consider cafe to be a substring of café; the e and é are treated as distinct characters. That é happens to be represented as a pair of characters starting with e is an internal implementation detail of the encoding scheme (Unicode) that should not affect results. This is demonstrated by the above example contrasting café and café; one would not expect different results unless specifically intending an ordinal (byte-by-byte) comparison.
Adapting this explanation to your example:
string line = "Mìng-dĕ̤ng-ngṳ̄"; // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 304
string sub = "Mìng-dĕ̤ng-ngṳ"; // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73
Each .NET character represents a UTF-16 code unit, whose values are shown in the comments above. The first 14 code units are identical, which is why your char-by-char comparison evaluates to true (just like StringComparison.Ordinal). However, the 15th code unit in line is the combining macron, ◌̄ (U+0304), which combines with its preceding ṳ (U+1E73) to give ṳ̄.
This is not a bug. The String.StartsWith is in fact much smarter than just a character-by-character check of your two strings. It takes into account your current culture (language settings, etc.) and it takes into account contractions and special characters. (It does not care you need two characters to end up with ṳ̄. It compares it as one).
So this means that if you don't want to take all those culture specific settings, and just want to check it using ordinal comparison, you have to tell the comparer that.
This is the correct way to do that (not ignoring the case, like Douglas did!):
line.StartsWith(sub, StringComparison.Ordinal);
I'm in the process of creating a program that will scrub extended ASCII characters from text documents. I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.
Consider:
namespace ASCIITest
{
class Program
{
static void Main(string[] args)
{
string value = "Slide™1½”C4®";
byte[] asciiValue = Encoding.ASCII.GetBytes(value); // byte array
char[] array = value.ToCharArray(); // char array
Console.WriteLine("CHAR\tBYTE\tINT32");
for (int i = 0; i < array.Length; i++)
{
char letter = array[i];
byte byteValue = asciiValue[i];
Int32 int32Value = array[i];
//
Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
}
Console.ReadLine();
}
}
}
Output from program
CHAR BYTE INT32
S 83 83
l 108 108
i 105 105
d 100 100
e 101 101
T 63 8482 <- trademark symbol
1 49 49
½ 63 189 <- fraction
" 63 8221 <- smartquotes
C 67 67
4 52 52
r 63 174 <- registered trademark symbol
In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32, but all show up as 63 when cast as the byte value. What's going on here?
ASCII.GetBytes conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63).
So since your string contains characters outside of that range your asciiValue have ? instead of all interesting symbols like ™ - its Char (Unicode) repesentation is 8482 which is indeed outside of 0-127 range.
Converting string to char array does not modify values of characters and you still have original Unicode codes (char is essentially Int16) - casting it to longer integer type Int32 does not change the value.
Below are possible conversion of that character into byte/integers:
var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482
var Int32 = (Int16)value[0]; // 8482
Details available at ASCIIEncoding Class
ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.