char[] array to textfile, including non-printing control characters - c#

I got a quite tricky question (at least for me), i am currently coding a small and simple Encryption and Decryption program which works with polyalphabetical substitution.
This is the Encryption function:
static string Encr(string plainText, string key)
{
char[] chars = new char[plainText.Length];
int h = 0;
for (int i = 0; i < plainText.Length; i++)
{
if (h == key.Length)
h = 0;
int j = plainText[i] + key[h];
chars[i] = (char)j;
h++;
}
StreamWriter sw = new StreamWriter(FILE_NAME, false, Encoding.Unicode);
for (int x = 0; x < plainText.Length; x++)
{
sw.Write(chars[x]);
}
sw.Close();
return new string(chars);
}
It's working fine, now my problem is that the outputFile created by StreamWriter contains additional unwanted 00's (due to the Unicode encoding) and 2 totally wrong beginning-values also due Unicode encoding
http://abload.de/img/unbenannt-16ilx4.jpg
(sorry i can't post images directly cause i am < 10rep)
FF FE 8A 00 A6 00 CC 00 A4 00 B0 00
bold ones are the correct ones, FF FE in the beginning is completely useless for my encryption/decryption, and the 00's are unwanted (i know this is standard Unicode encoding, my question is how do i achieve it without this encoding but still be able to display the corresponding Unicode chars)
I hope it is clear what i want to achieve, i would like to write out the characters only.
So in this special case to hex-view of the encrypted file would look like this:
8A A6 CC A4 B0, encoded as Unicode UTF-8 according to http://unicode-table.com/, so the corresponding letters would be Š¦Ì¤°
I have failed so far in all of my attempts to solve this. The solution is probably really easy through...

The characters at the start are the 2 bytes of the encoding preamble that identify the encoding that is used. I think you would be better of converting to bytes and writing the bytes to a general binary file without any encoding.
As in:
static string Encr(string plainText, string key)
{
char[] chars = new char[plainText.Length];
int h = 0;
for (int i = 0; i < plainText.Length; i++)
{
if (h == key.Length)
h = 0;
int j = plainText[i] + key[h];
chars[i] = (char)j;
h++;
}
File.WriteAllBytes(FILE_NAME, System.Text.Encoding.UTF8.GetBytes(chars));
return new String(chars, System.Text.Encoding.UTF8);
}

Is there a reason not to use Encoding.UTF8?

What you have appears to be working, but it's not really working. When you have encrypted the character codes, it doesn't make sense to look at them as character codes any more, they are just numerical values. Most values will correspond to some character, but not all.
The correct output from the encryption would be an array of 16 bit integers, not a string of characters. You can then turn the numbers into some textual representation if you like, but you can't use them as character codes and reliably get the same result back when decrypting.

Related

Reading UTF8 file, gives diamond with question mark on unknown characters

I have a file I am reading from to acquire a database of music files, like this;
00 6F 74 72 6B 00 00 02 57 74 74 79 70 00 00 00 .otrk...Wttyp...
06 00 6D 00 70 00 33 70 66 69 6C 00 00 00 98 00 ..m.p.3pfil...~.
44 00 69............. D.i.....
Etc., there could be hundreds to thousands of records in this file, all split at "otrk" into a string this is the start of a new track.
The problem actually lies in the above, all the tracks start with otrk, and the has field identifiers, their length, and value, for example above;
ttyp = type, and 06 following it is the length of the value, which is .m.p.3 or 00 6D 00 70 00 33
then the next field identifier is pfil = filename, this lies the issue, specifically the length, which value is 98, however when read into a string becomes unrecognizable and defaults to a diamond with a question mark, and a value of 239, which is wrong. How can I avoid this and get the correct value in order to display the value correctly.
My code to read the file;
db_file = File.ReadAllText(filePath, Encoding.UTF8);
and the code to split and sort through the file
string[] entries = content.Split(new string[] "otrk", StringSplitOptions.None);
public List<Songs> Songs { get; } = new List<Songs>();
foreach(string entry in entries)
{
Songs.Add(Song.Create(entry));
}
Song.Create looks like;
public static Song Create(string dbString)
{
Song toRet = new Song();
for (int farthestReached = 0; farthestReached < dbString.Length;)
{
int startOfString = -1;
int iLength = -1;
byte[] b = Encoding.UTF8.GetBytes("0");
//Gets the start index
foreach(var l in labels)
{
startOfString = dbString.IndexOf(l, farthestReached);
if (startOfString >= 0)
{
// get identifer index plus its length
iLength = startOfString + 3;
var valueIndex = iLength + 5;
// get length of value
string temp = dbString.Substring(iLength + 4, 1);
b = Encoding.UTF8.GetBytes(temp);
int xLen = b[0];
// populate the label
string fieldLabel = dbString.Substring(startOfString, l.Length);
// populate the value
string fieldValue = dbString.Substring(valueIndex, xLen);
// set new
farthestReached = xLen + valueIndex;
switch (fieldLabel[0])
{
case 'p':
case 't':
string stringValue = "";
foreach (char c in fieldValue)
{
if (c == 0)
continue;
stringValue += c;
}
assignStringField(toRet, fieldLabel, stringValue);
break;
}
break;
}
}
//If a field was not found, there are no more fields
if (startOfString == -1)
break;
}
return toRet;
}
The file is not a UTF-8 file. The hex dump shown in the question makes it clear that it is not a UTF-8 file, and neither a proper text file in any other text encoding. It rather looks like some binary (serialized) format, with data fields of different types.
You cannot reliably read a binary file naively like a text file, especially considering that certain UTF-8 characters are represented by two or more bytes. Thus, it is pretty likely that the UTF-8 decoder will get confused by all the binary data and might miss the first character(s) of a text field because preceding (binary) byte not belonging to a text field could coincidentally be equal to a start byte of a multi-byte character sequence in UTF-8, thus accidentally not correctly identifying the first character(s) in a text field because the UTF-8 decoder is trying to decode a multi-byte sequence not aligning with the text field.
Not only that, but certain byte values or byte sequences are not valid UTF-8 encodings for character, and you would "lose" such bytes when trying to read them as UTF-8 text.
Also, since it is possible for byte sequences of multiple bytes to form a single UTF-8 character, you cannot rely on every individual byte being turned into a character with the same ordinal value (even if the byte value might be a valid ASCII value), since such a byte could be decoded "merely" as part of a UTF-8 byte sequence into a single character whose ordinal value is entirely different from the value of the bytes in such a byte sequence.
That said, as far as i can tell, the text field in your litte data snippet above does not look like UTF-8 at all. 00 6D 00 70 00 33 (*m*p*3) and 00 44 00 69 (*D*i) are definitely not UTF-8 -- note the zero bytes.
Thus, first consult the file format specification to figure out the actual text encoding used for the text fields in this file format. Don't guess. Don't assume. Don't believe. Look up, verify and confirm.
Secondly, since the file is not a proper text file (as already mentioned), you cannot read it like a text file with File.ReadAllText. Instead, read the raw byte data, for example with File.ReadAllBytes.
Find the otrk marker in the byte data of the file not as text, but by the 4 byte values this marker is made of.
Then, parse the byte data following the otrk marker according to the file format specification, and only decode the bytes that are actual text data into strings, using the correct text encoding as denoted by the file format specification.

Convert from Hexadecimal to Text

With respect to this tool, I need to convert hexadecimal data, irrespective of their combination to equivalent text. For example:
"HelloWorld" = 48656c6c6f576f726c64;
The solution needs to take into account that hexadecimal can be grouped in different lengths:
48656c6c 6f576f72 6c64
or
48 65 6c 6c 6f 57 6f 72 6c 64
All of the hexadecimal values supplied above read as HelloWorld when converted to text.
First, I would like to point out that this question has been asked many times on the web (here is one example). However, I am going to break this down step by step for you to hopefully teach you how to not only utilize your resources available on the web, but also how to solve your problem.
Overview: Converting from hexadecimal data to text that is able to be read by human beings is a straight-forward process in modern development languages; you clean the data (ensuring no illegal characters remain), then you convert down to the byte level so that you can work with the raw data. Finally, you'll convert that raw data into readable text utilizing a method that has already been created by Microsoft.
Important: Remember, for the conversion to work, you have to ensure you're converting in the same format that you started with:
ASCII -> ASCII: Works Great!
ASCII -> UTF7: Not so much...
Removing Illegal Characters: One of the first things you'll need to do is ensure the hexadecimal value that you're supplying doesn't contain any illegal characters. The simplest way to do this is to create an array of acceptable characters and then remove anything but these in a loop:
private string GetCleanHex(string hex) {
string legalCharacters = "0123456789ABCDEF";
string result = hex.ToUpper();
foreach (char c in result) {
if (!legalCharacters.Contains(c))
result = result.Replace(c.ToString(), string.Empty);
}
}
Getting The Byte Array: Once you've cleaned out all illegal characters, you can now convert your hexadecimal string into a byte array. This is required to convert from hexadecimal to ASCII. This step was provided by the linked post above:
private byte[] GetBytesFromHex(string hex) {
byte[] bytes = new byte[result.Length / 2];
for (int i = 0; i < bytes.Length; i++)
bytes[i] = Convert.ToByte(result.Substring(i * 2, 2), 16);
}
Converting To Text: Now that you've cleaned your data, and converted it to a byte[], you can now convert that byte data into ASCII. This can be done using a method available in Encoding.ASCII called GetString:
string text = Encoding.ASCII.GetString(bytes);
The Final Result: Plug all of this into your application and you'll have successfully converted hexadecimal data into clean, readable text:
string hex = GetCleanHex("506c 65 61736520 72 656164 20686f77 2074 6f 2061 73 6b 2e");
byte[] bytes = GetBytesFromHex(hex);
string text = Encoding.ASCII.GetString(bytes);
Console.WriteLine(text);
Console.ReadKey();
The code above will print the following text to the console:
Please read how to ask.

How to convert from Hex to Unicode?

My application receives Hex values from client and converts back to character which is usually chinese character. But I can't implement this properly. As per my current programme it can convert "e5a682e4bd95313233" to "如何123" but I am actually receiving "59824F55003100320033" from the client side for the same Chinese character "如何123" and my programme unable to convert back into string. Please help me on this.
Here is my current code:
byte[] uniMsg = null;
string msg = "59824F55003100320033";
uniMsg = StringToByteArray(msg.ToUpper());
msg = System.Text.Encoding.UTF8.GetString(uniMsg);
public static byte[] StringToByteArray(String hex)
{
hex = hex.Replace("-", "");
byte[] raw = new byte[hex.Length / 2];
for (int i = 0; i < raw.Length; i++)
{
raw[i] = Convert.ToByte(hex.Substring(i * 2, 2), 16);
}
return raw;
}
Appreciate any help on this. Thanks.
Solution:
updated
msg = System.Text.Encoding.UTF8.GetString(uniMsg);
to
msg = System.Text.Encoding.BigEndianUnicode.GetString(uniMsg)
Thanks to #CodesInChaos for suggesting the encoding type.
It doesn't seem to be encoded in UTF8. Note the 00 31 00 32 00 33 part. In UTF8, it'd be just 31 32 33. I think the hexstring is in UTF16 BE, because it's exactly 2 bytes per character and they are 00-padded. Decode your byte array as UTF16, you will get a string. Then you can use it as string, or reconvert it to any other encoding you need.

Why are ASCII values of a byte different when cast as Int32?

I'm in the process of creating a program that will scrub extended ASCII characters from text documents. I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.
Consider:
namespace ASCIITest
{
class Program
{
static void Main(string[] args)
{
string value = "Slide™1½”C4®";
byte[] asciiValue = Encoding.ASCII.GetBytes(value); // byte array
char[] array = value.ToCharArray(); // char array
Console.WriteLine("CHAR\tBYTE\tINT32");
for (int i = 0; i < array.Length; i++)
{
char letter = array[i];
byte byteValue = asciiValue[i];
Int32 int32Value = array[i];
//
Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
}
Console.ReadLine();
}
}
}
Output from program
CHAR BYTE INT32
S 83 83
l 108 108
i 105 105
d 100 100
e 101 101
T 63 8482 <- trademark symbol
1 49 49
½ 63 189 <- fraction
" 63 8221 <- smartquotes
C 67 67
4 52 52
r 63 174 <- registered trademark symbol
In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32, but all show up as 63 when cast as the byte value. What's going on here?
ASCII.GetBytes conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63).
So since your string contains characters outside of that range your asciiValue have ? instead of all interesting symbols like ™ - its Char (Unicode) repesentation is 8482 which is indeed outside of 0-127 range.
Converting string to char array does not modify values of characters and you still have original Unicode codes (char is essentially Int16) - casting it to longer integer type Int32 does not change the value.
Below are possible conversion of that character into byte/integers:
var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482
var Int32 = (Int16)value[0]; // 8482
Details available at ASCIIEncoding Class
ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.

Weird behavior when converting byte from text box to byte array to characters?

I have a textbox that I use to convert things like:
74 00 65 00 73 00 74 00
Back into a string, the above says "test" but for some reason when I click the convert button it will display only the first letter "t" 74 00 and other byte arrays work just as expected, the entire text is converted.
Here is the 2 codes I have tried which produce the same behavior of not properly converting the entire byte array back to word:
byte[] bArray = ByteStrToByteArray(iSequence.Text);
ASCIIEncoding enc = new ASCIIEncoding();
string word = enc.GetString(bArray);
iResult.Text = word + Environment.NewLine;
which uses the function:
private byte[] ByteStrToByteArray(string byteString)
{
byteString = byteString.Replace(" ", string.Empty);
byte[] buffer = new byte[byteString.Length / 2];
for (int i = 0; i < byteString.Length; i += 2)
buffer[i / 2] = (byte)Convert.ToByte(byteString.Substring(i, 2), 16);
return buffer;
}
another way I was using is:
string str = iSequence.Text.Replace(" ", "");
byte[] bArray = Enumerable.Range(0, str.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(str.Substring(x, 2), 16))
.ToArray();
ASCIIEncoding enc = new ASCIIEncoding();
string word = enc.GetString(bArray);
iResult.Text = word + Environment.NewLine;
Tried checking for the lengths to see if it was iterating thru and it was ...
Don't really know how to debug why this is happenning to the above byte array but all the other byte arrays seemed to be working just fine only this one is outputing only the first letter of it.
Have I done something wrong that could produce this behavior some how ?
What could I try in order to find out what is wrong ?
If you have the byte sequence
var bytes = new byte[] { 0x74, 0x00, 0x65, 0x00, 0x73, 0x00, 0x74, 0x00 };
and you decode it to a string using ASCII encoding (Encoding.ASCII), then you get
var result = Encoding.ASCII.GetString(bytes);
// result == "\x74\x00\x65\x00\x73\x00\x74\x00" == "t\0e\0s\0t\0"
Notice the Null \0 characters? When you display such a string in a textbox, only the part of the string until the first Null character is displayed.
Since you say the result should read "test", the input is actually not encoded in ASCII but in UTF-16LE (Encoding.Unicode).
var result = Encoding.Unicode.GetString(bytes);
// result == "\u0074\u0065\u0073\u0074" == "test"
your converting a unicode string to ascii , your not specifying the codepage on your machine to convert from.
System.Text.Encoding.GetEncoding("codepage").GetString()
if my memory serves me correct. Also to note, any control in .NET is unicode ... Soooooo.... what your trying to stick in the text box (if the conversion isent correct) could be an end of line character .. or eof, or any kind of control character. all depends on your codepage.
I tried debugging the first program using breakpoints in VS2010. I found out that the line
string word = enc.GetString(bArray);
output word as "t\0e\0s\0t".
The last line
iResult.Text = word + Environment.NewLine;
gives iResult.Text as simply "t".
So I was thinking since \0 is not a valid escape sequence, the compiler ignored everything after it. Could be wrong though but try removing all occurrences of 00 in the input string.
I'm not really into C#. I'm only suggesting this because it looks like C++.
It works for me:
string outputText = "t\0e\0s\0t";
outputText = outputText.Replace("\0", " ");

Categories

Resources