Binary data being mangled when transfered in string between functions - c#

hope this is a trivial problem that can be solved easily.
Im trying to move the contents of a binary file from one location to another but with a twist: I need to transfer it as a string and this is where the file ends up a bit different than the source.
The reason for transfer it via a string is that the code that loads a file and the code that saves a file is only communicating through a host (this is a C# MEF application) and the interface forces me to send data via a string, nothing else.
So what I'm doing is this (pseudo'ish, only core functionality remains):
// This part loads a binary file
string output = string.Empty; // The data to be transfered
byte[] fileContent; // The binary file content
fileContent = File.ReadAllBytes(fileName);
output = Encoding.Default.GetString(fileContent);
//output = Convert.ToBase64String(fileContent);
//output = Encoding.UTF7.GetString(fileContent);
//output = Encoding.UTF8.GetString(fileContent);
//output = Encoding.UTF32.GetString(fileContent);
//output = Encoding.ASCII.GetString(fileContent);
//output = Encoding.BigEndianUnicode.GetString(fileContent);
//output = Encoding.Unicode.GetString(fileContent);
Then the string is transfered to its destination part:
// This part saves a binary file
string input; // This is the data recieved
byte[] content = Encoding.Unicode.GetBytes(input);
File.WriteAllBytes("c:\test.png", content);
The destination file now differs a bit from the source file, a byte here and there if I look at the files with a propriate tool. The encoding I'm using in the sending part that works the best is Unicode.
What am I missing here?

Like was said in the comments, the safest option is using Base-64. But if you want a little more efficiency, any simple 8-bit encoding without gaps should work, as long as you use the same encoding to decode it. By simple I mean not any of the Unicode multi-byte encodings. And I believe ASCII also won't work since it's 7-bit.
Note on efficiency: Each byte is actually being stored in 2 bytes since strings in C# are stored internally in unicode. But with Base-64 you are using 8 bytes for every 3 bytes of the binary.
I Tried using Encoding.GetEncoding(437) and it worked on a local system:
var b = new byte[256];
for (int i = 0; i < 256; i++)
b[i] = (byte)i;
var encoding = System.Text.Encoding.GetEncoding(437);
var s = encoding.GetString(b);
var b2 = encoding.GetBytes(s);
for (int i = 0; i < 256; i++)
if (b2[i] != i)
Console.WriteLine("Error at " + i);

Related

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

I've got two strings which are derived from Windows filenames, which contain unicode characters that do not display correctly in Windows (they show just the square box "unknown character" instead of the correct character). However the filenames are valid and these files exist without problems in the operating system, which means I need to be able to deal with them correctly and accurately.
I'm loading the filenames the usual way:
string path = #"c:\folder";
foreach (FileInfo file in DirectoryInfo.EnumerateFiles(path))
{
string filename = file.FullName;
}
but for the purposes of explaining this problem, these are the two filenames I'm having issues with:
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
Two strings, two filenames with a single unicode character plus an extension, both different. This so far is fine, I can read and write these files no problem, however I need to store these strings in a sqlite db and later retrieve them. Every attempt I make to do so results in both of these characters being changed to the "unknown character", so the original data is lost and I can no longer differentiate the two strings. At first I thought this was an sqlite issue, and I've made sure my db is in UTF16, but it turns out it's the conversion in c# to UTF16 that is causing the problem.
If I ignore sqlite entirely, and simply try to manually convert these strings to UTF16 (or to any other encoding), these characters are converted to the "unknown character" and the original data is lost. If I do this:
System.Text.Encoding enc = System.Text.Encoding.Unicode;
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
byte[] name1Bytes = enc.GetBytes(filename1);
byte[] name2Bytes = enc.GetBytes(filename2);
and I then inspect the bytearrays 'name1Bytes' and 'name2Bytes' they are both identical. and I can see that the unicode character in both cases has been converted to a pair of bytes 253 and 255 - the unknown character. and sure enough when I convert back
string newFilename1 = enc.GetString(name1Bytes);
string newFilename2 = enc.GetString(name2Bytes);
the orignal unicode character in each case is lost, and replaced with a diamond question mark symbol. I have lost the original filenames altogether.
It seems that these encoding conversions rely on the system font being able to display the characters, and this is a problem as these strings already exist as filenames, and changing the filenames isn't an option. I need to preserve this data somehow when sending it to sqlite, and when it's sent to sqlite it will go through a conversion process to UTF16, and it's this conversion that I need it to survive without losing data.
If you cast a char to an int, you get the numeric value, bypassing the Unicode conversion mechanism:
foreach (char ch in filename1)
{
int i = ch; // 0x0000de18 == 56856 for the first char in filename1
... do whatever, e.g., create an int array, store it as base64
}
This turns out to work as well, and is perhaps more elegant:
foreach (int ch in filename1)
{
...
}
So perhaps something like this:
string Encode(string raw)
{
byte[] bytes = new byte[2 * raw.Length];
int i = 0;
foreach (int ch in raw)
{
bytes[i++] = (byte)(ch & 0xff);
bytes[i++] = (byte)(ch >> 8);
}
return Convert.ToBase64String(bytes);
}
string Decode(string encoded)
{
byte[] bytes = Convert.FromBase64String(encoded);
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; ++i)
{
chars[i] = (char)(bytes[i * 2] | (bytes[i * 2 + 1] << 8));
}
return new string(chars);
}

Problems with writing bytes format of string data in Text File in C#

I have a text file stored locally. I want to store string data in binary format there and then retrieve the data again. In the following code snippet, I have done the conversion.
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
class ConsoleApplication
{
const string fileName = "AppSettings.dat";
static void Main()
{
string someText = "settings";
byte[] byteArray = Encoding.UTF8.GetBytes(someText);
int byteArrayLenght = byteArray.Length;
using (BinaryWriter writer = new BinaryWriter(File.Open(fileName, FileMode.Create)))
{
writer.Write(someText);
}
byte[] x = new byte[byteArrayLenght];
if (File.Exists(fileName))
{
using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
{
x = reader.ReadBytes(byteArrayLenght);
}
string str = Encoding.UTF8.GetString(x);
Console.Write(str);
Console.ReadKey();
}
}
}
In the AppSettings.dat file the bytes are written in the following way
But when I have assigned some random value in a byte array and save it in a file using BinaryWriter as I have done in the following code snippet
const string fileName = "AppSettings.dat";
static void Main()
{
byte[] array = new byte[8];
Random random = new Random();
random.NextBytes(array);
using (BinaryWriter writer = new BinaryWriter(File.Open(fileName, FileMode.Create)))
{
writer.Write(array);
}
}
It's actually saved the data in binary format in the text file, shown in the picture.
I don't understand why (in my first case) the byte data converted from string showing human readable format where I want to save the data in non-readable byte format(later case). What's the explanation regarding this?
Is there any way where I can store string data in binary format without approaching brute force?
FYI - I don't want to keep the data in Base64String format, I want it to be in binary format.
If security isn't a concern, and you just don't want the average usage to find your data while meddling into the settings files, a simple XOR will do:
const string fileName = "AppSettings.dat";
static void Main()
{
string someText = "settings";
byte[] byteArray = Encoding.UTF8.GetBytes(someText);
for (int i = 0; i < byteArray.Length; i++)
{
byteArray[i] ^= 255;
}
File.WriteAllBytes(fileName, byteArray);
if (File.Exists(fileName))
{
var x = File.ReadAllBytes(fileName);
for (int i = 0; i < byteArray.Length; i++)
{
x[i] ^= 255;
}
string str = Encoding.UTF8.GetString(x);
Console.Write(str);
Console.ReadKey();
}
}
It takes advantage of an interesting property of character encoding:
In ASCII, the 0-127 range contains the most used characters (a to z, 0 to 9) and the 128-256 range contains only special symbols and accents
For compatibility reasons, in UTF-8 the 0-127 range contains the same characters as ASCII, and the 128-256 range have a special meaning (it tells the decoder that the characters are encoded into multiple bytes)
All I do is flipping the strong-bit of each byte. Therefore, everything in the 0-127 range ends up in the 128-256 range, and vice-versa. Thanks to the property I described, no matter if the text-reader tries to parse in ASCII or in UTF-8, it will only get gibberish.
Please note that, while it doesn't produce human-readable content, it isn't secure at all. Don't use it to store sensitive data.
The notepad just reads your binary data and converts it to UTF8 text.
This code snippet would give you the same result.
byte[] randomBytes = new byte[20];
Random rand = new Random();
rand.NextBytes(randomBytes);
Console.WriteLine(Encoding.UTF8.GetString(randomBytes));
If you want to stop people from converting your data back to a string. then you need to encrypt your data. Here is a project that can help you with that.
But they are still able to read the data in a text editor because it converts your encrypted data to UFT8. They can't Convert it back to usable data unless they have to key to decrypt your data.

Convert a String, which is already malformed

I have a class, which uses another class which reads a Textfile.
The Textfile is written in Ascii or to be clear CP1525.
Background info: The Textfile is generated in Axapta and uses the ASCIIio class which writes the text by using the writeRaw method
The class which I am using is by a collegue and he is using a C# StreamReader to read files. Normally this works okay because the files are written in UTF8, but in this particular case it isn't.
So the Streamreader reads the file as UTF8 and passes the read string to me.
I now have some letters, like for example the Lating small letter o with Diaeresis (ö) which aren't formated as I would need them to be.
A simple convert of the String doesn't help in this case and I can't figure out how I can get the right letters.
So this is basically how he reads it:
char quotationChar = '"';
String line = "";
using (StreamReader reader = new StreamReader(fileName))
{
if((line = reader.ReadLine()) != null)
{
line = line.Replace(quotationChar.ToString(), "");
}
}
return line;
What now happens is, in the Textfile I have the german word "Röhre" which, after reading it with the streamreader, transforms to R�hre (which looks stupid in a database).
I could try to convert every letter
Encoding enc = Encoding.GetEncoding(1252);
byte[] utf8_Bytes = new byte[line.Length];
for (int i = 0; i < line.Length; ++i)
{
utf8_Bytes[i] = (byte)line[i];
}
String propEncodeString = enc.GetString(utf8_Bytes, 0, utf8_Bytes.Length);
That doesn't give me the right character !
byte[] myarr = Encoding.UTF8.GetBytes(line);
String propEncodeString = enc.GetString(myarr);
That also returns the wrong character.
I am aware that I could just solve the problem by using this:
using (StreamReader reader = new StreamReader(fileName, Encoding.Default, true))
But just for fun:
How can I get the right string from an already wrongly decoded string ?
Once the UTF8 to ASCII conversion is first made, all characters that don't map to valid ASCII entries are replaced with the same bad data character which means that data is just lost and you can't simply 'convert' back to a good character downstream. See this example: https://dotnetfiddle.net/XWysml

How can I output arbitrary binary data as character representation in C#?

I'm trying to recreate the functionallity of
slappasswd -h {md5}
on .Net
I have this code on Perl
use Digest::MD5;
use MIME::Base64;
$ctx = Digest::MD5->new;
$ctx->add('fredy');
print "Line $.: ", $ctx->clone->hexdigest, "\n";
print "Line $.: ", $ctx->digest, "\n";
$hashedPasswd = '{MD5}' . encode_base64($ctx->digest,'');
print $hashedPasswd . "\n";
I've tried to do the same on VB.Net , C# etc etc , but only works the
$ctx->clone->hexdigest # result : b89845d7eb5f8388e090fcc151d618c8
part in C# using the MSDN Sample
static string GetMd5Hash(MD5 md5Hash, string input)
{
// Convert the input string to a byte array and compute the hash.
byte[] data = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));
// Create a new Stringbuilder to collect the bytes
// and create a string.
StringBuilder sBuilder = new StringBuilder();
// Loop through each byte of the hashed data
// and format each one as a hexadecimal string.
for (int i = 0; i < data.Length; i++)
{
sBuilder.Append(data[i].ToString("x2"));
}
// Return the hexadecimal string.
return sBuilder.ToString();
}
With this code in Console App :
string source = "fredy";
using (MD5 md5Hash = MD5.Create())
{
string hash = GetMd5Hash(md5Hash, source);
Console.WriteLine("The MD5 hash of " + source + " is: " + hash + ".");
}
outputs : The MD5 hash of fredy is: b89845d7eb5f8388e090fcc151d618c8.
but i need to implement the $ctx->digest function, it outputs some binary data like
¸˜E×ë_ƒˆàüÁQÖÈ
this output happens on Linux and Windows with Perl.
Any ideas?
Thanks
As I already said in my comment above, you are mixing some things up. What the digest in Perl creates is a set of bytes. When those are printed, Perl will convert them automatically to a string-representation, because (simplified) it thinks if you print stuff it goes to a screen and you want to be able to read it. C# does not do that. That doesn't mean the Perl digest and the C# digest are not the same. Just their representation is different.
You have already established that they are equal if you convert both of them to a hexadecimal representation.
Now what you need to do to get output in C# that looks like the string that Perl prints when you do this:
print $ctx->digest; # output: ¸˜E×ë_ƒˆàüÁQÖÈ
... is to convert the C# byte[] data to a string of characters.
That has been answered before,f or example here: How to convert byte[] to string?
Using that technique, I believe your function to get it would look like this. Please note I am a Perl developer and I have no means of testing this. Consider it C#-like pseudo-code.
static string GetMd5PerlishString(MD5 md5Hash, string input)
{
// Convert the input string to a byte array and compute the hash.
byte[] data = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));
string result = System.Text.Encoding.UTF8.GetString(data);
return result;
}
Now it should look the same.
Please also note that MD5 is not a secure hashing algorithm for passwords any more. Please do not store use it to store user passwords!

Bit Array to String and back to Bit Array

Possible Duplicate Converting byte array to string and back again in C#
I am using Huffman Coding for compression and decompression of some text from here
The code in there builds a huffman tree to use it for encoding and decoding. Everything works fine when I use the code directly.
For my situation, i need to get the compressed content, store it and decompress it when ever need.
The output from the encoder and the input to the decoder are BitArray.
When I tried convert this BitArray to String and back to BitArray and decode it using the following code, I get a weird answer.
Tree huffmanTree = new Tree();
huffmanTree.Build(input);
string input = Console.ReadLine();
BitArray encoded = huffmanTree.Encode(input);
// Print the bits
Console.Write("Encoded Bits: ");
foreach (bool bit in encoded)
{
Console.Write((bit ? 1 : 0) + "");
}
Console.WriteLine();
// Convert the bit array to bytes
Byte[] e = new Byte[(encoded.Length / 8 + (encoded.Length % 8 == 0 ? 0 : 1))];
encoded.CopyTo(e, 0);
// Convert the bytes to string
string output = Encoding.UTF8.GetString(e);
// Convert string back to bytes
e = new Byte[d.Length];
e = Encoding.UTF8.GetBytes(d);
// Convert bytes back to bit array
BitArray todecode = new BitArray(e);
string decoded = huffmanTree.Decode(todecode);
Console.WriteLine("Decoded: " + decoded);
Console.ReadLine();
The Output of Original code from the tutorial is:
The Output of My Code is:
Where am I wrong friends? Help me, Thanks in advance.
You cannot stuff arbitrary bytes into a string. That concept is just undefined. Conversions happen using Encoding.
string output = Encoding.UTF8.GetString(e);
e is just binary garbage at this point, it is not a UTF8 string. So calling UTF8 methods on it does not make sense.
Solution: Don't convert and back-convert to/from string. This does not round-trip. Why are you doing that in the first place? If you need a string use a round-trippable format like base-64 or base-85.
I'm pretty sure Encoding doesn't roundtrip - that is you can't encode an arbitrary sequence of bytes to a string, and then use the same Encoding to get bytes back and always expect them to be the same.
If you want to be able to roundtrip from your raw bytes to string and back to the same raw bytes, you'd need to use base64 encoding e.g.
http://blogs.microsoft.co.il/blogs/mneiter/archive/2009/03/22/how-to-encoding-and-decoding-base64-strings-in-c.aspx

Categories

Resources