Converting a binary file to a string and vice versa

Converting a binary file to a string and vice versa - c#

I created a webservice which returns a (binary) file. Unfortunately, I cannot use byte[] so I have to convert the byte array to a string.
What I do at the moment is the following (but it does not work):
Convert file to string:
byte[] arr = File.ReadAllBytes(fileName);
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
string fileAsString = enc.GetString(arr);
To check if this works properly, I convert it back via:
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
byte[] file = enc.GetBytes(fileAsString);
But at the end, the original byte array and the byte array created from the string aren't equal. Do I have to use another method to read the file to a byte array?

Use Convert.ToBase64String to convert it to text, and Convert.FromBase64String to convert back again.
Encoding is used to convert from text to a binary representation, and from a binary representation of text back to text again. In this case you don't have a binary representation of text - you just have arbitrary binary data... so Encoding is inappropriate. Even if you use an encoding which can "sort of" handle any binary data (e.g. ISO Latin 1) you'll find that many ways of transmitting text will fail when you've got control characters etc.
Base64 encoding will give you text which is just ASCII, and much easier to handle.

Related

Best way to read text file into byte array in selected encoding?

Now i use something like that:
Encoding.UTF8.GetBytes(File.ReadAllText(filename))
Any suggestions how to do that better?
And what encoding uses File.ReadAllBytes(filename) method?
P.S. I need utf-8 byte arrays to store text files in db

Best way to read file into byte array in selected encoding?
Character Encoding is about storing text in binary form, as sequences of specific bytes for each character. Another way of thinking about it is that the Encoding system is what gives meaning to some bytes. Without the context that some bytes represents text, the bytes are just bytes.
Files are just bytes too; And they can be interpreted however you want your application to interpret them.
When you decode bytes you are giving meaning to those bytes according the encoding system used. For text encodings, you start with bytes and end up with characters.
You can't "decode" bytes from a file into a byte array. That doesn't give meaning to the bytes or produce any characters.
You can decode bytes into strings using a specific encoding though:
string allLinesFromFileAsAuto = File.ReadAllText(filename);
string allLinesFromFileAsUTF8 = File.ReadAllText(filename, Encoding.UTF8);
string allLinesFromFileAsASCII = File.ReadAllText(filename, Encoding.ASCII);
All three of these methods convert bytes from the same file into strings, but the resulting strings will be different depending on the encoding you use.
And what encoding uses File.ReadAllBytes(filename) method?
File.ReadAllBytes(filename) does not use any encoding. Files are just bytes. This method pulls all of a file's bytes into a byte array. You still have to decode those bytes into strings after getting that byte array. But this only works well for plaintext files.
I need utf-8 byte arrays to store files in db
Is this because your database uses UTF-8 encoding?
The encoding of a database defines how text is stored (as binary).
Binary data can be stored as-is, byte-for-byte, as "blobs" in most databases, regardless of the encoding.

ReadAllText will try to infer the encoding of the file and convert it to .NET strings. Your first example will then convert those to UTF-8 bytes no matter what the source encoding was.
Depending on the size of the files, this could be costly to load it all to memory twice. You can do things to read chunks of the source file and convert it that way.
ReadAllBytes reads the raw file as a series of bytes, there's no encoding/decoding for that.
If you are storing non-text files in the database, you should not encode the file as UTF-8.

Converting a string rappresentation of a file (byte array) back to a file in C#

As in the title I'm trying to convert back a string rappresentation of a bytearray to the original file where the bytes where taken.
What I've done:
I've a web service that gets a whole file and sends it:
answer.FileByte = File.ReadAllBytes(#"C:\QRY.txt");
After the serialization in the transmitted result xml I've this line:
<a:FileByte>TVNIfGF8MjAxMzAxMDF8YQ1QSUR8YXxhfGF8YXxhfGF8YXwyMDEzMDEwMXxhfGF8YXxhfGF8YXxhfGF8YXxhfGF8YXxhDVBWMXxhfGF8YXxhfGF8YXxhfGF8YXxhfDIwMTMwMTAxfDIwMTMwMTAxfDB8MHxhDQo=</a:FileByte>
I've tried to convert it back with this line in another simple application:
//filepath is the path of the file created
//bytearray is the string from the xml (copypasted)
File.WriteAllBytes(filepath, Encoding.UTF8.GetBytes(bytearray));
I've used UTF8 as enconding since the xml declares to use this charset. Keeping the datatype is not an option since I'm writing a simple utility to check the file conversion.
Maybe I'm missing something very basic but I'm not able to come up with a working solution.

This certainly isn't UTF8, the serializer probably converted it to Base64.
Use Convert.FromBase64String() to get your bytes back
Assuming that bytearray is the "TVNIfGF8M..." string, try:
string bytearray = ...;
File.WriteAllBytes(filepath, Convert.FromBase64String(bytearray));

UTF8 is a way to convert arbitrary text into bytes.
It was used by ReadAllText() to turn the bytes on disk back into XML.
You're seeing a mechanism to convert arbitrary bytes into text that can fit into XML. (that text is then convert to different bytes using UTF8).
It's probably Base64; use Convert.FromBase64String().

Why does this encrypted string have funny characters in it? It isn't readable?

I'm converting the encrypted text using UTF8, yet the resulting string has funny characters that I can't read and not sure if I can send this text to the browser.
string message = "hello world";
var rsa = new RSACryptoServiceProvider(2048);
var c = new UTF8Encoding();
byte[] dataToEncrypt = c.GetBytes(message);
byte[] encryptedData = rsa.Encrypt(dataToEncrypt, false);
string output = c.GetString(encryptedData);
Console.WriteLine(output);
Console.ReadLine();
When I run the above, I get the following:
�VJI����J/;�>�:<�M����g�1�7�A.#�`J�s��~��)�Fn�����5�.���o���ҵ���jH3;G�<<��F�͗��~?�Y�#���j���6l{{�Y�$�]���nylz���X8u�\f�V1/�$�n+�\b��\b�fsAh՝G\n�\t���\b���6߇3����Ԕ���4��#هhI���'\0� T�n��|EϺ^7ú l��T\\!�w���QRWA%p��V\f��5�
I need to send this text back to the browser, or save it to a file and currently I'm not sure why I am getting these characters?

The problem is that you are taking an array of bytes that was not created by encoding text, and use it as if it was. You can only decode data that was created by encoding, if you decode any arbitrary data, you end up with garbage.
If you want the binary data produced by the encryption as a string, use for example base64 encoding:
string output = Convert.ToBase64String(encryptedData);
When you want to decrypt the data, use Convert.FromBase64String to get the byte array back, decrypt it, and use Encoding.UTF8.GetString to turn it back into the original string. There it will work do decode the data, because it was created by encoding the string from the beginning.

These two lines are pretending that the output of an RSA-encrypted UTF-8 sequence is a valid UTF-8 sequence:
var c = new UTF8Encoding();
string output = c.GetString(encryptedData);
But this is simply not the case: the RSA encryption maps byte values to other, (seemingly) arbitrary byte values. The resulting byte sequence doesn’t form a valid UTF-8 sequence (there is no reason to assume that it would), and thus cannot be treated as one.
If you merely want a readable (or HTTP sendable) representation of your data, then Base64 is the way to go, as shown in other answers. Fundamentally, though, you should probably read Joel’s article about The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Encrypting a string will result in a byte array that contains non-printable characters. You'll want to convert it to base64 to have a readable version of it.

best encode decode for binary file in java and c#

i know there is many types of encode and decode and from what i have read, base64 is a great choice when it comes to encode binary file (image, mp3, video).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value. the process to get the string after decode, i will require to do like this (in c#): System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
here i noticed that i have several choices on what to use to get the string, such as ASCII, UNICODE, DEFAULT.
the real question in this post is if im using java to encode and c# to decode the binary file, what is the best solution/choice should i use? i have tried several method and some of the character could not be read thus gives out question mark symbol (?).
however, the most closer encode decode that could be read the byte is when im using this in Java: String encoded = Base64.encodeToString(fileData, Base64.CRLF); meanwhile in c# im using like this: byte[] encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
string returnValue = System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
Still, there are several character that cannot be read. Does anyone have solution for this problem statement? any feedback is much appreciated. thanks for advance.

The thing about binary files is that they are binary (type byte[]). Most of the time you can not convert the bytes directly to a string (using Encoding.GetString(byte[])), because some of them may have values which can not be represented in a string (which is what you are experiencing).
Converting binary data to string using Encoding.GetString(byte[]) to convert it to BASE64 doesn't make sense at all as you lose information when converting the binary information to string - you'd need to convert it directly to BASE64.
Converting a BASE64 string representation of a byte array to byte[] is OK - this gives you back the original binary data. However, converting this byte[] to string is not OK for the reason I've given above.
How BASE64 encoding is supposed to work is:
Get binary data as byte[]
Create BASE64 string from byte[]
Transfer BASE64 string
Create byte[] from BASE64 string
Continue working with byte[]

You state that that input is "image, mp3, video", so: arbitrary binary. You then state that you're using base-64, meaning: for some reason you need to transfer / store this data as a string (note: transfer / storage as raw binary would usually be preferred - base-64 has overhead).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value.
There's the problem; there is no string value here. An "image, mp3, video" is simply not a "string value". What you can do is decode from the base-64 back to raw binary (trivial in either java or c#), but that is all you can do. If you needed a "string value" from raw binary, the only thing you can do is to re-encode it via base-64 (which would give you back what you started with), or some other base-n.
A text-encoding such as ASCII or UTF-8 only makes sense if the binary data is known to actually be text data stored in that encoding. You cannot use UTF-8 to "decode" binary that isn't actually UTF-8.

If you want to get string after you decode your data, it implies that your data in somehow in text format.If this is the case you should have the knowledge of the file's initial encoding, such as UTF-8. Then you can properly decode the strings. If your program only transfer files from one place to another without doing anything with its content, you better leave them as you decode.

Convert string object (Java or C#) to byte array using UTF-8 (or some other, if you have a reason for that) encoding.
You now have binary data, UTF-8 encoded text to be specific. If you need to transfer it somewhere, which does not support raw binary data or UTF-8 text or if you don't want to worry about some characters having special meaning like in XML, convert it to ASCII string using base64 encoding.
Do whatever you wish with the ASCII string (base64 even allows some whitespace mangling etc) to get it to decoder.
Convert ASCII string back to byte array with base64 decode.
Convert byte array back to string object (C# or Java) using UTF-8 encoding.
If binary data or UTF-8 text is ok, you can skip steps 2 and 4. But 1 and 5 are needed, because in languages like C# and Java, string is "logical characters", it is not bytes you can store or transfer (of course it's bytes in memory, usually UTF-16 or UTF-32, but you should not care about that). It must be converted to bytes using some encoding. UTF-x are the only ones which don't lose any characters, and UTF-8 is most space-efficient if most characters are from "western" alphabets.
One special thing about base64 is, that while it is actually 7-bit ASCII characters, you can put base64 encoded text to C#/Java string object and back to base64 encoded byte array using any string encoding, since all string encodings in use are superset of 7-bit ASCII. So you can take image data, base64 encode it, and put resulting text to String object without worries about encodings and corruption.
Steps for binary files:
Get contents of binary file like PNG image file to byte array.
Same as step 2 above, except data is not UTF-8.
Same as step 3 above
Same as step 4 above
You now have byte array containing the PNG file contents from step 1.

How to convert string to base64 byte array, would this be valid?

I'm trying to write a function that converts a string to a base64 byte array. I've tried with this approach:
public byte[] stringToBase64ByteArray(String input)
{
byte[] ret = System.Text.Encoding.Unicode.GetBytes(input);
string s = Convert.ToBase64String(input);
ret = System.Text.Encoding.Unicode.GetBytes(s);
return ret;
}
Would this function produce a valid result (provided that the string is in unicode)?
Thanks!

You can use:
From byte[] to string:
byte[] array = somebytearray;
string result = Convert.ToBase64String(array);
From string to byte[]:
array = Convert.FromBase64String(result);

Looks okay, although the approach is strange. But use Encoding.ASCII.GetBytes() to convert the base64 string to byte[]. Base64 encoding only contains ASCII characters. Using Unicode gets you an extra 0 byte for each character.

Representing a string as a blob represented as a string is odd... any reason you can't just use the string directly?
The string is always unicode; it is the encoded bytes that change. Since base-64 is always <128, using unicode in the last part seems overkill (unless that is what the wire-format demands). Personally, I'd use UTF8 or ASCII for the last GetBytes so that each base-64 character only takes one byte.

All strings in .NET are unicode. This code will produce valid result but the consumer of the BASE64 string should also be unicode enabled.

Yes, it would output a base64-encoded string of the UTF-16 little-endian representation of your source string. Keep in mind that, AFAIK, it's not really common to use UTF-16 in base64, ASCII or UTF-8 is normally used. However, the important thing here is that the sender and the receiver agree on which encoding must be used.
I don't understand why you reconvert the base64 string in array of bytes: base64 is used to avoid encoding incompatibilities when transmitting, so you should keep is as a string and output it in the format required by the protocol you use to transmit the data. And, as Marc said, it's definitely overkill to use UTF-16 for that purpose, since base64 includes only 64 characters, all under 128.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.