best encode decode for binary file in java and c# - c#

i know there is many types of encode and decode and from what i have read, base64 is a great choice when it comes to encode binary file (image, mp3, video).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value. the process to get the string after decode, i will require to do like this (in c#): System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
here i noticed that i have several choices on what to use to get the string, such as ASCII, UNICODE, DEFAULT.
the real question in this post is if im using java to encode and c# to decode the binary file, what is the best solution/choice should i use? i have tried several method and some of the character could not be read thus gives out question mark symbol (?).
however, the most closer encode decode that could be read the byte is when im using this in Java: String encoded = Base64.encodeToString(fileData, Base64.CRLF); meanwhile in c# im using like this: byte[] encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
string returnValue = System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
Still, there are several character that cannot be read. Does anyone have solution for this problem statement? any feedback is much appreciated. thanks for advance.

The thing about binary files is that they are binary (type byte[]). Most of the time you can not convert the bytes directly to a string (using Encoding.GetString(byte[])), because some of them may have values which can not be represented in a string (which is what you are experiencing).
Converting binary data to string using Encoding.GetString(byte[]) to convert it to BASE64 doesn't make sense at all as you lose information when converting the binary information to string - you'd need to convert it directly to BASE64.
Converting a BASE64 string representation of a byte array to byte[] is OK - this gives you back the original binary data. However, converting this byte[] to string is not OK for the reason I've given above.
How BASE64 encoding is supposed to work is:
Get binary data as byte[]
Create BASE64 string from byte[]
Transfer BASE64 string
Create byte[] from BASE64 string
Continue working with byte[]

You state that that input is "image, mp3, video", so: arbitrary binary. You then state that you're using base-64, meaning: for some reason you need to transfer / store this data as a string (note: transfer / storage as raw binary would usually be preferred - base-64 has overhead).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value.
There's the problem; there is no string value here. An "image, mp3, video" is simply not a "string value". What you can do is decode from the base-64 back to raw binary (trivial in either java or c#), but that is all you can do. If you needed a "string value" from raw binary, the only thing you can do is to re-encode it via base-64 (which would give you back what you started with), or some other base-n.
A text-encoding such as ASCII or UTF-8 only makes sense if the binary data is known to actually be text data stored in that encoding. You cannot use UTF-8 to "decode" binary that isn't actually UTF-8.

If you want to get string after you decode your data, it implies that your data in somehow in text format.If this is the case you should have the knowledge of the file's initial encoding, such as UTF-8. Then you can properly decode the strings. If your program only transfer files from one place to another without doing anything with its content, you better leave them as you decode.

Convert string object (Java or C#) to byte array using UTF-8 (or some other, if you have a reason for that) encoding.
You now have binary data, UTF-8 encoded text to be specific. If you need to transfer it somewhere, which does not support raw binary data or UTF-8 text or if you don't want to worry about some characters having special meaning like in XML, convert it to ASCII string using base64 encoding.
Do whatever you wish with the ASCII string (base64 even allows some whitespace mangling etc) to get it to decoder.
Convert ASCII string back to byte array with base64 decode.
Convert byte array back to string object (C# or Java) using UTF-8 encoding.
If binary data or UTF-8 text is ok, you can skip steps 2 and 4. But 1 and 5 are needed, because in languages like C# and Java, string is "logical characters", it is not bytes you can store or transfer (of course it's bytes in memory, usually UTF-16 or UTF-32, but you should not care about that). It must be converted to bytes using some encoding. UTF-x are the only ones which don't lose any characters, and UTF-8 is most space-efficient if most characters are from "western" alphabets.
One special thing about base64 is, that while it is actually 7-bit ASCII characters, you can put base64 encoded text to C#/Java string object and back to base64 encoded byte array using any string encoding, since all string encodings in use are superset of 7-bit ASCII. So you can take image data, base64 encode it, and put resulting text to String object without worries about encodings and corruption.
Steps for binary files:
Get contents of binary file like PNG image file to byte array.
Same as step 2 above, except data is not UTF-8.
Same as step 3 above
Same as step 4 above
You now have byte array containing the PNG file contents from step 1.

Related

Best way to read text file into byte array in selected encoding?

Now i use something like that:
Encoding.UTF8.GetBytes(File.ReadAllText(filename))
Any suggestions how to do that better?
And what encoding uses File.ReadAllBytes(filename) method?
P.S. I need utf-8 byte arrays to store text files in db
Best way to read file into byte array in selected encoding?
Character Encoding is about storing text in binary form, as sequences of specific bytes for each character. Another way of thinking about it is that the Encoding system is what gives meaning to some bytes. Without the context that some bytes represents text, the bytes are just bytes.
Files are just bytes too; And they can be interpreted however you want your application to interpret them.
When you decode bytes you are giving meaning to those bytes according the encoding system used. For text encodings, you start with bytes and end up with characters.
You can't "decode" bytes from a file into a byte array. That doesn't give meaning to the bytes or produce any characters.
You can decode bytes into strings using a specific encoding though:
string allLinesFromFileAsAuto = File.ReadAllText(filename);
string allLinesFromFileAsUTF8 = File.ReadAllText(filename, Encoding.UTF8);
string allLinesFromFileAsASCII = File.ReadAllText(filename, Encoding.ASCII);
All three of these methods convert bytes from the same file into strings, but the resulting strings will be different depending on the encoding you use.
And what encoding uses File.ReadAllBytes(filename) method?
File.ReadAllBytes(filename) does not use any encoding. Files are just bytes. This method pulls all of a file's bytes into a byte array. You still have to decode those bytes into strings after getting that byte array. But this only works well for plaintext files.
I need utf-8 byte arrays to store files in db
Is this because your database uses UTF-8 encoding?
The encoding of a database defines how text is stored (as binary).
Binary data can be stored as-is, byte-for-byte, as "blobs" in most databases, regardless of the encoding.
ReadAllText will try to infer the encoding of the file and convert it to .NET strings. Your first example will then convert those to UTF-8 bytes no matter what the source encoding was.
Depending on the size of the files, this could be costly to load it all to memory twice. You can do things to read chunks of the source file and convert it that way.
ReadAllBytes reads the raw file as a series of bytes, there's no encoding/decoding for that.
If you are storing non-text files in the database, you should not encode the file as UTF-8.

when i decrypt base64 string in c# it shows me ????? question mark and some special character sign

i have a base64 string and i want to decrypt it thru c# code. but when i apply below code, it shows me question mark and differnt special characters.
i dont know why it is showing this.
byte[] data = Convert.FromBase64String(encpStr);
string decodedString = Encoding.UTF8.GetString(data);
i created console project in visual studio, and doing this practice. see my whole code below.
string encStr = "oJe6iooq+PbvArD+C7P7B/cHAAL9Dr2/vvIBFRcVCAYfxxEcygzMFB0eFNEWFC3VKibYLCknMiLeLzU7PC8pOPT19g==";
Console.WriteLine("******************\nEnc String:\n" + encStr + "\n\n\n**********************");
byte[] data = Convert.FromBase64String(encStr);
string decStr = Encoding.UTF8.GetString(data);
Console.WriteLine("\n\n\nDecr String: \n" + decStr + "\n\n");
Console.ReadKey();
By default many systems/languages display byte values that can not be represented as a ?, � or similar glyph.
Base64 is encoding not encryption.
Base64 encoding is generally used when data needs to be handled as a character string but the data values are not representable as characters (or at least not all).
Data is generally a collection of 8-bit bytes such as in an array. Eamples include images, compiled computer code, encrypted data and etc.
Not all, even most, byte values do not have a character representation and do not have a displayable representation. See ASCII, unicode and in particular UTF-8 as well as Base64.
Base64 is used to encode binary data as text. If you decode Base64 data, you therefore get binary data. Most binary data cannot be represented as a string (or any other form of text) because it simply is not text.
Your Base64 is such an example. The result is not text, neither in UTF-8, nor in ASCII nor in any other encoding.
That's why you get a bunch on funny characters. They are used to represent invalid characters.

Why does this encrypted string have funny characters in it? It isn't readable?

I'm converting the encrypted text using UTF8, yet the resulting string has funny characters that I can't read and not sure if I can send this text to the browser.
string message = "hello world";
var rsa = new RSACryptoServiceProvider(2048);
var c = new UTF8Encoding();
byte[] dataToEncrypt = c.GetBytes(message);
byte[] encryptedData = rsa.Encrypt(dataToEncrypt, false);
string output = c.GetString(encryptedData);
Console.WriteLine(output);
Console.ReadLine();
When I run the above, I get the following:
�VJI����J/;�>�:<�M����g�1�7�A.#�`J�s��~��)�Fn�����5�.���o���ҵ���jH3;G�<<��F�͗��~?�Y�#���j���6l{{�Y�$�]���nylz���X8u�\f�V1/�$�n+�\b��\b�fsAh՝G\n�\t���\b���6߇3����Ԕ���4��#هhI���'\0� T�n��|EϺ^7ú l��T\\!�w���QRWA%p��V\f��5�
I need to send this text back to the browser, or save it to a file and currently I'm not sure why I am getting these characters?
The problem is that you are taking an array of bytes that was not created by encoding text, and use it as if it was. You can only decode data that was created by encoding, if you decode any arbitrary data, you end up with garbage.
If you want the binary data produced by the encryption as a string, use for example base64 encoding:
string output = Convert.ToBase64String(encryptedData);
When you want to decrypt the data, use Convert.FromBase64String to get the byte array back, decrypt it, and use Encoding.UTF8.GetString to turn it back into the original string. There it will work do decode the data, because it was created by encoding the string from the beginning.
These two lines are pretending that the output of an RSA-encrypted UTF-8 sequence is a valid UTF-8 sequence:
var c = new UTF8Encoding();
string output = c.GetString(encryptedData);
But this is simply not the case: the RSA encryption maps byte values to other, (seemingly) arbitrary byte values. The resulting byte sequence doesn’t form a valid UTF-8 sequence (there is no reason to assume that it would), and thus cannot be treated as one.
If you merely want a readable (or HTTP sendable) representation of your data, then Base64 is the way to go, as shown in other answers. Fundamentally, though, you should probably read Joel’s article about The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Encrypting a string will result in a byte array that contains non-printable characters. You'll want to convert it to base64 to have a readable version of it.

How do I encode a Binary blob as Unicode blob?

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.
What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.
There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.
My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:
var base64Encoded = Convert.ToBase64String(byteArray);
var original = Convert.FromBase64String(base64Encoded);
In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.
Normally, this would be the way to convert between bytes and Unicode text:
// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);
// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);
EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:
// string from bytes
Convert.ToBase64String(byteArray);
// bytes from string
Convert.FromBase64String(base64Encoded);
(Thanks to #Timwi who pointed this out!)

Converting a binary file to a string and vice versa

I created a webservice which returns a (binary) file. Unfortunately, I cannot use byte[] so I have to convert the byte array to a string.
What I do at the moment is the following (but it does not work):
Convert file to string:
byte[] arr = File.ReadAllBytes(fileName);
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
string fileAsString = enc.GetString(arr);
To check if this works properly, I convert it back via:
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
byte[] file = enc.GetBytes(fileAsString);
But at the end, the original byte array and the byte array created from the string aren't equal. Do I have to use another method to read the file to a byte array?
Use Convert.ToBase64String to convert it to text, and Convert.FromBase64String to convert back again.
Encoding is used to convert from text to a binary representation, and from a binary representation of text back to text again. In this case you don't have a binary representation of text - you just have arbitrary binary data... so Encoding is inappropriate. Even if you use an encoding which can "sort of" handle any binary data (e.g. ISO Latin 1) you'll find that many ways of transmitting text will fail when you've got control characters etc.
Base64 encoding will give you text which is just ASCII, and much easier to handle.

Categories

Resources