Best way to read text file into byte array in selected encoding? - c#

Now i use something like that:
Encoding.UTF8.GetBytes(File.ReadAllText(filename))
Any suggestions how to do that better?
And what encoding uses File.ReadAllBytes(filename) method?
P.S. I need utf-8 byte arrays to store text files in db

Best way to read file into byte array in selected encoding?
Character Encoding is about storing text in binary form, as sequences of specific bytes for each character. Another way of thinking about it is that the Encoding system is what gives meaning to some bytes. Without the context that some bytes represents text, the bytes are just bytes.
Files are just bytes too; And they can be interpreted however you want your application to interpret them.
When you decode bytes you are giving meaning to those bytes according the encoding system used. For text encodings, you start with bytes and end up with characters.
You can't "decode" bytes from a file into a byte array. That doesn't give meaning to the bytes or produce any characters.
You can decode bytes into strings using a specific encoding though:
string allLinesFromFileAsAuto = File.ReadAllText(filename);
string allLinesFromFileAsUTF8 = File.ReadAllText(filename, Encoding.UTF8);
string allLinesFromFileAsASCII = File.ReadAllText(filename, Encoding.ASCII);
All three of these methods convert bytes from the same file into strings, but the resulting strings will be different depending on the encoding you use.
And what encoding uses File.ReadAllBytes(filename) method?
File.ReadAllBytes(filename) does not use any encoding. Files are just bytes. This method pulls all of a file's bytes into a byte array. You still have to decode those bytes into strings after getting that byte array. But this only works well for plaintext files.
I need utf-8 byte arrays to store files in db
Is this because your database uses UTF-8 encoding?
The encoding of a database defines how text is stored (as binary).
Binary data can be stored as-is, byte-for-byte, as "blobs" in most databases, regardless of the encoding.

ReadAllText will try to infer the encoding of the file and convert it to .NET strings. Your first example will then convert those to UTF-8 bytes no matter what the source encoding was.
Depending on the size of the files, this could be costly to load it all to memory twice. You can do things to read chunks of the source file and convert it that way.
ReadAllBytes reads the raw file as a series of bytes, there's no encoding/decoding for that.
If you are storing non-text files in the database, you should not encode the file as UTF-8.

Related

Converting a string rappresentation of a file (byte array) back to a file in C#

As in the title I'm trying to convert back a string rappresentation of a bytearray to the original file where the bytes where taken.
What I've done:
I've a web service that gets a whole file and sends it:
answer.FileByte = File.ReadAllBytes(#"C:\QRY.txt");
After the serialization in the transmitted result xml I've this line:
<a:FileByte>TVNIfGF8MjAxMzAxMDF8YQ1QSUR8YXxhfGF8YXxhfGF8YXwyMDEzMDEwMXxhfGF8YXxhfGF8YXxhfGF8YXxhfGF8YXxhDVBWMXxhfGF8YXxhfGF8YXxhfGF8YXxhfDIwMTMwMTAxfDIwMTMwMTAxfDB8MHxhDQo=</a:FileByte>
I've tried to convert it back with this line in another simple application:
//filepath is the path of the file created
//bytearray is the string from the xml (copypasted)
File.WriteAllBytes(filepath, Encoding.UTF8.GetBytes(bytearray));
I've used UTF8 as enconding since the xml declares to use this charset. Keeping the datatype is not an option since I'm writing a simple utility to check the file conversion.
Maybe I'm missing something very basic but I'm not able to come up with a working solution.
This certainly isn't UTF8, the serializer probably converted it to Base64.
Use Convert.FromBase64String() to get your bytes back
Assuming that bytearray is the "TVNIfGF8M..." string, try:
string bytearray = ...;
File.WriteAllBytes(filepath, Convert.FromBase64String(bytearray));
UTF8 is a way to convert arbitrary text into bytes.
It was used by ReadAllText() to turn the bytes on disk back into XML.
You're seeing a mechanism to convert arbitrary bytes into text that can fit into XML. (that text is then convert to different bytes using UTF8).
It's probably Base64; use Convert.FromBase64String().

best encode decode for binary file in java and c#

i know there is many types of encode and decode and from what i have read, base64 is a great choice when it comes to encode binary file (image, mp3, video).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value. the process to get the string after decode, i will require to do like this (in c#): System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
here i noticed that i have several choices on what to use to get the string, such as ASCII, UNICODE, DEFAULT.
the real question in this post is if im using java to encode and c# to decode the binary file, what is the best solution/choice should i use? i have tried several method and some of the character could not be read thus gives out question mark symbol (?).
however, the most closer encode decode that could be read the byte is when im using this in Java: String encoded = Base64.encodeToString(fileData, Base64.CRLF); meanwhile in c# im using like this: byte[] encodedDataAsBytes = System.Convert.FromBase64String(encodedData);
string returnValue = System.Text.Encoding.ASCII.GetString(encodedDataAsBytes);
Still, there are several character that cannot be read. Does anyone have solution for this problem statement? any feedback is much appreciated. thanks for advance.
The thing about binary files is that they are binary (type byte[]). Most of the time you can not convert the bytes directly to a string (using Encoding.GetString(byte[])), because some of them may have values which can not be represented in a string (which is what you are experiencing).
Converting binary data to string using Encoding.GetString(byte[]) to convert it to BASE64 doesn't make sense at all as you lose information when converting the binary information to string - you'd need to convert it directly to BASE64.
Converting a BASE64 string representation of a byte array to byte[] is OK - this gives you back the original binary data. However, converting this byte[] to string is not OK for the reason I've given above.
How BASE64 encoding is supposed to work is:
Get binary data as byte[]
Create BASE64 string from byte[]
Transfer BASE64 string
Create byte[] from BASE64 string
Continue working with byte[]
You state that that input is "image, mp3, video", so: arbitrary binary. You then state that you're using base-64, meaning: for some reason you need to transfer / store this data as a string (note: transfer / storage as raw binary would usually be preferred - base-64 has overhead).
Now, when it comes to decode, i will need to convert from the base64 and then get the string value.
There's the problem; there is no string value here. An "image, mp3, video" is simply not a "string value". What you can do is decode from the base-64 back to raw binary (trivial in either java or c#), but that is all you can do. If you needed a "string value" from raw binary, the only thing you can do is to re-encode it via base-64 (which would give you back what you started with), or some other base-n.
A text-encoding such as ASCII or UTF-8 only makes sense if the binary data is known to actually be text data stored in that encoding. You cannot use UTF-8 to "decode" binary that isn't actually UTF-8.
If you want to get string after you decode your data, it implies that your data in somehow in text format.If this is the case you should have the knowledge of the file's initial encoding, such as UTF-8. Then you can properly decode the strings. If your program only transfer files from one place to another without doing anything with its content, you better leave them as you decode.
Convert string object (Java or C#) to byte array using UTF-8 (or some other, if you have a reason for that) encoding.
You now have binary data, UTF-8 encoded text to be specific. If you need to transfer it somewhere, which does not support raw binary data or UTF-8 text or if you don't want to worry about some characters having special meaning like in XML, convert it to ASCII string using base64 encoding.
Do whatever you wish with the ASCII string (base64 even allows some whitespace mangling etc) to get it to decoder.
Convert ASCII string back to byte array with base64 decode.
Convert byte array back to string object (C# or Java) using UTF-8 encoding.
If binary data or UTF-8 text is ok, you can skip steps 2 and 4. But 1 and 5 are needed, because in languages like C# and Java, string is "logical characters", it is not bytes you can store or transfer (of course it's bytes in memory, usually UTF-16 or UTF-32, but you should not care about that). It must be converted to bytes using some encoding. UTF-x are the only ones which don't lose any characters, and UTF-8 is most space-efficient if most characters are from "western" alphabets.
One special thing about base64 is, that while it is actually 7-bit ASCII characters, you can put base64 encoded text to C#/Java string object and back to base64 encoded byte array using any string encoding, since all string encodings in use are superset of 7-bit ASCII. So you can take image data, base64 encode it, and put resulting text to String object without worries about encodings and corruption.
Steps for binary files:
Get contents of binary file like PNG image file to byte array.
Same as step 2 above, except data is not UTF-8.
Same as step 3 above
Same as step 4 above
You now have byte array containing the PNG file contents from step 1.

Convert file decoded in 64 bit to 32 and rewrite it in C#

I have a zip file decoded in base 64bit (it's a string). I want to take that string, convert it to 32bit and create the zip file. How can i do it?
Edited.
Check out http://www.atrevido.net/blog/2004/01/13/Base32%2BIn%2BNET.aspx if you need a base32 representation of your bytes.
If you just need to create a zip file from at base64 encoded string, convert it to byte[] and write it to a zip stream:
byte[] bytes = Convert.FromBase64String(base64String);
GZipStream stream..
stream.Write(bytes,0,bytes.length);
The base 64 string contains a representation of your bytes - it's not a 64 bit representation, it's a 64 characters representation: http://en.wikipedia.org/wiki/Base64
If you decoded a file in a String object, then you've got the work done. Now the string is represented by .NET runtime in some way you really should not care about.
Maybe you have some constraint on character encoding, that is ASCII, utf-8, and so on. If that is the case, you can use Encoding class and its utility methods for conversion.
Encoding.Convert() on MSDN

How do I encode a Binary blob as Unicode blob?

I'm trying to store a Gzip serialized object into Active Directory's "Extension Attribute", more info here. This field is a Unicode string according to it's oM syntax of 64.
What is the most efficient way to store a binary blob as Unicode? Once I get this down, the rest is a piece of cake.
There are, of course, many ways of reliably packing an arbitrary byte array into Unicode characters, but none of them are very efficient. It is very unfortunate that ActiveDirectory would choose to use Unicode for data that is not textual in nature. It’s like using a string to represent a 32-bit integer, or like using Nutella to write a love letter.
My recommendation would be to “play it safe” and use an ASCII-based encoding such as base64. The reason I recommend this is because there is already a built-in .NET implementation for this:
var base64Encoded = Convert.ToBase64String(byteArray);
var original = Convert.FromBase64String(base64Encoded);
In theory you could come up with an encoding that is more efficient than this by making use of more of the Unicode character set. However, in order to do so reliably, you would need to know quite a bit about Unicode.
Normally, this would be the way to convert between bytes and Unicode text:
// string from bytes
System.Text.Encoding.Unicode.GetString(bytes);
// bytes from string
System.Text.Encoding.Unicode.GetBytes(bytes);
EDIT:
But since not every possible byte sequence is a valid Unicode string, you should use a method that can create a string from an arbitrary byte sequence:
// string from bytes
Convert.ToBase64String(byteArray);
// bytes from string
Convert.FromBase64String(base64Encoded);
(Thanks to #Timwi who pointed this out!)

Converting a binary file to a string and vice versa

I created a webservice which returns a (binary) file. Unfortunately, I cannot use byte[] so I have to convert the byte array to a string.
What I do at the moment is the following (but it does not work):
Convert file to string:
byte[] arr = File.ReadAllBytes(fileName);
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
string fileAsString = enc.GetString(arr);
To check if this works properly, I convert it back via:
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
byte[] file = enc.GetBytes(fileAsString);
But at the end, the original byte array and the byte array created from the string aren't equal. Do I have to use another method to read the file to a byte array?
Use Convert.ToBase64String to convert it to text, and Convert.FromBase64String to convert back again.
Encoding is used to convert from text to a binary representation, and from a binary representation of text back to text again. In this case you don't have a binary representation of text - you just have arbitrary binary data... so Encoding is inappropriate. Even if you use an encoding which can "sort of" handle any binary data (e.g. ISO Latin 1) you'll find that many ways of transmitting text will fail when you've got control characters etc.
Base64 encoding will give you text which is just ASCII, and much easier to handle.

Categories

Resources