Prevent File.WriteAllText to write double Byte-Order Mark (BOM)? - c#

In the following example, if (string) text starts with a BOM, File.writeAllText() will add another one, writing two BOMs.
I want to write the text two times:
Without a BOM at all
With a single BOM (if applicable to the encoding)
What is the canonical way to achieve this?
HttpWebResponse response = ...
Byte[] byte = ... // bytes from response possibly including BOM
var encoding = Encoding.GetEncoding(
response.get_CharacterSet(),
new EncoderExceptionFallback(),
new DecoderExceptionFallback()
);
string text = encoding.GetString(bytes); // will preserve BOM if any
System.IO.File.WriteAllText(fileName, text, encoding);

You are decoding and then reencoding the file... It is quite useless.
Inside the Encoding class there is a GetPreamble() method that returns the preamble (called BOM for utf-* encodings), in a byte[]. Then we can check if the bytes array received has already this prefix or not. Then based on this information we can write the two versions of the file, adding or removing the prefix when necessary.
var encoding = Encoding.GetEncoding(response.CharacterSet, new EncoderExceptionFallback(), new DecoderExceptionFallback());
// This will throw if the file is wrongly encoded
encoding.GetCharCount(bytes);
byte[] preamble = encoding.GetPreamble();
bool hasPreamble = bytes.Take(preamble.Length).SequenceEqual(preamble);
if (hasPreamble)
{
File.WriteAllBytes("WithPreambleFile.txt", bytes);
using (var fs = File.OpenWrite("WithoutPreamble.txt"))
{
fs.Write(bytes, preamble.Length, bytes.Length - preamble.Length);
}
}
else
{
File.WriteAllBytes("WithoutPreambleFile.txt", bytes);
using (var fs = File.OpenWrite("WithPreamble.txt"))
{
fs.Write(preamble, 0, preamble.Length);
fs.Write(bytes, 0, bytes.Length);
}
}

Related

How can I Change the encoding of a file 'without BOM' to an 'Windows - 1252' encoded file?

This is my function to convert the encoding of a file.
Before conversion I opened the file in Notepad++, and checked the Encoding using encoding menu, it shows that the Encoding is in UTF 8. I tried to convert the file using following function, but it did not convert to ASCII.
Please have a look into function.
public static void ConvertFileEncoding(string srcFile, Encoding srcEncoding, string tempFile)
{
try
{
using (var reader = new StreamReader(srcFile))
using (var writer = new StreamWriter(tempFile, false, Encoding.ASCII))
{
char[] buf = new char[1024];
while (true)
{
int count = reader.Read(buf, 0, buf.Length);
if (count == 0)
{
break;
}
writer.Write(buf, 0, count);
}
}
System.IO.File.Copy(tempFile, srcFile, true); // Source file is replaced with Temp file
DeleteTempFile(tempFile);
// TO DO -- Log Sucess Details
}
catch (Exception e)
{
throw new IOException("Encoding conversion failed.", e);
// TO DO -- Log failure Details
}
}
Please help me on understanding what wrong happenes when I convert the file without BOM to Windows-1252?
Characters that have values less than 128 in ASCII are all the same when encoded in UTF-8 or ASCII. If your file consists only of these (it is likely) then the file is identical as UTF-8 or ASCII.
A program can't be expected to distinguish these, because they are identical. UTF-8 is very commonly used now, so it's a reasonable choice when a program has no information other than the content of a file to guess from and it wants to display the encoding.

Base64 - CryptoStream with StreamWriter vs Convert.ToBase64String()

Following feedback from Alexei, a simplification of the question:
How do I use a buffered Stream approach to convert the contents of a CryptoStream (using ToBase64Transform) into a StreamWriter (Unicode encoding) without using Convert.ToBase64String()?
Note: Calling Convert.ToBase64String() throws OutOfMemoryException, hence the need for a buffered/Stream approach to the conversion.
You probably should implement custom Stream, not a TextWriter. It is much easier to compose streams than writers (like pass your stream to compressed stream).
To create custom stream - derive from Stream and implement at least Write and Flush (and Read if you need R/W stream). The rest is more or less optional and depends on you additional needs, regular copy to other stream does not need anything else.
In constructor get inner stream passed to you for writing to. Base64 is always producing ASCII characters, so it should be easy to write output as UTF-8 with or without BOM directly to a stream, but if you want to specify encoding you can wrap inner stream with StreamWriter internally.
In your Write implementation buffer data till you get enough bytes to have block of multiple of 3 bytes (i.e. 300) and call Convert.ToBase64String on that portion. Make sure not to loose not-yet-converted portion. Since Base64 converts 3 bytes to 4 characters converting in blocks of multiple of 3 size will never have =/== padding at the end and can be concatenated with next block. So write that converted portion into inner stream/writer. Note that you want to limit block size to something relatively small like 3*10000 to avoid allocation of your blocks on large objects heap.
In Flush make sure to convert the last unwritten bytes (this will be the only one with = padding at the end) and write it to the stream too.
For reading you may need to be more careful as in Base64 white spaces are allowed, so you can't read fixed number of characters and convert to bytes. The easiest approach would be to read by character from StreamReader and convert each 4 non-space ones to bytes.
Note: you can consider writing/reading Base64 by hand directly from bytes. It will give you some performance benefits, but may be hard if you are not good with bit shifting.
Please try using following to encrypt. I am using fileName/filePath as input. You can adjust it as per your requirement. Using this I have encrypted over 1 gb file successfully without any out of memory exception.
public bool EncryptUsingStream(string inputFileName, string outputFileName)
{
bool success = false;
// here assuming that you already have key
byte[] key = new byte[128];
SymmetricAlgorithm algorithm = SymmetricAlgorithm.Create();
algorithm.Key = key;
using (ICryptoTransform transform = algorithm.CreateEncryptor())
{
CryptoStream cs = null;
FileStream fsEncrypted = null;
try
{
using (FileStream fsInput = new FileStream(inputFileName, FileMode.Open, FileAccess.Read))
{
//First write IV
fsEncrypted = new FileStream(outputFileName, FileMode.Create, FileAccess.Write);
fsEncrypted.Write(algorithm.IV, 0, algorithm.IV.Length);
//then write using stream
cs = new CryptoStream(fsEncrypted, transform, CryptoStreamMode.Write);
int bytesRead;
int _bufferSize = 1048576; //buggersize = 1mb;
byte[] buffer = new byte[_bufferSize];
do
{
bytesRead = fsInput.Read(buffer, 0, _bufferSize);
cs.Write(buffer, 0, bytesRead);
} while (bytesRead > 0);
success = true;
}
}
catch (Exception ex)
{
//handle exception or throw.
}
finally
{
if (cs != null)
{
cs.Close();
((IDisposable)cs).Dispose();
if (fsEncrypted != null)
{
fsEncrypted.Close();
}
}
}
}
return success;
}

String back into byte array - not working

I have a stream that gets converted into a byte array.
I then take that bye array and turn it into a string.
When I try to turn that string back into a byte array it is not correct...see the code below.
private void Parse(Stream stream, Encoding encoding)
{
// Read the stream into a byte array
byte[] allData = ToByteArray(stream);
// Copy to a string for header parsing
string allContent = encoding.GetString(allData);
//This does not convert back right - just for demo purposes, not how the code is used
allData = encoding.GetBytes(allContent);
}
private byte[] ToByteArray(Stream stream)
{
byte[] buffer = new byte[32768];
using (MemoryStream ms = new MemoryStream())
{
while (true)
{
int read = stream.Read(buffer, 0, buffer.Length);
if (read <= 0)
return ms.ToArray();
ms.Write(buffer, 0, read);
}
}
}
Without having more information, I'm quite certain that this is a text encoding issue. Most likely, the text encoding in the stream is different than the encoding specified as your parameter. This will result in different values at the byte level.
Here's a few good articles that explains why you're seeing what you're seeing.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
How to Determine Text File Encoding
General questions, relating to UTF or Encoding Form
I think changing the ToByteArray method to use a StreamReader that matches the encoding should work in this case, although without seeing more of the code I can't be certain.
private byte[] ToByteArray(Stream stream, System.Text.Encoding encoding)
{
using(var sr = new StreamReader(stream, encoding))
{
return encoding.GetBytes(sr.ReadToEnd());
}
}
EDIT
Since you're working with image data, you should use Convert.ToBase64String to convert the byte[] to a string. You can then use Convert.FromBase64String decode to convert back into a byte[]. The reason encoding.GetBytes doesn't work is because there may be some data in the byte[] that cannot be represented as a string for that encoding.
private void Parse(Stream stream, Encoding encoding)
{
byte[] allData = ToByteArray(stream);
string allContent = Convert.ToBase64String(allData);
allData = Convert.FromBase64String(allContent);
}

how to read this text from file

how to read the text below?
‰€ˆ‡‰�#îõ‘þüŠ ꑯõù ‚†ƒ� -#�ª÷‘þü “‘
ª“îù )øþ¦ùý ¤ª—ùý î‘õ•þø—¤(#•¢þ¢�
ø¤÷¢ù ꑯõù îõ‘þü#^a—ú¤�ö^b•¦øû÷¢ð‘ö
¤�ù ¢�÷©^cˆˆƒ�#‚€� «.: õ¬ø¤Š
›¢øñ#…�…ˆí/Š…/
…€�…Š}TK{^aˆˆƒ�#†/„€€#}BF{#ª“îùû‘ý
î‘õ•þø—¤ý#ª“îùû‘ý î‘õ•þø—¤ý --
�¥õøöû‘#^c
I use this code but not display all characters
FileStream fs = new FileStream(open.FileName, FileMode.Open, FileAccess.Read);
System.Text.Encoding enc = System.Text.Encoding.UTF8 ;
byte[] data = new byte[fs.Length];
fs.Read(data, 0, data.Length);
string text = enc.GetString(data);
and show text :
†‰€ˆ‡‰Â�#îõâ� �˜Ã¾Ã¼Å
ꑯõù ‚†ƒ� -#�ª÷‘þü
“‘ ª“îù )øþ¦ùý
¤ª—ùý î‘õ•þø—¤(#�
�¢þ¢� ø¤÷¢ù ꑯõù
îõ‘þü#^a—ú¤�ö
^b•¦øû÷¢ð‘ö ¤�ù
¢Â�֩^cˆˆƒÂ�#‚₠¬Â� «.:
õ¬ø¤Š›¢øñ#…Â�…ˆí/Å
…/ …€Â�…Š}TK{^aˆˆƒÂ�#â€
/„€€#}BF{#ª“îùû ‘ý
î‘õ•þø—¤ý#�
�“îùû‘ý î‘õ
this is a TEXT DOS
and encoding this text is:
IBM037
IBM437
IBM500
ASMO-708
DOS-720
ibm737
ibm775
ibm850
ibm852
IBM855
ibm857
IBM00858
IBM860
ibm861
DOS-862
IBM863
IBM864
IBM865
cp866
ibm869
IBM870
windows-874
cp875
shift_jis
gb2312
ks_c_5601-1987
big5
IBM1026
IBM01047
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
utf-16
unicodeFFFE
windows-1250
windows-1251
Windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
Johab
macintosh
x-mac-japanese
x-mac-chinesetrad
x-mac-korean
x-mac-arabic
x-mac-hebrew
x-mac-greek
x-mac-cyrillic
x-mac-chinesesimp
x-mac-romanian
x-mac-ukrainian
x-mac-thai
x-mac-ce
x-mac-icelandic
x-mac-turkish
x-mac-croatian
utf-32
utf-32BE
x-Chinese-CNS
x-cp20001
x-Chinese-Eten
x-cp20003
x-cp20004
x-cp20005
x-IA5
x-IA5-German
x-IA5-Swedish
x-IA5-Norwegian
us-ascii
x-cp20261
x-cp20269
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM423
IBM424
x-EBCDIC-KoreanExtended
IBM-Thai
koi8-r
IBM871
IBM880
IBM905
IBM00924
EUC-JP
x-cp20936
x-cp20949
cp1025
koi8-u
iso-8859-1
iso-8859-2
iso-8859-3
iso-8859-4
iso-8859-5
iso-8859-6
iso-8859-7
iso-8859-8
iso-8859-9
iso-8859-13
iso-8859-15
x-Europa
iso-8859-8-i
iso-2022-jp
csISO2022JP
iso-2022-jp
iso-2022-kr
x-cp50227
euc-jp
EUC-CN
euc-kr
hz-gb-2312
GB18030
x-iscii-de
x-iscii-be
x-iscii-ta
x-iscii-te
x-iscii-as
x-iscii-or
x-iscii-ka
x-iscii-ma
x-iscii-gu
x-iscii-pa
utf-7
utf-8
To read the file you need to know what encoding used in this file.
If you don't know, you can iterate through all encodings and see if find the one that works.
const string FileName = "FileName";
foreach (var encodingInfo in Encoding.GetEncodings())
{
try
{
var encoding = encodingInfo.GetEncoding();
var text = File.ReadAllText(FileName, encoding);
Console.WriteLine("{0} - {1}", encodingInfo.Name, text.Substring(0, 20));
// put break point and check if text is readable here
}
catch (Exception ex)
{
Console.WriteLine("Failed: {0}", encodingInfo.Name);
}
}
Disclaimer: assuming this is a text file, assuming the file isn't huge.
Well it looks like you're trying to open a .dat file, which is probably written with a byte format by the looks of it
Try the following code
File readThis = new File("file directory");
byte[] aByte = new byte[(int)readThis.length()];
FileInputStream Fis = new FileInputStream(readThis);
Fis.read(aByte);
System.out.println(Contents: "+aByte);
Fis.close();
Let me know how it goes :)

Base64 Encode a PDF in C#?

Can someone provide some light on how to do this? I can do this for regular text or byte array, but not sure how to approach for a pdf. do i stuff the pdf into a byte array first?
Use File.ReadAllBytes to load the PDF file, and then encode the byte array as normal using Convert.ToBase64String(bytes).
Byte[] fileBytes = File.ReadAllBytes(#"TestData\example.pdf");
var content = Convert.ToBase64String(fileBytes);
There is a way that you can do this in chunks so that you don't have to burn a ton of memory all at once.
.Net includes an encoder that can do the chunking, but it's in kind of a weird place. They put it in the System.Security.Cryptography namespace.
I have tested the example code below, and I get identical output using either my method or Andrew's method above.
Here's how it works: You fire up a class called a CryptoStream. This is kind of an adapter that plugs into another stream. You plug a class called CryptoTransform into the CryptoStream (which in turn is attached to your file/memory/network stream) and it performs data transformations on the data while it's being read from or written to the stream.
Normally, the transformation is encryption/decryption, but .net includes ToBase64 and FromBase64 transformations as well, so we won't be encrypting, just encoding.
Here's the code. I included a (maybe poorly named) implementation of Andrew's suggestion so that you can compare the output.
class Base64Encoder
{
public void Encode(string inFileName, string outFileName)
{
System.Security.Cryptography.ICryptoTransform transform = new System.Security.Cryptography.ToBase64Transform();
using(System.IO.FileStream inFile = System.IO.File.OpenRead(inFileName),
outFile = System.IO.File.Create(outFileName))
using (System.Security.Cryptography.CryptoStream cryptStream = new System.Security.Cryptography.CryptoStream(outFile, transform, System.Security.Cryptography.CryptoStreamMode.Write))
{
// I'm going to use a 4k buffer, tune this as needed
byte[] buffer = new byte[4096];
int bytesRead;
while ((bytesRead = inFile.Read(buffer, 0, buffer.Length)) > 0)
cryptStream.Write(buffer, 0, bytesRead);
cryptStream.FlushFinalBlock();
}
}
public void Decode(string inFileName, string outFileName)
{
System.Security.Cryptography.ICryptoTransform transform = new System.Security.Cryptography.FromBase64Transform();
using (System.IO.FileStream inFile = System.IO.File.OpenRead(inFileName),
outFile = System.IO.File.Create(outFileName))
using (System.Security.Cryptography.CryptoStream cryptStream = new System.Security.Cryptography.CryptoStream(inFile, transform, System.Security.Cryptography.CryptoStreamMode.Read))
{
byte[] buffer = new byte[4096];
int bytesRead;
while ((bytesRead = cryptStream.Read(buffer, 0, buffer.Length)) > 0)
outFile.Write(buffer, 0, bytesRead);
outFile.Flush();
}
}
// this version of Encode pulls everything into memory at once
// you can compare the output of my Encode method above to the output of this one
// the output should be identical, but the crytostream version
// will use way less memory on a large file than this version.
public void MemoryEncode(string inFileName, string outFileName)
{
byte[] bytes = System.IO.File.ReadAllBytes(inFileName);
System.IO.File.WriteAllText(outFileName, System.Convert.ToBase64String(bytes));
}
}
I am also playing around with where I attach the CryptoStream. In the Encode method,I am attaching it to the output (writing) stream, so when I instance the CryptoStream, I use its Write() method.
When I read, I'm attaching it to the input (reading) stream, so I use the read method on the CryptoStream. It doesn't really matter which stream I attach it to. I just have to pass the appropriate Read or Write enumeration member to the CryptoStream's constructor.

Categories

Resources