c# change file encoding without loading all the file in memory - c#

I need to change a file's encoding. The method that I've used loads all the file in memory:
string DestinationString = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(File.ReadAllText(FileName)));
File.WriteAllText(FileName, DestinationString, new System.Text.ASCIIEncoding());
This works for smaller files (in case that I want to change the file's encoding to ASCII), but it won't be ok with files larger than 2 GB. How to change the encoding without loading all the file's content in memory?

You can't do so by writing to the same file - but you can easily do it to a different file, just by reading a chunk of characters at a time in one encoding and writing each chunk in the target encoding.
public void RewriteFile(string source, Encoding sourceEncoding,
string destination, Encoding destinationEncoding)
{
using (var reader = File.OpenText(source, sourceEncoding))
{
using (var writer = File.CreateText(destination, destinationEncoding))
{
char[] buffer = new char[16384];
int charsRead;
while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
writer.Write(buffer, 0, charsRead);
}
}
}
}
You could always end up with the original filename via renaming, of course.

Related

Wrong file extension added in image byte array

I have a method that takes a web-based image, converts it into a byte array and then passes that to a CDN as an image.
I have now used this successfully hundreds of times, however on migrating a particular set of images, I've noticed that these Jpeg files are identifying as .png files which when they arrive at the CDN are blank.
Using some code that I copied from the web, I am able to identify the file extension from the image byte array after it is built.
So, it's the conversion from the original image to byte array that is mysteriously updating the file type.
This is my method:-
public byte[] Get(string fullPath)
{
byte[] imageBytes = { };
var imageRequest = (HttpWebRequest)WebRequest.Create(fullPath);
var imageResponse = imageRequest.GetResponse();
var responseStream = imageResponse.GetResponseStream();
if (responseStream != null)
{
using (var br = new BinaryReader(responseStream))
{
imageBytes = br.ReadBytes(500000);
br.Close();
}
responseStream.Close();
}
imageResponse.Close();
return imageBytes;
}
I have also tried converting this to use MemoryStream instead.
I'm not sure what else I can do to ensure that this identifies the correct file type.
Edit
I have now updated the number of allowed bytes which has resulted in viable images.
However the issue with the JPEG files being altered to PNG is still ongoing.
It's only this selection of images that are affected.
These images were saved in an old CMS system so I do wonder if the way that they were saved is the cause?
Up to now, the code only reads 500,000 bytes for each file. If the file is larger than that, the end is truncated and the content is not valid anymore. In order to read all bytes, you can use the following code:
public byte[] Get(string fullPath)
{
List<byte> imageBytes = new List<byte>(500000);
var imageRequest = (HttpWebRequest)WebRequest.Create(fullPath);
using (var imageResponse = imageRequest.GetResponse())
{
using (var responseStream = imageResponse.GetResponseStream())
{
using (var br = new BinaryReader(responseStream))
{
var buffer = new byte[500000];
int bytesRead;
while ((bytesRead = br.Read(buffer, 0, buffer.length)) > 0)
{
imageBytes.AddRange(buffer);
}
}
}
}
return imageBytes.ToArray();
}
Above sample reads the data in chunks of 500,000 bytes - for most of your files, this should be sufficient. If a file is larger, the code reads more chunks until there are no more bytes to read. All the chunks are assembled in a list.
This asserts that all the bytes are read, even if the content is larger than 500,000 bytes.

Decompress tar files using C#

I'm searching a way to add embedded resource to my solution. This resources will be folders with a lot of files in them. On user demand they need to be decompressed.
I'm searching for a way do store such folders in executable without involving third-party libraries (Looks rather stupid, but this is the task).
I have found, that I can GZip and UnGZip them using standard libraries. But GZip handles single file only. In such cases TAR should come to the scene. But I haven't found TAR implementation among standard classes.
Maybe it possible decompress TAR with bare C#?
While looking for a quick answer to the same question, I came across this thread, and was not entirely satisfied with the current answers, as they all point to using third-party dependencies to much larger libraries, all just to achieve simple extraction of a tar.gz file to disk.
While the gz format could be considered rather complicated, tar on the other hand is quite simple. At its core, it just takes a bunch of files, prepends a 500 byte header (but takes 512 bytes) to each describing the file, and writes them all to single archive on a 512 byte alignment. There is no compression, that is typically handled by compressing the created file to a gz archive, which .NET conveniently has built-in, which takes care of all the hard part.
Having looked at the spec for the tar format, there are only really 2 values (especially on Windows) we need to pick out from the header in order to extract the file from a stream. The first is the name, and the second is size. Using those two values, we need only seek to the appropriate position in the stream and copy the bytes to a file.
I made a very rudimentary, down-and-dirty method to extract a tar archive to a directory, and added some helper functions for opening from a stream or filename, and decompressing the gz file first using built-in functions.
The primary method is this:
public static void ExtractTar(Stream stream, string outputDir)
{
var buffer = new byte[100];
while (true)
{
stream.Read(buffer, 0, 100);
var name = Encoding.ASCII.GetString(buffer).Trim('\0');
if (String.IsNullOrWhiteSpace(name))
break;
stream.Seek(24, SeekOrigin.Current);
stream.Read(buffer, 0, 12);
var size = Convert.ToInt64(Encoding.ASCII.GetString(buffer, 0, 12).Trim(), 8);
stream.Seek(376L, SeekOrigin.Current);
var output = Path.Combine(outputDir, name);
if (!Directory.Exists(Path.GetDirectoryName(output)))
Directory.CreateDirectory(Path.GetDirectoryName(output));
using (var str = File.Open(output, FileMode.OpenOrCreate, FileAccess.Write))
{
var buf = new byte[size];
stream.Read(buf, 0, buf.Length);
str.Write(buf, 0, buf.Length);
}
var pos = stream.Position;
var offset = 512 - (pos % 512);
if (offset == 512)
offset = 0;
stream.Seek(offset, SeekOrigin.Current);
}
}
And here is a few helper functions for opening from a file, and automating first decompressing a tar.gz file/stream before extracting.
public static void ExtractTarGz(string filename, string outputDir)
{
using (var stream = File.OpenRead(filename))
ExtractTarGz(stream, outputDir);
}
public static void ExtractTarGz(Stream stream, string outputDir)
{
// A GZipStream is not seekable, so copy it first to a MemoryStream
using (var gzip = new GZipStream(stream, CompressionMode.Decompress))
{
const int chunk = 4096;
using (var memStr = new MemoryStream())
{
int read;
var buffer = new byte[chunk];
do
{
read = gzip.Read(buffer, 0, chunk);
memStr.Write(buffer, 0, read);
} while (read == chunk);
memStr.Seek(0, SeekOrigin.Begin);
ExtractTar(memStr, outputDir);
}
}
}
public static void ExtractTar(string filename, string outputDir)
{
using (var stream = File.OpenRead(filename))
ExtractTar(stream, outputDir);
}
Here is a gist of the full file with some comments.
Tar-cs will do the job, but it is quite slow. I would recommend using SharpCompress which is significantly quicker. It also supports other compression types and it has been updated recently.
using System;
using System.IO;
using SharpCompress.Common;
using SharpCompress.Reader;
private static String directoryPath = #"C:\Temp";
public static void unTAR(String tarFilePath)
{
using (Stream stream = File.OpenRead(tarFilePath))
{
var reader = ReaderFactory.Open(stream);
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
{
ExtractionOptions opt = new ExtractionOptions {
ExtractFullPath = true,
Overwrite = true
};
reader.WriteEntryToDirectory(directoryPath, opt);
}
}
}
}
See tar-cs
using (FileStream unarchFile = File.OpenRead(tarfile))
{
TarReader reader = new TarReader(unarchFile);
reader.ReadToEnd("out_dir");
}
Since you are not allowed to use outside libraries, you are not restricted to a specific format of the tar file either. In fact, they don't even need it to be all in the same file.
You can write your own tar-like utility in C# that walks a directory tree, and produces two files: a "header" file that consists of a serialized dictionary mapping System.IO.Path instances to an offset/length pairs, and a big file containing the content of individual files concatenated into one giant blob. This is not a trivial task, but it's not overly complicated either.
there are 2 ways to compress/decompress in .NET first you can use Gzipstream class and DeflatStream both can actually do compress your files in .gz format so if you compressed any file in Gzipstream it can be opened with any popular compression applications such as winzip/ winrar, 7zip but you can't open compressed file with DeflatStream. these two classes are from .NET 2.
and there is another way which is Package class it's actually same as Gzipstream and DeflatStream the only different is you can compress multiple files which then can be opened with winzip/ winrar, 7zip.so that's all .NET has. but it's not even generic .zip file,
it something Microsoft uses to compress their *x extension office files. if you decompress any docx file with package class you can see everything stored in it. so don't use .NET libraries for compressing or even decompressing cause you can't even make a generic compress file or even decompress a generic zip file. you have to consider for a third party library such as
http://www.icsharpcode.net/OpenSource/SharpZipLib/
or implement everything from the ground floor.

c# How to read a single file with normal and xml text elements

I am receiving a stream of data from a webservice and trying to save the contents of the stream to file. The stream contains standard lines of text alongside large chunks of xml data (on a single line). The size of the file is about 800Mb.
Problem: Receiving an out of memory exception when I process the xml section of each line.
==start file
line 1
line 2
<?xml version=.....huge line etc</xml>
line 3
line4
<?xml version=.....huge line etc</xml>
==end file
Current code, as you can see when it reads in the huge xml line then it spikes the memory.
string readLine;
using (StreamReader reader = new StreamReader(downloadStream))
{
while ((readLine = reader.ReadLine()) != null)
{
streamWriter.WriteLien(readLine); //writes to file
}
}
I was trying to think of a solution where I used both a TextReader/StreamReader and XmlTextReader in combination to process each section. As I get to the xml section I could switch to the XmlTextReader and use the Read() method to read each node thus stopping the memory spike.
Any suggestions on how I could do this? Alternatively, I could create a custom XmlTextReader that was able to read in these lines? Any pointers for this?
Updated
A further problem to this is that I need to read this file back in and split out the two xml sections to separate xml files! I converted the solution to write the file using a binary writer and then started to read the file back in using a binary reader. I have text processing to detect the start of the xml section and specifically which xml section so I can map it to the correct file! However this causes problems reading in the binary file and doing detection...
using (BinaryReader reader = new BinaryReader(savedFileStream))
{
while ((streamLine = reader.ReadString()) != null)
{
if (streamLine.StartsWith("<?xml version=\"1.0\" ?><tag1"))
//xml file 1
else if (streamLine.StartsWith("<?xml version=\"1.0\" ?><tag2"))
//xml file 2
XML may contain all content as one single line, so you'd probably better use a binary reader/writer where you can decide about the read/write size.
An example below, here we read BUFFER_SIZE bytes for each iteration:
Stream s = new MemoryStream();
Stream outputStream = new MemoryStream();
int BUFFER_SIZE = 1024;
using (BinaryReader reader = new BinaryReader(s))
{
BinaryWriter writer = new BinaryWriter(outputStream);
byte[] buffer = new byte[BUFFER_SIZE];
int read = buffer.Length;
while(read != 0)
{
read = reader.Read(buffer, 0, BUFFER_SIZE);
writer.Write(buffer, 0, read);
}
writer.Flush();
writer.Close();
}
I don't know if this causes you problems with encodings etc, but I think you will have to read the file as binary.
If all you want to do is copy one stream to another without modifying the data, you don't need the Stream text or binary helpers (StreamReader, StreamWriter, BinaryReader, BinaryWriter, etc.), simply copy the stream.
internal static class StreamExtensions
{
public static void CopyTo(this Stream readStream, Stream writeStream)
{
byte[] buffer = new byte[4096];
int read;
while ((read = readStream.Read(buffer, 0, buffer.Length)) > 0)
writeStream.Write(buffer, 0, read);
}
}
I think there is a memory leakage
Are you getting out of memory exception after processing a few lines or on the first line itself?
And there is no streamWriter.Flush() inside the while loop.
Don't you think there should be one?

Adding text to beginning and end of file in C#

I have a process which picks up a series of "xml" files. The reason I put xml in quotes is that that the text in the file does not have a root element which makes in invalid xml. In my processing, I want to correct this and open up each file add a root node to the beginning and end of each file, and then close it up. Here is what I had in mind, but this involves opening the file, reading the entire file, tagging on the nodes, and then writing the entire file out. These files may be more than 20 MB in size.
foreach (FileInfo file in files)
{
//open the file
StreamReader sr = new StreamReader(file.FullName);
// add the opening and closing tags
string text = "<root>" + sr.ReadToEnd() + "<root>";
sr.Close();
// now open the same file for writing
StreamWriter sw = new StreamWriter(file.FullName, false);
sw.Write(text);
sw.Close();
}
Any recommendations?
To avoid holding the whole file in memory, rename the original file, then open it with StreamReader. Then open the original filename with StreamWriter to create a new file.
Write the <root> prefix to the file, then copy data in large-ish chunks from the reader to the writer. When you've transferred all the data, write the closing </root> (note the forward slash if you want it to be XML). Then close both files and delete the renamed original.
char[] buffer = new char[10000];
string renamedFile = file.FullName + ".orig";
File.Move(file.FullName, renamedFile);
using (StreamReader sr = new StreamReader(renamedFile))
using (StreamWriter sw = new StreamWriter(file.FullName, false))
{
sw.Write("<root>");
int read;
while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
sw.Write(buffer, 0, read);
sw.Write("</root>");
}
File.Delete(renamedFile);
20 MB is not terribly much, but when you read it as a string, it will use about 40 MB of memory. That's not terribly much either, but it's processing that you don't need to do. You can handle it as raw bytes to reduce the memory usage, and to avoid decoding and re-encoding the data:
byte[] start = Encoding.UTF8.GetBytes("<root>");
byte[] ending = Encoding.UTF8.GetBytes("</root>");
byte[] data = File.ReadAllBytes(file.FullName);
int bom = (data[0] == 0xEF) ? 3 : 0;
using (FileStream s = File.Create(file.FullName)) {
if (bom > 0) {
s.Write(data, 0, bom);
}
s.Write(start, 0, start.Length);
s.Write(data, bom, data.Length - bom);
s.Write(ending, 0, ending.Length);
}
If you need to recude the memory usage much more, use a second file as Earwicker suggested.
Edit:
Added code to handle BOM (byte order mark).
I can't see any real improvement on this...which is kind of a bummer. Since there's no way to "shift" a file you'll always have to move the bytes in the entire file to inject anything at the top.
You may find some performance benefit by using raw streams rather than the StreamReader which has to actually parse the stream as text.
If you do not want to do this is C#, it would be easy to handle at the commandline or in a batch file.
ECHO ^<root^> > outfile.xml
TYPE temp.xml >> outfile.xml
ECHO ^</root^> >> outfile.xml
This would assume that you have some existing process for getting the data files that this could be hooked into.

Base64 Encode a PDF in C#?

Can someone provide some light on how to do this? I can do this for regular text or byte array, but not sure how to approach for a pdf. do i stuff the pdf into a byte array first?
Use File.ReadAllBytes to load the PDF file, and then encode the byte array as normal using Convert.ToBase64String(bytes).
Byte[] fileBytes = File.ReadAllBytes(#"TestData\example.pdf");
var content = Convert.ToBase64String(fileBytes);
There is a way that you can do this in chunks so that you don't have to burn a ton of memory all at once.
.Net includes an encoder that can do the chunking, but it's in kind of a weird place. They put it in the System.Security.Cryptography namespace.
I have tested the example code below, and I get identical output using either my method or Andrew's method above.
Here's how it works: You fire up a class called a CryptoStream. This is kind of an adapter that plugs into another stream. You plug a class called CryptoTransform into the CryptoStream (which in turn is attached to your file/memory/network stream) and it performs data transformations on the data while it's being read from or written to the stream.
Normally, the transformation is encryption/decryption, but .net includes ToBase64 and FromBase64 transformations as well, so we won't be encrypting, just encoding.
Here's the code. I included a (maybe poorly named) implementation of Andrew's suggestion so that you can compare the output.
class Base64Encoder
{
public void Encode(string inFileName, string outFileName)
{
System.Security.Cryptography.ICryptoTransform transform = new System.Security.Cryptography.ToBase64Transform();
using(System.IO.FileStream inFile = System.IO.File.OpenRead(inFileName),
outFile = System.IO.File.Create(outFileName))
using (System.Security.Cryptography.CryptoStream cryptStream = new System.Security.Cryptography.CryptoStream(outFile, transform, System.Security.Cryptography.CryptoStreamMode.Write))
{
// I'm going to use a 4k buffer, tune this as needed
byte[] buffer = new byte[4096];
int bytesRead;
while ((bytesRead = inFile.Read(buffer, 0, buffer.Length)) > 0)
cryptStream.Write(buffer, 0, bytesRead);
cryptStream.FlushFinalBlock();
}
}
public void Decode(string inFileName, string outFileName)
{
System.Security.Cryptography.ICryptoTransform transform = new System.Security.Cryptography.FromBase64Transform();
using (System.IO.FileStream inFile = System.IO.File.OpenRead(inFileName),
outFile = System.IO.File.Create(outFileName))
using (System.Security.Cryptography.CryptoStream cryptStream = new System.Security.Cryptography.CryptoStream(inFile, transform, System.Security.Cryptography.CryptoStreamMode.Read))
{
byte[] buffer = new byte[4096];
int bytesRead;
while ((bytesRead = cryptStream.Read(buffer, 0, buffer.Length)) > 0)
outFile.Write(buffer, 0, bytesRead);
outFile.Flush();
}
}
// this version of Encode pulls everything into memory at once
// you can compare the output of my Encode method above to the output of this one
// the output should be identical, but the crytostream version
// will use way less memory on a large file than this version.
public void MemoryEncode(string inFileName, string outFileName)
{
byte[] bytes = System.IO.File.ReadAllBytes(inFileName);
System.IO.File.WriteAllText(outFileName, System.Convert.ToBase64String(bytes));
}
}
I am also playing around with where I attach the CryptoStream. In the Encode method,I am attaching it to the output (writing) stream, so when I instance the CryptoStream, I use its Write() method.
When I read, I'm attaching it to the input (reading) stream, so I use the read method on the CryptoStream. It doesn't really matter which stream I attach it to. I just have to pass the appropriate Read or Write enumeration member to the CryptoStream's constructor.

Categories

Resources