I'd like to make a simple text file viewer and I'd like it to be able to handle large files (possibly larger than the computer's memory).
I know that I need to implement something like a sliding buffer, that will contain the currently visible portion of the file. The main problem is to determine the relation between lines and file offsets. If I just needed to be able to navigate by lines, I'd just need an linked list of lines and on line up/line down just read new line from the file. But what should I do when I also want to go to, say 50% of the file? I need to show the lines starting from the half of the file, so if the file is 10000 bytes long, I'd seek to byte 5000, look for a line break and display stuff from there. The problem is, that I don't know what line I'm at when seeking like this.
So what I would like to know is what would be a suitable data structure for keeping these few lines in memory (the ones that will be painted on the screen).
Keep in mind that I don't need to edit the files, just view them, so I don't need to care about efficiency of the chosen approach for editing.
If you're reading in a defined chunk of bytes via a FileStream you could keep track of which byte you read last so you know where to pick up next to read more data chunks from the file. FileStream exposes Read() which allows you to specify an offset byte (position to start) and also how many bytes to read at a time.
After you read in your bytes you can decode them to UTF8 with a decoder, for instance, and then retrieve a char array with it. All of that should initialize your initial data. What I would do since this will be displayed somewhere is setup event handlers tied to scrolling. When you start scrolling down you can remove top lines from memory (at the same time counting their bytes before deleting so you can dynamically read in the next set bytes with the same exact size) and append new lines to the bottom. Likewise for scrolling upward.
If you're wanting to figure out half of your data then you could try something with makign a FileInfo object on the text file path and then using the Length() method to return the number of bytes. Since streams deal in bytes this comes in handy when trying to read in a percentage. You can use that to define how many bytes to read in. You'll have to read data in to determine where line breaks are and set your last byte read as the CR-LF to pickup at the next line when you retrieve data again.
Here's what I would do to read a predefined count of bytes from a file.
public static LastByteRead = 0; // keep it zero indexed
public String[] GetFileChunk( String path, long chunkByteSize )
{
FileStream fStream;
String[] FileTextLines;
int SuccessBytes = 0;
long StreamSize;
byte[] FileBytes;
char[] FileTextChars;
Decoder UtfDecoder = Encoding.UTF8.GetDecoder();
FileInfo TextFileInfo = new FileInfo(path);
if( File.Exists(path) )
{
try {
StreamSize = (TextFileInfo.Length >= chunkByteSize) ? chunkByteSize : TextFileInfo.Length;
fStream = new FileStream( path, FileMode.Open, FileAccess.Read );
FileBytes = new byte[ StreamSize ];
FileTextChars = new char[ StreamSize ]; // this can be same size since it's UTF-8 (8bit chars)
SuccessBytes = fStream.Read( FileBytes, 0, (Int32)StreamSize );
if( SuccessBytes > 0 )
{
UtfDecoder.GetChars( FileBytes, 0, StreamSize, FileTextChars, 0 );
LastByteRead = SuccessBytes - 1;
return
String.Concat( fileTextChars.ToArray<char>() ).Split('\n');
}
else
return new String[1] {""};
}
catch {
var errorException = "ERROR: " + ex.Message;
Console.Writeline( errorException );
}
finally {
fStream.Close();
}
}
}
Maybe that will get you in the right direction at least.
Related
I am trying to convert byte[] to base64 string format so that i can send that information to third party. My code as below:
byte[] ByteArray = System.IO.File.ReadAllBytes(path);
string base64Encoded = System.Convert.ToBase64String(ByteArray);
I am getting below error:
Exception of type 'System.OutOfMemoryException' was thrown. Can you
help me please ?
Update
I just spotted #PanagiotisKanavos' comment pointing to Is there a Base64Stream for .NET?. This does essentially the same thing as my code below attempts to achieve (i.e. allows you to process the file without having to hold the whole thing in memory in one go), but without the overhead/risk of self-rolled code / rather using a standard .Net library method for the job.
Original
The below code will create a new temporary file containing the Base64 encoded version of your input file.
This should have a lower memory footprint, since rather than doing all data at once, we handle it several bytes at a time.
To avoid holding the output in memory, I've pushed that back to a temp file, which is returned. When you later need to use that data for some other process, you'd need to stream it (i.e. so that again you're not consuming all of this data at once).
You'll also notice that I've used WriteLine instead of Write; which will introduce non base64 encoded characters (i.e. the line breaks). That's deliberate, so that if you consume the temp file with a text reader you can easily process it line by line.
However, you can amend per your needs.
void Main()
{
var inputFilePath = #"c:\temp\bigfile.zip";
var convertedDataPath = ConvertToBase64TempFile(inputFilePath);
Console.WriteLine($"Take a look in {convertedDataPath} for your converted data");
}
//inputFilePath = where your source file can be found. This is not impacted by the below code
//bufferSizeInBytesDiv3 = how many bytes to read at a time (divided by 3); the larger this value the more memory is required, but the better you'll find performance. The Div3 part is because we later multiple this by 3 / this ensures we never have to deal with remainders (i.e. since 3 bytes = 4 base64 chars)
public string ConvertToBase64TempFile(string inputFilePath, int bufferSizeInBytesDiv3 = 1024)
{
var tempFilePath = System.IO.Path.GetTempFileName();
using (var fileStream = File.Open(inputFilePath,FileMode.Open))
{
using (var reader = new BinaryReader(fileStream))
{
using (var writer = new StreamWriter(tempFilePath))
{
byte[] data;
while ((data = reader.ReadBytes(bufferSizeInBytesDiv3 * 3)).Length > 0)
{
writer.WriteLine(System.Convert.ToBase64String(data)); //NB: using WriteLine rather than Write; so when consuming this content consider removing line breaks (I've used this instead of write so you can easily stream the data in chunks later)
}
}
}
}
return tempFilePath;
}
First of all I understand that I can solve this issue using different ways. I guess that this issue exists only because of using different methods in incorrect way. But I want to find out what exactly happened in my example.
I was using StreamReader for reading file. In order to get bytes from it I decided to use BaseStream.Read:
int length = (int)reader.BaseStream.Length;
byte[] file = new byte[length];
while(!reader.EndOfStream)
{
int readBytes = reader.BaseStream.Read(file, 0,
(length-offset)>bufferSize?bufferSize:(length - offset));
for (int i = 0; i<readBytes; i++)
{
...
}
offset += readBytes;
}
BaseStream.Read refuses to get last 1024 bytes when property StreamReader.EndOfStream was used before reading. Later I've found information, that EndOfStream trying to read 1 byte, but in fact he reads 1024 bytes due performance. Apparently this 1kb become impossible to reach.
EDIT: If I delete reader.EndOfStream property in code, reader.BaseStream.Read will work correctly. That was the main point of question.
Again, I understand, that this code example is absolutely inefficient. I'm just trying to understand how streams work in that example and does this issue exist because of bad code only (or StreamReader.BaseStream has some issues)? Thanks in advance.
It is not StreamReader.BaseStream has some issues but is a problem in your code. When you work directly with the Stream wraped inside StreamReader.
From MSDN about StreamReader.DiscardBufferedData:
You need to call this method only when the position of the internal buffer and the BaseStream do not match. These positions can become mismatched when you read data into the buffer and then seek a new position in the underlying stream.
That mean, in your case, when the Stream already reached end position, the position of StreamReader internal buffer still remain the value before you read the underlying stream directedly, therefore reader.EndOfStream still = false. That why you can not finish the loop.
Edit:
I think you are missing something, I give you this code to prove that the file is successfully reached to the end. Run it and you see that your app repeatly say: I'm at the end of the file!
static void Main()
{
using (StreamReader reader = new StreamReader(#"yourFile"))
{
int offset = 0;
int bufferSize = 102400;
int length = (int)reader.BaseStream.Length;
byte[] file = new byte[length];
while (!reader.EndOfStream)
{
// Add this line:
Console.WriteLine(reader.BaseStream.Position);
Console.ReadLine();
int readBytes = reader.BaseStream.Read(file, 0,
(length - offset) > bufferSize ? bufferSize : (length - offset));
string str = Encoding.UTF8.GetString(file, 0, readBytes);
offset += readBytes;
if (reader.BaseStream.Position == length)
{
Console.WriteLine("I'm at the end of the file! Current Tickcount: " + Environment.TickCount);
Thread.Sleep(100);
}
}
}
}
Edit 2
But still , offset and length should be equal, im my case length - offset = 1024 (in case of files that bigger than 1kb). Maybe I'm doing something wrong, but if I use files with size less than 1kb, readBytes always equals 0.
That because your first call to while (!reader.EndOfStream), the reader have to read the file (this case is 1024 bytes - read bytes to internal buffer) to detemine if file is ended or not (see two lines of code I add above), after it read the file is seeked 1024 bytes, that why length - offset = 1024, and if your file small than 1kb then with this first call, it already seek to end of file. This is where you lost data.
The second call to it, it don't seek because you don't send any read request to the reader, so it consider unchanged, then it don't need read file again to check if at the end of file, that why the second call don't loss data.
I have a configuration file (.cfg) that I am using to create a command line application to add users to a SFTP server application.
The cfg file needs to have a certain number of reserved bytes for each entry in the cfg file. I am currently just appending a new user to the end of the file by creating a byte array and converting it to a string, then copying it to the file, but i've hit a snag. The config file requires 4 bytes at the end of the file.
The process I need to accomplish is to remove these trailing bytes from the file, append the new user, then append the bytes to the end.
So, now that you have some context behind my problem.
Here is the question:
How do you remove and add bytes from a byte array?
Here is the code I've got so far, it reads the user from one file and appends it to another.
static void Main(string[] args)
{
System.Text.ASCIIEncoding code = new System.Text.ASCIIEncoding(); //Encoding in ascii to pick up mad characters
StreamReader reader = new StreamReader("one_user.cfg", code, false, 1072);
string input = "";
input = reader.ReadToEnd();
//convert input string to bytes
byte[] byteArray = Encoding.ASCII.GetBytes(input);
MemoryStream stream = new MemoryStream(byteArray);
//Convert Stream to string
StreamReader byteReader = new StreamReader(stream);
String output = byteReader.ReadToEnd();
int len = System.Text.Encoding.ASCII.GetByteCount(output);
using (StreamWriter writer = new StreamWriter("freeFTPdservice.cfg", true, Encoding.ASCII, 5504))
{
writer.Write(output, true);
writer.Close();
}
Console.WriteLine("Appended: " + len);
Console.ReadLine();
reader.Close();
byteReader.Close();
}
To try and illustrate this point, here is a "diagram".
1) Add first user
File(appended text)Bytes at end (zeros)
2) Add second user
File(appended text)(appended text)bytes at end (zeros)
and so on.
To explicitly answer your question: How do you remove and add bytes from a byte array?
You can only do this by creating a new array and copying the bytes into it.
Fortunately, this is simplified by using Array.Resize():
byte[] array = new byte[10];
Console.WriteLine(array.Length); // Prints 10
Array.Resize(ref array, 20); // Copies contents of old array to new.
Console.WriteLine(array.Length); // Prints 20
If you need to remove bytes from the beginning - Array.Copy bytes first and than resize (or copy to new array if you don't like ref):
// remove 42 bytes from beginning of the array, add size checks as needed
Array.Copy(array, 42, array, 0, array.Length-42);
Array.Resize(ref array, array.Length-42);
You don't. You can copy to a new array of the desired size. Or you can work with a List<byte> and then create an array from that.
But, in your case, I would suggest looking into the file streams themselves... they let you read and write individual bytes or byte arrays and also:
Seek
which lets you move around to arbitrary locations in the file... So, for the use case you described, you would
open the file (for read/write access)
move to the end of the file
move back four bytes (do you know which ones they are? if not, this would be a good time to stash them)
write the new user
write the four bytes
close the file
Something like this:
using (var fs = new FileStream(PATH, FileMode.Open, FileAccess.ReadWrite))
{
fs.Seek(-4, SeekOrigin.End);
fs.Write(userBytes);
fs.Write(fourBytesAtEnd);
}
This also has the advantage of not having to slurp in the whole file and write it back out.
I have an application that is running on a stand-alone panel PC in a kiosk (C#/WPF). It performs some typical logging operations to a text file. The PC has some limited amount of disk space to store these logs as they grow.
What I need to do is be able to specify the maximum size that a log file is allowed to be. If, when attempting to write to the log, the max size is exceeded, new data will be written to the end of the log and the oldest data will be purged from the beginning.
Getting the file size is no problem, but are there any typical file manipulation techniques to keep a file under a certain size?
One technique to handle this is to have two log files which are half the maximum size each. You simply rotate between the two as you reach the max size of each file. Rotating to a file causes it to be overwritten with a new file.
A logging framework such as log4net has this functionality built in.
Try using Log4Net
http://www.codeproject.com/KB/aspnet/log4net.aspx
There's no easy way to strip the data from the beginning of file. So you have several options:
Keep the log in several smaller log files and delete the oldest "chunks" if the total size of all log files exceeds your limit. This is similar to what you want to do, but on different level
Rename the log file to "log.date" and start a new log. Similar to (1) but not an option if you have limited disk space.
IF you have enough RAM and your log size is relatively small to fit in memory, you can do the following: map the whole file into memory using Memory-mapped file, then perform move operation by taking the data from the middle of the file and moving them to the beginning. Then truncate the file. This is the only way to easily strip the data from the beginning of the log file without creating a copy of it.
Linux os: check out logrotate - http://www.cyberciti.biz/faq/how-do-i-rotate-log-files/
Windows os: try googling windows logrotate. for example: http://blog.arithm.com/2008/02/07/windows-log-file-rotation/
I wanted a simple solution as well, but I didn't want to add another dependency so I made a simple method. This has everything you need other than the part of compressing the old file to a zip, which you can find here: Create zip file in memory from bytes (text with arbitrary encoding)
static int iMaxLogLength = 2000; // Probably should be bigger, say 200,000
static int KeepLines = 5; // minimum of how much of the old log to leave
public static void ManageLogs(string strFileName)
{
try
{
FileInfo fi = new FileInfo(strFileName);
if (fi.Length > iMaxLogLength) // if the log file length is already too long
{
int TotalLines = 0;
var file = File.ReadAllLines(strFileName);
var LineArray = file.ToList();
var AmountToCull = (int)(LineArray.Count - KeepLines);
var trimmed = LineArray.Skip(AmountToCull).ToList();
File.WriteAllLines(strFileName, trimmed);
string archiveName = strFileName + "-" + DateTime.Now.ToString("MM-dd-yyyy") + ".zip";
File.WriteAllBytes(archiveName, Compression.Zip(string.Join("\n", file)));
}
}
catch (Exception ex)
{
Console.WriteLine("Failed to write to logfile : " + ex.Message);
}
}
I have this as part of the initialization / reinitialization section of my application, so it gets run a few times a day.
ErrorLogging.ManageLogs("Application.log");
I wouldn't use this for a file meant to be over say 1 Meg and it's not terribly efficient, but it works good if you need to solve a pesky problem of when you need a log file that you can't conveniently maintain. Make sure the log file exists before you use this though... or you could add code for it as well as checking the location exists, etc.
// This is how to call it
private void buttonLog_Click(object sender, EventArgs e)
{
c_Log.writeToFile(textBoxMessages.Text, "../../log.log", 1);
}
public static class c_Log
{
static int iMaxLogLength = 15000; // Probably should be bigger, say 200,000
static int iTrimmedLogLength = -1000; // minimum of how much of the old log to leave
static public void writeToFile(string strNewLogMessage, string strFile, int iLogLevel)
{
try
{
FileInfo fi = new FileInfo(strFile);
Byte[] bytesSavedFromEndOfOldLog = null;
if (fi.Length > iMaxLogLength) // if the log file length is already too long
{
using (BinaryReader br = new BinaryReader(File.Open(strFile, FileMode.Open)))
{
// Seek to our required position of what you want saved.
br.BaseStream.Seek(iTrimmedLogLength, SeekOrigin.End);
// Read what you want to save and hang onto it.
bytesSavedFromEndOfOldLog = br.ReadBytes((-1 * iTrimmedLogLength));
}
}
byte[] newLine = System.Text.ASCIIEncoding.ASCII.GetBytes(Environment.NewLine);
FileStream fs = null;
// If the log file is less than the max length, just open it at the end to write there
if (fi.Length < iMaxLogLength)
fs = new FileStream(strFile, FileMode.Append, FileAccess.Write, FileShare.Read);
else // If the log file is more than the max length, just open it empty
fs = new FileStream(strFile, FileMode.Create, FileAccess.Write, FileShare.Read);
using (fs)
{
// If you are trimming the file length, write what you saved.
if (bytesSavedFromEndOfOldLog != null)
{
Byte[] lineBreak = Encoding.ASCII.GetBytes("### " + DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss") + " *** *** *** Old Log Start Position *** *** *** *** ###");
fs.Write(newLine, 0, newLine.Length);
fs.Write(newLine, 0, newLine.Length);
fs.Write(lineBreak, 0, lineBreak.Length);
fs.Write(newLine, 0, newLine.Length);
fs.Write(bytesSavedFromEndOfOldLog, 0, bytesSavedFromEndOfOldLog.Length);
fs.Write(newLine, 0, newLine.Length);
}
Byte[] sendBytes = Encoding.ASCII.GetBytes(strNewLogMessage);
// Append your last log message.
fs.Write(sendBytes, 0, sendBytes.Length);
fs.Write(newLine, 0, newLine.Length);
}
}
catch (Exception ex)
{
; // Nothing to do...
//writeEvent("writeToFile() Failed to write to logfile : " + ex.Message + "...", 5);
}
}
}
I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:
private void copy(string srcFile, string dstFile, int offset, int length)
{
BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
reader.BaseStream.Seek(offset, SeekOrigin.Begin);
byte[] buffer = reader.ReadBytes(length);
BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
writer.Write(buffer);
}
Considering that I have to call this function about 100,000 times, it is remarkably slow.
Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)
I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:
public static void CopySection(Stream input, string targetFile, int length)
{
byte[] buffer = new byte[8192];
using (Stream output = File.OpenWrite(targetFile))
{
int bytesRead = 1;
// This will finish silently if we couldn't read "length" bytes.
// An alternative would be to throw an exception
while (length > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
output.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:
public static void CopySection(Stream input, string targetFile,
int length, byte[] buffer)
{
using (Stream output = File.OpenWrite(targetFile))
{
int bytesRead = 1;
// This will finish silently if we couldn't read "length" bytes.
// An alternative would be to throw an exception
while (length > 0 && bytesRead > 0)
{
bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
output.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
Note that this also closes the output stream (due to the using statement) which your original code didn't.
The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.
I think it'll be significantly faster, but obviously you'll need to try it to see...
This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.
The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:
http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/
How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.
(edit) something like:
private static void copy(string srcFile, string dstFile, int offset,
int length, byte[] buffer)
{
using(Stream inStream = File.OpenRead(srcFile))
using (Stream outStream = File.OpenWrite(dstFile))
{
inStream.Seek(offset, SeekOrigin.Begin);
int bufferLength = buffer.Length, bytesRead;
while (length > bufferLength &&
(bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
{
outStream.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
while (length > 0 &&
(bytesRead = inStream.Read(buffer, 0, length)) > 0)
{
outStream.Write(buffer, 0, bytesRead);
length -= bytesRead;
}
}
}
You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.
If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:
offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000
can be grouped to one read:
offset = 1234, length = 1074
Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.
Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.
static void Main(string[] args)
{
Dispatcher dp = new Dispatcher();
DispatcherQueue dq = new DispatcherQueue("DQ", dp);
Port<long> offsetPort = new Port<long>();
Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
new Handler<long>(Split)));
FileStream fs = File.Open(file_path, FileMode.Open);
long size = fs.Length;
fs.Dispose();
for (long i = 0; i < size; i += split_size)
{
offsetPort.Post(i);
}
}
private static void Split(long offset)
{
FileStream reader = new FileStream(file_path, FileMode.Open,
FileAccess.Read);
reader.Seek(offset, SeekOrigin.Begin);
long toRead = 0;
if (offset + split_size <= reader.Length)
toRead = split_size;
else
toRead = reader.Length - offset;
byte[] buff = new byte[toRead];
reader.Read(buff, 0, (int)toRead);
reader.Dispose();
File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
}
This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.
The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?
Over 100,000 accesses (sum the times):
How much time is spent allocating the buffer array?
How much time is spent opening the file for read (is it the same file every time?)
How much time is spent in read and write operations?
If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)
Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.
Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.
If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.
No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.
(For future reference.)
Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).
Memory Mapped files are supported in managed code in .NET 4.0.
But as noted, you need to profile, and expect to switch to native code for maximum performance.