Sharpziplib InflaterInputStream.Read behavior when zero bytes - c#

I am working on a .Net Remoting Compression Sink using InflaterInputStream. I got the guidelines from a .Net Remoting book here.
I am using binary as formatter and the issue I am encountering is when .Net binaryformatter calls the InflaterInputStream.Read(...) method it sometimes passes a zero length byte as the first parameter. The InflaterInputStream is not able to handle zero length byte and throws a "Don't know what to do" exception:
public override int Read(byte[] b, int off, int len)
{
for (;;) {
int count;
try {
count = inf.Inflate(b, off, len);
} catch (Exception e) {
throw new SharpZipBaseException(e.ToString());
}
if (count > 0) {
return count;
}
if (inf.IsNeedingDictionary) {
throw new SharpZipBaseException("Need a dictionary");
} else if (inf.IsFinished) {
return 0;
} else if (inf.IsNeedingInput) {
Fill();
} else {
throw new InvalidOperationException("Don't know what to do");
}
}
}
I am planning to add a "if" code block on top of the method so that it checks first if byte is zero. If it's zero then it would just return 0 something like:
public override int Read(byte[] b, int off, int len)
{
//empty bytes should not go into Inflate method.
if (b.Length == 0)
{
return 0;
}
Would this be the most appropriate solution for this issue? The calls to this method are coming from the .Net BinaryFormatter so I just handled it in this class but I am not particularly sure if returning zero would have an effect on the BinaryFormatter.

Related

Gzip only after a threshold reached?

I have a requirement to archive all the data used to build a report everyday. I compress most of the the data using gzip, as some of the datasets can be very large (10mb+). I write each individual protobuf graph to a file. I also whitelist a fixed set of known small object types and added some code to detect if the file is gzipped or not, when I read it. This is because a small file, when compressed can actually be bigger then uncompressed.
Unfortunately, just due to the nature of the data, I may only have a few elements of a larger object type, and the whitelist approach can be problematic.
Is there anyway to write an object to a stream, and only if it reaches a threshold (like 8kb), then compress it? I don't know the size of the object beforehand, and sometimes I have an object graph with an IEnumerable<T> that might be considerable in size.
Edit:
The code is fairly basic. I did skim over the fact that I store this in a filestream db table. That shouldn't really matter for the implementation purpose. I removed some of the extraneous code.
public Task SerializeModel<T>(TransactionalDbContext dbConn, T Item, DateTime archiveDate, string name)
{
var continuation = (await dbConn
.QueryAsync<PathAndContext>(_getPathAndContext, new {archiveDate, model=name})
.ConfigureAwait(false))
.First();
var useGzip = !_whitelist.Contains(typeof(T));
using (var fs = new SqlFileStream(continuation.Path, continuation.Context, FileAccess.Write,
FileOptions.SequentialScan | FileOptions.Asynchronous, 64*1024))
using (var buffer = useGzip ? new GZipStream(fs, CompressionLevel.Optimal) : default(Stream))
{
_serializerModel.Serialize(stream ?? fs, item);
}
dbConn.Commit();
}
During the serialization, you can use an intermediate stream to accomplish what you are asking for. Something like this will do the job
class SerializationOutputStream : Stream
{
Stream outputStream, writeStream;
byte[] buffer;
int bufferedCount;
long position;
public SerializationOutputStream(Stream outputStream, int compressTreshold = 8 * 1024)
{
writeStream = this.outputStream = outputStream;
buffer = new byte[compressTreshold];
}
public override long Seek(long offset, SeekOrigin origin) { throw new NotSupportedException(); }
public override void SetLength(long value) { throw new NotSupportedException(); }
public override int Read(byte[] buffer, int offset, int count) { throw new NotSupportedException(); }
public override bool CanRead { get { return false; } }
public override bool CanSeek { get { return false; } }
public override bool CanWrite { get { return writeStream != null && writeStream.CanWrite; } }
public override long Length { get { throw new NotSupportedException(); } }
public override long Position { get { return position; } set { throw new NotSupportedException(); } }
public override void Write(byte[] buffer, int offset, int count)
{
if (count <= 0) return;
var newPosition = position + count;
if (this.buffer == null)
writeStream.Write(buffer, offset, count);
else
{
int bufferCount = Math.Min(count, this.buffer.Length - bufferedCount);
if (bufferCount > 0)
{
Array.Copy(buffer, offset, this.buffer, bufferedCount, bufferCount);
bufferedCount += bufferCount;
}
int remainingCount = count - bufferCount;
if (remainingCount > 0)
{
writeStream = new GZipStream(outputStream, CompressionLevel.Optimal);
try
{
writeStream.Write(this.buffer, 0, this.buffer.Length);
writeStream.Write(buffer, offset + bufferCount, remainingCount);
}
finally { this.buffer = null; }
}
}
position = newPosition;
}
public override void Flush()
{
if (buffer == null)
writeStream.Flush();
else if (bufferedCount > 0)
{
try { outputStream.Write(buffer, 0, bufferedCount); }
finally { buffer = null; }
}
}
protected override void Dispose(bool disposing)
{
try
{
if (!disposing || writeStream == null) return;
try { Flush(); }
finally { writeStream.Close(); }
}
finally
{
writeStream = outputStream = null;
buffer = null;
base.Dispose(disposing);
}
}
}
and use it like this
using (var stream = new SerializationOutputStream(new SqlFileStream(continuation.Path, continuation.Context, FileAccess.Write,
FileOptions.SequentialScan | FileOptions.Asynchronous, 64*1024)))
_serializerModel.Serialize(stream, item);
datasets can be very large (10mb+)
On most devices, that is not very large. Is there a reason you can't read in the entire object before deciding whether to compress? Note also the suggestion from #Niklas to read in one buffer's worth of data (e.g. 8K) before deciding whether to compress.
This is because a small file, when compressed can actually be bigger then uncompressed.
The thing that makes a small file potentially larger is the ZIP header, in particular the dictionary. Some ZIP libraries allow you to use a custom dictionary known while compressing and uncompressing. I used SharpZipLib for this many years back.
It is more effort, in terms of coding and testing, to use this approach. If you feel that the benefit is worthwhile, it may provide the best approach.
Note no matter what path you take, you will physically store data using multiples of the block size of your storage device.
if the object is 1 byte or 100mb I have no idea
Note that protocol buffers is not really designed for large data sets
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data.
If your largest object can comfortably serialize into memory, first serialize it into a MemoryStream, then either write that MemoryStream to your final destination, or run it through a GZipStream and then to your final destination. If the largest object cannot comfortably serialize into memory, I'm not sure what further advice to give.

IEnumerable to Stream

I would like to do something roughly equivalent to the code example below. I want to generate and serve a stream of data without necessarily having the entire data set in memory at any one time.
It seems like I would need some implementation of Stream that accepts an IEnumerable<string> (or IEnumerable<byte>) in its constructor. Internally this Stream would only walk the IEnumerable as the Stream is being read or as needed. But I don't know of any Stream implementation like this.
Am I on the right track? Do you know of any way to do something like this?
public FileStreamResult GetResult()
{
IEnumerable<string> data = GetDataForStream();
Stream dataStream = ToStringStream(Encoding.UTF8, data);
return File(dataStream, "text/plain", "Result");
}
private IEnumerable<string> GetDataForStream()
{
StringBuilder sb;
for (int i = 0; i < 10000; i++)
{
yield return i.ToString();
yield return "\r\n";
}
}
private Stream ToStringStream(Encoding encoding, IEnumerable<string> data)
{
// I have to write my own implementation of stream?
throw new NotImplementedException();
}
Here's a read-only Stream implementation that uses an IEnumerable<byte> as input:
public class ByteStream : Stream, IDisposable
{
private readonly IEnumerator<byte> _input;
private bool _disposed;
public ByteStream(IEnumerable<byte> input)
{
_input = input.GetEnumerator();
}
public override bool CanRead => true;
public override bool CanSeek => false;
public override bool CanWrite => false;
public override long Length => 0;
public override long Position { get; set; } = 0;
public override int Read(byte[] buffer, int offset, int count)
{
int i = 0;
for (; i < count && _input.MoveNext(); i++)
buffer[i + offset] = _input.Current;
return i;
}
public override long Seek(long offset, SeekOrigin origin) => throw new InvalidOperationException();
public override void SetLength(long value) => throw new InvalidOperationException();
public override void Write(byte[] buffer, int offset, int count) => throw new InvalidOperationException();
public override void Flush() => throw new InvalidOperationException();
void IDisposable.Dispose()
{
if (_disposed)
return;
_input.Dispose();
_disposed= true;
}
}
What you then still need is a function that converts IEnumerable<string> to IEnumerable<byte>:
public static IEnumerable<byte> Encode(IEnumerable<string> input, Encoding encoding)
{
byte[] newLine = encoding.GetBytes(Environment.NewLine);
foreach (string line in input)
{
byte[] bytes = encoding.GetBytes(line);
foreach (byte b in bytes)
yield return b;
foreach (byte b in newLine)
yield return b;
}
}
And finally, here's how to use this in your controller:
public FileResult GetResult()
{
IEnumerable<string> data = GetDataForStream();
var stream = new ByteStream(Encode(data, Encoding.UTF8));
return File(stream, "text/plain", "Result.txt");
}
I created a class called ProducerConsumerStream that does this. The producer writes data to the stream and the consumer reads. There's a buffer in the middle so that the producer can "write ahead" a little bit. You can define the size of the buffer.
Anyway, if it's not exactly what you're looking for, I suspect it will give you a good idea of how it's done. See Building a new type of stream.
Update
The link went stale, so I've copied my code here. The original article is still available on the Wayback machine at https://web.archive.org/web/20151210235510/http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=852
First, the ProducerConsumerStream class:
using System;
using System.IO;
using System.Threading;
using System.Diagnostics;
namespace Mischel.IO
{
// This class is safe for 1 producer and 1 consumer.
public class ProducerConsumerStream : Stream
{
private byte[] CircleBuff;
private int Head;
private int Tail;
public bool IsAddingCompleted { get; private set; }
public bool IsCompleted { get; private set; }
// For debugging
private long TotalBytesRead = 0;
private long TotalBytesWritten = 0;
public ProducerConsumerStream(int size)
{
CircleBuff = new byte[size];
Head = 1;
Tail = 0;
}
[Conditional("JIM_DEBUG")]
private void DebugOut(string msg)
{
Console.WriteLine(msg);
}
[Conditional("JIM_DEBUG")]
private void DebugOut(string fmt, params object[] parms)
{
DebugOut(string.Format(fmt, parms));
}
private int ReadBytesAvailable
{
get
{
if (Head > Tail)
return Head - Tail - 1;
else
return CircleBuff.Length - Tail + Head - 1;
}
}
private int WriteBytesAvailable { get { return CircleBuff.Length - ReadBytesAvailable - 1; } }
private void IncrementTail()
{
Tail = (Tail + 1) % CircleBuff.Length;
}
public override int Read(byte[] buffer, int offset, int count)
{
if (disposed)
{
throw new ObjectDisposedException("The stream has been disposed.");
}
if (IsCompleted)
{
throw new EndOfStreamException("The stream is empty and has been marked complete for adding.");
}
if (count == 0)
{
return 0;
}
lock (CircleBuff)
{
DebugOut("Read: requested {0:N0} bytes. Available = {1:N0}.", count, ReadBytesAvailable);
while (ReadBytesAvailable == 0)
{
if (IsAddingCompleted)
{
IsCompleted = true;
return 0;
}
Monitor.Wait(CircleBuff);
}
// If Head < Tail, then there are bytes available at the end of the buffer
// and also at the front of the buffer.
// If reading from Tail to the end doesn't fulfill the request,
// and there are still bytes available,
// then read from the start of the buffer.
DebugOut("Read: Head={0}, Tail={1}, Avail={2}", Head, Tail, ReadBytesAvailable);
IncrementTail();
int bytesToRead;
if (Tail > Head)
{
// When Tail > Head, we know that there are at least
// (CircleBuff.Length - Tail) bytes available in the buffer.
bytesToRead = CircleBuff.Length - Tail;
}
else
{
bytesToRead = Head - Tail;
}
// Don't read more than count bytes!
bytesToRead = Math.Min(bytesToRead, count);
Buffer.BlockCopy(CircleBuff, Tail, buffer, offset, bytesToRead);
Tail += (bytesToRead - 1);
int bytesRead = bytesToRead;
// At this point, either we've exhausted the buffer,
// or Tail is at the end of the buffer and has to wrap around.
if (bytesRead < count && ReadBytesAvailable > 0)
{
// We haven't fulfilled the read.
IncrementTail();
// Tail is always equal to 0 here.
bytesToRead = Math.Min((count - bytesRead), (Head - Tail));
Buffer.BlockCopy(CircleBuff, Tail, buffer, offset + bytesRead, bytesToRead);
bytesRead += bytesToRead;
Tail += (bytesToRead - 1);
}
TotalBytesRead += bytesRead;
DebugOut("Read: returning {0:N0} bytes. TotalRead={1:N0}", bytesRead, TotalBytesRead);
DebugOut("Read: Head={0}, Tail={1}, Avail={2}", Head, Tail, ReadBytesAvailable);
Monitor.Pulse(CircleBuff);
return bytesRead;
}
}
public override void Write(byte[] buffer, int offset, int count)
{
if (disposed)
{
throw new ObjectDisposedException("The stream has been disposed.");
}
if (IsAddingCompleted)
{
throw new InvalidOperationException("The stream has been marked as complete for adding.");
}
lock (CircleBuff)
{
DebugOut("Write: requested {0:N0} bytes. Available = {1:N0}", count, WriteBytesAvailable);
int bytesWritten = 0;
while (bytesWritten < count)
{
while (WriteBytesAvailable == 0)
{
Monitor.Wait(CircleBuff);
}
DebugOut("Write: Head={0}, Tail={1}, Avail={2}", Head, Tail, WriteBytesAvailable);
int bytesToCopy = Math.Min((count - bytesWritten), WriteBytesAvailable);
CopyBytes(buffer, offset + bytesWritten, bytesToCopy);
TotalBytesWritten += bytesToCopy;
DebugOut("Write: {0} bytes written. TotalWritten={1:N0}", bytesToCopy, TotalBytesWritten);
DebugOut("Write: Head={0}, Tail={1}, Avail={2}", Head, Tail, WriteBytesAvailable);
bytesWritten += bytesToCopy;
Monitor.Pulse(CircleBuff);
}
}
}
private void CopyBytes(byte[] buffer, int srcOffset, int count)
{
// Insert at head
// The copy might require two separate operations.
// copy as much as can fit between Head and end of the circular buffer
int offset = srcOffset;
int bytesCopied = 0;
int bytesToCopy = Math.Min(CircleBuff.Length - Head, count);
if (bytesToCopy > 0)
{
Buffer.BlockCopy(buffer, offset, CircleBuff, Head, bytesToCopy);
bytesCopied = bytesToCopy;
Head = (Head + bytesToCopy) % CircleBuff.Length;
offset += bytesCopied;
}
// Copy the remainder, which will go from the beginning of the buffer.
if (bytesCopied < count)
{
bytesToCopy = count - bytesCopied;
Buffer.BlockCopy(buffer, offset, CircleBuff, Head, bytesToCopy);
Head = (Head + bytesToCopy) % CircleBuff.Length;
}
}
public void CompleteAdding()
{
if (disposed)
{
throw new ObjectDisposedException("The stream has been disposed.");
}
lock (CircleBuff)
{
DebugOut("CompleteAdding: {0:N0} bytes written.", TotalBytesWritten);
IsAddingCompleted = true;
Monitor.Pulse(CircleBuff);
}
}
public override bool CanRead { get { return true; } }
public override bool CanSeek { get { return false; } }
public override bool CanWrite { get { return true; } }
public override void Flush() { /* does nothing */ }
public override long Length { get { throw new NotImplementedException(); } }
public override long Position
{
get { throw new NotImplementedException(); }
set { throw new NotImplementedException(); }
}
public override long Seek(long offset, SeekOrigin origin)
{
throw new NotImplementedException();
}
public override void SetLength(long value)
{
throw new NotImplementedException();
}
private bool disposed = false;
protected override void Dispose(bool disposing)
{
if (!disposed)
{
base.Dispose(disposing);
disposed = true;
}
}
}
}
And an example of how to use it:
class Program
{
static readonly string TestText = "This is a test of the emergency broadcast system.";
static readonly byte[] TextBytes = Encoding.UTF8.GetBytes(TestText);
const int Megabyte = 1024 * 1024;
const int TestBufferSize = 12;
const int ProducerBufferSize = 4;
const int ConsumerBufferSize = 5;
static void Main(string[] args)
{
Console.WriteLine("TextBytes contains {0:N0} bytes.", TextBytes.Length);
using (var pcStream = new ProducerConsumerStream(TestBufferSize))
{
Thread ProducerThread = new Thread(ProducerThreadProc);
Thread ConsumerThread = new Thread(ConsumerThreadProc);
ProducerThread.Start(pcStream);
Thread.Sleep(2000);
ConsumerThread.Start(pcStream);
ProducerThread.Join();
ConsumerThread.Join();
}
Console.Write("Done. Press Enter.");
Console.ReadLine();
}
static void ProducerThreadProc(object state)
{
Console.WriteLine("Producer: started.");
var pcStream = (ProducerConsumerStream)state;
int offset = 0;
while (offset < TestText.Length)
{
int bytesToWrite = Math.Min(ProducerBufferSize, TestText.Length - offset);
pcStream.Write(TextBytes, offset, bytesToWrite);
offset += bytesToWrite;
}
pcStream.CompleteAdding();
Console.WriteLine("Producer: {0:N0} total bytes written.", offset);
Console.WriteLine("Producer: exit.");
}
static void ConsumerThreadProc(object state)
{
Console.WriteLine("Consumer: started.");
var instream = (ProducerConsumerStream)state;
int testOffset = 0;
var inputBuffer = new byte[TextBytes.Length];
int bytesRead;
do
{
int bytesToRead = Math.Min(ConsumerBufferSize, inputBuffer.Length - testOffset);
bytesRead = instream.Read(inputBuffer, testOffset, bytesToRead);
//Console.WriteLine("Consumer: {0:N0} bytes read.", bytesRead);
testOffset += bytesRead;
} while (bytesRead != 0);
Console.WriteLine("Consumer: {0:N0} total bytes read.", testOffset);
// Compare bytes read with TextBytes
for (int i = 0; i < TextBytes.Length; ++i)
{
if (inputBuffer[i] != TextBytes[i])
{
Console.WriteLine("Read error at position {0}", i);
break;
}
}
Console.WriteLine("Consumer: exit.");
}
}
I had the same problem. In my case a third party package only accepts streams but I have an IEnumerable, and couldn't find an answer online so I wrote my own, which I'll share:
public class IEnumerableStringReader : TextReader
{
private readonly IEnumerator<string> _enumerator;
private bool eof = false; // is set to true when .MoveNext tells us there is no more data.
private char[] curLine = null;
private int curLinePos = 0;
private bool disposed = false;
public IEnumerableStringReader(IEnumerable<string> input)
{
_enumerator = input.GetEnumerator();
}
private void GetNextLine()
{
if (eof) return;
eof = !_enumerator.MoveNext();
if (eof) return;
curLine = $"{_enumerator.Current}\r\n" // IEnumerable<string> input implies newlines exist betweent he lines.
.ToCharArray();
curLinePos = 0;
}
public override int Peek()
{
if (disposed) throw new ObjectDisposedException("The stream has been disposed.");
if (curLine == null || curLinePos == curLine.Length) GetNextLine();
if (eof) return -1;
return curLine[curLinePos];
}
public override int Read()
{
if (disposed) throw new ObjectDisposedException("The stream has been disposed.");
if (curLine == null || curLinePos == curLine.Length) GetNextLine();
if (eof) return -1;
return curLine[curLinePos++];
}
public override int Read(char[] buffer, int index, int count)
{
if (disposed) throw new ObjectDisposedException("The stream has been disposed.");
if (count == 0) return 0;
int charsReturned = 0;
int maxChars = Math.Min(count, buffer.Length - index); // Assuming we dont run out of input chars, we return count characters if we can. If the space left in the buffer is not big enough we return as many as will fit in the buffer.
while (charsReturned < maxChars)
{
if (curLine == null || curLinePos == curLine.Length) GetNextLine();
if (eof) return charsReturned;
int maxCurrentCopy = maxChars - charsReturned;
int charsAtTheReady = curLine.Length - curLinePos; // chars available in current line
int copySize = Math.Min(maxCurrentCopy, charsAtTheReady); // stop at end of buffer.
// cant use Buffer.BlockCopy because it's byte based and we're dealing with chars.
Array.ConstrainedCopy(curLine, curLinePos, buffer, index, copySize);
index += copySize;
curLinePos += copySize;
charsReturned += copySize;
}
return charsReturned;
}
public override string ReadLine()
{
if (curLine == null || curLinePos == curLine.Length) GetNextLine();
if (eof) return null;
if (curLinePos > 0) // this is necessary in case the client uses both Read() and ReadLine() calls
{
var tmp = new string(curLine, curLinePos, (curLine.Length - curLinePos) - 2); // create a new string from the remainder of the char array. The -2 is because GetNextLine appends a crlf.
curLinePos = curLine.Length; // so next call will re-read
return tmp;
}
// read full line.
curLinePos = curLine.Length; // so next call will re-read
return _enumerator.Current; // if all the client does is call ReadLine this (faster) code path will be taken.
}
protected override void Dispose(bool disposing)
{
if (!disposed)
{
_enumerator.Dispose();
base.Dispose(disposing);
disposed = true;
}
}
}
In my case, I want to use it as input to Datastreams.Csv:
using (var tr = new IEnumerableStringReader(input))
using (var reader = new CsvReader(tr))
{
while (reader.ReadRecord())
{
// do whatever
}
}
Using the EnumerableToStream Nuget package, you would implement your method like so:
using EnumerableToStream;
private Stream ToStringStream(Encoding encoding, IEnumerable<string> data)
{
return data.ToStream(encoding);
}
I had the same requirement and ended up rolling my own implementation which I have been using for a while now. Getting all the nitty-gritty details just right took some time and effort. For instance, you want your IEnumerable to be disposed after the stream is read to the end and you don't want multibyte characters to be partially written to the buffer.
In this particular implementation, reading the stream does zero allocations, unlike other implementations using encoding.GetBytes(line).
After seeing this question, I decided to release the code as a Nuget package. Hope it saves you a few hours. The source code is on GitHub.
Steve Sadler wrote a perfectly working answer. However, he makes it way more difficult than needed
According to the reference source of TextReader you'll need only override Peek and Read:
A subclass must minimally implement the Peek() and Read() methods.
So first I write a function that converts IEnumerable<string> into IEnumerable<char> where a new line is added at the end of each string:
private static IEnumerable<char> ReadCharacters(IEnumerable<string> lines)
{
foreach (string line in lines)
{
foreach (char c in line + Environment.NewLine)
{
yield return c;
}
}
}
Environment.NewLine is the part that adds the new line at the end of each string.
Now the class is failry straightforward:
class EnumStringReader : TextReader
{
public EnumStringReader(IEnumerable<string> lines)
{
this.enumerator = ReadCharacters(lines).GetEnumerator();
this.dataAvailable = this.enumerator.MoveNext();
}
private bool disposed = false;
private bool dataAvailable;
private readonly IEnumerator<char> enumerator;
The constructor takes a sequence of lines to read. It uses this sequence and the earlier written function to convert the sequence into a sequence of characters with the added Environment.NewLine.
It gets the enumerator of the converted sequence, and moves to the first character. It remembers whether there is a first character in DataAvailable
Now we are ready to Peek: if no data available: return -1, otherwise return the current character as int. Do not move forward:
public override int Peek()
{
this.ThrowIfDisposed();
return this.dataAvailable ? this.enumerator.Current : -1;
}
Read: if no data available, return -1, otherwise return the current character as int. Move forward to the next character and remember whether there is data available:
public override int Read()
{
this.ThrowIfDisposed();
if (this.dataAvailable)
{
char nextChar = this.enumerator.Current;
this.dataAvailable = this.enumerator.MoveNext();
return (int)nextChar;
}
else
{
return -1;
}
}
Don't forget to override Dispose(bool) where you dispose the enumerator.
That is all that is needed. All other functions will use these two.
Now to fill your stream with the lines:
IEnumerable<string> lines = ...
using (TextWriter writer = System.IO.File.CreateText(...))
{
using (TextReader reader = new EnumStringReader(lines);
{
// either write per char:
while (reader.Peek() != -1)
{
char c = (char)reader.Read();
writer.Write(c);
}
// or write per line:
string line = reader.ReadLine();
// line is without newLine!
while (line != null)
{
writer.WriteLine(line);
line = reader.ReadLine();
}
// or write per block
buffer buf = new char[4096];
int nrRead = reader.ReadBlock(buf, 0, buf.Length)
while (nrRead > 0)
{
writer.Write(buf, 0, nrRead);
nrRead = reader.ReadBlock(buf, 0, buf.Length);
}
}
}

How to compute hash of a large file chunk?

I want to be able to compute the hashes of arbitrarily sized file chunks of a file in C#.
eg: Compute the hash of the 3rd gigabyte in 4gb file.
The main problem is that I don't want to load the entire file at memory, as there could be several files and the offsets could be quite arbitrary.
AFAIK, the HashAlgorithm.ComputeHash allows me to either use a byte buffer, of a stream. The stream would allow me to compute the hash efficiently, but for the entire file, not just for a specific chunk.
I was thinking to create aan alternate FileStream object and pass it to ComputeHash, where I would overload the FileStream methods and have read only for a certain chunk in a file.
Is there a better solution than this, preferably using the built in C# libraries ?
Thanks.
You should pass in either:
A byte array containing the chunk of data to compute the hash from
A stream that restricts access to the chunk you want to computer the hash from
The second option isn't all that hard, here's a quick LINQPad program I threw together. Note that it lacks quite a bit of error handling, such as checking that the chunk is actually available (ie. that you're passing in a position and length of the stream that actually exists and doesn't fall off the end of the underlying stream).
Needless to say, if this should end up as production code I would add a lot of error handling, and write a bunch of unit-tests to ensure all edge-cases are handled correctly.
You would construct the PartialStream instance for your file like this:
const long gb = 1024 * 1024 * 1024;
using (var fileStream = new FileStream(#"d:\temp\too_long_file.bin", FileMode.Open))
using (var chunk = new PartialStream(fileStream, 2 * gb, 1 * gb))
{
var hash = hashAlgorithm.ComputeHash(chunk);
}
Here's the LINQPad test program:
void Main()
{
var buffer = Enumerable.Range(0, 256).Select(i => (byte)i).ToArray();
using (var underlying = new MemoryStream(buffer))
using (var partialStream = new PartialStream(underlying, 64, 32))
{
var temp = new byte[1024]; // too much, ensure we don't read past window end
partialStream.Read(temp, 0, temp.Length);
temp.Dump();
// should output 64-95 and then 0's for the rest (64-95 = 32 bytes)
}
}
public class PartialStream : Stream
{
private readonly Stream _UnderlyingStream;
private readonly long _Position;
private readonly long _Length;
public PartialStream(Stream underlyingStream, long position, long length)
{
if (!underlyingStream.CanRead || !underlyingStream.CanSeek)
throw new ArgumentException("underlyingStream");
_UnderlyingStream = underlyingStream;
_Position = position;
_Length = length;
_UnderlyingStream.Position = position;
}
public override bool CanRead
{
get
{
return _UnderlyingStream.CanRead;
}
}
public override bool CanWrite
{
get
{
return false;
}
}
public override bool CanSeek
{
get
{
return true;
}
}
public override long Length
{
get
{
return _Length;
}
}
public override long Position
{
get
{
return _UnderlyingStream.Position - _Position;
}
set
{
_UnderlyingStream.Position = value + _Position;
}
}
public override void Flush()
{
throw new NotSupportedException();
}
public override long Seek(long offset, SeekOrigin origin)
{
switch (origin)
{
case SeekOrigin.Begin:
return _UnderlyingStream.Seek(_Position + offset, SeekOrigin.Begin) - _Position;
case SeekOrigin.End:
return _UnderlyingStream.Seek(_Length + offset, SeekOrigin.Begin) - _Position;
case SeekOrigin.Current:
return _UnderlyingStream.Seek(offset, SeekOrigin.Current) - _Position;
default:
throw new ArgumentException("origin");
}
}
public override void SetLength(long length)
{
throw new NotSupportedException();
}
public override int Read(byte[] buffer, int offset, int count)
{
long left = _Length - Position;
if (left < count)
count = (int)left;
return _UnderlyingStream.Read(buffer, offset, count);
}
public override void Write(byte[] buffer, int offset, int count)
{
throw new NotSupportedException();
}
}
You can use TransformBlock and TransformFinalBlock directly. That's pretty similar to what HashAlgorithm.ComputeHash does internally.
Something like:
using(var hashAlgorithm = new SHA256Managed())
using(var fileStream = new File.OpenRead(...))
{
fileStream.Position = ...;
long bytesToHash = ...;
var buf = new byte[4 * 1024];
while(bytesToHash > 0)
{
var bytesRead = fileStream.Read(buf, 0, (int)Math.Min(bytesToHash, buf.Length));
hashAlgorithm.TransformBlock(buf, 0, bytesRead, null, 0);
bytesToHash -= bytesRead;
if(bytesRead == 0)
throw new InvalidOperationException("Unexpected end of stream");
}
hashAlgorithm.TransformFinalBlock(buf, 0, 0);
var hash = hashAlgorithm.Hash;
return hash;
};
Your suggestion - passing in a restricted access wrapper for your FileStream - is the cleanest solution. Your wrapper should defer everything to the wrapped Stream except the Length and Position properties.
How? Simply create a class that inherits from Stream. Make the constructor take:
Your source Stream (in your case, a FileStream)
The chunk start position
The chunk end position
As an extension - this is a list of all the Streams that are available http://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.100%29.aspx#inheritanceContinued
To easily compute the hash of a chunk of a larger stream, use these two methods:
HashAlgorithm.TransformBlock
HashAlgorithm.TransformFinalBlock
Here's a LINQPad program that demonstrates:
void Main()
{
const long gb = 1024 * 1024 * 1024;
using (var stream = new FileStream(#"d:\temp\largefile.bin", FileMode.Open))
{
stream.Position = 2 * gb; // 3rd gb-chunk
byte[] buffer = new byte[32768];
long amount = 1 * gb;
using (var hashAlgorithm = SHA1.Create())
{
while (amount > 0)
{
int bytesRead = stream.Read(buffer, 0,
(int)Math.Min(buffer.Length, amount));
if (bytesRead > 0)
{
amount -= bytesRead;
if (amount > 0)
hashAlgorithm.TransformBlock(buffer, 0, bytesRead,
buffer, 0);
else
hashAlgorithm.TransformFinalBlock(buffer, 0, bytesRead);
}
else
throw new InvalidOperationException();
}
hashAlgorithm.Hash.Dump();
}
}
}
To answer your original question ("Is there a better solution..."):
Not that I know of.
This seems to be a very special, non-trivial task, so a little extra work might be involved anyway. I think your approach of using a custom Stream-class goes in the right direction, I'd probably do exactly the same.
And Gusdor and xander have already provided very helpful information on how to implement that — good job guys!

Save/load 2 XDocuments to/from one stream

I've got 2 XDocuments. One is some meta data, the other is a lot of data.
On the Xbox (XNA), I'd like to be able to save both to a file stream, meta data XDoc first, then the actual data XDoc.
I'd then like to be able to access just the meta data XDoc (ignoring the rest of the file stream), and also to be able to access the meta data XDoc and the data XDoc.
Currently i'm saving/loading as follows:
public void Serialise(Stream SaveStream, object Obj)
{
XDocument XDoc = new XDocument(new XElement(#"SaveData", new XAttribute(#"Version", #"1.0"),
GetXMLElement(Obj)));
XDoc.Save(SaveStream);
}
public object Deserialise(Stream ObjectStream)
{
XDocument XDoc = XDocument.Load(ObjectStream); // Error line
switch (XDoc.Element(#"SaveData").Attribute(#"Version").Value)
{
case #"1.0":
return GetObject(XDoc.Element(#"SaveData").FirstNode as XElement);
default:
throw new NotSupportedException("This save file version (" + XDoc.Element(#"SaveData").Attribute(#"Version").Value +
" is not supported, please upgrade your game.");
}
}
To save meta data followed by actual data i'm just calling serialise twice on the same stream.
I get a file as below:
<?xml version="1.0" encoding="utf-8"?>
<SaveData Version="1.0">
....
</SaveData><?xml version="1.0" encoding="utf-8"?>
<SaveData Version="1.0">
....
</SaveData>
The problem comes when i try and read the first XDoc: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it. Line 18, position 14.
Any help would be greatly appreciated.
The XML declaration (<?xml version=....?>) can only appear once at the beginning of the document - that is the error you are seeing. You also need a root node, so you can't serialize two different documents together under one file in this manner. If you fixed the XML declaration, this would probably be the next error you get.
If you want to save and load the documents from one file, then you need to combine them into a single document for serialization/deserialization.
I ended up writing my own Stream, which can be thought of as a multistream. It allows you to treat one stream as multiple stream in succession. i.e. pass a multistream to an xml parser (or anything else) and i'll read up to a marker, which says 'this is the end of the stream'. If you then pass that same stream to another xml parser, it'll read from that marker, to the next one or EOF:
public class MultiStream : Stream
{
private readonly byte[] _RandomBytes = "410801dd-6f14-4d68-8e0e-29686d212cb2".Select(c => (byte)c).ToArray();
private Queue<byte> _RollingBytesRead;
private Stream _UnderlyingStream;
private bool UnderlyingEOF = false;
private bool EOFMarker = false;
private int BufferedBytesToRead = 0;
public MultiStream(Stream UnderlyingStream)
: base()
{
_UnderlyingStream = UnderlyingStream;
_RollingBytesRead = new Queue<byte>(_RandomBytes.Length);
}
public override bool CanRead
{
get { return !UnderlyingEOF || _UnderlyingStream.CanRead; }
}
public override bool CanSeek
{
get { return false; }
}
public override bool CanWrite
{
get { return _UnderlyingStream.CanWrite; }
}
public override void Flush()
{
_UnderlyingStream.Flush();
}
public override long Length
{
get { throw new NotSupportedException(); }
}
public override long Position
{
get
{
throw new NotSupportedException();
}
set
{
throw new NotSupportedException();
}
}
public override int ReadByte()
{
if (EOFMarker)
return -1;
// This should read the next byte from the underlying stream, check for the random bytes EOF marker, then return the next byte from the buffer
// If our buffer is smaller than the random bytes and we've not hit the EOF, then we need to fill it
while (!UnderlyingEOF && _RollingBytesRead.Count < _RandomBytes.Length)
{
int BytesRead = _UnderlyingStream.ReadByte();
if (BytesRead == -1)
{
UnderlyingEOF = true;
BufferedBytesToRead = _RollingBytesRead.Count;
}
else
{
_RollingBytesRead.Enqueue((byte)BytesRead);
}
}
if (EncounteredEndOfStreamBytes()) // Now check to see if the buffer matches our EOF marker
{
// If it does stop now, since we don't want to output any of the EOF marker.
BufferedBytesToRead = 0;
_RollingBytesRead.Clear();
EOFMarker = true;
return -1;
}
else if (UnderlyingEOF) // If we've already encountered the end of the underlying stream and have a buffer,
// then output the next byte since it's not part of the EOF marker, it's part of the stream
{
if (BufferedBytesToRead != 0)
{
BufferedBytesToRead--;
return _RollingBytesRead.Dequeue();
}
else
{
return -1;
}
}
else
{
int ByteRead = _UnderlyingStream.ReadByte();
if (ByteRead == -1)
{
// We've reached the end so we should output the buffer
UnderlyingEOF = true;
BufferedBytesToRead = _RollingBytesRead.Count;
// Recurse once just to avoid repeating code above
return ReadByte();
}
else
{
byte BufferedByte = _RollingBytesRead.Dequeue();
_RollingBytesRead.Enqueue((byte)ByteRead);
return BufferedByte;
}
}
}
public override int Read(byte[] buffer, int offset, int count)
{
bool EncounteredEOF = false;
int BufferIndex = 0;
while (offset > 0)
{
if (ReadByte() == -1)
{
EncounteredEOF = true;
}
offset--;
}
while (!EncounteredEOF && count > 0)
{
// Read the next byte (includes checks for our end of stream marker) and actually returns the buffered byte (not the next underlying stream read byte)
int ByteRead = ReadByte();
if (ByteRead == -1)
{
break;
}
else
{
buffer[BufferIndex] = (byte)ByteRead;
count--;
BufferIndex++;
}
}
return BufferIndex;
}
private bool EncounteredEndOfStreamBytes()
{
if (_RollingBytesRead.Count != _RandomBytes.Length)
return false;
byte[] QueueArray = _RollingBytesRead.ToArray();
for (int i = 0; i < _RandomBytes.Length; i++)
{
if (_RandomBytes[i] != QueueArray[i])
return false;
}
return true;
}
public override long Seek(long offset, SeekOrigin origin)
{
throw new NotSupportedException();
}
public override void SetLength(long value)
{
throw new NotSupportedException();
}
public override void Write(byte[] buffer, int offset, int count)
{
_UnderlyingStream.Write(buffer, offset, count);
}
public void WriteStreamSeperator()
{
Write(_RandomBytes, 0, _RandomBytes.Length);
}
public void AdvanceToNextStream()
{
if (UnderlyingEOF)
throw new InvalidOperationException("No more streams");
// If we're not currently at an EOF marker, advance until we get to one.
while (!EOFMarker)
{
ReadByte();
}
EOFMarker = false;
_RollingBytesRead.Clear();
}
}

Difference between multiple BinaryReader.Read() and BinaryReader.ReadBytes(int i)

I was searching for a BinaryReader.Skip function, while I came across this feature request on msdn.
He said you can provide your own BinaryReader.Skip() function, by using this.
Only looking at this code, I'm wondering why he chose this way to skip a certain amount of bytes:
for (int i = 0, i < count; i++) {
reader.ReadByte();
}
Is there a difference between that and:
reader.ReadBytes(count);
Even if it's just a small optimalisation, I'd like to undestand. Because now it doesnt make sense to me why you would use the for loop.
public void Skip(this BinaryReader reader, int count) {
if (reader.BaseStream.CanSeek) {
reader.BaseStream.Seek(count, SeekOffset.Current);
}
else {
for (int i = 0, i < count; i++) {
reader.ReadByte();
}
}
}
No, there is no difference. EDIT: Assuming that the stream has enough byes
The ReadByte method simply forwards to the underlying Stream's ReadByte method.
The ReadBytes method calls the underlying stream's Read until it reads the required number of bytes.
It's defined like this:
public virtual byte[] ReadBytes(int count) {
if (count < 0) throw new ArgumentOutOfRangeException("count", Environment.GetResourceString("ArgumentOutOfRange_NeedNonNegNum"));
Contract.Ensures(Contract.Result<byte[]>() != null);
Contract.Ensures(Contract.Result<byte[]>().Length <= Contract.OldValue(count));
Contract.EndContractBlock();
if (m_stream==null) __Error.FileNotOpen();
byte[] result = new byte[count];
int numRead = 0;
do {
int n = m_stream.Read(result, numRead, count);
if (n == 0)
break;
numRead += n;
count -= n;
} while (count > 0);
if (numRead != result.Length) {
// Trim array. This should happen on EOF & possibly net streams.
byte[] copy = new byte[numRead];
Buffer.InternalBlockCopy(result, 0, copy, 0, numRead);
result = copy;
}
return result;
}
For most streams, ReadBytes will probably be faster.
ReadByte will throw an EndOfStreamException if the end of the stream is reached, whereas ReadBytes will not. It depends on whether you want Skip to throw if it cannot skip the requested number of bytes without reaching the end of the stream.
ReadBytes is faster than multiple ReadByte calls.
Its a very small optimization which will occasionally skip bytes (rather then reading them into ReadByte) Think of it this way
if(vowel)
{
println(vowel);
}
else
{
nextLetter();
}
If you can prevent that extra function call you save a little runtime

Categories

Resources