Gzip only after a threshold reached? - c#

I have a requirement to archive all the data used to build a report everyday. I compress most of the the data using gzip, as some of the datasets can be very large (10mb+). I write each individual protobuf graph to a file. I also whitelist a fixed set of known small object types and added some code to detect if the file is gzipped or not, when I read it. This is because a small file, when compressed can actually be bigger then uncompressed.
Unfortunately, just due to the nature of the data, I may only have a few elements of a larger object type, and the whitelist approach can be problematic.
Is there anyway to write an object to a stream, and only if it reaches a threshold (like 8kb), then compress it? I don't know the size of the object beforehand, and sometimes I have an object graph with an IEnumerable<T> that might be considerable in size.
Edit:
The code is fairly basic. I did skim over the fact that I store this in a filestream db table. That shouldn't really matter for the implementation purpose. I removed some of the extraneous code.
public Task SerializeModel<T>(TransactionalDbContext dbConn, T Item, DateTime archiveDate, string name)
{
var continuation = (await dbConn
.QueryAsync<PathAndContext>(_getPathAndContext, new {archiveDate, model=name})
.ConfigureAwait(false))
.First();
var useGzip = !_whitelist.Contains(typeof(T));
using (var fs = new SqlFileStream(continuation.Path, continuation.Context, FileAccess.Write,
FileOptions.SequentialScan | FileOptions.Asynchronous, 64*1024))
using (var buffer = useGzip ? new GZipStream(fs, CompressionLevel.Optimal) : default(Stream))
{
_serializerModel.Serialize(stream ?? fs, item);
}
dbConn.Commit();
}

During the serialization, you can use an intermediate stream to accomplish what you are asking for. Something like this will do the job
class SerializationOutputStream : Stream
{
Stream outputStream, writeStream;
byte[] buffer;
int bufferedCount;
long position;
public SerializationOutputStream(Stream outputStream, int compressTreshold = 8 * 1024)
{
writeStream = this.outputStream = outputStream;
buffer = new byte[compressTreshold];
}
public override long Seek(long offset, SeekOrigin origin) { throw new NotSupportedException(); }
public override void SetLength(long value) { throw new NotSupportedException(); }
public override int Read(byte[] buffer, int offset, int count) { throw new NotSupportedException(); }
public override bool CanRead { get { return false; } }
public override bool CanSeek { get { return false; } }
public override bool CanWrite { get { return writeStream != null && writeStream.CanWrite; } }
public override long Length { get { throw new NotSupportedException(); } }
public override long Position { get { return position; } set { throw new NotSupportedException(); } }
public override void Write(byte[] buffer, int offset, int count)
{
if (count <= 0) return;
var newPosition = position + count;
if (this.buffer == null)
writeStream.Write(buffer, offset, count);
else
{
int bufferCount = Math.Min(count, this.buffer.Length - bufferedCount);
if (bufferCount > 0)
{
Array.Copy(buffer, offset, this.buffer, bufferedCount, bufferCount);
bufferedCount += bufferCount;
}
int remainingCount = count - bufferCount;
if (remainingCount > 0)
{
writeStream = new GZipStream(outputStream, CompressionLevel.Optimal);
try
{
writeStream.Write(this.buffer, 0, this.buffer.Length);
writeStream.Write(buffer, offset + bufferCount, remainingCount);
}
finally { this.buffer = null; }
}
}
position = newPosition;
}
public override void Flush()
{
if (buffer == null)
writeStream.Flush();
else if (bufferedCount > 0)
{
try { outputStream.Write(buffer, 0, bufferedCount); }
finally { buffer = null; }
}
}
protected override void Dispose(bool disposing)
{
try
{
if (!disposing || writeStream == null) return;
try { Flush(); }
finally { writeStream.Close(); }
}
finally
{
writeStream = outputStream = null;
buffer = null;
base.Dispose(disposing);
}
}
}
and use it like this
using (var stream = new SerializationOutputStream(new SqlFileStream(continuation.Path, continuation.Context, FileAccess.Write,
FileOptions.SequentialScan | FileOptions.Asynchronous, 64*1024)))
_serializerModel.Serialize(stream, item);

datasets can be very large (10mb+)
On most devices, that is not very large. Is there a reason you can't read in the entire object before deciding whether to compress? Note also the suggestion from #Niklas to read in one buffer's worth of data (e.g. 8K) before deciding whether to compress.
This is because a small file, when compressed can actually be bigger then uncompressed.
The thing that makes a small file potentially larger is the ZIP header, in particular the dictionary. Some ZIP libraries allow you to use a custom dictionary known while compressing and uncompressing. I used SharpZipLib for this many years back.
It is more effort, in terms of coding and testing, to use this approach. If you feel that the benefit is worthwhile, it may provide the best approach.
Note no matter what path you take, you will physically store data using multiples of the block size of your storage device.
if the object is 1 byte or 100mb I have no idea
Note that protocol buffers is not really designed for large data sets
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data.
If your largest object can comfortably serialize into memory, first serialize it into a MemoryStream, then either write that MemoryStream to your final destination, or run it through a GZipStream and then to your final destination. If the largest object cannot comfortably serialize into memory, I'm not sure what further advice to give.

Related

How can I stream a massive object directly to S3, without a MemoryStream or local file?

I am trying to write a massive object to AWS S3 (e.g. 25 GB).
Currently I can get it working in two ways:
Write the content to a file on local disk, then send the file to S3 using multi-part upload
Write the content to a MemoryStream, then send that stream to S3 using multi-part upload
However, I don't like either approach, because I need to reserve a large amount of disk space or memory for the operation. I am generating this content in code, so I was hoping to open a stream to an S3 object, and generate the content directly to that object. But I can't see how to make that work.
Is it possible to build a massive object in S3 without representing the entire object in a local file or memory first?
(Note: My question is very similar to this question, but that question doesn't have a useful answer.)
I was able to get it working by breaking the overall payload into chunks, and sending each individual chunk as a separate MemoryStream.
Technically this solution still uses a MemoryStream, but that's OK, since I can control how much memory is used by adjusting the chunk size. For my test, I created a 25GB file while keeping memory usage well below that (~2 GB IIRC).
Here is my solution:
private const string BucketName = "YOUR-BUCKET-NAME-HERE";
private static readonly RegionEndpoint BucketRegion = RegionEndpoint.USEast1;
private const string Key = "massive-file-test";
// We're going to send 100 chunks of 256 MB each, for a total of 25 GB.
// The content will be the asterisk ("*") repeated for the desired size.
private const int ChunkSizeMb = 256;
private const int TotalSizeGb = 25;
public static void Main(string[] args)
{
Console.WriteLine($"Writing object to {BucketName}, {Key}");
int totalChunks = TotalSizeGb * 1024 / ChunkSizeMb;
int chunkSizeBytes = ChunkSizeMb * 1024 * 1024;
string payload = new String('*', chunkSizeBytes);
// Initiate the request.
InitiateMultipartUploadRequest initiateRequest = new InitiateMultipartUploadRequest
{
BucketName = BucketName,
Key = Key
};
List<UploadPartResponse> uploadResponses = new List<UploadPartResponse>();
IAmazonS3 s3Client = new AmazonS3Client(BucketRegion);
InitiateMultipartUploadResponse initResponse = s3Client.InitiateMultipartUpload(initiateRequest);
// Open a stream to build the input.
for (int i = 0; i < totalChunks; i++)
{
// Write the next chunk to the input stream.
Console.WriteLine($"Writing chunk {i} of {totalChunks}");
using (var stream = ToStream(payload))
{
// Write the next chunk to s3.
UploadPartRequest uploadRequest = new UploadPartRequest
{
BucketName = BucketName,
Key = Key,
UploadId = initResponse.UploadId,
PartNumber = i + 1,
PartSize = chunkSizeBytes,
InputStream = stream,
};
uploadResponses.Add(s3Client.UploadPart(uploadRequest));
}
}
// Complete the request.
CompleteMultipartUploadRequest completeRequest = new CompleteMultipartUploadRequest
{
BucketName = BucketName,
Key = Key,
UploadId = initResponse.UploadId
};
completeRequest.AddPartETags(uploadResponses);
s3Client.CompleteMultipartUpload(completeRequest);
Console.WriteLine("Script is complete. Press any key to exit...");
Console.ReadKey();
}
private static Stream ToStream(string s)
{
var stream = new MemoryStream();
var writer = new StreamWriter(stream);
writer.Write(s);
writer.Flush();
stream.Position = 0;
return stream;
}
Here is what AnonCoward started, finished off by adding seeking - it's a trivial op for a stream that does nothing except write asterisks to its buffer. If you were generating more complex data it would be hard work but for seeking all you need to do is set the position and say "yep, done that" because no matter where you seek to in the stream the behavior of creating asterisks is always the same
class AsteriskGeneratingStream : Stream
{
long _pos = 0;
long _length = 0;
public AsteriskGeneratingStream(long length)
{
_length = length;
}
public override long Length => _length;
public override int Read(byte[] buffer, int offset, int count)
{
// Create the data as needed
if (count + _pos > _length)
count = (int)(_length - _pos);
for (int i = offset; i < count; i++)
buffer[i] = (byte)'*';
_pos += count;
return count;
}
public override bool CanRead => true;
public override long Seek(long offset, SeekOrigin origin)
{
if(origin == SeekOrigin.Begin) //lets just trust that the caller will be sensible and not set e.g. negative offset
_pos = offset;
else if(origin == SeekOrigin.Current)
_pos += offset;
else if(origin == SeekOrigin.End)
_pos = _length + offset;
return _pos;
}
public override bool CanSeek => true;
public override bool CanWrite => false;
public override long Position { get => _pos; set => _pos = value; }
public override void Flush() { }
public override void SetLength(long value) { _length = value; }
public override void Write(byte[] buffer, int offset, int count) { throw new NotImplementedException(); }
}
class Program
{
static void Main(string[] args)
{
long objectSize = 25L * 1024 * 1024;
var s3 = new AmazonS3Client(Amazon.RegionEndpoint.USWest1);
var xfer = new TransferUtility(s3,new TransferUtilityConfig
{
MinSizeBeforePartUpload = 5L * 1024 * 1024
});
var helper = new AsteriskGeneratingStream(objectSize);
xfer.Upload(helper, "bucket-name", "object-key");
}
}
Note, I can't guarantee it'll work right off the bat because I'm on a cellphone and can't test this via c# fiddle but let's see how it blows up! 😀
If you can create the object on the fly, or at least cache fairly small segments, you can create a stream that serves the data up to S3. Note, that unless you can also create any part of the object out of order, you need to prevent the AWS SDK from using a multi-part upload, which will slow down the transfer speed.
class DataStream : Stream
{
long _pos = 0;
long _length = 0;
public DataStream(long length)
{
_length = length;
}
public override long Length => _length;
public override int Read(byte[] buffer, int offset, int count)
{
// Create the data as needed, on demand
// For this example, just cycle through 0 to 256 in the data over and over again
if (count + _pos > _length)
{
count = (int)(_length - _pos);
}
for (int i = 0; i < count; i++)
{
buffer[i + offset] = (byte)((_pos + i) % 256);
}
_pos += count;
return count;
}
public override bool CanRead => true;
// Stub out all other methods. For a seekable stream
// Seek() and Postion need to be implemented, along with CanSeek changed
public override long Seek(long offset, SeekOrigin origin) { throw new NotImplementedException(); }
public override bool CanSeek => false;
public override bool CanWrite => false;
public override long Position { get => _pos; set => throw new NotImplementedException(); }
public override void Flush() { throw new NotImplementedException(); }
public override void SetLength(long value) { throw new NotImplementedException(); }
public override void Write(byte[] buffer, int offset, int count) { throw new NotImplementedException(); }
}
class Program
{
static void Main(string[] args)
{
long objectSize = 25L * 1024 * 1024;
var s3 = new AmazonS3Client(Amazon.RegionEndpoint.USWest1);
// Prevent a multi-part upload, which requires a seekable stream
var xfer = new TransferUtility(s3, new TransferUtilityConfig
{
MinSizeBeforePartUpload = objectSize + 1
});
var helper = new DataStream(objectSize);
xfer.Upload(helper, "bucket-name", "object-key");
}
}

Prevent memory leaks on reinitialise

I have a class that can open memory mapped files, read and write to it :
public class Memory
{
protected bool _lock;
protected Mutex _locker;
protected MemoryMappedFile _descriptor;
protected MemoryMappedViewAccessor _accessor;
public void Open(string name, int size)
{
_descriptor = MemoryMappedFile.CreateOrOpen(name, size);
_accessor = _descriptor.CreateViewAccessor(0, size, MemoryMappedFileAccess.ReadWrite);
_locker = new Mutex(true, Guid.NewGuid().ToString("N"), out _lock);
}
public void Close()
{
_accessor.Dispose();
_descriptor.Dispose();
_locker.Close();
}
public Byte[] Read(int count, int index = 0, int position = 0)
{
Byte[] bytes = new Byte[count];
_accessor.ReadArray<Byte>(position, bytes, index, count);
return bytes;
}
public void Write(Byte[] data, int count, int index = 0, int position = 0)
{
_locker.WaitOne();
_accessor.WriteArray<Byte>(position, data, index, count);
_locker.ReleaseMutex();
}
Usually I use it this way :
var data = new byte[5];
var m = new Memory();
m.Open("demo", sizeof(data));
m.Write(data, 5);
m.Close();
I would like to implement some kind of lazy loading for opening and want to open file only when I am ready to write there something, e.g. :
public void Write(string name, Byte[] data, int count, int index = 0, int position = 0)
{
_locker.WaitOne();
Open(name, sizeof(byte) * count); // Now I don't need to call Open() before the write
_accessor.WriteArray<Byte>(position, data, index, count);
_locker.ReleaseMutex();
}
Question : when I call "Write" method several times (in a loop) it will cause member variables (like _locker) to reinitialise and I would like to know - is it safe to do it this way, can it cause memory leaks or unpredictable behavior with mutex?
If you open in the write method using a lock, it's safe to close before you release the mutex.
When you are dealing with unmanaged resources and disposable object, it's always better to implement IDispose interface correctly. Here is some more information.
Then you can initialse Memory instance in a using clause
using (var m = new Memory())
{
// Your read write
}

How to compute hash of a large file chunk?

I want to be able to compute the hashes of arbitrarily sized file chunks of a file in C#.
eg: Compute the hash of the 3rd gigabyte in 4gb file.
The main problem is that I don't want to load the entire file at memory, as there could be several files and the offsets could be quite arbitrary.
AFAIK, the HashAlgorithm.ComputeHash allows me to either use a byte buffer, of a stream. The stream would allow me to compute the hash efficiently, but for the entire file, not just for a specific chunk.
I was thinking to create aan alternate FileStream object and pass it to ComputeHash, where I would overload the FileStream methods and have read only for a certain chunk in a file.
Is there a better solution than this, preferably using the built in C# libraries ?
Thanks.
You should pass in either:
A byte array containing the chunk of data to compute the hash from
A stream that restricts access to the chunk you want to computer the hash from
The second option isn't all that hard, here's a quick LINQPad program I threw together. Note that it lacks quite a bit of error handling, such as checking that the chunk is actually available (ie. that you're passing in a position and length of the stream that actually exists and doesn't fall off the end of the underlying stream).
Needless to say, if this should end up as production code I would add a lot of error handling, and write a bunch of unit-tests to ensure all edge-cases are handled correctly.
You would construct the PartialStream instance for your file like this:
const long gb = 1024 * 1024 * 1024;
using (var fileStream = new FileStream(#"d:\temp\too_long_file.bin", FileMode.Open))
using (var chunk = new PartialStream(fileStream, 2 * gb, 1 * gb))
{
var hash = hashAlgorithm.ComputeHash(chunk);
}
Here's the LINQPad test program:
void Main()
{
var buffer = Enumerable.Range(0, 256).Select(i => (byte)i).ToArray();
using (var underlying = new MemoryStream(buffer))
using (var partialStream = new PartialStream(underlying, 64, 32))
{
var temp = new byte[1024]; // too much, ensure we don't read past window end
partialStream.Read(temp, 0, temp.Length);
temp.Dump();
// should output 64-95 and then 0's for the rest (64-95 = 32 bytes)
}
}
public class PartialStream : Stream
{
private readonly Stream _UnderlyingStream;
private readonly long _Position;
private readonly long _Length;
public PartialStream(Stream underlyingStream, long position, long length)
{
if (!underlyingStream.CanRead || !underlyingStream.CanSeek)
throw new ArgumentException("underlyingStream");
_UnderlyingStream = underlyingStream;
_Position = position;
_Length = length;
_UnderlyingStream.Position = position;
}
public override bool CanRead
{
get
{
return _UnderlyingStream.CanRead;
}
}
public override bool CanWrite
{
get
{
return false;
}
}
public override bool CanSeek
{
get
{
return true;
}
}
public override long Length
{
get
{
return _Length;
}
}
public override long Position
{
get
{
return _UnderlyingStream.Position - _Position;
}
set
{
_UnderlyingStream.Position = value + _Position;
}
}
public override void Flush()
{
throw new NotSupportedException();
}
public override long Seek(long offset, SeekOrigin origin)
{
switch (origin)
{
case SeekOrigin.Begin:
return _UnderlyingStream.Seek(_Position + offset, SeekOrigin.Begin) - _Position;
case SeekOrigin.End:
return _UnderlyingStream.Seek(_Length + offset, SeekOrigin.Begin) - _Position;
case SeekOrigin.Current:
return _UnderlyingStream.Seek(offset, SeekOrigin.Current) - _Position;
default:
throw new ArgumentException("origin");
}
}
public override void SetLength(long length)
{
throw new NotSupportedException();
}
public override int Read(byte[] buffer, int offset, int count)
{
long left = _Length - Position;
if (left < count)
count = (int)left;
return _UnderlyingStream.Read(buffer, offset, count);
}
public override void Write(byte[] buffer, int offset, int count)
{
throw new NotSupportedException();
}
}
You can use TransformBlock and TransformFinalBlock directly. That's pretty similar to what HashAlgorithm.ComputeHash does internally.
Something like:
using(var hashAlgorithm = new SHA256Managed())
using(var fileStream = new File.OpenRead(...))
{
fileStream.Position = ...;
long bytesToHash = ...;
var buf = new byte[4 * 1024];
while(bytesToHash > 0)
{
var bytesRead = fileStream.Read(buf, 0, (int)Math.Min(bytesToHash, buf.Length));
hashAlgorithm.TransformBlock(buf, 0, bytesRead, null, 0);
bytesToHash -= bytesRead;
if(bytesRead == 0)
throw new InvalidOperationException("Unexpected end of stream");
}
hashAlgorithm.TransformFinalBlock(buf, 0, 0);
var hash = hashAlgorithm.Hash;
return hash;
};
Your suggestion - passing in a restricted access wrapper for your FileStream - is the cleanest solution. Your wrapper should defer everything to the wrapped Stream except the Length and Position properties.
How? Simply create a class that inherits from Stream. Make the constructor take:
Your source Stream (in your case, a FileStream)
The chunk start position
The chunk end position
As an extension - this is a list of all the Streams that are available http://msdn.microsoft.com/en-us/library/system.io.stream%28v=vs.100%29.aspx#inheritanceContinued
To easily compute the hash of a chunk of a larger stream, use these two methods:
HashAlgorithm.TransformBlock
HashAlgorithm.TransformFinalBlock
Here's a LINQPad program that demonstrates:
void Main()
{
const long gb = 1024 * 1024 * 1024;
using (var stream = new FileStream(#"d:\temp\largefile.bin", FileMode.Open))
{
stream.Position = 2 * gb; // 3rd gb-chunk
byte[] buffer = new byte[32768];
long amount = 1 * gb;
using (var hashAlgorithm = SHA1.Create())
{
while (amount > 0)
{
int bytesRead = stream.Read(buffer, 0,
(int)Math.Min(buffer.Length, amount));
if (bytesRead > 0)
{
amount -= bytesRead;
if (amount > 0)
hashAlgorithm.TransformBlock(buffer, 0, bytesRead,
buffer, 0);
else
hashAlgorithm.TransformFinalBlock(buffer, 0, bytesRead);
}
else
throw new InvalidOperationException();
}
hashAlgorithm.Hash.Dump();
}
}
}
To answer your original question ("Is there a better solution..."):
Not that I know of.
This seems to be a very special, non-trivial task, so a little extra work might be involved anyway. I think your approach of using a custom Stream-class goes in the right direction, I'd probably do exactly the same.
And Gusdor and xander have already provided very helpful information on how to implement that — good job guys!

Is there anyway to calculate or get serialization time for displaying in ProgressBar?

I use C# .net 4.0 and don't see any possible way to do it, but maybe you know? :)
I do my serialization in that way:
public static void SaveCollection<T>(string file_name, T list)
{
BinaryFormatter bf = new BinaryFormatter();
FileStream fs = null;
try
{
fs = new FileStream(Application.StartupPath + "/" + file_name, FileMode.Create);
bf.Serialize(fs, list);
fs.Flush();
fs.Close();
}
catch (Exception exc)
{
if (fs != null)
fs.Close();
string msg = "Unable to save collection {0}\nThe error is {1}";
MessageBox.Show(Form1.ActiveForm, string.Format(msg, file_name, exc.Message));
}
}
So, let's say you use actually know in advance the size of your object graph, which itself may be difficult, but let's just assume you do :). You could do this:
public class MyStream : MemoryStream {
public long bytesWritten = 0;
public override void Write(byte[] buffer, int offset, int count) {
base.Write(buffer, offset, count);
bytesWritten += count;
}
public override void WriteByte(byte value) {
bytesWritten += 1;
base.WriteByte(value);
}
}
Then you could use it like:
BinaryFormatter bf = new BinaryFormatter();
var s = new MyStream();
bf.Serialize(s, new DateTime[200]);
This will give you the bytes as they are written, so you could calculate the time using that. Note: it's possible you might need to override a few more methods of the stream class.
I don't believe there is. My advice would be to time how long it takes to serialize (repeating the measurement several hundred or thousand times), average them, and then use that as a constant for calculating serialization progress.
You could start a timer that runs at a specific frequency (say 4 times a second, but that really has no bearing on anything but how frequently you want to update progress) that calculates how long data currently has taken to transfer, then estimate the remaining time. For example:
private void timer1_Tick(object sender, EventArgs e)
{
int currentBytesTransferred = Thread.VolatileRead(ref this.bytesTransferred);
TimeSpan timeTaken = DateTime.Now - this.startDateTime;
var bps = timeTaken.TotalSeconds / currentBytesTransferred;
TimeSpan remaining = new TimeSpan(0, 0, 0, (int)((this.totalBytesToTransfer - currentBytesTransferred) / bps));
// TODO: update UI with remaining
}
This assumes you're updating this.bytesTransferred on another thread and you're targeting AnyCPU.

Save/load 2 XDocuments to/from one stream

I've got 2 XDocuments. One is some meta data, the other is a lot of data.
On the Xbox (XNA), I'd like to be able to save both to a file stream, meta data XDoc first, then the actual data XDoc.
I'd then like to be able to access just the meta data XDoc (ignoring the rest of the file stream), and also to be able to access the meta data XDoc and the data XDoc.
Currently i'm saving/loading as follows:
public void Serialise(Stream SaveStream, object Obj)
{
XDocument XDoc = new XDocument(new XElement(#"SaveData", new XAttribute(#"Version", #"1.0"),
GetXMLElement(Obj)));
XDoc.Save(SaveStream);
}
public object Deserialise(Stream ObjectStream)
{
XDocument XDoc = XDocument.Load(ObjectStream); // Error line
switch (XDoc.Element(#"SaveData").Attribute(#"Version").Value)
{
case #"1.0":
return GetObject(XDoc.Element(#"SaveData").FirstNode as XElement);
default:
throw new NotSupportedException("This save file version (" + XDoc.Element(#"SaveData").Attribute(#"Version").Value +
" is not supported, please upgrade your game.");
}
}
To save meta data followed by actual data i'm just calling serialise twice on the same stream.
I get a file as below:
<?xml version="1.0" encoding="utf-8"?>
<SaveData Version="1.0">
....
</SaveData><?xml version="1.0" encoding="utf-8"?>
<SaveData Version="1.0">
....
</SaveData>
The problem comes when i try and read the first XDoc: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it. Line 18, position 14.
Any help would be greatly appreciated.
The XML declaration (<?xml version=....?>) can only appear once at the beginning of the document - that is the error you are seeing. You also need a root node, so you can't serialize two different documents together under one file in this manner. If you fixed the XML declaration, this would probably be the next error you get.
If you want to save and load the documents from one file, then you need to combine them into a single document for serialization/deserialization.
I ended up writing my own Stream, which can be thought of as a multistream. It allows you to treat one stream as multiple stream in succession. i.e. pass a multistream to an xml parser (or anything else) and i'll read up to a marker, which says 'this is the end of the stream'. If you then pass that same stream to another xml parser, it'll read from that marker, to the next one or EOF:
public class MultiStream : Stream
{
private readonly byte[] _RandomBytes = "410801dd-6f14-4d68-8e0e-29686d212cb2".Select(c => (byte)c).ToArray();
private Queue<byte> _RollingBytesRead;
private Stream _UnderlyingStream;
private bool UnderlyingEOF = false;
private bool EOFMarker = false;
private int BufferedBytesToRead = 0;
public MultiStream(Stream UnderlyingStream)
: base()
{
_UnderlyingStream = UnderlyingStream;
_RollingBytesRead = new Queue<byte>(_RandomBytes.Length);
}
public override bool CanRead
{
get { return !UnderlyingEOF || _UnderlyingStream.CanRead; }
}
public override bool CanSeek
{
get { return false; }
}
public override bool CanWrite
{
get { return _UnderlyingStream.CanWrite; }
}
public override void Flush()
{
_UnderlyingStream.Flush();
}
public override long Length
{
get { throw new NotSupportedException(); }
}
public override long Position
{
get
{
throw new NotSupportedException();
}
set
{
throw new NotSupportedException();
}
}
public override int ReadByte()
{
if (EOFMarker)
return -1;
// This should read the next byte from the underlying stream, check for the random bytes EOF marker, then return the next byte from the buffer
// If our buffer is smaller than the random bytes and we've not hit the EOF, then we need to fill it
while (!UnderlyingEOF && _RollingBytesRead.Count < _RandomBytes.Length)
{
int BytesRead = _UnderlyingStream.ReadByte();
if (BytesRead == -1)
{
UnderlyingEOF = true;
BufferedBytesToRead = _RollingBytesRead.Count;
}
else
{
_RollingBytesRead.Enqueue((byte)BytesRead);
}
}
if (EncounteredEndOfStreamBytes()) // Now check to see if the buffer matches our EOF marker
{
// If it does stop now, since we don't want to output any of the EOF marker.
BufferedBytesToRead = 0;
_RollingBytesRead.Clear();
EOFMarker = true;
return -1;
}
else if (UnderlyingEOF) // If we've already encountered the end of the underlying stream and have a buffer,
// then output the next byte since it's not part of the EOF marker, it's part of the stream
{
if (BufferedBytesToRead != 0)
{
BufferedBytesToRead--;
return _RollingBytesRead.Dequeue();
}
else
{
return -1;
}
}
else
{
int ByteRead = _UnderlyingStream.ReadByte();
if (ByteRead == -1)
{
// We've reached the end so we should output the buffer
UnderlyingEOF = true;
BufferedBytesToRead = _RollingBytesRead.Count;
// Recurse once just to avoid repeating code above
return ReadByte();
}
else
{
byte BufferedByte = _RollingBytesRead.Dequeue();
_RollingBytesRead.Enqueue((byte)ByteRead);
return BufferedByte;
}
}
}
public override int Read(byte[] buffer, int offset, int count)
{
bool EncounteredEOF = false;
int BufferIndex = 0;
while (offset > 0)
{
if (ReadByte() == -1)
{
EncounteredEOF = true;
}
offset--;
}
while (!EncounteredEOF && count > 0)
{
// Read the next byte (includes checks for our end of stream marker) and actually returns the buffered byte (not the next underlying stream read byte)
int ByteRead = ReadByte();
if (ByteRead == -1)
{
break;
}
else
{
buffer[BufferIndex] = (byte)ByteRead;
count--;
BufferIndex++;
}
}
return BufferIndex;
}
private bool EncounteredEndOfStreamBytes()
{
if (_RollingBytesRead.Count != _RandomBytes.Length)
return false;
byte[] QueueArray = _RollingBytesRead.ToArray();
for (int i = 0; i < _RandomBytes.Length; i++)
{
if (_RandomBytes[i] != QueueArray[i])
return false;
}
return true;
}
public override long Seek(long offset, SeekOrigin origin)
{
throw new NotSupportedException();
}
public override void SetLength(long value)
{
throw new NotSupportedException();
}
public override void Write(byte[] buffer, int offset, int count)
{
_UnderlyingStream.Write(buffer, offset, count);
}
public void WriteStreamSeperator()
{
Write(_RandomBytes, 0, _RandomBytes.Length);
}
public void AdvanceToNextStream()
{
if (UnderlyingEOF)
throw new InvalidOperationException("No more streams");
// If we're not currently at an EOF marker, advance until we get to one.
while (!EOFMarker)
{
ReadByte();
}
EOFMarker = false;
_RollingBytesRead.Clear();
}
}

Categories

Resources