C# MongoDb insert entire collection from a stream

C# MongoDb insert entire collection from a stream - c#

I have a process that archives MongoDb collections by getting an IAsyncCursor and writing the raw bytes out to an Azure Blob stream. This seems to be quite efficient and works. Here is the working code.
var cursor = await clientDb.GetCollection<RawBsonDocument>(collectionPath).Find(new BsonDocument()).ToCursorAsync();
while (cursor.MoveNext())
foreach (var document in cursor.Current)
{
var bytes = new byte[document.Slice.Length];
document.Slice.GetBytes(0, bytes, 0, document.Slice.Length);
blobStream.Write(bytes, 0, bytes.Length);
}
However, in order to move this data from the archive back into MongoDb, the only way I've figured out how to do it is to load the entire raw byte array into a memory stream and then .InsertOneAsync() in to MongoDb. This does work fine for smaller collections, but for very large collections I'm getting MongoDb errors. Also, this obviously isn't very memory efficient. Is there any way to stream raw byte data into MongoDb, or use a cursor like I'm doing on the read?
var rawRef = clientDb.GetCollection<RawBsonDocument>(collectionPath);
using (var ms = new MemoryStream())
{
await stream.CopyToAsync(ms);
var bytes = ms.ToArray();
var rawBson = new RawBsonDocument(bytes);
await rawRef.InsertOneAsync(rawBson);
}
Here is the error I get if the collection is too large.
MongoDB.Driver.MongoConnectionException : An exception occurred while sending a message to the server.
---- System.IO.IOException : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
-------- System.Net.Sockets.SocketException : An established connection was aborted by the software in your host machine.

Instead of copying the stream as a whole to a byte-Array and parsing this to a RawBsonDocument, you can parse the documents one by one, e.g.:
while (stream.Position < stream.Length)
{
var rawBson = BsonSerializer.Deserialize<RawBsonDocument>(stream);
await rawRef.InsertOneAsync(rawBson);
}
The stream will be read in chunks of one. Above sample inserts the documents directly into the database. If you want to insert in batches, you can collect a reasonable amount of documents in a list and use InsertManyAsync.

Related

Strange results from OpenReadAsync() when reading data from Azure Blob storage

I'm having a go at modifying an existing C# (dot net core) app that reads a type of binary file to use Azure Blob Storage.
I'm using Windows.Azure.Storage (8.6.0).
The issue is that this app reads the binary data from files from a Stream in very small blocks (e.g. 5000-6000 bytes). This reflects how the data is structured.
Example pseudo code:
var blocks = new List<byte[]>();
var numberOfBytesToRead = 6240;
var numberOfBlocksToRead = 1700;
using (var stream = await blob.OpenReadAsync())
{
stream.Seek(3000, SeekOrigin.Begin); // start reading at a particular position
for (int i = 1; i <= numberOfBlocksToRead; i++)
{
byte[] traceValues = new byte[numberOfBytesToRead];
stream.Read(traceValues, 0, numberOfBytesToRead);
blocks.Add(traceValues);
}
}`
If I try to read a 10mb file using OpenReadAsync(), I get invalid/junk values in the byte arrays after around 4,190,000 bytes.
If I set StreamMinimumReadSize to 100Mb it works.
If I read more data per block (e.g. 1mb) it works.
Some of the files can be more than 100Mb, so setting the StreamMinimumReadSize may not be the best solution.
What is going on here, and how can I fix this?

Are the invalid/junk values zeros? If so (and maybe even if not) check the return value from stream.Read. That method is not guaranteed to actually read the number of bytes that you ask it to. It can read less. In which case you are supposed to call it again in a loop, until it has read the total amount that you want. A quick web search should show you lots of examples of the necessary looping.

Principles behind FileStreaming

I've been working on a project recently that involves a lot of FileStreaming, something which I've not really touched on before.
To try and better acquaint myself with the principles of such methods, I've written some code that (theoretically) downloads a file from one dir to another, and gone through it step by step, commenting in my understanding of what each step achieves, like so...
Get fileinfo object from DownloadRequest Object
RemoteFileInfo fileInfo = svr.DownloadFile(request);
DownloadFile method in WCF Service
public RemoteFileInfo DownloadFile(DownloadRequest request)
{
RemoteFileInfo result = new RemoteFileInfo(); // create empty fileinfo object
try
{
// set filepath
string filePath = System.IO.Path.Combine(request.FilePath , #"\" , request.FileName);
System.IO.FileInfo fileInfo = new System.IO.FileInfo(filePath); // get fileinfo from path
// check if exists
if (!fileInfo.Exists)
throw new System.IO.FileNotFoundException("File not found",
request.FileName);
// open stream
System.IO.FileStream stream = new System.IO.FileStream(filePath,
System.IO.FileMode.Open, System.IO.FileAccess.Read);
// return result
result.FileName = request.FileName;
result.Length = fileInfo.Length;
result.FileByteStream = stream;
}
catch (Exception ex)
{
// do something
}
return result;
}
Use returned FileStream from fileinfo to read into a new write stream
// set new location for downloaded file
string basePath = System.IO.Path.Combine(#"C:\SST Software\DSC\Compilations\" , compName, #"\");
string serverFileName = System.IO.Path.Combine(basePath, file);
double totalBytesRead = 0.0;
if (!Directory.Exists(basePath))
Directory.CreateDirectory(basePath);
int chunkSize = 2048;
byte[] buffer = new byte[chunkSize];
// create new write file stream
using (System.IO.FileStream writeStream = new System.IO.FileStream(serverFileName, FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
do
{
// read bytes from fileinfo stream
int bytesRead = fileInfo.FileByteStream.Read(buffer, 0, chunkSize);
totalBytesRead += (double)bytesRead;
if (bytesRead == 0) break;
// write bytes to output stream
writeStream.Write(buffer, 0, bytesRead);
} while (true);
// report end
Console.WriteLine(fileInfo.FileName + " has been written to " + basePath + " - Done!");
writeStream.Close();
}
What I was hoping for is any clarification or expansion on what exactly happens when using a FileStream.
I can achieve the download, and now I know what code I need to write in order to perform such a download, but I would like to know more about why it works. I can find no 'beginner-friendly' or step by step explanations on the web.
What is happening here behind the scenes?

A stream is just an abstraction, fundamentally it works like a pointer within a collection of data.
Take the example string of "Hello World!" for example, it is just a collection of characters, which are fundamentally just bytes.
As a stream, it could be represented to have:
A length of 12 (possibly more including termination characters etc)
A position in the stream.
You read a stream by moving the position around and requesting data.
So reading the text above could be (in pseudocode) seen to be like this:
do
get next byte
add gotten byte to collection
while not the end of the stream
the entire data is now in the collection
Streams are really useful when it comes to accessing data from sources such as the file system or remote machines.
Imagine a file that is several gigabytes in size, if the OS loaded all of that into memory any time a program wanted to read it (say a video player), there would be a lot of problems.
Instead, what happens is the program requests access to the file, and the OS returns a stream; the stream tells the program how much data there is, and allows it to access that data.
Depending on implementation, the OS may load a certain amount of data into memory ahead of the program accessing it, this is known as a buffer.
Fundamentally though, the program just requests the next bit of data, and the OS either gets it from the buffer, or from the source (e.g. the file on disk).
The same principle applies to streams between different computers, except requesting the next bit of data may very well involve a trip to the remote machine to request it.
The .NET FileStream class and the Stream base class, all just defer to the windows systems for working with streams in the end, there's nothing particularly special about them, it's just what you can do with the abstraction that makes them so powerful.
Writing to a stream is just the same, but it just puts data into the buffer, ready for the requester to access.
Infinite Data
As a user pointed out, streams can be used for data of indeterminate length.
All stream operations take time, so reading a stream is typically a blocking operation that will wait until data is available.
So you could loop forever while the stream is still open, and just wait for data to come in - an example of this in practice would be a live video broadcast.

I've since located a book - C# 5.0 All-In-One For Dummies - It explains everything about all Stream classes, how they work, which one is most appropriate and more.
Only been reading about 30 minutes, already have such a better understanding. Excellent guide!

C# Networkstream BeginRead How to obtain buffer length/size?

I have a problem to obtain the right buffer size of my application.
What i read from the site about specifying the buffer size is normally declared before reading.
byte[] buffer = new byte[2000];
And then using to get the result.
However, this method will stop once the received data contains '00', but my return code contains something like this... 5300000002000000EF0000000A00. and the length is not fixed, can be this short until 400 bytes
So the problems comes, if i define a prefixed length like above, eg 2000, the return value is
5300000002000000EF0000000A000000000000000000000000000000000000000000000000000..........
thus making me unable to split the bytes to the correct amount.
Can any1 show me how to obtain the actual received data size from networkstream or any method/cheat to get what i need?
Thanks in advance.

Network streams have no length.
Unfortunately, your question is light on detail, so it's hard to offer specific advice. But you have a couple of options:
If the high-level protocol being used here offers a way to know the length of the data that will be sent, use that. This could be as simple as the remote host sending the byte count before the rest of the data, or some command you could send to the remote host to query the length of the data. Without knowing what high-level protocol you're using, it's not possible to say whether this is even an option or not.
Write the incoming data into a MemoryStream object. This would always work, whether or not the high-level protocol offers a way to know in advance how much data to expect. Note that if it doesn't, then you will simply have to receive data until the end of the network stream.
The latter option looks something like this:
MemoryStream outputStream = new MemoryStream();
int readByteCount;
byte[] rgb = new byte[1024]; // can be any size
while ((readByteCount = inputStream.Read(rgb, 0, rgb.Length)) > 0)
{
outputStream.Write(rgb, 0, readByteCount);
}
return outputStream.ToArray();
This assumes you have a network stream named "inputStream".
I show the above mainly because it illustrates the more general practice of reading from a network stream in pieces and then storing the result elsewhere. Also, it is easily adapted to directly reading from a socket instance (you didn't mention what you're actually using for network I/O).
However, if you are actually using a Stream object for your network I/O, then as of .NET 4.0, there has been a more convenient way to write the above:
MemoryStream outputStream = new MemoryStream();
inputStream.CopyTo(outputStream);
return outputStream.ToArray();

effective approach to writing an compressed XML document and tracing it's current size

My application produces a compressed (gzip) xml messages to transfer over https to a webserver. It receives up to 10,000 data packets/sec from client apps through TCP/IP. Size of each packet could be from 500 to 2000 bytes. It should add a received packet to a corresponding xml message in the collection based on the packet metadata. When the length of the particular message reaches the specified limit (1024Kb, for example) it should be saved to file in the specified folder and then sent from from there to a web server.
What is the correct/efficient approach to do this?
I am considering two ways:
1) GZipStream > FileStream
Write each packet on-the-fly directly to the corresponding file and watch on its size. But I am not sure that so frequent writes is good and i would like to keep the file system workload as low as possible.
2) GZipStream > MemoryStream > File
Use a memory as a buffer for messages. In this case i should create a MemoryStream for each message and keep it open while adding packets. But in this case i don't know how to get memory stream's current length without closing it.
Example:
I can get the ms.Length only after ms.close() call.
var ms = new MemoryStream();
var gstream = new GZipStream(ms, CompressionMode.Compress);
var writer = XmlWriter.Create(gstream);
writer.WriteStartElement("x", "root", "123");
writer.WriteElementString("packet_ID", "some data");
writer.WriteEndElement();
writer.Close();
gstream.Close();
ms.Close();
Console.WriteLine("Memory size: " + ms.ToArray().Length);
Console.ReadLine();

How can I send and receive a large file over HTTP in C#

I am working on developing an HTTP Server/Client and I can currently send small files over it such as .txt files and other easy to read files that do not require much memory. However when I want to send a larger file say a .exe or large .pdf I get memory errors. This are occurring from the fact that before I try to send or receive a file I have to specify the size of my byte[] buffer. Is there a way to get the size of the buffer while reading it from stream?
I want to do something like this:
//Create the stream.
private Stream dataStream = response.GetResponseStream();
//read bytes from stream into buffer.
byte[] byteArray = new byte[Convert.ToInt32(dataStream.Length)];
dataStream.read(byteArray,0,byteArray.Length);
However when calling "dataStream.Length" it throws the error:
ExceptionError: This stream does not support seek operations.
Can someone offer some advice as to how I can get the length of my byte[] from the stream?
Thanks,

You can use CopyTo method of the stream.
MemoryStream m = new MemoryStream();
dataStream.CopyTo(m);
byte[] byteArray = m.ToArray();
You can also write directly to file
var fs = File.Create("....");
dataStream.CopyTo(fs);

The network layer has no way of knowing how long the response stream is.
However, the server is supposed to tell you how long it is; look in the Content-Length response header.
If that header is missing or incorrect, you're out of luck; you'll need to keep reading until you run out of data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# MongoDb insert entire collection from a stream - c#

Related

Strange results from OpenReadAsync() when reading data from Azure Blob storage

Principles behind FileStreaming

C# Networkstream BeginRead How to obtain buffer length/size?

effective approach to writing an compressed XML document and tracing it's current size

How can I send and receive a large file over HTTP in C#

Categories

Resources