Stitching together multiple streams in one Stream class - c#

I want to make a class (let's call the class HugeStream) that takes an IEnumerable<Stream> in its constructor. This HugeStream should implement the Stream abstract class.
Basically, I have 1 to many pieces of UTF8 streams coming from a DB that when put together, make a gigantic XML document. The HugeStream needs to be file-backed so that I can seek back to position 0 of the whole stitched-together-stream at any time.
Anyone know how to make a speedy implementation of this?
I saw something similar created at this page but it does not seem optimal for handling large numbers of large streams. Efficiency is the key.
On a side note, I'm having trouble visualizing Streams and am a little confused now that I need to implement my own Stream. If there's a good tutorial on implementing the Stream class that anyone knows of, please let me know; I haven't found any good articles browsing around. I just see a lot of articles on using already-existing FileStreams and MemoryStreams. I'm a very visual learner and for some reason can't find anything useful to study the concept.
Thanks,
Matt

If you only read data sequentially from the HugeStream, then it simply needs to read each child stream (and append it into a local file as well as returning the read data to the caller) until the child-stream is exhausted, then move on to the next child-stream. If a Seek operation is used to jump "backwards" in the data, you must start reading from the local cache file; when you reach the end of the cache file, you must resume reading the current child stream where you left off.
So far, this is all pretty straight-forward to implement - you just need to indirect the Read calls to the appropriate stream, and switch streams as each one runs out of data.
The inefficiency of the quoted article is that it runs through all the streams every time you read to work out where to continue reading from. To improve on this, you need to open the child streams only as you need them, and keep track of the currently-open stream so you can just keep reading more data from that current stream until it is exhausted. Then open the next stream as your "current" stream and carry on. This is pretty straight-forward, as you have a linear sequence of streams, so you just step through them one by one. i.e. something like:
int currentStreamIndex = 0;
Stream currentStream = childStreams[currentStreamIndex++];
...
public override int Read(byte[] buffer, int offset, int count)
{
while (count > 0)
{
// Read what we can from the current stream
int numBytesRead = currentSteam.Read(buffer, offset, count);
count -= numBytesRead;
offset += numBytesRead;
// If we haven't satisfied the read request, we have exhausted the child stream.
// Move on to the next stream and loop around to read more data.
if (count > 0)
{
// If we run out of child streams to read from, we're at the end of the HugeStream, and there is no more data to read
if (currentStreamIndex >= numberOfChildStreams)
break;
// Otherwise, close the current child-stream and open the next one
currentStream.Close();
currentStream = childStreams[currentStreamIndex++];
}
}
// Here, you'd write the data you've just read (into buffer) to your local cache stream
}
To allow seeking backwards, you just have to introduce a new local file stream that you copy all the data into as you read (see the comment in my pseudocode above). You need to introduce a state so you know that you are reading from the cache file rather than the current child stream, and then just directly access the cache (seeking etc is easy because the cache represents the entire history of the data read from the HugeStream, so the seek offsets are identical between the HugeStream and the Cache - you simply have to redirect any Read calls to get the data out of the cache stream)
If you read or seek back to the end of the cache stream, you need to resume reading data from the current child stream. Just go back to the logic above and continue appending data to your cache stream.
If you wish to be able to support full random access within the HugeStream you will need to support seeking "forwards" (beyond the current end of the cache stream). If you don't know the lengths of the child streams beforehand, you have no choice but to simply keep reading data into your cache until you reach the seek offset. If you know the sizes of all the streams, then you could seek directly and more efficiently to the right place, but you will then have to devise an efficient means for storing the data you read to the cache file and recording which parts of the cache file contain valid data and which have not actually been read from the DB yet - this is a bit more advanced.
I hope that makes sense to you and gives you a better idea of how to proceed...
(You shouldn't need to implement much more than the Read and Seek interfaces to get this working).

Related

How to duplicate a stream

Currently I have this code
SessionStream(Request.Content.ReadAsStreamAsync(), new { });
I need to somehow "mirror" the incoming stream so that I have two instances of it.
Something like the following pseudo code:
Task<Stream> stream = Request.Content.ReadAsStreamAsync();
SessionStream(stream, new { });
Stream theotherStram;
stream.Result.CopyToAsync(theotherStram)
ThoOtherStream(theotherStram, new { });
A technique that always works is to copy the stream to a MemoryStream and then use that.
Often, it is more efficient to just seek the original stream back to the beginning using the Seek method. That only works if this stream supports seeking.
If you do not want to buffer and cannot seek you need to push the stream contents blockwise two the two consumers. Read a block, write it two times.
If in addition you need a pull model (i.e. hand a readable stream to some component) it gets really hard and threading is involved. You'd need to write a push-to-pull adapter which is always hard.
The answer of usr is still correct in 2020, but for those wondering about why it is not trivial here is a short explanation.
The idea behind the steams is that writing to the stream and reading from it are independent. Usually, the process of reading is much faster then writing (think about receiving data through network - you can read the data as soon as it arrives) so usually the reader waits for new portion of data, processes it as soon as it arrives and then drops it to free the memory, and then waits for the next portion.
This allows processing potentially infinite data stream (for example, application log stream) without using much RAM.
Suppose now we have 2 readers (as required by the question). A data portion arrives, and then we have to wait for both the readers to read the data before we can drop it. Which means that it must be stored in memory until both readers are done with it. The problem is that the readers can process the data with very different speed. E.g. one can write it to a file, another can just count the symbols in memory. In this case either the fast one would have to wait for the slow one before reading further, or we would need to save the data to a buffer in memory, and let the readers read from it. In the worst case we will end up with the full copy of the input stream in memory, basically creating an instance of memory stream.
To implement the first option, you would have to implement a stream reader that is aware which of your stream usage is faster, and considering it, would distribute and drop the data accordingly.
If you are sure you have enough memory, and processing speed is not critical, just use the memory stream:
using var memStream = new MemoryStream();
await incomingStream.CopyToAsync(memStream);
UseTheStreamForTheFirstTime(memStream);
memStream.Seek(0, SeekOrigin.Begin);
UseTheStreamAnotherTime(memStream);

WinRT - Read from IInputStream one byte at a time until a specific byte encountered

I have an IInputStream that I want to read data from until I encounter a certain byte, at which point I will pass the IInputStream to some other object to consume the rest of the stream.
This is what I've come up with:
public async Task HandleInputStream(IInputStream instream)
{
using (var dataReader = new DataReader(instream))
{
byte b;
do
{
await dataReader.LoadAsync(1);
b = dataReader.ReadByte();
// Do something with the byte
} while (b != <some condition>);
dataReader.DetachStream();
}
}
It seems like running LoadData for one byte at a time will be horribly slow. My dilemma is that if I pick a buffer size (like 1024) and load that, and my value shows up 10 bytes in, then this method will have the next 1014 bytes of data and will have to pass it to the next method for processing.
Is there a better way to approach this, or is this an acceptable solution?
If the value you're looking for is not too far from the beginning of the stream, this kind of reading shouldn't be that slow. How far into the stream are you expecting it? Have you tested the performance?
Depending on the type of stream you are using, you might be able to use other approaches:
If it supports seeking backwards (e.g. you're reading from a file), you could read larger chuncks at once as long as you know at what offset you have found your value. You can then seek into the stream to that position before you hand it off.
If that's not possible you could create another intermediate memory stream into which you would copy the remaining part of the buffer you have already read, followed by the rest of the stream. This works even if you can't seek backwards. The only problem might be memory consumption if the stream is too large.

Something about Stream

I've been working on something that make use of streams and I found myself not clear about some stream concepts( you can also view another question posted by me at About redirected stdout in System.Diagnostics.Process ).
1.how do you indicate that you have finished writing a stream, writing something like a EOF?
2.follow the previous question, if I have written a EOF(or something like that) to a stream but didn't close the stream, then I want to write something else to the same stream, can I just start writing to it and no more set up required?
3.if a procedure tries to read a stream(like the stdin ) that no one has written anything to it, the reading procedure will be blocked,finally some data arrives and the procedure will just read till the writing is done,which is indicated by getting a return of 0 count of bytes read rather than being blocked, and now if the procedure issues another read to the same stream, it will still get a 0 count and return immediately while I was expecting it will be blocked since no one is writing to the stream now. So does the stream holds different states when the stream is opened but no one has written to it yet and when someone has finished a writing session?
I'm using Windows the .net framework if there will by any thing platform specific.
Thanks a lot!
This depends on the concrete stream. For example, reading from a MemoryStream would not block as you describle. This is because a MemoryStream has an explicit size, and as you read from the stream the pointer is progressed through the stream untile you reach the end, at which point the Read will return 0. If there was not data in the MemoryStream the first Read would have immediately returned 0.
What you describe fits with a NetworkStream, in which case reading from the stream will block until data becomes available, when the "server" side closes the underlying Socket that is wrapped by the NetworkStream the Read will return 0.
So the actual details depends on the stream, but at the high level they are all treated the same ie. You can read from a stream until the Read returns 0.
There is no "EOF" with streams. You write to a stream until you close it, which prevents it from being written further.
Streams just read and write bytes. That's all.

My custom XML reader is a two-legged turtle. Suggestions?

I wrote a custom XML reader because I needed something that would not read ahead from the source stream. I wanted the ability to have an object read its data from the stream without negatively affecting the stream for the parent object. That way, the stream can be passed down the object tree.
It's a minimal implementation, meant only to serve the purpose of the project that uses it (right now). It works well enough, except for one method -- ReadString. That method is used to read the current element's content as a string, stopping when the end element is reached. It determines this by counting nesting levels. Meanwhile, it's reading from the stream, character by character, adding to a StringBuilder for the resulting string.
For a collection element, this can take a long time. I'm sure there is much that can be done to better implement this, so this is where my continuing education begins once again. I could really use some help/guidance. Some notes about methods it calls:
Read - returns the next byte in the stream or -1.
ReadUntilChar - calls Read until the specified character or -1 is reached, appending to a string with StringBuilder.
Without further ado, here is my two-legged turtle. Constants have been replaced with the actual values.
public string ReadString() {
int level = 0;
long originalPosition = m_stream.Position;
StringBuilder sb = new StringBuilder();
sbyte read;
try {
// We are already within the element that contains the string.
// Read until we reach an end element when the level == 0.
// We want to leave the reader positioned at the end element.
do {
sb.Append(ReadUntilChar('<'));
if((read = Read()) == '/') {
// End element
if(level == 0) {
// End element for the element in context, the string is complete.
// Replace the two bytes of the end element read.
m_stream.Seek(-2, System.IO.SeekOrigin.Current);
break;
} else {
// End element for a child element.
// Add the two bytes read to the resulting string and continue.
sb.Append('<');
sb.Append('/');
level--;
}
} else {
// Start element
level++;
sb.Append('<');
sb.Append((char)read);
}
} while(read != -1);
return sb.ToString().Trim();
} catch {
// Return to the original position that we started at.
m_stream.Seek(originalPosition - m_stream.Position, System.IO.SeekOrigin.Current);
throw;
}
}
Right off the bat, you should using a profiler for performance optimizations if you haven't already (I'd recommend SlimTune if you're on a budget). Without one you're just taking slightly-educated stabs in the dark.
Once you've profiled the parser you should have a good idea of where the ReadString() method is spending all its time, which will make your optimizing much easier.
One suggestion I'd make at the algorithm level is to scan the stream first, and then build the contents out: Instead of consuming each character as you see it, mark where you find <, >, and </ characters. Once you have those positions you can pull the data out of the stream in blocks rather than throwing characters into a StringBuilder one at a time. This will optimize away a significant amount of StringBuilder.Append calls, which may increase your performance (this is where profiling would help).
You may find this analysis useful for optimizing string operations, if they prove to be the source of the slowness.
But really, profile.
Your implementation assumes the Stream is seekable. If it is known to be seekable, why do anything? Just create an XmlReader at your position; consume the data; ditch the reader; and seek the Stream back to where you started?
How large is the xml? You may find that throwing the data into a DOM (XmlDocument / XDocument / ec) is a viable way of getting a reader that does what you need without requiring lots of rework. In the case of XmlDocument, XmlNodeReader would suffice, for example (it would also provide xpath support if you want to use non-trivial queries).
I wrote a custom XML reader because I needed something that would not read ahead from the
source stream. I wanted the ability to have an object read its data from the stream without
negatively affecting the stream for the parent object. That way, the stream can be passed
down the object tree.
That sounds more like a job for XmlReader.ReadSubTree(), which lets you create a new XmlReader to pass to another object to initialise itself from the reader without it being able to read beyond the bounds of the current element.
The ReadSubtree method is not intended to create a copy of the XML data that you can
work with independently. Rather, it can be used create a boundary around an XML
element. This is useful if you need to pass data to another component for processing
and you wish to limit how much of your data the component can access. When you pass an
XmlReader returned by the ReadSubtree method to another application, the application
can access only that XML element, rather than the entire XML document.
It does say that after reading the subtree the parent reader is re-positioned to the "EndElement" of the current element rather than remaining at the beginning, but is that likely to be a problem?
Why not use an existing one, like this one?

What is a good method to handle line based network I/O streams?

Note: Let me appologize for the length of this question, i had to put a lot of information into it. I hope that doesn't cause too many people to simply skim it and make assumptions. Please read in its entirety. Thanks.
I have a stream of data coming in over a socket. This data is line oriented.
I am using the APM (Async Programming Method) of .NET (BeginRead, etc..). This precludes using stream based I/O because Async I/O is buffer based. It is possible to repackage the data and send it to a stream, such as a Memory stream, but there are issues there as well.
The problem is that my input stream (which i have no control over) doesn't give me any information on how long the stream is. It simply is a stream of newline lines looking like this:
COMMAND\n
...Unpredictable number of lines of data...\n
END COMMAND\n
....repeat....
So, using APM, and since i don't know how long any given data set will be, it is likely that blocks of data will cross buffer boundaries requiring multiple reads, but those multiple reads will also span multiple blocks of data.
Example:
Byte buffer[1024] = ".................blah\nThis is another l"
[another read]
"ine\n.............................More Lines..."
My first thought was to use a StringBuilder and simply append the buffer lines to the SB. This works to some extent, but i found it difficult to extract blocks of data. I tried using a StringReader to read newlined data but there was no way to know whether you were getting a complete line or not, as StringReader returns a partial line at the end of the last block added, followed by returning null aftewards. There isn't a way to know if what was returned was a full newlined line of data.
Example:
// Note: no newline at the end
StringBuilder sb = new StringBuilder("This is a line\nThis is incomp..");
StringReader sr = new StringReader(sb);
string s = sr.ReadLine(); // returns "This is a line"
s = sr.ReadLine(); // returns "This is incomp.."
What's worse, is that if I just keep appending to the data, the buffers get bigger and bigger, and since this could run for weeks or months at a time that's not a good solution.
My next thought was to remove blocks of data from the SB as I read them. This required writing my own ReadLine function, but then I'm stuck locking the data during reads and writes. Also, the larger blocks of data (which can consist of hundreds of reads and megabytes of data) require scanning the entire buffer looking for newlines. It's not efficient and pretty ugly.
I'm looking for something that has the simplicity of a StreamReader/Writer with the convenience of async I/O.
My next thought was to use a MemoryStream, and write the blocks of data to a memory stream then attach a StreamReader to the stream and use ReadLine, but again I have issues with knowing if a the last read in the buffer is a complete line or not, plus it's even harder to remove the "stale" data from the stream.
I also thought about using a thread with synchronous reads. This has the advantage that using a StreamReader, it will always return a full line from a ReadLine(), except in broken connection situations. However this has issues with canceling the connection, and certain kinds of network problems can result in hung blocking sockets for an extended period of time. I'm using async IO because i don't want to tie up a thread for the life of the program blocking on data receive.
The connection is long lasting. And data will continue to flow over time. During the intial connection, there is a large flow of data, and once that flow is done the socket remains open waiting for real-time updates. I don't know precisely when the initial flow has "finished", since the only way to know is that no more data is sent right away. This means i can't wait for the initial data load to finish before processing, I'm pretty much stuck processing "in real time" as it comes in.
So, can anyone suggest a good method to handle this situation in a way that isn't overly complicated? I really want this to be as simple and elegant as possible, but I keep coming up with more and more complicated solutions due to all the edge cases. I guess what I want is some kind of FIFO in which i can easily keep appending more data while at the same time poping data out of it that matches certain criteria (ie, newline terminated strings).
That's quite an interesting question. The solution for me in the past has been to use a separate thread with synchronous operations, as you propose. (I managed to get around most of the problems with blocking sockets using locks and lots of exception handlers.) Still, using the in-built asynchronous operations is typically advisable as it allows for true OS-level async I/O, so I understand your point.
Well I've gone and written a class for accomplishing what I believe you need (in a relatively clean manner I would say). Let me know what you think.
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
public class AsyncStreamProcessor : IDisposable
{
protected StringBuilder _buffer; // Buffer for unprocessed data.
private bool _isDisposed = false; // True if object has been disposed
public AsyncStreamProcessor()
{
_buffer = null;
}
public IEnumerable<string> Process(byte[] newData)
{
// Note: replace the following encoding method with whatever you are reading.
// The trick here is to add an extra line break to the new data so that the algorithm recognises
// a single line break at the end of the new data.
using(var newDataReader = new StringReader(Encoding.ASCII.GetString(newData) + Environment.NewLine))
{
// Read all lines from new data, returning all but the last.
// The last line is guaranteed to be incomplete (or possibly complete except for the line break,
// which will be processed with the next packet of data).
string line, prevLine = null;
while ((line = newDataReader.ReadLine()) != null)
{
if (prevLine != null)
{
yield return (_buffer == null ? string.Empty : _buffer.ToString()) + prevLine;
_buffer = null;
}
prevLine = line;
}
// Store last incomplete line in buffer.
if (_buffer == null)
// Note: the (* 2) gives you the prediction of the length of the incomplete line,
// so that the buffer does not have to be expanded in most/all situations.
// Change it to whatever seems appropiate.
_buffer = new StringBuilder(prevLine, prevLine.Length * 2);
else
_buffer.Append(prevLine);
}
}
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
private void Dispose(bool disposing)
{
if (!_isDisposed)
{
if (disposing)
{
// Dispose managed resources.
_buffer = null;
GC.Collect();
}
// Dispose native resources.
// Remember that object has been disposed.
_isDisposed = true;
}
}
}
An instance of this class should be created for each NetworkStream and the Process function should be called whenever new data is received (in the callback method for BeginRead, before you call the next BeginRead I would imagine).
Note: I have only verified this code with test data, not actual data transmitted over the network. However, I wouldn't anticipate any differences...
Also, a warning that the class is of course not thread-safe, but as long as BeginRead isn't executed again until after the current data has been processed (as I presume you are doing), there shouldn't be any problems.
Hope this works for you. Let me know if there are remaining issues and I will try to modify the solution to deal with them. (There could well be some subtlety of the question I missed, despite reading it carefully!)
What you're explaining in you're question, reminds me very much of ASCIZ strings. (link text). That may be a helpfull start.
I had to write something similar to this in college for a project I was working on. Unfortunatly, I had control over the sending socket, so I inserted a length of message field as part of the protocol. However, I think that a similar approach may benefit you.
How I approached my solution was I would send something like 5HELLO, so first I'd see 5, and know I had message length 5, and therefor the message I needed was 5 characters. However, if on my async read, i only got 5HE, i would see that I have message length 5, but I was only able to read 3 bytes off the wire (Let's assume ASCII characters). Because of this, I knew I was missing some bytes, and stored what I had in fragment buffer. I had one fragment buffer per socket, therefor avoiding any synchronization problems. The rough process is.
Read from socket into a byte array, record how many bytes was read
Scan through byte by byte, until you find a newline character (this becomes very complex if you're not receiving ascii characters, but characters that could be multiple bytes, you're on you're own for that)
Turn you're frag buffer into a string, and append you're read buffer up until the new line to it. Drop this string as a completed message onto a queue or it's own delegate to be processed. (you can optimize these buffers by actually having you're read socket writing to the same byte array as you're fragment, but that's harder to explain)
Continue looping through, every time we find a new line, create a string from the byte arrange from a recorded start / end position and drop on queue / delegate for processing.
Once we hit the end of our read buffer, copy anything that's left into the frag buffer.
Call the BeginRead on the socket, which will jump to step 1. when data is available in the socket.
Then you use another Thread to read you're queue of incommign messages, or just let the Threadpool handle it using delegates. And do whatever data processing you have to do. Someone will correct me if I'm wrong, but there is very little thread synchronization issues with this, since you can only be reading or waiting to read from the socket at any one time, so no worry about locks (except if you're populating a queue, I used delegates in my implementation). There are a few details you will need to work out on you're own, like how big of a frag buffer to leave, if you receive 0 newlines when you do a read, the entire message must be appended to the fragment buffer without overwriting anything. I think it ran me about 700 - 800 lines of code in the end, but that included the connection setup stuff, negotiation for encryption, and a few other things.
This setup performed very well for me; I was able to perform up to 80Mbps on 100Mbps ethernet lan using this implementation a 1.8Ghz opteron including encryption processing. And since you're tied to the socket, the server will scale since multiple sockets can be worked on at the same time. If you need items processed in order, you'll need to use a queue, but if order doesn't matter, then delegates will give you very scalable performance out of the threadpool.
Hope this helps, not meant to be a complete solution, but a direction in which to start looking.
*Just a note, my implementation was down purely at the byte level and supported encryption, I used characters for my example to make it easier to visualize.

Categories

Resources