I am working on parsing a fairly complicated data stream generated by a usb device that emulates a keyboard. The easiest way for me to conceptualize and deal with the data would be if I had a method called something like GetNextInputCharacter, and I could do all the parsing in one go without having to do it one piece at a time as I receive input. Unfortunately I won't know how many bytes to expect in the stream until I have parsed into it significantly, and I would rather not wait until the end to parse it anyway.
Is there any mechanism or design pattern that can take asynchronous input from the key event and pass it to my parse method on demand? All I can think of is an IEnumerable that does a busy wait on a FIFO that the key event populates and yields them out one at a time. That seems like a bit of a hack, but maybe it would work. I just want a way for the parse routine to pretend like the input is already there and take it without knowing that it has to wait for the events.
How about parsing a Stream, and making the parser block until it has enough characters to make a sensible result? Then the async data from the USB device can just write to the stream. You'd probably have to write your own Stream implementation, but that isn't too hard.
This is a common enough pattern by the way- when you use the built-in .net serialization, the deserializers block on reading an input stream which may be coming over a network socket.
How about something like this:
...
var stream = //Set up stream
var data = from dataStream in StreamStuff(stream) select dataStream;
...
private IEnumerable<String> StreamStuff(Stream stream)
{
stream.Open();
while(stream.Read())
{
//Do some stuff to your read value
yield return yourProcessedData;
}
stream.Close();
}
Related
I need to check a NamedPipeClientStream to see if there are bytes for it to read before I attempt to read it. The reason for this is because the thread stops on any read operation if there's nothing to read and I simply cannot have that. I must be able to continue even if there's no bytes to read.
I've also tried wrapping it in a StreamReader, which I've seen suggested, but that has the same result.
StreamReader sr = new StreamReader(myPipe)
string temp;
while((temp = sr.ReadLine()) != null) //Thread stops in ReadLine
{
Console.WriteLine("Received from server: {0}", temp);
}
I either need for the read operations to not wait until there are bytes to read, or a way to check if there are bytes to read before attempting the read operations.
PipeStream does not support the Length, Position or ReadTimout properties or Seek...
This is a very bad pattern. Structure your code so that there's a reading thread that always tries to read until the stream has ended. Then, make your threads communicate to achieve the logic and control flow you want.
It is generally not possible to check whether an arbitrary Stream has data available. I think it's possible with named pipes. But even if you do that you need to ensure that incoming bytes will be read in a timely manner. There is no event for that. Even if you manage all of this the code will be quite nasty. It will not be easy to mentally verify.
For that reason, simply keep a reading loop alive. You could make that reading loop enqueue the data into a queue (maybe BlockingCollection). Then other threads can check that queue for data or wait for data to arrive. The stream will always be drained correctly. You can signal the stream end by enqueueing null.
When I say "thread" I mean any primitive that gives you the appearance of a thread. These days you would never use Thread. Rather, use async/await or Task.
I have a service that takes an input Stream containing CSV data that needs to be bulk-inserted into a database, and my application is using async/await wherever possible.
The process is: Parse stream using CsvHelper's CsvParser, add each row to DataTable, use SqlBulkCopy to copy the DataTable to the database.
The data could be any size so I'd like to avoid reading the whole thing into memory at one time - obviously I'll have all that data in the DataTable by the end anyway so would essentially have 2 copies in memory.
I would like to do all of this as asynchronously as possible, but CsvHelper doesn't have any async methods so I've come up with the following workaround:
using (var inputStreamReader = new StreamReader(inputStream))
{
while (!inputStreamReader.EndOfStream)
{
// Read line from the input stream
string line = await inputStreamReader.ReadLineAsync();
using (var memoryStream = new MemoryStream())
using (var streamWriter = new StreamWriter(memoryStream))
using (var memoryStreamReader = new StreamReader(memoryStream))
using (var csvParser = new CsvParser(memoryStreamReader))
{
await streamWriter.WriteLineAsync(line);
await streamWriter.FlushAsync();
memoryStream.Position = 0;
// Loop through all the rows (should only be one as we only read a single line...)
while (true)
{
var row = csvParser.Read();
// No more rows to process
if (row == null)
{
break;
}
// Add row to DataTable
}
}
}
}
Are there any issues with this solution? Is it even necessary? I've seen that the CsvHelper devs specifically did not add async functionality (https://github.com/JoshClose/CsvHelper/issues/202) but I don't really follow the reasoning behind not doing so.
EDIT: I've just realised that this solution isn't going to work for instances where a column contains a line break anyway :( Guess I'll just have to copy the whole input Stream to a MemoryStream or something
EDIT2: Some more information.
This is in an async method in a library where I am trying to do async all the way down. It'll likely be consumed by an MVC controller (if I just wanted to offload it from a UI thread I would just Task.Run() it). Mostly the method will be waiting on external sources such as a database / DFS, and I would like for the thread to be freed while it is.
CsvParser.Read() is going to block even if what's blocking is reading the Stream (e.g. if the data I'm attempting to read resides on a server on the other side of the world), whereas if CsvHelper were to implement an async method that uses TextReader.ReadAsync(), then I wouldn't be blocked waiting for my data to arrive from Dubai. As far as I can tell, I'm not asking for an async wrapper around a synchronous method.
EDIT3: Update from far in the future! Async functionality was actually added to CsvHelper back in 2017. I hope someone at the company I was working for has upgraded to the newer version since then!
Eric lippert explained the usefulness of async-await using a metaphor of cooking a meal in a restaurant. According to his explanation it is not useful to do something asynchronously if your thread has nothing else to do.
Also, be aware that while your thread is doing something it cannot do something else. Only if your thread is waiting for something it can do other things. One of the things you wait for in your process is the reading of a file. While the thread is reading the file line by line, it has to wait several times for lines to be read. During this waiting it could do other things, like parsing the read CSV-data and sending the parsed data to your destination.
Parsing the data is not a process where your thread has to wait for some other process to finish, like it has to do when reading a file or sending data to a database. That's why there is no async version of the parsing process. A normal async-await wouldn't help keeping your thread busy, because during the parsing process there is nothing to await for, so during the parsing your thread wouldn't have time to do something else.
You could of course convert the parsing process to an awaitable task using Task.Run ( () => ParseReadData(...)), and await for this task to finish, but in the analogy of Eric Lippert's restaurant this would be defrosting a cook to do the job, while you are sitting behind the counter doing nothing.
However, if your thread has something meaningful to do, while the read CSV-data is being parsed, like responding to user input, then it might be useful to start the parsing in a separate task.
If your complete reading - parsing - updating database process doesn't need interaction with the user, but you need your thread to be free to do other things while doing the process, consider putting the complete process in a separate task, and start the task without awaiting for it. In that case you only use your interface thread to start the other task, and your interface thread is free to do other things. Starting this new task is a relatively small cost in comparison to the total time of your process.
Once again: if your thread has nothing else to do, let this thread do the processing, don't start other tasks to do it.
Here is a good article on exposing async wrappers on sync methods, and why CsvHelper didn't do it. http://blogs.msdn.com/b/pfxteam/archive/2012/03/24/10287244.aspx
If you don't want to block the UI thread, run the processing on a background thread.
CsvHelper pulls in a buffer of data. The size of the buffer is a setting that you can change if you like. If your server is on the other side of the world, it'll buffer some data, then read it. More than likely, it'll take several reads before the buffer is used.
CsvHelper also yields records, so if you don't actually get a row, nothing is read. If you only read a couple rows, only that much of the file is read (actually the buffer size).
If you're worried about memory, there are a couple simple options.
Buffer the data. You can bulk copy in 100 or 1000 rows at a time instead of the whole file. Just keep doing that until the file is done.
Use a FileStream. If you need to read the whole file at once for some reason, use a FileStream instead and write the whole thing to disc. It will be slower, but you won't be using a bunch of memory.
The (entire) documentation for the position property on a stream says:
When overridden in a derived class, gets or sets the position within the current stream.
The Position property does not keep track of the number of bytes from the stream that have been consumed, skipped, or both.
That's it.
OK, so we're fairly clear on what it doesn't tell us, but I'd really like to know what it in fact does stand for. What is 'the position' for? Why would we want to alter or read it? If we change it - what happens?
In a pratical example, I have a a stream that periodically gets written to, and I have a thread that attempts to read from it (ideally ASAP).
From reading many SO issues, I reset the position field to zero to start my reading. Once this is done:
Does this affect where the writer to this stream is going to attempt to put the data? Do I need to keep track of the last write position myself? (ie if I set the position to zero to read, does the writer begin to overwrite everything from the first byte?)
If so, do I need a semaphore/lock around this 'position' field (subclassing, perhaps?) due to my two threads accessing it?
If I don't handle this property, does the writer just overflow the buffer?
Perhaps I don't understand the Stream itself - I'm regarding it as a FIFO pipe: shove data in at one end, and suck it out at the other.
If it's not like this, then do I have to keep copying the data past my last read (ie from position 0x84 on) back to the start of my buffer?
I've seriously tried to research all of this for quite some time - but I'm new to .NET. Perhaps the Streams have a long, proud (undocumented) history that everyone else implicitly understands. But for a newcomer, it's like reading the manual to your car, and finding out:
The accelerator pedal affects the volume of fuel and air sent to the fuel injectors. It does not affect the volume of the entertainment system, or the air pressure in any of the tires, if fitted.
Technically true, but seriously, what we want to know is that if we mash it to the floor you go faster..
EDIT - Bigger Picture
I have data coming in either from a serial port, a socket, or a file, and have a thread that sits there waiting for new data, and writing it to one or more streams - all identical.
One of these streams I can access from a telnet session from another pc, and that all works fine.
The problem I'm having now is parsing the data in code in the same program (on another of the duplicated streams). I'm duplicating the data to a MemoryStream, and have a thread to sit and decipher the data, and pass it back up to the UI.
This thread does a dataStream.BeginRead() into it's own buffer, which returns some(?) amount of data up to but not more than the count argument. After I've dealt with whatever I got back from the BeginRead, I copy the remaining data (from the end of my read point to the end of the stream) to the start of my buffer so it won't overflow.
At this point, since both the writing and reading are asynchronous, I don't know if I can change the position (since it's a 'cursor' - thanks Jon). Even if send a message to the other thread to say that I've just read 28 bytes, or whatever - it won't know which 28 bytes they were, and won't know how to reset it's cursor/position.
I haven't subclassed any streams - I've just created a MemoryStream, and passed that to the thread that duplicates the data out to whatever streams are needed.
This all feels too complex to be the right way of doing it - I'm just unable to find a simple example I can modify as needed..
How else do people deal with a long-term sporadic data stream that needs to be send to some other task that isn't instantaneous to perform?
EDIT: Probable Solution
While trying to write a Stream wrapper around a queue due to information in the answers, I stumbled upon this post by Stephen Toub.
He has written a BlockingStream, and explains:
Most streams in the .NET Framework are not thread safe, meaning that multiple threads can't safely access an instance of the stream concurrently and most streams maintain a single position at which the next read or write will occur. BlockingStream, on the other hand, is thread safe, and, in a sense, it implicitly maintains two positions, though neither is exposed as a numerical value to the user of the type.
BlockingStream works by maintaining an internal queue of data buffers written to it. When data is written to the stream, the buffer written is enqueued. When data is read from the stream, a buffer is dequeued in a first-in-first-out (FIFO) order, and the data in it is handed back to the caller. In that sense, there is a position in the stream at which the next write will occur and a position at which the next read will occur.
This seems exactly what I was looking for - so thanks for the answerrs guys, I only found this from your answers.
I think that you are expecting a little too much from the documentation. It does tell you exactly what everything does, but it doesn't tell you much about how to use it. If you are not familiar with streams, reading only the documention will not give you enough information to actually understand how to use them.
Let's look at what the documentation says:
"When overridden in a derived class,
gets or sets the position within the
current stream."
This is "standard documentation speak" for saying that the property is intended for keeping track of the position in the stream, but that the Stream class itself doesn't provide the actual implementation of that. The implementation lies in classes that derive from the Stream class, like a FileStream or a MemoryStream. Each have their own system of maintaining the position, because they work against completely different back ends.
There can even be implementation of streams where the Position property doesn't make sense. You can use the CanSeek property to find out if a stream implementation supports a position.
"The Position property does not keep
track of the number of bytes from the
stream that have been consumed,
skipped, or both."
This means that the Position property represents an absolute position in the back end implementation, it's not just a counter of what's been read or written. The methods for reading and writing the stream uses the position to keep track of where to read or write, it's not the other way around.
For a stream implementation that doesn't support a position, it could still have returned how many bytes have been read or written, but it doesn't. The Position property should reflect an actual place in the data, and if it can't do that it should throw a NotSupportedException exception.
Now, let's look at your case:
Using a StreamReader and a StreamWriter against the same stream is tricky, and mostly pointless. The stream only has one position, and that will be used for both reading and writing, so you would have to keep track of two separate positions. Also, you would have to flush the buffer after each read and write operation, so that there is nothing left in the buffers and the Position of the stream is up to date when you retrieve it. This means that the StreamReader and StreamWriter can't be used as intended, and only act as a wrapper around the stream.
If you are using the StreamReader and StreamWriter from different threads, you have to synchronise every operation. Two threads can never use the stream at the same time, so a read/write operation would have to do:
lock
set position of the stream from local copy
read/write
flush buffer
get position of the stream to local copy
end lock
A stream can be used as a FIFO buffer that way, but there are other ways that may be better suited for your needs. A Queue<T> for example works as an in-memory FIFO buffer.
The position is the "cursor" for both writing and reading. So yes, after resetting the Position property to 0, it will start overwriting existing data
You should be careful when dealing with a stream from multiple threads in the first place, to be honest. It's not clear whether you've written a new Stream subclass, or whether you're just the client of an existing stream, but either way you need to be careful.
It's not clear what you mean by "If I don't handle this property" - what do you mean by "handle" here? Again, it would help if you were clearer on what you were doing.
A Stream may act like a pipe... it really depends on what you're doing with it. It's unclear what you mean by "do I have to keep copying the data past my last read" - and unclear what you mean by your buffer, too.
If you could give an idea of the bigger picture of what you're trying to achieve, that would really help.
Does anyone know of a lazy stream implementation in .net? IOW, I want a to create a method like this:
public Stream MyMethod() {
return new LazyStream(...whatever parameters..., delegate() {
... some callback code.
});
}
and when my other code calls MyMethod() to return retrieve the stream, it will not actually perform any work until someone actually tries to read from the stream. The usual way would be to make MyMethod take the stream parameter as a parameter, but that won't work in my case (I want to give the returned stream to an MVC FileStreamResult).
To further explain, what I'm looking for is to create a layered series of transformations, so
Database result set =(transformed to)=> byte stream =(chained to)=> GZipStream =(passed to)=> FileStreamResult constructor.
The result set can be huge (GB), so I don't want to cache the result in a MemoryStream, which I can pass to the GZipStream constructor. Rather, I want to fetch from the result set as the GZipStream requests data.
Most stream implementations are, by nature, lazy streams. Typically, any stream will not read information from its source until it is requested by the user of the stream (other than some extra "over-reading" to allow for buffering to occur, which makes stream usage much faster).
It would be fairly easy to make a Stream implementation that did no reading until necessary by overriding Read to open the underlying resource and then read from it when used, if you need a fully lazy stream implementation. Just override Read, CanRead, CanWrite, and CanSeek.
In your Stream class you have to implement several methods of System.IO.Stream including the Read method.
What you do in this method is up to you. If you choose to call a delegate - this is up to you as well, and of course you can pass this delegate as one of the parameters of your constructor. At least this is how I would do it.
Unfortunately it will take more than implementing read method, and your delegate will not cover other required methods
This answer (https://stackoverflow.com/a/22048857/1037948) links to this article about how to write your own stream class.
To quote the answer:
The producer writes data to the stream and the consumer reads. There's a buffer in the middle so that the producer can "write ahead" a little bit. You can define the size of the buffer.
To quote the original source:
You can think of the ProducerConsumerStream as a queue that has a Stream interface. Internally, it's implemented as a circular buffer. Two indexes keep track of the insertion and removal points within the buffer. Bytes are written at the Head index, and removed from the Tail index.
If Head wraps around to Tail, then the buffer is full and the producer has to wait for some bytes to be read before it can continue writing. Similarly, if Tail catches up with Head, the consumer has to wait for bytes to be written before it can proceed.
The article goes on to describe some weird cases when the pointers wrap around, with full code samples.
Does anyone know where I can find a Stream splitter implementation?
I'm looking to take a Stream, and obtain two separate streams that can be independently read and closed without impacting each other. These streams should each return the same binary data that the original stream would. No need to implement Position or Seek and such... Forward only.
I'd prefer if it didn't just copy the whole stream into memory and serve it up multiple times, which would be fairly simple enough to implement myself.
Is there anything out there that could do this?
I have made a SplitStream available on github and NuGet.
It goes like this.
using (var inputSplitStream = new ReadableSplitStream(inputSourceStream))
using (var inputFileStream = inputSplitStream.GetForwardReadOnlyStream())
using (var outputFileStream = File.OpenWrite("MyFileOnAnyFilestore.bin"))
using (var inputSha1Stream = inputSplitStream.GetForwardReadOnlyStream())
using (var outputSha1Stream = SHA1.Create())
{
inputSplitStream.StartReadAhead();
Parallel.Invoke(
() => {
var bytes = outputSha1Stream.ComputeHash(inputSha1Stream);
var checksumSha1 = string.Join("", bytes.Select(x => x.ToString("x")));
},
() => {
inputFileStream.CopyTo(outputFileStream);
},
);
}
I have not tested it on very large streams, but give it a try.
github: https://github.com/microknights/SplitStream
Not out of the box.
You'll need to buffer the data from the original stream in a FIFO manner, discarding only data which has been read by all "reader" streams.
I'd use:
A "management" object holding some sort of queue of byte[] holding the chunks to be buffered and reading additional data from the source stream if required
Some "reader" instances which known where and on what buffer they are reading, and which request the next chunk from the "management" and notify it when they don't use a chunk anymore, so that it may be removed from the queue
This could be tricky without risking keeping everything buffered in memory (if the streams are at BOF and EOF respectively).
I wonder whether it isn't easier to write the stream to disk, copy it, and have two streams reading from disk, with self-deletion built into the Close() (i.e. write your own Stream wrapper around FileStream).
The below seems to be valid called EchoStream
http://www.codeproject.com/Articles/3922/EchoStream-An-Echo-Tee-Stream-for-NET
Its a very old implementation (2003) but should provide some context
found via Redirect writes to a file to a stream C#
You can't really do this without duplicating at least part of the sourse stream - mostly due to the fact that if doesn't sound like you can control the rate at which they are consumed (multiple threads?). You could do something clever regarding one reading ahread of the other (and thereby making the copy at that point only) but the complexiy of this sounds like it's not worth the trouble.
I do not think you will be able to find a generic implementation to do just that. A Stream is rather abstract, you don't know where the bytes are coming from. For instance you don't know if it will support seeking; and you don't know the relative cost of operations. (The Stream might be an abstraction of reading data from a remote server, or even off a backup tape !).
If you are able to have a MemoryStream and store the contents once, you can create two separate streams using the same buffer; and they will behave as independent Streams but only use the memory once.
Otherwise, I think you are best off by creating a wrapper class that stores the bytes read from one stream, until they are also read by the second stream. That would give you the desired forward-only behaviour - but in worst case, you might risk storing all of the bytes in memory, if the second Stream is not read until the first Stream has completed reading all content.
With the introduction of async / await, so long as all but one of your reading tasks are async, you should be able to process the same data twice using only a single OS thread.
What I think you want, is a linked list of the data blocks you have seen so far. Then you can have multiple custom Stream instances that hold a pointer into this list. As blocks fall off the end of the list, they will be garbage collected. Reusing the memory immediately would require some other kind of circular list and reference counting. Doable, but more complicated.
When your custom Stream can answer a ReadAsync call from the cache, copy the data, advance the pointer down the list and return.
When your Stream has caught up to the end of the cache list, you want to issue a single ReadAsync to the underlying stream, without awaiting it, and cache the returned Task with the data block. So if any other Stream reader also catches up and tries to read more before this read completes, you can return the same Task object.
This way, both readers will hook their await continuation to the result of the same ReadAsync call. When the single read returns, both reading tasks will sequentially execute the next step of their process.