C# Stream Design Question

C# Stream Design Question - c#

I have an appliction right now that is a pipeline design. In one the first stage it reads some data and files into a Stream. There are some intermediate stages that do stuff to the stream of data. And then there is a final stage that writes the stream out to somewhere. This all happens serially, one stage completes and then hands off to the next stage.
This all has been working just great, but now the amount of data is starting to get quite a bit larger (hundreds of GB potentially). So I'm thinking that I will need to do something to alleviate this. My initial thought is what I'm looking for some feedback on (being an independent developer I just don't have anywhere to bounce the idea off of).
I'm thinking of creating a Parallel pipeline. The Object that starts off the pipeline would create all of the stages and kick each one off in it's own thread. When the first stage gets the stream to some certain size then it will pass that stream off to the next stage for processing and start up a new stream of its own to continue to fill up. The idea here being that the final stage will be closing out streams as the first stage is building a new ones so my memory usage would be kept lower.
So questions:
1) Any high level thoughts on directions for this design?
2) Is there a simpler approach that you can think of that might apply here?
3) Is there anything existing out there that does something like this that I could reuse (not a product I have to buy)?
Thanks,
MikeD

The producer/consumer model is a good way to proceed. And Microsoft has their new Parallel Extensions which should provide most of the ground work for you. Look into the Task object. There's a preview release available for .NET 3.5 / VS2008.
Your first task should read blocks of data from your stream and then pass them onto other tasks. Then, have as many tasks in the middle as logically fit. Smaller tasks are (generally) better. The only thing you need to watch out for is to make sure the last task saves the data in the order it was read (because all the tasks in the middle may finish in a different order to what they started).

For the design you've suggested, you'd want to have a good read up on producer/consumer problems if you haven't already. You'll need a good understanding of how to use semaphores in that situation.
Another approach you could try is to create multiple identical pipelines, each in a separate thread. This would probably be easier to code because it has a lot less inter-thread communication. However, depending on your data you may not be able to split it into chunks this way.

In each stage, do you read the entire chunk of data, do the manipulation, then send the entire chuck to the next stage?
If that is the case, you are using a "push" technique where you push the entire chunk of data to the next stage. Are you able to handle things in a more stream like manor using a "pull" technique? Each stage is a stream, and as you read data from that stream, it pulls data from the previous stream by calling read on it. As each stream is being read, it reads from the previous stream in small bits, processes it and returns the processed data. The destination stream determines how many bytes to read from the previous stream, and you don't ever have to consume large amounts of memory. This is how applications like BizTalk work. There are some blogs about how BizTalk Pipeline streams work, and I think it might be exactly what you want.
Here's a multi-part blog entry that you might find interesting:
Part 1
Part 2
Part 3
Part 4
Part 5

Related

Using a double buffer technique for concurrent reading and writing?

I have a relatively simple case where:
My program will be receiving updates via Websockets, and will be using these updates to update it's local state. These updates will be very small (usually < 1-1000 bytes JSON so < 1ms to de-serialize) but will be very frequent (up to ~1000/s).
At the same time, the program will be reading/evaluating from this local state and outputs its results.
Both of these tasks should run in parallel and will run for the duration for the program, i.e. never stop.
Local state size is relatively small, so memory usage isn't a big concern.
The tricky part is that updates need to happen "atomically", so that it does not read from a local state that has for example, written only half of an update. The state is not constrained to using primitives and could contain arbitrary classes AFAICT atm, so I cannot solve it by something simple like using Interlocked atomic operations. I plan on running each task on its own thread, so a total of two threads in this case.
To achieve this goal I thought to use a double buffer technique, where:
It keeps two copies of the state so one can be read from while the other is being written to.
The threads could communicate which copy they are using by using a lock. i.e. Writer thread locks copy when writing to it; reader thread requests access to lock after it's done with current copy; writer thread sees that reader thread is using it so it switches to other copy.
Writing thread keeps track of state updates it's done on the current copy so when it switches to the other copy it can "catch up".
That's the general gist of the idea, but the actual implementation will be a bit different of course.
I've tried to lookup whether this is a common solution but couldn't really find much info, so it's got me wondering things like:
Is it viable, or am I missing something?
Is there a better approach?
Is it a common solution? If so what's it commonly referred to as?
(bonus) Is there a good resource I could read up on for topics related to this?
Pretty much I feel I've run into a dead-end where I cannot find (because I don't know what to search for) much more resources and info to see if this approach is "good". I plan on writing this in .NET C#, but I assume the techniques and solutions could translate to any language. All insights appreciated.

You actually need four buffers/objects. Two buffers/objects are owned by the reader, one by the writer, and one in the mailbox.
The reader -- each time he finishes a group of atomic operations on his newer object, he uses interlocked exchange to swap his older object handle (pointer or index doesn't matter) with the mailbox one. Then he looks at the newly obtained object and compares the sequence number to the object he just read (and is still holding) to find out which is newer.
The writer -- writes a complete copy of latest data into his object, then uses interlocked exchange to swap his newly written object with the mailbox one.
As you can see, the writer can steal the mailbox object at any time, but never the one that the reader is using, so read operations stay atomic. And the reader can steal the mailbox object at any time, but never the one the writer is using, so write operations stay atomic.
As long as the interlocked-exchange function produces the correct memory fence (release for the swap done in the writer thread, acquire for the reader thread), the objects can themselves be arbitrarily complex.

If I understand correctly, the writes themselves are synchronous. If so, then maybe it's not necessary to keep two copies or even to use locks.
Maybe something like this could work?
State state = populateInitialState();
...
// Reader thread
public State doRead() {
return makeCopyOfState(state);
}
...
// Writer thread
public void updateState() {
State newState = makeCopyOfState(state);
// make changes in newState
state = newState;
}

It looks like you are using the input-process-output pattern in a multithreaded pipeline. Sometimes the input and processing phases (or processing and output phases) are merged when the problem is simple.
You have added a C# tag so using something like a BlockingCollection might be a useful way to communicate between the input and output threads. Since the local state is relatively small (your words) then posting a data-object containing a copy of the local state from the input thread to the output thread could be a simple solution. This follows a share-nothing philosophy which satisfies the atomic requirement because a snapshot of the current state is queued. The "catch up" capability is satisfied because the queue contains the backlog of state changes.
Generally, Messaging Patterns and Conversation Patterns are useful resources when trying to work out what to communicate and how to communicate between 2 or more threads (or processes, services, servers, etc).

How to write from multiple threads to a TcpListener?

Let's say I have a static list List<string> dataQueue, where data keeps getting added at random intervals and also at a varying rate (1-1000 entries/second).
My main objective is to send the data from the list to the server, I'm using a TcpClient class.
What I've done so far is, I'm sending the data synchronously to the client in a Single thread
byte[] bytes = Encoding.ASCII.GetBytes(message);
tcpClient.GetStream().Write(bytes, 0, bytes.Length);
//The client is already connected at the start
And I remove the entry from the list, once the data is sent.
This works fine, but the speed of data being sent is not fast enough, the list gets populated and consumes more memory, as the list gets iterated and sent one by one.
My question is can I use the same tcpClient object to write concurrently from another thread or can I use another tcpClient object with a new connection to the same server in another thread? What is the most efficient(quickest) way to send this data to the server?
PS: I don't want to use UDP

Right; this is a fun topic which I think I can opine about. It sounds like you are sharing a single socket between multiple threads - perfectly valid as long as you do it very carefully. A TCP socket is a logical stream of bytes, so you can't use it concurrently as such, but if your code is fast enough, you can share the socket very effectively, with each message being consecutive.
Probably the very first thing to look at is: how are you actually writing the data to the socket? what is your framing/encoding code like? If this code is simply bad/inefficient: it can probably be improved. For example, is it indirectly creating a new byte[] per string via a naive Encode call? Are there multiple buffers involved? Is it calling Send multiple times while framing? How is it approaching the issue of packet fragmentation? etc
As a very first thing to try - you could avoid some buffer allocations:
var enc = Encoding.ASCII;
byte[] bytes = ArrayPool<byte>.Shared.Rent(enc.GetMaxByteCount(message.Length));
// note: leased buffers can be oversized; and in general, GetMaxByteCount will
// also be oversized; so it is *very* important to track how many bytes you've used
int byteCount = enc.GetBytes(message, 0, message.Length, bytes, 0);
tcpClient.GetStream().Write(bytes, 0, byteCount);
ArrayPool<byte>.Shared.Return(bytes);
This uses a leased buffer to avoid creating a byte[] each time - which can massively improve GC impact. If it was me, I'd also probably be using a raw Socket rather than the TcpClient and Stream abstractions, which frankly don't gain you a lot. Note: if you have other framing to do: include that in the size of the buffer you rent, use appropriate offsets when writing each piece, and only write once - i.e. prepare the entire buffer once - avoid multiple calls to Send.
Right now, it sounds like you have a queue and dedicated writer; i.e. your app code appends to the queue, and your writer code dequeues things and writes them to the socket. This is a reasonably way to implement things, although I'd add some notes:
List<T> is a terrible way to implement a queue - removing things from the start requires a reshuffle of everything else (which is expensive); if possible, prefer Queue<T>, which is implemented perfectly for your scenario
it will require synchronization, meaning you need to ensure that only one thread alters the queue at a time - this is typically done via a simple lock, i.e. lock(queue) {queue.Enqueue(newItem);} and SomeItem next; lock(queue) { next = queue.Count == 0 ? null : queue.Dequeue(); } if (next != null) {...write it...}.
This approach is simple, and has some advantages in terms of avoiding packet fragmentation - the writer can use a staging buffer, and only actually write to the socket when a certain threshold is buffered, or when the queue is empty, for example - but it has the possibility of creating a huge backlog when stalls occur.
However! The fact that a backlog has occurred indicates that something isn't keeping up; this could be the network (bandwidth), the remote server (CPU) - or perhaps the local outbound network hardware. If this is only happening in small blips that then resolve themselves - fine (especially if it happens when some of the outbound messages are huge), but: one to watch.
If this kind of backlog is recurring, then frankly you need to consider that you're simply saturated for the current design, so you need to unblock one of the pinch points:
making sure your encoding code is efficient is step zero
you could move the encode step into the app-code, i.e. prepare a frame before taking the lock, encode the message, and only enqueue an entirely prepared frame; this means that the writer thread doesn't have to do anything except dequeue, write, recycle - but it makes buffer management more complex (obviously you can't recycle buffers until they've been completely processed)
reducing packet fragmentation may help significantly, if you're not already taking steps to achieve that
otherwise, you might need (after investigating the blockage):
better local network hardware (NIC) or physical machine hardware (CPU etc)
multiple sockets (and queues/workers) to round-robin between, distributing load
perhaps multiple server processes, with a port per server, so your multiple sockets are talking to different processes
a better server
multiple servers
Note: in any scenario that involves multiple sockets, you want to be careful not to go mad and have too many dedicated worker threads; if that number goes above, say, 10 threads, you probably want to consider other options - perhaps involving async IO and/or pipelines (below).
For completeness, another basic approach is to write from the app-code; this approach is even simpler, and avoids the backlog of unsent work, but: it means that now your app-code threads themselves will back up under load. If your app-code threads are actually worker threads, and they're blocked on a sync/lock, then this can be really bad; you do not want to saturate the thread-pool, as you can end up in the scenario where no thread-pool threads are available to satisfy the IO work required to unblock whichever writer is active, which can land you in real problems. This is not usually a scheme that you want to use for high load/volume, as it gets problematic very quickly - and it is very hard to avoid packet fragmentation since each individual message has no way of knowing whether more messages are about to come in.
Another option to consider, recently, is "pipelines"; this is a new IO framework in .NET that is designed for high volume networking, giving particular attention to things like async IO, buffer re-use, and a well-implemented buffer/back-log mechanism that makes it possible to use the simple writer approach (syncronize while writing) and have that not translate into direct sends - it manifests as an async writer with access to a backlog, which makes packet fragmentation avoidance simple and efficient. This is quite an advanced area, but it can be very effective. The problematic part for you will be: it is designed for async usage throughout, even for writes - so if your app-code is currently synchronous, this could be a pain to implement. But: it is an area to consider. I have a number of blog posts talking about this topic, and a range of OSS examples and real-life libraries that make use of pipelines that I can point you at, but: this isn't a "quick fix" - it is a radical overhaul of your entire IO layer. It also isn't a magic bullet - it can only remove overhead due to local IO processing costs.

Write Large File Listing To File Efficiently in C#

I have what I would consider to be a fairly common problem, but have not managed to find a good solution on my own or by browsing this forum.
Problem
I have written a tool to get a file listing of a folder with some additional information such as file name, file path, file size, hash, etc.
The biggest problem that I have is that some of the folders contain millions of files (maybe 50 million in the structure).
Possible Solutions
I have two solutions, but neither of them are ideal.
Every time a file is read, the information is written straight to file. This is OK, but it means I can't multi-thread the file without running into issues with the thread locking the file.
Every time a file is read, the information is added to some form of collection such as a ConcurrentBag. The means I can multi-thread the enumeration of the files and add them to the collection. Once the enumeration is done, I can write the whole collection to a file using File.WriteAllLines; however adding 50 million entries to the collection makes most machines run out of memory.
Other Options
Is there any way to add items to a collection and then write them to a file when it gets to a certain number of records in the collection or something like that?
I looked into a BlockingCollection, but that will just fill up really quickly as the producer will be multi-threaded, but the consumer would only be single-threaded.

Create a FileStream that is shared by all threads. Before writing to that FileStream, a thread must lock it. FileStream has some buffer (4096bytes if i remember right), so it doesn't actually write to disk every time. You may use a BufferedStream around that if 4096 bytes is still not enough.

BlockingCollection is precisely what you need. You can create one with a large buffer and have a single writer thread writing to a file that it keeps open for the duration of the run.
If reading is the dominant operation time-wise the queue will be near empty the whole time and total time will be just slightly more than the read time.
If writing is the dominant operation time-wise the queue will fill up until you reach the limit you set (to prevent out of memory situations) and producers will only advance as the writer advances. The total time will be the time needed to write all the records to a single file sequentially and you cannot do better than that (when writer is the slowest part).
You may be able to get slightly better performance by pipelining through multiple blocking collections, e.g. making the hash-calculation (a CPU-bound operation) potentially separate from the read, or write operations. If you want to do that though consider the TPL DataFlow library.

Best strategy to implement reader for large text files

We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.
Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.
I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.
The files will not be edited!
While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.
I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.
One problem I have with this approach is how to get specific lines from the stored string.
If anyone has any ideas, can highlight problems with my approach then thank you.

I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:
string[] lines = File.ReadAllLines("file.txt");
Then you can search for matching lines with LINQ, display them easily etc.

Here is an approach that scales well on modern CPU's with multiple cores.
You create an iterator block that yields the lines from the text file (or multiple text files if required):
IEnumerable<String> GetLines(String fileName) {
using (var streamReader = File.OpenText(fileName))
while (!streamReader.EndOfStream)
yield return streamReader.ReadLine();
}
You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.
GetLines(fileName)
.AsParallel()
.AsOrdered()
.Where(line => ...)
.ForAll(line => ...);
You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.
This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.
ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.
This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.

I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)
There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.
So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.
Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.
[/rant]

I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.

Fast data recording/logging on a separate thread in C#

We're developing an application which reads data from a number of external hardware devices continuously. The data rate is between 0.5MB - 10MB / sec, depending on the external hardware configuration.
The reading of the external devices is currently being done on a BackgroundWorker. Trying to write the acquired data to disk with this same BackgroundWorker does not appear to be a good solution, so what we want to do is, to queue this data to be written to a file, and have another thread dequeue the data and write to a file. Note that there will be a single producer and single consumer for the data.
We're thinking of using a synchronized queue for this purpose. But we thought this wheel must have been invented so many times already, so we should ask the SO community for some input.
Any suggestions or comments on things that we should watch out for would be appreciated.

I would do what a combination of what mr 888 does.
Basicly in you have 2 background workers,
one that reads from the hardware device.
one that writes the data to disk.
Hardware background worker:
Adds chucks on data from the hardware in the Queue<> . In whatever format you have it in.
Write background worker
Parses the data if needed and dumps to disk.
One thing to consider here is is getting the data from the hardware to disk as fast as posible importent?
If Yes, then i would have the write brackground test basicly in a loop with a 100ms or 10ms sleep in the while loop with checking if the que has data.
If No, Then i would have it either sleep a simular amount ( Making the assumtion that the speed you get from your hardware changes periodicly) and make only write to disk when it has around 50-60mb of data. I would consider doing it this way because modern hard drives can write about 60mb pr second ( This is a desktop hard drive, your enterprice once could be much quicker) and constantly writing data to it in small chucks is a waste of IO bandwith.

I am pretty confident that your queue will be pretty much ok. But make sure that you use efficient method of storing/retrieving data not to overhaul you logging procedure with memory allocation/deallocation. I would go for some pre-allocated memory buffer, and use it as a circular queue.

u might need queing
eg. code
protected Queue<Byte[]> myQ;
or
protected Queue<Stream> myQ;
//when u got the content try
myQ.Enque(...);
and use another thread to pop the queue
// another thread
protected void Loging(){
while(true){
while(myQ.Count > 0){
var content = myQ.Dequeue();
// save content
}
System.Threading.Thread.Sleep(1000);
}
}

I have a similar situation, In my case I used an asynchrounous lockfree queue with a LIFO synchronous object
Basically the threads that write to the queue set the sync object in the LIFO while the other threads 'workers' reset the sync object in the LIFO
We have fixed number of sync objects that are equal to the threads. the reason for using a LIFO is that to keep minimum number of threads running and better use of cache system.

Have you tried MSMQ

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.