What are some strategies for working multi-threading into libraries?

What are some strategies for working multi-threading into libraries? - c#

I'm in the process of writing a library that deals with long-running tasks like file downloading and processing large amounts of text. I want to multi-thread this library so that these tasks won't freeze up the applications that use them.
Do you have any advice for doing so in a structured manner, or specific classes I should use/avoid? I was thinking of using the IAsyncResult interface: http://msdn.microsoft.com/en-us/library/system.iasyncresult.aspx, or perhaps some BackgroundWorkers.

so that these tasks won't freeze up the applications that use them.
If this is your goal, you should look into the standard asynchronous programming patterns in the framework.
If your library is targeting .NET 4, have it return Task and Task<T>, as this will ease transition into the async support coming in the next release of C# and VB.NET. This also has the very nice addition of allowing synchronous usage with no extra work on your part, since the user can always just do:
var result = foo.BarAsync().Result; // Getting Task<T>.Result blocks, effectively making this synchronous
If you're targeting .NET 3.5 or earlier, you should consider using the Event-based asynchronous pattern, as it is used in more of the current APIs than the APM.

Related

what is the main difference between .net Async and google go light weight thread

When calling runtime.GOMAXPROCS(1) in go the runtime will only use one thread for all your goroutines. When doing io your goroutines will yield and let the other goroutines run on the same thread.
This seem very similar to me to how the .net Async CTP feature is doing cooperative concurrency if you are not using background thread.
My question is which advantage or drawback could you think of one methode over the other.

Making value judgements is always a tricky thing so I'll highlight 3 differences. You decide whether they fall into the "pro" or "con" bucket.
While both Go and async allow you to write async code in a straightforward way, in .NET you have to be aware which part of your code is async and which one isn't (i.e. you have to explicitly use async/await keywords). In Go you don't need to know that - the runtime makes it "just work", there is no special syntax to mark async code.
Go design doesn't require any special code in standard library. .NET required adding new code to the standard library for every async operation, essentially doubling API surface for those cases e.g. there's new async http download API and the old, non-async http download API has to remain for backwards compatibility.
Go design and implementation is orders of magnitude simpler. A small piece of runtime code (scheduler) takes care of suspending goroutines that block on system calls and yielding to sleeping goroutines. There is no need for any special async support in standard library.
.NET implementation first required adding the aforementioned new APIs. Furthermore .NET implementation is based on compiler rewriting code with async/await into an equivalent state machines. It's very clever but also rather complicated. The practical result was that the first async CTP had known bugs while Go's implementation was working pretty much from the beginning.
Ultimately, it doesn't really matter. async/await is the best way to write async code in .NET. Goroutines are the best way to get that in Go. Both are great, especially compared to alternatives in most other languages.

Multithreading in .Net

I've a configuration xml which is being used by a batch module in my .Net 3.5 windows application.
Each node in the xml is mapped to a .Net class. Each class does processing like mathematical calculations, making db calls etc.
The batch module loads the xml, identifies the class associated with each node and then processes it.
Now, we have the following requirements:
1.Lets say there are 3 classes[3 nodes in the xml]...A,B, and C.
Class A can be dependant on class B...ie. we need to execute class B before processing class A. Class C processing should be done on a separare thread.
2.If a thread is running, then we should be able to cancel that thread in the middle of its processing.
We need to implement this whole module using .net multi-threading.
My questions are:
1.Is it possible to implement requirement # 1 above?If yes, how?
2.Given these requirements, is .Net 3.5 a good idea or .Net 4.0 would be a better choice?Would like to know advantages and disadvantages please.
Thanks for reading.

You'd be better off using the Task Parallel Library (TPL) in .NET 4.0. It'll give you lots of nice features for abstracting the actual business of creating threads in the thread pool. You could use the parallel tasks pattern to create a Task for each of the jobs defined in the XML and the TPL will handle the scheduling of those tasks regardless of the hardware. In other words if you move to a machine with more cores the TPL will schedule more threads.
1) The TPL supports the notion of continuation tasks. You can use these to enforce task ordering and pass the result of one Task or future from the antecedent to the continuation. This is the futures pattern.
// The antecedent task. Can also be created with Task.Factory.StartNew.
Task<DayOfWeek> taskA = new Task<DayOfWeek>(() => DateTime.Today.DayOfWeek);
// The continuation. Its delegate takes the antecedent task
// as an argument and can return a different type.
Task<string> continuation = taskA.ContinueWith((antecedent) =>
{
return String.Format("Today is {0}.",
antecedent.Result);
});
// Start the antecedent.
taskA.Start();
// Use the contuation's result.
Console.WriteLine(continuation.Result);
2) Thread cancellation is supported by the TPL but it is cooperative cancellation. In other words the code running in the Task must periodically check to see if it has been cancelled and shut down cleanly. TPL has good support for cancellation. Note that if you were to use threads directly you run into the same limitations. Thread.Abort is not a viable solution in almost all cases.
While you're at it you might want to look at a dependency injection container like Unity for generating configured objects from your XML configuration.
Answer to comment (below)
Jimmy: I'm not sure I understand holtavolt's comment. What is true is that using parallelism only pays off if the amount of work being done is significant, otherwise your program may spend more time managing parallelism that doing useful work. The actual datasets don't have to be large but the work needs to be significant.
For example if your inputs were large numbers and you we checking to see if they were prime then the dataset would be very small but parallelism would still pay off because the computation is costly for each number or block of numbers. Conversely you might have a very large dataset of numbers that you were searching for evenness. This would require a very large set of data but the calculation is still very cheap and a parallel implementation might still not be more efficient.
The canonical example is using Parallel.For instead of for to iterate over a dataset (large or small) but only perform a simple numerical operation like addition. In this case the expected performance improvement of utilizing multiple cores is outweighed by the overhead of creating parallel tasks and scheduling and managing them.

Of course it can be done.
Assuming you're new, I would likely look into multithreading, and you want 1 thread per class then I would look into the backgroundworker class, and basically use it in the different classes to do the processing.
What version you want to use of .NET also depends on if this is going to run on client machines also. But I would go for .NET 4 simply because it's newest, and if you want to split up a single task into multiple threads it has built-in classes for this.

Given your use case, the Thread and BackgroundWorkerThread should be sufficient. As you'll discover in reading the MSDN information regarding these classes, you will want to support cancellation as your means of shutting down a running thread before it's complete. (Thread "killing" is something to be avoided if at all possible)
.NET 4.0 has added some advanced items in the Task Parallel Library (TPL) - where Tasks are defined and managed with some smarter affinity for their most recently used core (to provide better cache behavior, etc.), however this seems like overkill for your use case, unless you expect to be running very large datasets. See these sites for more information:
http://msdn.microsoft.com/en-us/library/dd460717.aspx
http://archive.msdn.microsoft.com/ParExtSamples

Available parallel technologies in .Net

I am new to .Net platform. I did a search and found that there are several ways to do parallel computing in .Net:
Parallel task in Task Parallel Library, which is .Net 3.5.
PLINQ, .Net 4.0
Asynchounous Programming, .Net 2.0, (async is mainly used to do I/O heavy tasks, F# has a concise syntax supporting this). I list this because in Mono, there seem to be no TPL or PLINQ. Thus if I need to write cross platform parallel programs, I can use async.
.Net threads. No version limitation.
Could you give some short comments on these or add more methods in this list?

You do need to do a fair amount of research in order to determine how to effectively multithread. There are some good technical articles, part of the Microsoft Parallel Computing team's site.
Off the top of my head, there are several ways to go about multithreading:
Thread class.
ThreadPool, which also has support for I/O-bound operations and an I/O completion port.
Begin*/End* asynchronous operations.
Event-based asynchronous programming (or "EBAP") components, which use SynchronizationContext.
BackgroundWorker, which is an EBAP that defines an asynchronous operation.
Task class (Task Parallel Library) in .NET 4.
Parallel LINQ. There is a good article on Parallel.ForEach (Task Parallel Library) vs. PLINQ.
Rx or "LINQ to Events", which does not yet have a non-Beta version but is nearing completion and looks promising.
(F# only) Asynchronous workflows.
Update: There is an article Understanding and Applying Parallel Patterns with the .NET Framework 4 available for download that gives some direction on which solutions to use for which kinds of parallel scenarios (though it assumes .NET 4 and doesn't cover Rx).

Strictly speaking, the distinction between parallel, asynchronous and concurrent should be made here.
Parallel means that a "task" is split among several smaller sub-"tasks" that can be run at the same time. This requires a multi-core CPU or a multi-CPU computer, where each task has its dedicated core or CPU. Or multiple computers. PLINQ (data parallelism) and TPL (task parallelism) fall into this category.
Asynchronous means that tasks run without blocking each other. F#'s async expression, Rx, Begin/End pattern are all APIs for async programming.
Concurrency is a concept more broad than parallelization and asynchrony.
Concurrency means that several "tasks" run at the same time, interacting with each other. But these "tasks" don't have to run on separate physical computing units, as is meant in parallelization. For example, multitasking operating systems can execute multiple processes concurrently even on single-core single-CPU computers, using time slices.
Concurrency can be achieved for example with the Actor model and message passing (e.g. F#'s mailbox, Erlang processes (Retlang in .Net))
Threads are a relatively low-level concept compared to the concepts above. Threads are tasks running within a process, running concurrently and managed directly by the operating system's scheduler. You can implement parallelization when the operating system maps each thread to a separate core, or an Actor model by implementing message queuing, routing, etc on each thread.

There are also some .NET libraries for data parallel programming which target the Graphics Processing Unit (GPU) including:
Microsoft Accelerator
is for data parallel programming and can target either the GPU or multi-core processors.
Brama is for LINQ style data transformations that run on the GPU.
CUDA.NET provides a wrapper to allow to CUDA to be used from .NET programs.

There is also the Reactive Extensions for .NET (Rx)
Rx is basically linq queries for events. It allows you to process and combine asynchronous data streams in the same way linq allows you to work with collections. So you would probably use it in conjunction with other parallel technologies as a way of bringing the results of your parallel operations together without having to worry about locks and other low level threading primitives.
Expert to Expert: Brian Beckman and Erik Meijer - Inside the .NET Reactive Framework (Rx) gives a good overview of what Rx is all about.
EDIT: Another library worth mention is the Concurrency and Coordination Runtime (CCR), it's been around for a long time (earlier than '06) and is shipped as part of the Microsoft Robotics Studio.
Rx has a lot of the same cool ideas that the CCR has inside it, but with a much nicer API in my opinion. There's still some interesting stuff in the CCR though so it might be worth checking out. There's also a distributed services framework that works with the CCR that might make it useful depending on what you're doing.
Expert to Expert: Meijer and Chrysanthakopoulos - Concurrency, Coordination and the CCR

One more is the new Task Parallel library in .NET 4.0, which is similar and along the lines of what you've already discovered, but this may be an interesting read:
Task Parallel Library

two major ways to do parallel are threads and the new task based library TPL.
Asynchronous Programming you mention is nothing more then one new thread in the threadpool.
PLINQ, Rx and others mentioned are actually extensions sitting on the top of the new task scheduler.
the best article explaining exactly the new architecture for new task scheduler and all libraries on the top of it, Visual Studio 2010 and new TPL .NET 4.0 Task-based Parallelism is here (by Steve Teixeira, Product Unit Manager for Parallel Developer Tools at Microsoft):
http://www.drdobbs.com/visualstudio/224400670
otherwise Dr Dobbs has dedicated parallel programming section here: http://www.drdobbs.com/go-parallel/index.jhtml
The main difference between say threads and new task based parallel programming is you do not need to think anymore in terms of threads, how do you manage pools and underlying OS and hardware anymore. TPL takes care for that you just use tasks. That is a huge change in the way you do parallel on any level including abstraction.
So in .NET actually you do not have many choices:
Thread
New task based, task scheduler.
Obviously the task based is the way to go.
cheers
Valko

Java's ThreadPoolExecutor equivalent for C#?

I used to make good use of Java's ThreadPoolExecutor class and have yet to find a good equivalent in C#. I know of ThreadPool.QueueUserWorkItem which is useful in many cases but no good if you want to control the number of threads assigned to a task or have multiple individual queues for different task types.
For example I liked to use a ThreadPoolExecutor with a single thread to guarantee sequential execution of asynchronous calls.. Is there an easy way to do this in C#? Is there a non-static thread pool implementation?

Until .Net 4.0 and the TPL, there is no such feature built-in.
However, see this artcle

As part of the Reactive Extensions (Rx), the Task Parallel Library was backported to .NET 3.5. If you add a reference to the System.Threading.dll including in its distribution, you can use the TPL with .NET 3.5.
There are also thread pools built into the Concurrency and Coordination Runtime, which is freely available for use. See this MSDN article for use.

Coroutines in C#

I am looking at ways to implement co-routines (user scheduled threads) in c#. When using c++ I was using fibers. I see on the internet fibers do not exist in C#. I would like to get similar functionality.
Is there any "right" way to implement coroutines in c#?
I have thought of implementing this using threads that acquire a single execution mutex + 1 on scheduler thread which releases this mutex for each coroutine. But this seems very costly (it forces a context switch between each coroutine)
I have also seen the yield iterator functionality, but as I understand you can't yield within an internal function (only in the original ienumerator function). So this does me little good.

I believe with the new .NET 4.5\C# 5 the async\await pattern should meet your needs.
async Task<string> DownloadDocument(Uri uri)
{
var webClient = new WebClient();
var doc = await webClient.DownloadStringTaskAsync(url);
// do some more async work
return doc;
}
I suggest looking at http://channel9.msdn.com/Events/TechEd/Australia/Tech-Ed-Australia-2011/DEV411 for more info. It is a great presentation.
Also http://msdn.microsoft.com/en-us/vstudio/gg316360 has some great information.
If you are using an older version of .NET there is a Async CTP available for older .NET with a go live license so you can use it in production environments. Here is a link to the CTP http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=9983
If you don't like either of the above options I believe you could follow the async iterator pattern as outlined here. http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=9983

Edit: You can now use these: Is there a fiber api in .net?
I believe that you should look at the the Reactive Extensions for .NET. For example coroutines can be simulated using iterators and the yield statement.
However you may want to read this SO question too.

Here is an example of using threads to implement coroutines:
So I cheat. I use threads, but I only
let one of them run at a time. When I
create a coroutine, I create a thread,
and then do some handshaking that ends
with a call to Monitor.Wait(), which
blocks the coroutine thread — it won’t
run anymore until it’s unblocked. When
it’s time to call into the coroutine,
I do a handoff that ends with the
calling thread blocked, and the
coroutine thread runnable. Same kind
of handoff on the way back.
Those handoffs are kind of expensive,
compared with other implementations.
If you need speed, you’ll want to
write your own state machine, and
avoid all this context switching. (Or
you’ll want to use a fiber-aware
runtime — switching fibers is pretty
cheap.) But if you want expressive
code, I think coroutines hold some
promise.

It's 2020, lots of things have evolved in C#. I've published an article on this topic, Asynchronous coroutines with C# 8.0 and IAsyncEnumerable:
In the C# world, they (coroutines) have been popularized by Unity
game development
platform, and Unity
uses
IEnumerator-style
methods and yield return for that.
Prior to C# 8, it wasn't possible to combine await and yield return within the same method, making it difficult to use asynchrony
inside coroutines. Now, with the compiler's support for
IAsyncEnumerable, it can be done naturally.

Channels the missing piece
Pipelines are the missing piece relative to channels in golang. Channels are actually what make golang tick. Channels are the core concurrency tool. If you're using something like a coroutine in C# but using thread synchronisation primatives (semaphore, monitor, interlocked, etc..) then it's not the same.
Almost the same - Pipelines, but baked in
8 years later, and .Net Standard (.Net Framework / .Net Core) has support for Pipelines [https://blogs.msdn.microsoft.com/dotnet/2018/07/09/system-io-pipelines-high-performance-io-in-net/]. Pipelines are preferred for network processing. Aspcore now rates among the top 11 plaintext throughput request rates [https://www.techempower.com/benchmarks/#section=data-r16&hw=ph&test=plaintext].
Microsoft advises best practice for interfacing with network traffic: the awaited network bytes (Completion Port IO) should put data into a pipeline, and another thread should read data from the pipeline asynchronously. Many pipelines can be used in series for various processes on the byte stream. The Pipeline has a reader and a writer cursor, and the virtual buffer size will cause backpressure on the writer to reduce unnecessary use of memory for buffering, typically slowing down network traffic.
There are some critical differences between Pipelines and Go Channels. Pipelines aren't the same as a golang Channel. Pipelines are about passing mutable bytes rather than golang channels which are for signalling with memory references (including pointers). Finally, there is no equivalent select with Pipelines.
(Pipelines use Spans [https://adamsitnik.com/Span/], which have been around for a little while, but now are optimised deeply in .Net Core. Spans improve performance significantly. The .Net core support improves performance further but only incrementally, so .Net Framework use is perfectly fine. )
So pipelines are a built-in standard should help replace golang channels in .Net, but they are not the same, and there will be plenty of cases where pipelines are not the answer.
Direct Implementations of Golang Channel
https://codereview.stackexchange.com/questions/32500/golang-channel-in-c - this is some custom code, and not complete.
You would need to be careful (as with golang) that passed messages via a .Net Channel indicate a change of ownership over an object. This is something only a programmer can track and check, and if you get it wrong, you'll two or more threads accessing data without synchronisation.

You may be interested in this is a library that hides the usage of coroutines. For example to read a file:
//Prepare the file stream
FileStream sourceStream = File.Open("myFile.bin", FileMode.OpenOrCreate);
sourceStream.Seek(0, SeekOrigin.End);
//Invoke the task
yield return InvokeTaskAndWait(sourceStream.WriteAsync(result, 0, result.Length));
//Close the stream
sourceStream.Close();
This library uses one thread to run all coroutines and allow calling the task for the truly asynchronous operations. For example to call another method as a coroutine (aka yielding for its return
//Given the signature
//IEnumerable<string> ReadText(string path);
var result = new Container();
yield return InvokeLocalAndWait(() => _globalPathProvider.ReadText(path), container);
var data = container.RawData as string;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.