Thread Queue Process - c#

I'm building this program in visual studio 2010 using C# .Net4.0
The goal is to use thread and queue to improve performance.
I have a list of urls I need to process.
string[] urls = { url1, url2, url3, etc.} //up to 50 urls
I have a function that will take in each url and process them.
public void processUrl(string url) {
//some operation
}
Originally, I created a for-loop to go through each urls.
for (i = 0; i < urls.length; i++)
processUrl(urls[i]);
The method works, but the program is slow as it was going through urls one after another.
So the idea is to use threading to reduce the time, but I'm not too sure how to approach that.
Say I want to create 5 threads to process at the same time.
When I start the program, it will start processing the first 5 urls. When one is done, the program start process the 6th url; when another one is done, the program starts processing the 7th url, and so on.
The problem is, I don't know how to actually create a 'queue' of urls and be able to go through the queue and process.
Can anyone help me with this?
-- EDIT at 1:42PM --
I ran into another issue when I was running 5 process at the same time.
The processUrl function involve writing to log file. And if multiple processes timeout at the same time, they are writing to the same log file at the same time and I think that's throwing an error.
I'm assuming that's the issue because the error message I got was "The process cannot access the file 'data.log' because it is being used by another process."

The simplest option would be to just use Parallel.ForEach. Provided processUrl is thread safe, you could write:
Parallel.ForEach(urls, processUrl);
I wouldn't suggest restricting to 5 threads (the scheduler will automatically scale normally), but this can be done via:
Parallel.ForEach(urls, new ParallelOptions { MaxDegreeOfParallelism = 5}, processUrl);
That being said, URL processing is, by its nature, typically IO bound, and not CPU bound. If you could use Visual Studio 2012, a better option would be to rework this to use the new async support in the language. This would require changing your method to something more like:
public async Task ProcessUrlAsync(string url)
{
// Use await with async methods in the implementation...
You could then use the new async support in the loop:
// Create an enumerable to Tasks - this will start all async operations..
var tasks = urls.Select(url => ProcessUrlAsync(url));
await Task.WhenAll(tasks); // "Await" until they all complete

Use a Parallel Foreach with the Max Degree of Parallelism set to the number of threads you want (or leave it empty and let .NET do the work for you)
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = 5;
Parallel.ForEach(urls, parallelOptions, url =>
{
processUrl(url);
});

If you really want to create threads to accomplish you task in place of using parallel execution:
Suppose that I want one thread for each URL:
string[] urls = {"url1", "url2", "url3"};
I just start a new Thread instance for each URL (or each 5 url's):
foreach (var thread in urls.Select(url => new Thread(() => DownloadUrl(url))))
thread.Start();
And the method to download your URL:
private static void DownloadUrl(string url)
{
Console.WriteLine(url);
}

Related

How does Console.WriteLine affects parallel execution

I am playing and learning through async and parallel programming. I have a list of addresses and want to dns resolve them. Furthermore, I have made this function for that:
private static Task<string> ResolveAsync(string ipAddress)
{
return Task.Run(() => Dns.GetHostEntry(ipAddress).HostName);
}
Now, in the program I am resolving addresses like this, the idea is to use parallel programming:
//getting orderedClientIps before
var taskArray = new List<Task>();
foreach (var orderedClientIp in orderedClientIps)
{
var task = new Task(async () =>
{
orderedClientIp.Address = await ResolveAsync(orderedClientIp.Ip);
});
taskArray.Add(task);
task.Start();
}
Task.WaitAll(taskArray.ToArray());
foreach (var orderedClientIp in orderedClientIps)
{
Console.WriteLine($"{(orderedClientIp.Ip)} ({orderedClientIp.Ip}) - {orderedClientIp.Count}");
}
So, here we wait for all the addresses to resolve, and then in a separate iteration print them.
What interests me, what would be the difference if instead of printing in separate iteration, I would do something like this:
foreach (var orderedClientIp in orderedClientIps)
{
var task = new Task(async () =>
{
orderedClientIp.Address = await ResolveAsync(orderedClientIp.Ip);
Console.WriteLine($"{(orderedClientIp.Ip)} ({orderedClientIp.Ip}) - {orderedClientIp.Count}");
});
taskArray.Add(task);
task.Start();
}
Task.WaitAll(taskArray.ToArray());
I have tried executing, and it writes to console one by one, whereas in the first instance writes them all out after waiting them.
I think that the first approach is parallel and better, but am not quite sure of the differences. What is, in the context of async and parallel programming, different in the second approach? And, does the second approach somehow violates Task.WaitAll() line.
The difference in the output behaviour you see is simply related to the point in time where you write the output.
Second approach: "and it writes to console one by one"
That's because the code to write the output is called as soon as any task is "done". That happens at different point in time and thus you see them being output "one by one".
First approach: "in the first instance writes them all out after waiting them."
That's because you do just that in your code. Wait until all is done and then output sequentially what you have found.
Your example cannot be judged by the behaviour of the output regarding which version is better in running things parallel.
In fact, for all practical purposes they are identical. The overhead of Console.WriteLine inside the task (compared to doing the actual DNS-lookup) should be neglectable.
It would be different for compute intensive things, but then you should probably be using Parallel.ForEach anyway.
So where should you output then? It depends. If you need to show the information (here the DNS-lookup result) as soon as possible, then do it from inside the Task. If it can wait until all is done (which might take some time), then do it at the end.
The write to the console is not asyncron. because the console is per default not async. With the Console part you "syncronize" your tasks. Maybe:
var task = new Task(async () =>
{
orderedClientIp.Address = await ResolveAsync(orderedClientIp.Ip);
return $"{(orderedClientIp.Ip)} ({orderedClientIp.Ip}) - {orderedClientIp.Count}";
}).ContinueWith(previousTask => Console.WriteLine(previousTask.Result));

Is there a way to limit the number of parallel Tasks globally in an ASP.NET Web API application?

I have an ASP.NET 5 Web API application which contains a method that takes objects from a List<T> and makes HTTP requests to a server, 5 at a time, until all requests have completed. This is accomplished using a SemaphoreSlim, a List<Task>(), and awaiting on Task.WhenAll(), similar to the example snippet below:
public async Task<ResponseObj[]> DoStuff(List<Input> inputData)
{
const int maxDegreeOfParallelism = 5;
var tasks = new List<Task<ResponseObj>>();
using var throttler = new SemaphoreSlim(maxDegreeOfParallelism);
foreach (var input in inputData)
{
tasks.Add(ExecHttpRequestAsync(input, throttler));
}
List<ResponseObj> resposnes = await Task.WhenAll(tasks).ConfigureAwait(false);
return responses;
}
private async Task<ResponseObj> ExecHttpRequestAsync(Input input, SemaphoreSlim throttler)
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
using var request = new HttpRequestMessage(HttpMethod.Post, "https://foo.bar/api");
request.Content = new StringContent(JsonConvert.SerializeObject(input, Encoding.UTF8, "application/json");
var response = await HttpClientWrapper.SendAsync(request).ConfigureAwait(false);
var responseBody = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
var responseObject = JsonConvert.DeserializeObject<ResponseObj>(responseBody);
return responseObject;
}
finally
{
throttler.Release();
}
}
This works well, however I am looking to limit the total number of Tasks that are being executed in parallel globally throughout the application, so as to allow scaling up of this application. For example, if 50 requests to my API came in at the same time, this would start at most 250 tasks running parallel. If I wanted to limit the total number of Tasks that are being executed at any given time to say 100, is it possible to accomplish this? Perhaps via a Queue<T>? Would the framework automatically prevent too many tasks from being executed? Or am I approaching this problem in the wrong way, and would I instead need to Queue the incoming requests to my application?
I'm going to assume the code is fixed, i.e., Task.Run is removed and the WaitAsync / Release are adjusted to throttle the HTTP calls instead of List<T>.Add.
I am looking to limit the total number of Tasks that are being executed in parallel globally throughout the application, so as to allow scaling up of this application.
This does not make sense to me. Limiting your tasks limits your scaling up.
For example, if 50 requests to my API came in at the same time, this would start at most 250 tasks running parallel.
Concurrently, sure, but not in parallel. It's important to note that these aren't 250 threads, and that they're not 250 CPU-bound operations waiting for free thread pool threads to run on, either. These are Promise Tasks, not Delegate Tasks, so they don't "run" on a thread at all. It's just 250 objects in memory.
If I wanted to limit the total number of Tasks that are being executed at any given time to say 100, is it possible to accomplish this?
Since (these kinds of) tasks are just in-memory objects, there should be no need to limit them, any more than you would need to limit the number of strings or List<T>s. Apply throttling where you do need it; e.g., number of HTTP calls done simultaneously per request. Or per host.
Would the framework automatically prevent too many tasks from being executed?
The framework has nothing like this built-in.
Perhaps via a Queue? Or am I approaching this problem in the wrong way, and would I instead need to Queue the incoming requests to my application?
There's already a queue of requests. It's handled by IIS (or whatever your host is). If your server gets too busy (or gets busy very suddenly), the requests will queue up without you having to do anything.
If I wanted to limit the total number of Tasks that are being executed at any given time to say 100, is it possible to accomplish this?
What you are looking for is to limit the MaximumConcurrencyLevel of what's called the Task Scheduler. You can create your own task scheduler that regulates the MaximumCongruencyLevel of the tasks it manages. I would recommend implementing a queue-like object that tracks incoming requests and currently working requests and waits for the current requests to finish before consuming more. The below information may still be relevant.
The task scheduler is in charge of how Tasks are prioritized, and in charge of tracking the tasks and ensuring that their work is completed, at least eventually.
The way it does this is actually very similar to what you mentioned, in general the way the Task Scheduler handles tasks is in a FIFO (First in first out) model very similar to how a ConcurrentQueue<T> works (at least starting in .NET 4).
Would the framework automatically prevent too many tasks from being executed?
By default the TaskScheduler that is created with most applications appears to default to a MaximumConcurrencyLevel of int.MaxValue. So theoretically yes.
The fact that there practically is no limit to the amount of tasks(at least with the default TaskScheduler) might not be that big of a deal for your case scenario.
Tasks are separated into two types, at least when it comes to how they are assigned to the available thread pools. They're separated into Local and Global queues.
Without going too far into detail, the way it works is if a task creates other tasks, those new tasks are part of the parent tasks queue (a local queue). Tasks spawned by a parent task are limited to the parent's thread pool.(Unless the task scheduler takes it upon itself to move queues around)
If a task isn't created by another task, it's a top-level task and is placed into the Global Queue. These would normally be assigned their own thread(if available) and if one isn't available it's treated in a FIFO model, as mentioned above, until it's work can be completed.
This is important because although you can limit the amount of concurrency that happens with the TaskScheduler, it may not necessarily be important - if for say you have a top-level task that's marked as long running and is in-charge of processing your incoming requests. This would be helpful since all the tasks spawned by this top-level task will be part of that task's local queue and therefor won't spam all your available threads in your thread pool.
When you have a bunch of items and you want to process them asynchronously and with limited concurrency, the SemaphoreSlim is a great tool for this job. There are two ways that it can be used. One way is to create all the tasks immediately and have each task acquire the semaphore before doing it's main work, and the other is to throttle the creation of the tasks while the source is enumerated. The first technique is eager, and so it consumes more RAM, but it's more maintainable because it is easier to understand and implement. The second technique is lazy, and it's more efficient if you have millions of items to process.
The technique that you have used in your sample code is the second (lazy) one.
Here is an example of using two SemaphoreSlims in order to impose two maximum concurrency policies, one per request and one globally. First the eager approach:
private const int maxConcurrencyGlobal = 100;
private static SemaphoreSlim globalThrottler
= new SemaphoreSlim(maxConcurrencyGlobal, maxConcurrencyGlobal);
public async Task<ResponseObj[]> DoStuffAsync(IEnumerable<Input> inputData)
{
const int maxConcurrencyPerRequest = 5;
var perRequestThrottler
= new SemaphoreSlim(maxConcurrencyPerRequest, maxConcurrencyPerRequest);
Task<ResponseObj>[] tasks = inputData.Select(async input =>
{
await perRequestThrottler.WaitAsync();
try
{
await globalThrottler.WaitAsync();
try
{
return await ExecHttpRequestAsync(input);
}
finally { globalThrottler.Release(); }
}
finally { perRequestThrottler.Release(); }
}).ToArray();
return await Task.WhenAll(tasks);
}
The Select LINQ operator provides an easy and intuitive way to project items to tasks.
And here is the lazy approach for doing exactly the same thing:
private const int maxConcurrencyGlobal = 100;
private static SemaphoreSlim globalThrottler
= new SemaphoreSlim(maxConcurrencyGlobal, maxConcurrencyGlobal);
public async Task<ResponseObj[]> DoStuffAsync(IEnumerable<Input> inputData)
{
const int maxConcurrencyPerRequest = 5;
var perRequestThrottler
= new SemaphoreSlim(maxConcurrencyPerRequest, maxConcurrencyPerRequest);
var tasks = new List<Task<ResponseObj>>();
foreach (var input in inputData)
{
await perRequestThrottler.WaitAsync();
await globalThrottler.WaitAsync();
Task<ResponseObj> task = Run(async () =>
{
try
{
return await ExecHttpRequestAsync(input);
}
finally
{
try { globalThrottler.Release(); }
finally { perRequestThrottler.Release(); }
}
});
tasks.Add(task);
}
return await Task.WhenAll(tasks);
static async Task<T> Run<T>(Func<Task<T>> action) => await action();
}
This implementation assumes that the await globalThrottler.WaitAsync() will never throw, which is a given according to the documentation. This will no longer be the case if you decide later to add support for cancellation, and you pass a CancellationToken to the method. In that case you would need one more try/finally wrapper around the task-creation logic. The first (eager) approach could be enhanced with cancellation support without such considerations. Its existing try/finally infrastructure is
already sufficient.
It is also important that the internal helper Run method is implemented with async/await. Eliding the async/await would be an easy mistake to make, because in that case any exception thrown synchronously by the ExecHttpRequestAsync method would be rethrown immediately, and it would not be encapsulated in a Task<ResponseObj>. Then the task returned by the DoStuffAsync method would fail without releasing the acquired semaphores, and also without awaiting the completion of the already started operations. That's another argument for preferring the eager approach. The lazy approach has too many gotchas to watch for.

Downloading while processing

I have the following scenario, a timer every x minutes:
download an item to work from a rest service (made in php)
run a process batch to elaborate item
Now the application is fully functional, but I want to speedup the entire process downloading another item (if present in the rest service) while the application is processing one.
I think that I need a buffer/queue to accomplish this, like BlockingCollection, but I've no idea how to use it.
What's the right way to accomplish what I'm trying to do?
Thank you in advance!
What you can do is create a function which checks for new files to download. Have this function start as its own background thread that runs in an infinite loop, checking for new downloads in each iteration. If it finds any files that need downloading, call a separate function to download the file as a new thread. This new download function can then call the processing function as yet another thread once the file finishes downloading. With this approach you will be running all tasks in parallel for multiple files if needed.
Functions can be started as new threads by doing this
Thread thread = new Thread(FunctionName);
thread.start();
Use Microsoft's Reactive Framework (NuGet "System.Reactive"). Then you can do this:
var x_minutes = 5;
var query =
from t in Observable.Interval(TimeSpan.FromMinutes(x_minutes))
from i in Observable.Start(() => DownloadAnItem())
from e in Observable.Start(() => ElaborateItem(i))
select new { i, e };
var subscription =
query.Subscribe(x =>
{
// Do something with each `x.i` & `x.e`
});
Multi-threaded and simple.
If you want to stop processing then just call subscription.Dispose().

Multithreading to open file and update a class object

If I am creating Tasks using a for loop will those tasks run in parallel or would they just run one after the other?
Here is my code -
private void initializeAllSpas()
{
Task[] taskArray = new Task[spaItems.Count];
for(int i = 0; i < spaItems.Count; i++)
{
taskArray[i] = Task.Factory.StartNew(() => spaItems[i].initializeThisSpa());
}
Task.WhenAll(taskArray).Wait();
foreach (var task in taskArray) task.Dispose();
}
where spaItems is a list of items from another class, call it SpaItem, in which the initializeThisSpa() function opens a file and updates the information for that particular SpaItem.
My question is, does the above code actually excute initializeThisSpa() on all of the spaItems at the same time? if not, how can I correct that?
(I Ignored syntax issues if any and not tested)
At the same time?..
Not guaranteed. At least (the best bet) definitely there will be nano secs difference.
Tasks are placed in a queue.
And every task waits for its opportunity for a thread from threadpool, for its turn of execution.
It all depends on the availability of threads in thread pool. If no thread available, the tasks waits in queue.
There are different states for the task before its final execution. Here is a good explanation. And after going through this link, you will come to know that it is almost impossible to call a function at the same time from multiple tasks.
https://blogs.msdn.microsoft.com/pfxteam/2009/08/30/the-meaning-of-taskstatus/
You can achieve tasks sequentially (one after another) calling a specific function by creating tasks with methods like "ContinueWith, ContinueWhenAll, ContinueWhenAny,"
An example is below in MSDN documentation link.
https://msdn.microsoft.com/en-us/library/dd321473(v=vs.110).aspx

Blocking queue with task

Problem:
I have filesystem where files appear and i want to upload them over WCF. What i want is to limit maximum number of parallelism to some ThreadMaxConcurrency
Idea is to utilize Producer-Consumer pattern in blockingQueue. Producing part is - UploadNewFile,CreateFolder etc...
What i am blind of is that consumer part.
Also what i dont know is to some kind of... delay single task -
For an example - DONT upload new files before FolderWasCreated for them.
I am using .NET 4.5 and i dont know how to utilize BlockingQueue and how to properly monitor if there is task and how to monitor and how to postpone some task until another one completes (enqueue them to the end again would work i guess).
You should use TPL Dataflow which is a framework that does all that for you. You create an AcionBlock, give it a delegate and set it's MaxDegreeOfParallelism.
It should look similar to this:
var block = new ActionBlock<string>(folderName =>
{
UploadFolder(folderName);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
foreach (var folderName in GetFolderNames())
{
block.Post(folderName);
}
block.Complete();
await block.Completion;

Categories

Resources