I created the following code to compare images and check if they are similar. Since that takes quite a while, I tried to optimized my code using multithreading.
I worked with BackgroundWorker in the past and was now starting to use Tasks, but I am still not fully familiar with that.
Code below:
allFiles is a list of images to be compared.
chunksToCompare contains subset of the Tuples of files to compare (always a combination of two files to compare) - so each task can compare e. g. 20 Tuples of files.
The code below works fine in general but has two issues
progress reporting does not really make sense, since progress is only updated when all Tasks have been completed which takes quite a while
depending on the size of files, each thread has different processing time: in the code below it always waits until all (64) task are completed before the next is started which is obviously not optimal
Many thanks in advance of any hint / idea.
// List for results
List<SimilarImage> similarImages = new List<SimilarImage>();
// create chunk of files to send to a thread
var chunksToCompare = GetChunksToCompare(allFiles);
// position of processed chunks of files
var i = 0;
// number of tasks
var taskCount = 64;
while (true)
{
// list of all tasks
List<Task<List<SimilarImage>>> tasks = new();
// create single tasks
for (var n = 0; n < taskCount; n++)
{
var task = (i + 1 + n < chunksToCompare.Count) ?
GetSimilarImageAsync2(chunksToCompare[i + n], threshold) : null;
if (task != null) tasks.Add(task);
}
// wait for all tasks to complete
await Task.WhenAll(tasks.Where(i => i != null));
// get results of single task and add it to list
foreach (var task in tasks)
{
if (task?.Result != null) similarImages.AddRange(task.Result);
}
// progress of processing
i += tasks.Count;
// report the progress
progress.Report(new ProgressInformation() { Count = chunksToCompare.Count,
Position = i + 1 });
// exit condition
if (i + 1 >= chunksToCompare.Count) break;
}
return similarImages;
More info: I am using .NET 6. Images are stores on a SSD. With my test dataset it took 6:30 minutes with sequential and 4:00 with parallel execution. I am using a lib which only takes the image path of two images and then compares them. There is a lot of overhead because the same image is reloaded multiple times. I was looking for a different lib to compare images, but I was not successful.
Related
I've been using Parallel.ForEach to do some time-consuming processing on collections of items. The processing is actually handled by an external command line tool and I cannot change that. However, it seems that the Parallel.ForEach will get "stuck" on a long running item from the collection. I've distilled the problem down and can show that Parallel.ForEach is, in fact, waiting for this long one to finish and not allowing any others through. I've written a console app to demonstrate the problem:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace testParallel
{
class Program
{
static int inloop = 0;
static int completed = 0;
static void Main(string[] args)
{
// initialize an array integers to hold the wait duration (in milliseconds)
var items = Enumerable.Repeat(10, 1000).ToArray();
// set one of the items to 10 seconds
items[50] = 10000;
// Initialize our line for reporting status
Console.Write(0.ToString("000") + " Threads, " + 0.ToString("000") + " completed");
// Start the loop in a task (to avoid SO answers having to do with the Parallel.ForEach call, itself, not being parallel)
var t = Task.Factory.StartNew(() => Process(items));
// Wait for the operations to compelte
t.Wait();
// Report finished
Console.WriteLine("\nDone!");
}
static void Process(int[] items)
{
// SpinWait (not sleep or yield or anything) for the specified duration
Parallel.ForEach(items, (msToWait) =>
{
// increment the counter for how many threads are in the loop right now
System.Threading.Interlocked.Increment(ref inloop);
// determine at what time we shoule stop spinning
var e = DateTime.Now + new TimeSpan(0, 0, 0, 0, msToWait);
// spin until the target time
while (DateTime.Now < e) /* no body -- just a hard loop */;
// count another completed
System.Threading.Interlocked.Increment(ref completed);
// we're done with this iteration
System.Threading.Interlocked.Decrement(ref inloop);
// report status
Console.Write("\r" + inloop.ToString("000") + " Threads, " + completed.ToString("000") + " completed");
});
}
}
}
Basically, I make an array of int to store the number of milliseconds a given operation takes. I set them all to 10 except for one, which I set to 10000 (so, 10 seconds). I kick off the Parallel.ForEach in a task and process each integer in a hard spin wait (so it shouldn't be yielding or sleeping or anything).
On each iteration, I report how many iterations are in the body of the loop right now, and how many iterations we have completed. Mostly, it goes along fine. However, toward the end (time-wise), it reports "001 Threads, 987 Completed".
My question is why doesn't it use 7 of the other cores to work on the remaining 13 "jobs"? This one long-running iteration should not keep it from processing other elements in the collection, right?
This example happens to be a fixed collection, but it could easily be set to be an enumerable. We wouldn't want to stop fetching the next item in the enumerable just because one was taking a long time.
I found the answer (or at least, an answer). It has to do with the chunk partitioning. The SO answer here got it for me. So basically, at the top of my "Process" function, if I change from this:
static void Process(int[] items)
{
Parallel.ForEach(items, (msToWait) => { ... });
}
to this
static void Process(int[] items)
{
var partitioner = Partitioner.Create(items, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partitioner, (msToWait) => { ... });
}
it grabs the work one at a time. For the more typical case of a parallel for each, where the body doesn't take more than a second, I can certainly see chunking the sets of work. In my use case, however, each body part can take anywhere from half a second to 5 hours. I certainly would not want a bunch of the 10-second variety elements to be blocked by one 5 hour element. So, in this case, the overhead of "one-at-a-time" is well worth it.
I have an array of filepath in List<string> with thousands of files. I want to process them in a function parallel with 8 threads.
ParallelOptions opt= new ParallelOptions();
opt.TaskScheduler = null;
opt.MaxDegreeOfParallelism = 8;
Parallel.ForEach(fileList, opt, item => DoSomething(item));
This code works fine for me but it guarantees to run max 8 threads and I want to run 8 threads always. CLR decides the number of threads to be use as per CPU load.
Please suggest a way in threading that always 8 threads are used in computing with minimum overhead.
Use a producer / consumer model. Create one producer and 8 consumers. For example:
BlockingCollection<string> _filesToProcess = new BlockingCollection<string>();
// start 8 tasks to do the processing
List<Task> _consumers = new List<Task>();
for (int i = 0; i < 8; ++i)
{
var t = Task.Factory.StartNew(ProcessFiles, TaskCreationOptions.LongRunning);
_consumers.Add(t);
}
// Populate the queue
foreach (var filename in filelist)
{
_filesToProcess.Add(filename);
}
// Mark the collection as complete for adding
_filesToProcess.CompleteAdding();
// wait for consumers to finish
Task.WaitAll(_consumers.ToArray(), Timeout.Infinite);
Your processing method removes things from the BlockingCollection and processes them:
void ProcessFiles()
{
foreach (var filename in _filesToProcess.GetConsumingEnumerable())
{
// do something with the file name
}
}
That will keep 8 threads running until the collection is empty. Assuming, of course, you have 8 cores on which to run the threads. If you have fewer available cores, then there will be a lot of context switching, which will cost you.
See BlockingCollection for more information.
Within a static counter, you might be able to get the number of current threads.
Every time you call start a task there is the possibility to use the Task.ContinueWith (http://msdn.microsoft.com/en-us/library/dd270696.aspx) to notify that it's over and you can start another one.
This way there is going to be always 8 tasks running.
OrderablePartitioner<Tuple<int, int>> chunkPart = Partitioner.Create(0, fileList.Count, 1);//Partition the list in chunk of 1 entry
ParallelOptions opt= new ParallelOptions();
opt.TaskScheduler = null;
opt.MaxDegreeOfParallelism = 8;
Parallel.ForEach(chunkPart, opt, chunkRange =>
{
for (int i = chunkRange.Item1; i < chunkRange.Item2; i++)
{
DoSomething(fileList[i].FullName);
}
});
I'm using C# Parallel.ForEach to process more than thousand subsets of data. One set takes 5-30 minutes to process, depending on size of the set. In my computer with option
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = Environment.ProcessorCount
I'll get 8 parallel processes. As I understood, processes are divided equally between parallel tasks (e.g. the first task gets jobs number 1,9,17 etc, the second gets 2,10,18 etc.); therefore, one task can finish own jobs sooner than others. Because those sets of data took less time than others.
The problem is that four parallel tasks finish their jobs within 24 hours, but the last one finish in 48 hours. It there some chance to organize parallelism so that all parallel tasks are finishing equally? It means all parallel tasks continue working until all jobs are done?
Since the jobs are not equal, you can't split the number of jobs between processors and have them finish at about the same time. I think what you need here is 8 worker threads that retrieve the next job in line. You will have to use a lock on the function to get the next job.
Somebody correct me if I'm wrong, but off the top of my head... a worker thread could be given a function like this:
public void ProcessJob()
{
for (Job myJob = GetNextJob(); myJob != null; myJob = GetNextJob())
{
// process job
}
}
And the function to get the next job would look like:
private List<Job> jobs;
private int currentJob = 0;
private Job GetNextJob()
{
lock (jobs)
{
Job job = null;
if (currentJob < jobs.Count)
{
job = jobs[currentJob];
currentJob++;
}
return job;
}
}
It seems that there is no ready-to-use solution and it has to be created.
My previous code was:
var ListOfSets = (from x in Database
group x by x.SetID into z
select new { ID = z.Key}).ToList();
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(ListOfSets, po, SingleSet=>
{
AnalyzeSet(SingleSet.ID);
});
To share work equally between all CPU-s, I still use Parallel to do the work, but instead of ForEach I use For and an idea from Matt. The new code is:
Parallel.For(0, Environment.ProcessorCount, i=>
{
while(ListOfSets.Count() > 0)
{
double SetID = 0;
lock (ListOfSets)
{
SetID = ListOfSets[0].ID;
ListOfSets.RemoveAt(0);
}
AnalyzeSet(SetID);
}
});
So, thank you for your advice.
One option, as suggested by others, is to manage your own producer consumer queue. I'd like to note that using the BlockingCollection makes this very easy to do.
BlockingCollection<JobData> queue = new BlockingCollection<JobData>();
//add data to queue; if it can be done quickly, just do it inline.
//If it's expensive, start a new task/thread just to add items to the queue.
foreach (JobData job in data)
queue.Add(job);
queue.CompleteAdding();
for (int i = 0; i < Environment.ProcessorCount; i++)
{
Task.Factory.StartNew(() =>
{
foreach (var job in queue.GetConsumingEnumerable())
{
ProcessJob(job);
}
}, TaskCreationOptions.LongRunning);
}
I'm reading the great article series on Eric Lippert's blog about C#5's new asynchrony features. There he uses an example of a method fetchting documents from a remote location and, once retrieved, archives them on a storage drive. This is the code he uses:
async Task<long> ArchiveDocumentsAsync(List<Url> urls)
{
long count = 0;
Task archive = null;
for(int i = 0; i < urls.Count; ++i)
{
var document = await FetchAsync(urls[i]);
count += document.Length;
if (archive != null)
await archive;
archive = ArchiveAsync(document);
}
return count;
}
Now imagine that fetching documents is very quick. So the first document is fetched. After that, it's started to be archived, while the second document is being fetched. Now imagine the second document has been fetched and the first document is still being archived. Will this piece of code start fetching the third document or wait until the first document has been archived?
As Eric says in its article, this code is converted by the compiler to this:
Task<long> ArchiveDocuments(List<Url> urls)
{
var taskBuilder = AsyncMethodBuilder<long>.Create();
State state = State.Start;
TaskAwaiter<Document> fetchAwaiter = null;
TaskAwaiter archiveAwaiter = null;
int i;
long count = 0;
Task archive = null;
Document document;
Action archiveDocuments = () =>
{
switch(state)
{
case State.Start: goto Start;
case State.AfterFetch: goto AfterFetch;
case State.AfterArchive: goto AfterArchive;
}
Start:
for(i = 0; i < urls.Count; ++i)
{
fetchAwaiter = FetchAsync(urls[i]).GetAwaiter();
state = State.AfterFetch;
if (fetchAwaiter.BeginAwait(archiveDocuments))
return;
AfterFetch:
document = fetchAwaiter.EndAwait();
count += document.Length;
if (archive != null)
{
archiveAwaiter = archive.GetAwaiter();
state = State.AfterArchive;
//----> interesting part! <-----
if (archiveAwaiter.BeginAwait(archiveDocuments))
return; //Returns if archive is still working => Fetching of next document not done
AfterArchive:
archiveAwaiter.EndAwait();
}
archive = ArchiveAsync(document);
}
taskBuilder.SetResult(count);
return;
};
archiveDocuments();
return taskBuilder.Task;
}
Additional question:
If the execution is stopped, would it be possible to continue with fetching documents? If yes, how?
Will this piece of code start fetching the third document or wait until the first document has been archived?
It waits. The point of the article is to describe how the control flow works with the transformation, not to actually describe the best possible system for managing the fetch-archive operation.
Suppose you did have a hundred documents to fetch and archive, and you really didn't care what order they happened in. (*) You could make a new asynchronous method "FetchAndArchive" that fetches one document asynchronously and then archives it asynchronously. You could then call that method a hundred times from another asynchronous method that makes a hundred tasks, each one of which asynchronously fetches a document and archives it. The result of that method is a combined task that represents the work of doing those hundred tasks, each of which represents the work of doing two tasks.
In this scenario, whenever one of the fetch operations can't produce its result immediately, one of the tasks that is ready to do its archive step can run.
I didn't want to get into task combinators in this article; I wanted to concentrate on a more simple control flow.
(*) You might care what order they happened in if instead of "download a document and archive it" the operation was "fetch the next video in this series and play it". You don't want to play them out-of-order even if they can more efficiently arrive out-of-order. Rather, you want to download the next one while the current one is playing.
This piece of code makes it wait until the previous document is archived before starting to archive the next. And it will only start downloading the third once it started to archive the second.
if (archive != null)
await archive;
But I think usually fetching is slow because it downloads from the internet, whereas archiving is fast since it's to a local hard-disk. But of course that depends or your exact use-case.
Without using async/await, the same* function in pseudo-code would be something like
long ArchiveDocumentsAsync(List<Url> urls)
{
long count = 0;
Task archive = null;
for(int i = 0; i < urls.Count; ++i)
{
Task<Something> documentTask = FetchAsync(urls[i]);
//Wait for the completion of the task.
documentTask.Wait();
//Get the results.
Something document = documentTask.getReturnValue();
count += document.Length;
if (archive != null) {
//Wait for the completion of the task.
archive.Wait();
}
archive = ArchiveAsync(document);
}
return count;
}
Note that we never have two Fetches or two Archivings at the same time. The 2nd Archiving cannot start before the 1st Archiving is completed, and the 3rd Fetch cannot start before the 2nd Archiving is started.
(*) Now for the Async magic:
The compiler generates code so that the calls to Wait() do not actually block execution of the current thread. The function ArchiveDocumentsAsync simply "yields" to its caller (except if its caller is awaiting for its results - in this case the flow is yielded to the caller-caller, and so on).
The compiler generated machinery makes sure the execution continues right next were it had been stopped, after the Waited task is completed.
Note: Eric Lippert already answered this question. I just want to give my two cents and write down my understanding so you guys can warn here if it's wrong.
I have an application that has many cases. Each case has many multipage tif files. I need to covert the tf files to pdf file. Since there are so many file, I thought I could thread the conversion process. I'm currently limiting the process to ten conversions at a time (i.e ten treads). When one conversion completes, another should start.
This is the current setup I'm using.
private void ConvertFiles()
{
List<AutoResetEvent> semaphores = new List<AutoResetEvet>();
foreach(String fileName in filesToConvert)
{
String file = fileName;
if(semaphores.Count >= 10)
{
WaitHandle.WaitAny(semaphores.ToArray());
}
AutoResetEvent semaphore = new AutoResetEvent(false);
semaphores.Add(semaphore);
ThreadPool.QueueUserWorkItem(
delegate
{
Convert(file);
semaphore.Set();
semaphores.Remove(semaphore);
}, null);
}
if(semaphores.Count > 0)
{
WaitHandle.WaitAll(semaphores.ToArray());
}
}
Using this, sometimes results in an exception stating the WaitHandle.WaitAll() or WaitHandle.WaitAny() array parameters must not exceed a length of 65. What am I doing wrong in this approach and how can I correct it?
There are a few problems with what you have written.
1st, it isn't thread safe. You have multiple threads adding, removing and waiting on the array of AutoResetEvents. The individual elements of the List can be accessed on separate threads, but anything that adds, removes, or checks all elements (like the WaitAny call), need to do so inside of a lock.
2nd, there is no guarantee that your code will only process 10 files at a time. The code between when the size of the List is checked, and the point where a new item is added is open for multiple threads to get through.
3rd, there is potential for the threads started in the QueueUserWorkItem to convert the same file. Without capturing the fileName inside the loop, the thread that converts the file will use whatever value is in fileName when it executes, NOT whatever was in fileName when you called QueueUserWorkItem.
This codeproject article should point you in the right direction for what you are trying to do: http://www.codeproject.com/KB/threads/SchedulingEngine.aspx
EDIT:
var semaphores = new List<AutoResetEvent>();
foreach (String fileName in filesToConvert)
{
String file = fileName;
AutoResetEvent[] array;
lock (semaphores)
{
array = semaphores.ToArray();
}
if (array.Count() >= 10)
{
WaitHandle.WaitAny(array);
}
var semaphore = new AutoResetEvent(false);
lock (semaphores)
{
semaphores.Add(semaphore);
}
ThreadPool.QueueUserWorkItem(
delegate
{
Convert(file);
lock (semaphores)
{
semaphores.Remove(semaphore);
}
semaphore.Set();
}, null);
}
Personally, I don't think I'd do it this way...but, working with the code you have, this should work.
Are you using a real semaphore (System.Threading)? When using semaphores, you typically allocate your max resources and it'll block for you automatically (as you add & release). You can go with the WaitAny approach, but I'm getting the feeling that you've chosen the more difficult route.
Looks like you need to remove the handle the triggered the WaitAny function to proceed
if(semaphores.Count >= 10)
{
int index = WaitHandle.WaitAny(semaphores.ToArray());
semaphores.RemoveAt(index);
}
So basically I would remove the:
semaphores.Remove(semaphore);
call from the thread and use the above to remove the signaled event and see if that works.
Maybe you shouldn't create so many events?
// input
var filesToConvert = new List<string>();
Action<string> Convert = Console.WriteLine;
// limit
const int MaxThreadsCount = 10;
var fileConverted = new AutoResetEvent(false);
long threadsCount = 0;
// start
foreach (var file in filesToConvert) {
if (threadsCount++ > MaxThreadsCount) // reached max threads count
fileConverted.WaitOne(); // wait for one of started threads
Interlocked.Increment(ref threadsCount);
ThreadPool.QueueUserWorkItem(
delegate {
Convert(file);
Interlocked.Decrement(ref threadsCount);
fileConverted.Set();
});
}
// wait
while (Interlocked.Read(ref threadsCount) > 0) // paranoia?
fileConverted.WaitOne();