.Net collecting the same list in different task - c#

I created list of object. And i want to fill the list in different tasks. It looks correct but it doesn't work.
This is my code:
var splittedDataList = Extensions.ListExtensions.SplitList(source, 500);
// Create a list of tasks
var poolTasks = new List<Task>();
var objectList = new List<Car>();
for (int i = 0; i < splittedDataList.Count; i++)
{
var data = splittedDataList[i];
poolTasks.Add(Task.Factory.StartNew(() =>
{
// Collect list of car
objectList = CollectCarList(data);
}));
}
// Wait all tasks to finish
Task.WaitAll(poolTasks.ToArray());
public List<Car> CollectCarList(List<Car> list)
{
///
return list;
}

The code is using Tasks as if they were threads to flatten a nested list. Tasks aren't threads, they're a promise that something will produce a result in the future. In JavaScript they're actually called promises.
The question's exact code is flattening a nested list. This can easily be done with Enumerable.SelectMany(), eg :
var cars=source.SelectMany(data=>data).ToList();
Flattening isn't an expensive operation so there shouldn't be any need for parallelism. If there are really that many items, Parallel LINQ can be used with .AsParallel(). LINQ operators after that are executed using parallel algorithms and collected at the end :
var cars=source.AsParallel()
.SelectMany(data=>data)
.ToList();
Parallel LINQ is far more useful if it's used to parallelize the real time consuming processing before flattening :
var cars=source.AsParallel()
.Select(data=>DoSomethingExpensive(data))
.SelectMany(data=>data)
.ToList();
Parallel LINQ is built for data parallelism - processing large amounts of in-memory data by partitioning the input and using worker tasks to process each partition with minimal synchronization between workers. It's definitely not meant for executing lots of asynchronous operations concurrently. There are other high-level classes for that

First off List are not thread safe. If you really wanted to fill a list via different async tasks then you would probably want to use some sort of concurrent collection.
https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent?view=net-6.0
The second questions is why would you want to do this? In your current example all this work is CPU bound anyway so creating multiple tasks does not really get you anywhere. It's not going to speed anything up, in fact it will do quite the contrary as the async state machine calls will add overhead to the processing.
If your input lists where coming from various other async tasks, e.g calls to a database then this might make more sense. In any case based on what I see above this would do what your asking.
object ListLock = new object();
async void Main()
{
var splittedDataList = new List<List<int>> { Enumerable.Range(0, 500).ToList(), Enumerable.Range(0, 500).ToList() };
// Create a list of tasks
var poolTasks = new List<Task>();
var objectList = new List<int>();
for (int i = 0; i < splittedDataList.Count; i++)
{
var data = splittedDataList[i];
poolTasks.Add(Task.Factory.StartNew(() =>
{
lock (ListLock)
{
// Collect list of car
objectList.AddRange(CollectCarList(data));
}
}));
}
// Wait all tasks to finish
Task.WaitAll(poolTasks.ToArray());
objectList.Dump();
}
// You can define other methods, fields, classes and namespaces here
public List<int> CollectCarList(List<int> list)
{
///
return list;
}
I changed the list to be a simple List of int as I didn't what the definition of Car was in your application. The lock is required to overcome the thread safety issue with List. This could be removed if you used some kind of concurrent collection. I just want to reiterate that what this code is doing in it's current state is pointless. You would be better off just doing all this on a single thread unless there is some actual async IO going somewhere else.

Related

Why Some records are missing when using parallel.forEach? [duplicate]

This question already has answers here:
multiple threads adding elements to one list. why are there always fewer items in the list than expected?
(2 answers)
Closed 2 years ago.
In my code,i'm getting list of menus from database and map them to DTO objects,
due to nested child,i decided to use parallel to map for mapping entities,but i bumped into a weird issue ,when forEach is finished some of the records are not mapped !
The number of missed records are different each time,one time one and another time more !
public List<TreeStructureDto> GetParentNodes()
{
var data = new List<TreeStructureDto>();
var result = MenuDLL.Instance.GetTopParentNodes();
Parallel.ForEach(result, res =>
{
data.Add( new Mapper().Map(res));
});
return data;
}
but when I'm debugging I'm getting
number of my original data is 59
But after mapping, the number of my final list is 58 !
My mapper class is as follows:
public TreeStructureDto Map(Menu menu)
{
return new TreeStructureDto()
{
id = menu.Id.ToString(),
children = true,
text = menu.Name,
data = new MenuDto()
{
Id = menu.Id,
Name = menu.Name,
ParentId = menu.ParentId,
Script = menu.Script,
SiblingsOrder = menu.SiblingsOrder,
systemGroups = menu.MenuSystemGroups.Select(x => Map(x)).ToList()
}
};
}
I appreciate your helps in advance.
You are adding to a single list concurrently, which is not valid because List<T> is not thread-safe (most types are not thread-safe; this isn't a fault of List<T> - the fault is simply: never assume something is thread-safe unless you've checked).
If the bulk of the CPU work in that per-item callback is the new Mapper().Map(res) part, then you may be able to fix this with synchronization, i.e.
Parallel.ForEach(result, res =>
{
var item = new Mapper().Map(res);
lock (data)
{
data.Add(item);
}
});
which prevents threads fighting while adding, but still allows the Map part to run concurrently and independently. Note that the order is going to be undefined, though; you might want some kind of data.Sort(...) after the Parallel.ForEach has finished.
An alternative solution to locking inside a Parallel.ForEach would be to use PLINQ:
public List<TreeStructureDto> GetParentNodes()
{
var mapper = new Mapper();
return MenuDLL.Instance.GetTopParentNodes()
.AsParallel()
.Select(mapper.Map)
.ToList();
}
AsParallel uses multiple threads to perform the mappings, but no collection needs to be accessed via multiple threads concurrently.
As mentioned by Marc, this may or may not prove more efficient for your situation, so you should benchmark both approaches, as well as comparing to a single-threaded approach.

Parallel.For loop - Assigning a unique data entity for each thread

I have a 100 records for Parallelization, from 1 to 100, Now I can conveniently use a Parallel.For to execute them in Parallel as follows, which will work based on computing resources
Parallel.For(0, limit, i =>
{
DoWork(i);
});
but there are certain restrictions, each thread need to work with an identical Data entity and there are limited number of Data entities say 10, which are created in advanced by cloning each other and saving them in a structure like Dictionary or List. Now I can restrict the amount of parallelization using the following code:
Parallel.For(0, limit, new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
DoWork(i);
});
But the issue is how to assign a unique data entity for each incoming thread, such that Data entity is not used by any other current thread in execution, since the number of threads and data entity are same, so starvation is not an issue. I can think of way, in which I create a boolean value for each data entity, specifying whether it's in use or not, thus we iterate through the dictionary or list to find the next available data entity and lock the overall assignment process, so that one thread is assigned a data entity at a given time, but in my view this issue will have much more elegant solution, my version is just a workaround, not really a fix. My logic is:
Parallel.For(0, limit, new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
lock(All_Threads_Common_Object)
{
Check for available data entity using boolean
Assign the Data entity
}
DoWork(i);
Reset the Boolean value for another thread to use it
});
Please let me know if the question needs further clarification
Use the overload of Parallel.For which accepts a thread local initialization function.
Parallel.For<DataEntity>(0, limit,
//will run once for each thread
() => GetThreadLocalDataEntity(),
//main loop body, will run once per iteration
(i, loop, threadDataEntity) =>
{
DoWork(i, threadDataEntity);
return threadDataEntity; //we must return it here to adhere to the Func signature.
},
//will run once for each thread after the loop
(threadDataEntity) => threadDataEntity.Dispose() //if necessary
);
The main advantage of this method vs. the one you posted in the question, is that assignment of DataEntity happens once per thread, not once per loop iteration.
You can use a concurrent collection to store your 10 objects.
Each Worker will pull one data entity out, use it, and give it back. Te use of the concurrent collection is important, because in your scenario the normal one is not thread safe.
Like so:
var queue = new ConcurrentQueue<DataEntity>();
// fill the queue with 10 items
Parallel.For(0, limit, new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
DataEntity x;
if(!queue.TryDequeue(out x))
throw new InvalidOperationException();
DoWork(i, x);
queue.Enqueue(x);
});
Or, if blocking needs to be provided, wrap the thing in a BlockingCollection.
Edit: Do not wrap it in a loop to keep waiting. Rather, use the BlockingCollection like this:
var entities = new BlockingCollection(new ConcurrentQueue<DataEntity>());
// fill the collection with 10 items
Parallel.For(0, limit, new ParallelOptions { MaxDegreeOfParallelism = 10 }, i =>
{
DataEntity x = entities.Take();
DoWork(i, x);
entities.Add(x);
});

Task List in a multithreaded environment - "enumeration operation may not execute"

I have a Task factory thats kicking off many tasks, sometimes over 1000. I add every Task to a list, and remove it when the Task has completed.
var scan = Task.Factory.StartNew(() =>
{
return operation.Run();
}, token.Token
);
operations.Add(scan);
When a task Completes:
var finishedTask = scan.ContinueWith(resultTask =>
OperationComplete(resultTask),
TaskContinuationOptions.OnlyOnRanToCompletion
);
public virtual void OperationComplete(Task task)
{
operations.Remove(task);
}
When all are complete:
Task.Factory.ContinueWhenAll(operations.ToArray(),
result =>
{
AllOperationsComplete();
}, TaskContinuationOptions.None);
Then, at certain points in my application I want to get the count of running tasks. (This is where I get the error: "Collection was modified; enumeration operation may not execute.")
public int Count()
{
int running = operations.Count<Task>((x) => x.Status == TaskStatus.Running);
return running;
}
A couple questions:
1) Should I even worry about removing the tasks from the list? The list could easily be in the 1000s.
2) Whats the best way to make Count() safe? Creating a new List and adding operations to it will still enumerate the collection, if I remember right.
Either you need to lock to make sure only one thread accesses the list at a time (whether that's during removal or counting) or you should use a concurrent collection. Don't forget that Count(Func<T, bool>) needs to iterate over the collection in order to perform the count - it's like using a foreach loop... and you can't modify a collection (in general) while you're iterating over it.
I suspect that ConcurrentBag is an appropriate choice here - and as you're using TPL, presumably you have the .NET 4 concurrent collections available...
You need to make sure you don't modify a collection while you're iterating. Most collections don't support that. A lock would likely suffice.
But, you'll likely want to revisit the design. Locking a collection for an extended period of time will likely kill any performance gains you where hoping to get from asynchronous Tasks.
Given the code is already checking status as part of the count call, and assuming you aren't doing the count until after all tasks are in the collection, just not removing them seems like the simplest answer. Make sure to actually measure perf differences if you decide to switch out List for something else, especially if the number of times that Count call is done is low relative to the size of the collection. :)
You can use a ConcurrentDictionary
to keep track of your tasks (Concurrentbags don't let you remove specific items).
ConcurrentDictionary<Task, string> runningTasks = new ConcurrentDictionary<Task, string>();
Task task = Task.Factory.StartNew(() =>
{
// Do your stuff
}).ContinueWith(processedTask => {
var outString; // A string we don't care about
runningTasks.TryRemove(processedTask, out outString);
});
runningTasks.TryAdd(task, "Hello I'm a task");
// Add lots more tasks to runningTasks
while (runningTasks.Count > 0)
{
Console.WriteLine("I'm still waiting...");
Thread.Sleep(1000);
}
If you wanna do a proper "WaitAll" (requires LINQ):
try
{
Task[] keys = runningTasks.Keys.Select(x => x).ToArray();
Task.WaitAll(keys);
}
catch { } // WaitAll will always throw an exception.
Hope it helps.

Parallel.ForEach with adding to list

I'm trying to run multiple functions that connect to a remote site (by network) and return a generic list. But I want to run them simultaneously.
For example:
public static List<SearchResult> Search(string title)
{
//Initialize a new temp list to hold all search results
List<SearchResult> results = new List<SearchResult>();
//Loop all providers simultaneously
Parallel.ForEach(Providers, currentProvider =>
{
List<SearchResult> tmpResults = currentProvider.SearchTitle((title));
//Add results from current provider
results.AddRange(tmpResults);
});
//Return all combined results
return results;
}
As I see it, multiple insertions to 'results' may happend at the same time... Which may crash my application.
How can I avoid this?
You can use a concurrent collection.
The System.Collections.Concurrent namespace provides several thread-safe collection classes that should be used in place of the corresponding types in the System.Collections and System.Collections.Generic namespaces whenever multiple threads are accessing the collection concurrently.
You could for example use ConcurrentBag since you have no guarantee which order the items will be added.
Represents a thread-safe, unordered collection of objects.
//In the class scope:
Object lockMe = new Object();
//In the function
lock (lockMe)
{
results.AddRange(tmpResults);
}
Basically a lock means that only one thread can have access to that critical section at the same time.
For those who prefer code:
public static ConcurrentBag<SearchResult> Search(string title)
{
var results = new ConcurrentBag<SearchResult>();
Parallel.ForEach(Providers, currentProvider =>
{
results.Add(currentProvider.SearchTitle((title)));
});
return results;
}
The Concurrent Collections are new for .Net 4; they are designed to work with the new parallel functionality.
See Concurrent Collections in the .NET Framework 4:
Before .NET 4, you had to provide your own synchronization mechanisms if multiple threads might be accessing a single shared collection. You had to lock the collection ...
... the [new] classes and interfaces in System.Collections.Concurrent [added in .NET 4] provide a consistent implementation for [...] multi-threaded programming problems involving shared data across threads.
This could be expressed concisely using PLINQ's AsParallel and SelectMany:
public static List<SearchResult> Search(string title)
{
return Providers.AsParallel()
.SelectMany(p => p.SearchTitle(title))
.ToList();
}

What is the correct PLINQ syntax to convert this foreach loop to parallel execution?

Update 2011-05-20 12:49AM: The foreach is still 25% faster than the parallel solution for my application. And don't use the collection count for max parallelism, use somthing closer to the number of cores on your machine.
=
I have an IO bound task that I would like to run in parallel. I want to apply the same operation to every file in a folder. Internally, the operation results in a Dispatcher.Invoke that adds the computed file info to a collection on the UI thread. So, in a sense, the work result is a side effect of the method call, not a value returned directly from the method call.
This is the core loop that I want to run in parallel
foreach (ShellObject sf in sfcoll)
ProcessShellObject(sf, curExeName);
The context for this loop is here:
var curExeName = Path.GetFileName(Assembly.GetEntryAssembly().Location);
using (ShellFileSystemFolder sfcoll = ShellFileSystemFolder.FromFolderPath(_rootPath))
{
//This works, but is not parallel.
foreach (ShellObject sf in sfcoll)
ProcessShellObject(sf, curExeName);
//This doesn't work.
//My attempt at PLINQ. This code never calls method ProcessShellObject.
var query = from sf in sfcoll.AsParallel().WithDegreeOfParallelism(sfcoll.Count())
let p = ProcessShellObject(sf, curExeName)
select p;
}
private String ProcessShellObject(ShellObject sf, string curExeName)
{
String unusedReturnValueName = sf.ParsingName
try
{
DesktopItem di = new DesktopItem(sf);
//Up date DesktopItem stuff
di.PropertyChanged += new PropertyChangedEventHandler(DesktopItem_PropertyChanged);
ControlWindowHelper.MainWindow.Dispatcher.Invoke(
(Action)(() => _desktopItemCollection.Add(di)));
}
catch (Exception ex)
{
}
return unusedReturnValueName ;
}
Thanks for any help!
+tom
EDIT: Regarding the update to your question. I hadn't spotted that the task was IO-bound - and presumably all the files are from a single (traditional?) disk. Yes, that would go slower - because you're introducing contention in a non-parallelizable resource, forcing the disk to seek all over the place.
IO-bound tasks can still be parallelized effectively sometimes - but it depends on whether the resource itself is parallelizable. For example, an SSD (which has much smaller seek times) may completely change the characteristics you're seeing - or if you're fetching over the network from several individually-slow servers, you could be IO-bound but not on a single channel.
You've created a query, but never used it. The simplest way of forcing everything to be used with the query would be to use Count() or ToList(), or something similar. However, a better approach would be to use Parallel.ForEach:
var options = new ParallelOptions { MaxDegreeOfParallelism = sfcoll.Count() };
Parallel.ForEach(sfcoll, options, sf => ProcessShellObject(sf, curExeName));
I'm not sure that setting the max degree of parallelism like that is the right approach though. It may work, but I'm not sure. A different way of approaching this would be to start all the operations as tasks, specifying TaskCreationOptions.LongRunning.
Your query object created via LINQ is an IEnumerable. It gets evaluated only if you enumerate it (eg. via foreach loop):
var query = from sf in sfcoll.AsParallel().WithDegreeOfParallelism(sfcoll.Count())
let p = ProcessShellObject(sf, curExeName)
select p;
foreach(var q in query)
{
// ....
}
// or:
var results = query.ToArray(); // also enumerates query
Should you add a line in the end
var results = query.ToList();

Categories

Resources