Flatten nested loops to one list with LINQ - c#

I'm replacing an old parallelisation helper class of mine with the TPL classes now. My old code has proven very unreliable when errors occur in the action code and it doesn't seem to be built for what I'm doing now.
The first list of jobs was easily translated to Parallel.ForEach. But here comes a nested and indexed loop that I can't resolve so easily.
int streamIndex = 0;
foreach (var playlist in selectedPlaylists)
{
var localPlaylist = playlist;
foreach (var streamFile in playlist.StreamFiles)
{
var localStreamFile = streamFile;
var localStreamIndex = streamIndex++;
// Action that uses localPlaylist, localStreamFile and localStreamIndex
...
// Save each job's result to its assigned place in the list
lock (streamsList)
{
streamsList[localStreamIndex] = ...;
}
}
}
The local variables are for proper closure support as the foreach iteration variable was shared.
I'm thinking of something like
selectedPlaylists.SelectMany(p => p.StreamFiles)
but then I'm losing the association of where each streamFile came from, and the index which should be deterministic as it's used for ordering the results in the results list. Is there a way to keep these associations with Linq and also add that counter while enumerating the list? Maybe like this (made-up pseudocode):
selectedPlaylists
.SelectMany(p => new
{
Playlist = p,
StreamFile = ~~each one of p.StreamFiles~~,
Index = ~~Counter()~~
})
I could keep those old nested foreach loops and collect all jobs in a list, then use Parallel.Invoke, but that seems more complex than it needs to be. I'd like to know if there's a simple Linq feature I don't know yet.

Well you could do something like this...
//
Dictionary<int, object> streamsList = new Dictionary<int, object>();
// First create a composition that holds the playlist and the streamfile
selectedPlaylists.SelectMany(playList => playList.StreamFiles.Select(streamFile => new { PlayList = playList, StreamFile = streamFile }))
// thenfor all of theese add the respective index
.Select((composition, i) => new { StreamFile = composition.StreamFile, PlayList = composition.PlayList, LocalStreamIndex = i })
.AsParallel()
.WithCancellation(yourTokenGoesHere)
.WithDegreeOfParallelism(theDegreeGoesHere)
.ForAll(indexedComposition =>
{
object result =somefunc(indexedComposition.LocalStreamIndex, indexedComposition.PlayList, indexedComposition.StreamFile);;
lock(streamsList) // dont call the function insde the lock or the as parallel is useless.
streamsList[indexedComposition.LocalStreamIndex] = result;
});

To flatten the StreamFiles and keep association with PlayList and index them you canuse this query:
int index = 0;
var query = selectedPlaylists
.SelectMany(p => p.StreamFiles
.Select(s =>
new {
PlayList = p,
Index = index++,
StreamFile = s
}));

Related

How is this parallel for not processing all elements?

I've created this normal for loop:
public static Dictionary<string,Dictionary<string,bool>> AnalyzeFiles(IEnumerable<string> files, IEnumerable<string> dependencies)
{
Dictionary<string, Dictionary<string, bool>> filesAnalyzed = new Dictionary<string, Dictionary<string, bool>>();
foreach (var item in files)
{
filesAnalyzed[item] = AnalyzeFile(item, dependencies);
}
return filesAnalyzed;
}
The loop just checks if each file that is in the variable "files" has all the dependencies specified in the variable "dependencies".
the "files" variable should only have unique elements because it is used as the key for the result, a dictionary, but I check this before calling the method.
The for loop works correctly and all elements are processed in single thread, so I wanted to increase the performance by changing to a parallel for loop, the problem is that not all the elements that come from the "files" variable are being processed in the parallel for (in my test case I get 30 elements instead of 53).
I've tried to increase the timespan, or to remove all the "Monitor.TryEnter" code and use just a lock(filesAnalyzed) but still got the same result
I'm not very familiar with the paraller for, so it might be something in the syntax that I'm using.
public static Dictionary<string,Dictionary<string,bool>> AnalyzeFiles(IEnumerable<string> files, IEnumerable<string> dependencies)
{
var filesAnalyzed = new Dictionary<string, Dictionary<string, bool>>();
Parallel.For<KeyValuePair<string, Dictionary<string, bool>>>(
//start index
0,
//end index
files.Count(),
// initialization?
()=>new KeyValuePair<string, Dictionary<string, bool>>(),
(index, loop, result) =>
{
var temp = new KeyValuePair<string, Dictionary<string, bool>>(
files.ElementAt(index),
AnalyzeFile(files.ElementAt(index), dependencies));
return temp;
}
,
//finally
(x) =>
{
if (Monitor.TryEnter(filesAnalyzed, new TimeSpan(0, 0, 30)))
{
try
{
filesAnalyzed.Add(x.Key, x.Value);
}
finally
{
Monitor.Exit(filesAnalyzed);
}
}
}
);
return filesAnalyzed;
}
any feedback is appreciated
Assuming the code inside AnalyzeFile and dependencies is thread safe, how about something like this:
var filesAnalyzed = files
.AsParellel()
.Select(x => new{Item = x, File = AnalyzeFile(x, dependencies)})
.ToDictionary(x => x.Item, x=> x.File);
Rewrite your normal loop this way:
Parallel.Foreach(files, item=>
{
filesAnalyzed[item] = AnalyzeFile(item, dependencies);
});
You should also use ConcurrentDictionary except Dictionary to make all process thread-safe
You can simplify your code a lot if you use Parallel LINQ instead :
public static Dictionary<string,Dictionary<string,bool>> AnalyzeFiles(IEnumerable<string> files, IEnumerable<string> dependencies)
{
var filesAnalyzed = ( from item in files.AsParallel()
let result=AnalyzeFile(item, dependencies)
select (Item:item,Result:result)
).ToDictionary( it=>it.Item,it=>it.Result)
return filesAnalyzed;
}
I used tuple syntax in this case to avoid noise. It also cuts down on allocations.
Using method syntax, the same can be written as :
var filesAnalyzed = files.AsParallel()
.Select(item=> (item, AnalyzeFile(item, dependencies)))
.ToDictionary( it=>it.Item,it=>it.Result)
Dictionary<> isn't thread-safe for modification. If you wanted to use Parallel.ForEach without locking, you'd have to use ConcurrentDictionary
var filesAnalyzed = ConcurrentDictionary<string,Dictionary<string,bool>>;
Parallel.ForEach(files,file => {
filesAnalyzed[item] = AnalyzeFile(item, dependencies);
});
In this case at least, there is no benefit in using Parallel over PLINQ.
Hard to say what is exactly going wrong without debugging the code. Just looking at it though I would have used a ConcurrentDictionary for filesAnalyzed variable instead of a normal `Dictionary and get rid of the Monitor.
I would also check whether same key already exists in the dictionary filesAnalyzed, it could be that you are trying to add a kvp withthe key that is added to the dictionary already.

`Using Parallel.ForEach to speed up processing of file but cant return in correct order

so im trying to use a Parallel.ForEach loop to speed up my processing of a file but I can't figure out how to make it build the output in an ordered fashion. This is the code I have so far:
string[] lines = File.ReadAllLines(fileName);
List<string> list_lines = new List<string>(lines);
Parallel.ForEach(list_lines, async line =>
{
processedData += await processSingleLine(line);
});
As you can see it doesn't have any sort of ordered implementation since I have tried looking for something to fit my solution I haven't found anything that I've been able to get even near working.
So preferably I'd like have each line processed but build up the processedData variable in the same order that each line was sent out, however I do realize that this might just be out of my current skill level so any advice would be nice.
EDIT:
After trying reading the answers below I tried it with two methods:
ConcurrentDictionary<int, string> result = new ConcurrentDictionary<int, string>();
Parallel.For(0, list.Length, i =>
{
// process your data and save to dict
result[i] = processData(lines[i]);
});
and
ConcurrentDictionary<int, string> result = new ConcurrentDictionary<int, string>();
for (var i = 0; i < lines.Length; i++)
{
result[i] = lines[i];
}
Array.Clear(lines,0, lines.Length);
Parallel.ForEach(result, line =>
{
result[line.Key] = encrypt(line.Value, key);
});
Yet both only appear to be using about 1 core(4 core processor), 30% of total in Task manager, where as before I implemented the ordering it was using near on 80% on the CPU.
You can try using Parallel.For instead of Parallel.ForEach. Then you will have indexes for your lines. I.e.:
string[] lines = File.ReadAllLines(fileName);
// use thread safe collection for catching the results in parallel
ConcurrentDictionary<int, Data> result = new ConcurrentDictionary<int, Data>();
Parallel.For(0, list.Length, i =>
{
// process your data and save to dict
result[i] = processData(lines[i]);
});
// having data in dict you can easily retrieve initial order
Data[] orderedData = Data[lines.Length];
for(var i=0; i<lines.Length; i++)
{
orderedData[i] = result[i];
}
EDIT: And as it was said in comments under your question, you can't use async methods here. When you do, Parallel.ForEach will return you a bunch of tasks, not results. If you want to parallelize asynchronous code, you can use multiple Task.Run, like here:
string[] lines = File.ReadAllLines(fileName);
var tasks = lines.Select(
l => Task.Run<Data>(
async () => {
return await processAsync(l);
})).ToList();
var results = await Task.WhenAll(tasks);
NOTE: Should work, but didn't check it.
I believe Parallel.ForEach.AsOrdered() does what you want.
Taking the data structure list_lines and the method processSingleLine from your code, the following should preserve the order and have parallel execution:
var parallelQuery = from line in list_lines.AsParallel().AsOrdered()
select processSingleLine(line);
foreach (var processedLine in parallelQuery)
{
Console.Write(processedLine);
}

Thread-safe changes to a ConcurrentDictionary

I am populating a ConcurrentDictionary in a Parallel.ForEach loop:
var result = new ConcurrentDictionary<int, ItemCollection>();
Parallel.ForEach(allRoutes, route =>
{
// Some heavy operations
lock(result)
{
if (!result.ContainsKey(someKey))
{
result[someKey] = new ItemCollection();
}
result[someKey].Add(newItem);
}
}
How do I perform the last steps in a thread-safe manner without using the lock statement?
EDIT: Assume that ItemCollection is thread-safe.
I think you want GetOrAdd, which is explicitly designed to either fetch an existing item, or add a new one if there's no entry for the given key.
var collection = result.GetOrAdd(someKey, _ => new ItemCollection());
collection.Add(newItem);
As noted in the question comments, this assumes that ItemCollection is thread-safe.
You need to use the GetOrAdd method.
var result = new ConcurrentDictionary<int, ItemCollection>();
int someKey = ...;
var newItem = ...;
ItemCollection collection = result.GetOrAdd(someKey, _ => new ItemCollection());
collection.Add(newItem);
Assuming ItemCollection.Add is not thread-safe, you will need a lock, but you can reduce the size of the critical region.
var collection = result.GetOrAdd(someKey, k => new ItemCollection());
lock(collection)
collection.Add(...);
Update: Since it seems to be thread-safe, you don't need the lock at all
var collection = result.GetOrAdd(someKey, k => new ItemCollection());
collection.Add(...);

Is there a more efficient way of creating a list based on an existing list and a lookup list?

I have the following method that takes an extremely long time to run and would love some help to make it run faster and or be more efficient.
The main responsibility of the method is to take a list of data points created from a CSV file, map the Name property of the file datapoints to the to the HistorianTagname property in a list of tagnames by the DataLoggerTagname property and create a resulting list from the mapping. If the mapping does not exist, the file datapoint is ignored.
I know it that was long-winded, but I hope it makes sense. It may be easier just to look at the method:
private IEnumerable<DataPoint> GetHistorianDatapoints(IEnumerable<DataPoint> fileDatapoints, IEnumerable<Tagname> historianTagnames)
{
/**
** REFACTOR THIS
**/
foreach (var fileDatapoint in fileDatapoints)
{
var historianTagname = historianTagnames.FirstOrDefault(x => x.DataLoggerTagname.Equals(fileDatapoint.Name, StringComparison.OrdinalIgnoreCase));
if (historianTagname != null)
{
var historianDatapoint = new DataPoint();
historianDatapoint.Name = historianTagname.HistorianTagname;
historianDatapoint.Date = fileDatapoint.Date;
historianDatapoint.Value = fileDatapoint.Value;
yield return historianDatapoint;
}
}
}
Notes:
I have complete control of classes and methods of mapping, so if I am doing something fundamentally wrong. I would love to know!
Thanks!
I would start by fixing up:
var historianTagname = historianTagnames.FirstOrDefault(x => x.DataLoggerTagname.Equals(fileDatapoint.Name, StringComparison.OrdinalIgnoreCase))
That's a pretty expensive operation to run every iteration through this loop.
Below is my proposition:
private IEnumerable<DataPoint> GetHistorianDatapoints(IEnumerable<DataPoint> fileDatapoints, IEnumerable<Tagname> historianTagnames)
{
var tagNameDictionary = historianTagnames.ToDictionary(t => t.DataLoggerTagname, StringComparer.OrdinalIgnoreCase);
foreach (var fileDatapoint in fileDatapoints)
{
if (tagNameDictionary.ContainsKey(fileDatapoint.Name))
{
var historianTagname = tagNameDictionary[fileDatapoint.Name];
var historianDatapoint = new DataPoint();
historianDatapoint.Name = historianTagname.HistorianTagname;
historianDatapoint.Date = fileDatapoint.Date;
historianDatapoint.Value = fileDatapoint.Value;
yield return historianDatapoint;
}
}
}
Like #Sheldon Warkentin said FirstOrDefault is probably bottle neck of your function, i s better to create historianTagnames a Dictionary where Name is key, then in your function you can get value by key.
Something like bellow:
// this is passed to method
IDictionary<string, Tagname> historianTagnames;
// .. method body
var historianTagname = historianTagnames[fileDatapoint.Name];
ofcourse you need to add proper if's.
As others have said, a Dictionary<string, Tagname> might perform better.
var historianDict = new Dictionary<string, Tagname>();
foreach (var tagName in historianTagnames) {
historianDict[tagName.DataLoggerTagname.ToLowerInvariant()] = tagName;
}
foreach (var fileDatapoint in fileDatapoints) {
if (historianDict.ContainsKey(fileDatapoint.Name.ToLowerInvariant()) {
// ...
}
}

Create sequence consisting of multiple property values

I have an existing collection of objects with two properties of interest. Both properties are of the same type. I want to create a new sequence consisting of the property values. Here's one way (I'm using tuples instead of my custom type for simplicity):
var list = new List<Tuple<string, string>>
{ Tuple.Create("dog", "cat"), Tuple.Create("fish", "frog") };
var result =
list.SelectMany(x => new[] {x.Item1, x.Item2});
foreach (string item in result)
{
Console.WriteLine(item);
}
Results in:
dog
cat
fish
frog
This gives me the results I want, but is there a better way to accomplish this (in particular, without the need to create arrays or collections)?
Edit:
This also works, at the cost of iterating over the collection twice:
var result = list.Select(x => x.Item1).Concat(list.Select(x => x.Item2));
If you want to avoid creating another collection, you could yield the results instead.
void Main()
{
var list = new List<Tuple<string, string>>
{ Tuple.Create("dog", "cat"), Tuple.Create("fish", "frog") };
foreach (var element in GetSingleList(list))
{
Console.WriteLine (element);
}
}
// A reusable extension method would be a better approach.
IEnumerable<T> GetSingleList<T>(IEnumerable<Tuple<T,T>> list) {
foreach (var element in list)
{
yield return element.Item1;
yield return element.Item2;
}
}
I think your approach is fine and I would stick with that. The use of the array nicely gets the job done when using SelectMany, and the final result is an IEnumerable<string>.
There are some alternate approaches, but I think they're more verbose than your approach.
Aggregate approach:
var result = list.Aggregate(new List<string>(), (seed, t) =>
{
seed.Add(t.Item1);
seed.Add(t.Item2);
return seed;
});
result.ForEach(Console.WriteLine);
ForEach approach:
var result = new List<string>();
list.ForEach(t => { result.Add(t.Item1); result.Add(t.Item2); });
result.ForEach(Console.WriteLine);
In both cases a new List<string> is created.

Categories

Resources