How is this parallel for not processing all elements? - c#

I've created this normal for loop:
public static Dictionary<string,Dictionary<string,bool>> AnalyzeFiles(IEnumerable<string> files, IEnumerable<string> dependencies)
{
Dictionary<string, Dictionary<string, bool>> filesAnalyzed = new Dictionary<string, Dictionary<string, bool>>();
foreach (var item in files)
{
filesAnalyzed[item] = AnalyzeFile(item, dependencies);
}
return filesAnalyzed;
}
The loop just checks if each file that is in the variable "files" has all the dependencies specified in the variable "dependencies".
the "files" variable should only have unique elements because it is used as the key for the result, a dictionary, but I check this before calling the method.
The for loop works correctly and all elements are processed in single thread, so I wanted to increase the performance by changing to a parallel for loop, the problem is that not all the elements that come from the "files" variable are being processed in the parallel for (in my test case I get 30 elements instead of 53).
I've tried to increase the timespan, or to remove all the "Monitor.TryEnter" code and use just a lock(filesAnalyzed) but still got the same result
I'm not very familiar with the paraller for, so it might be something in the syntax that I'm using.
public static Dictionary<string,Dictionary<string,bool>> AnalyzeFiles(IEnumerable<string> files, IEnumerable<string> dependencies)
{
var filesAnalyzed = new Dictionary<string, Dictionary<string, bool>>();
Parallel.For<KeyValuePair<string, Dictionary<string, bool>>>(
//start index
0,
//end index
files.Count(),
// initialization?
()=>new KeyValuePair<string, Dictionary<string, bool>>(),
(index, loop, result) =>
{
var temp = new KeyValuePair<string, Dictionary<string, bool>>(
files.ElementAt(index),
AnalyzeFile(files.ElementAt(index), dependencies));
return temp;
}
,
//finally
(x) =>
{
if (Monitor.TryEnter(filesAnalyzed, new TimeSpan(0, 0, 30)))
{
try
{
filesAnalyzed.Add(x.Key, x.Value);
}
finally
{
Monitor.Exit(filesAnalyzed);
}
}
}
);
return filesAnalyzed;
}
any feedback is appreciated

Assuming the code inside AnalyzeFile and dependencies is thread safe, how about something like this:
var filesAnalyzed = files
.AsParellel()
.Select(x => new{Item = x, File = AnalyzeFile(x, dependencies)})
.ToDictionary(x => x.Item, x=> x.File);

Rewrite your normal loop this way:
Parallel.Foreach(files, item=>
{
filesAnalyzed[item] = AnalyzeFile(item, dependencies);
});
You should also use ConcurrentDictionary except Dictionary to make all process thread-safe

You can simplify your code a lot if you use Parallel LINQ instead :
public static Dictionary<string,Dictionary<string,bool>> AnalyzeFiles(IEnumerable<string> files, IEnumerable<string> dependencies)
{
var filesAnalyzed = ( from item in files.AsParallel()
let result=AnalyzeFile(item, dependencies)
select (Item:item,Result:result)
).ToDictionary( it=>it.Item,it=>it.Result)
return filesAnalyzed;
}
I used tuple syntax in this case to avoid noise. It also cuts down on allocations.
Using method syntax, the same can be written as :
var filesAnalyzed = files.AsParallel()
.Select(item=> (item, AnalyzeFile(item, dependencies)))
.ToDictionary( it=>it.Item,it=>it.Result)
Dictionary<> isn't thread-safe for modification. If you wanted to use Parallel.ForEach without locking, you'd have to use ConcurrentDictionary
var filesAnalyzed = ConcurrentDictionary<string,Dictionary<string,bool>>;
Parallel.ForEach(files,file => {
filesAnalyzed[item] = AnalyzeFile(item, dependencies);
});
In this case at least, there is no benefit in using Parallel over PLINQ.

Hard to say what is exactly going wrong without debugging the code. Just looking at it though I would have used a ConcurrentDictionary for filesAnalyzed variable instead of a normal `Dictionary and get rid of the Monitor.
I would also check whether same key already exists in the dictionary filesAnalyzed, it could be that you are trying to add a kvp withthe key that is added to the dictionary already.

Related

Add items to ConcurrentBag by multithreading

I'm trying to add multiple values to a ConcurrentBag, but no values actually get inside. At first I tried using List, but that apparently isn't "Thread-Safe", so I searched around and it seems like people suggest using ConcurrentBag. I tried using Thread.Sleep(100) with List and that worked, but it's slower. How can I properly add values? The debuger always shows "Count:0". Here's my code:
foreach (KeyValuePair<string, string> entry in test_Words)
{
Form1.fr.progressBar1.Value++;
new Thread(delegate () {
switch (test_Type)
{
case "Definitions":
bagOfExercises.Add(Read(Definitions.get(entry.Value, entry.Key)));
break;
case "Examples":
bagOfExercises.Add(Read(Examples.get(entry.Value, entry.Key)).Replace(entry.Key, new string('_', entry.Key.Length)));
break;
}
}).Start();
}
Example for PLinq:
Func<KeyValuePair<string, string>, string> get;
if(test_Type == "Definitions")
{
get = kvp => Read(Definitions.get(kvp.Value, kvp.Key));
}
else
{
get = kvp => Read(Examples.get(kvp.Value, kvp.Key)).Replace(entry.Key, new string('_', kvp.Key.Length)));
}
var results = test_Words.AsParallel()
.WithDegreeOfParallelism(test_Words.Count())
.Select(get)
.ToList();
This tries to use one thread per entry. Normally, PLinq will decide what is the best use of resources, but in this case we know something PLinq cannot know: we wait a lot on external resources, this can be done massively parallel.

Adding more than one value to a single key in a hashtable in c#?

I'm trying to make a program where it reads strings that has a word and its meaning, for example
Book: Cover with Papers in between
Book: Reserve
And whenever I try my code I get an error because each key has to be unique. Is there a way to work around this?
Hashtable ht = new Hashtable();
var fileStream = new FileStream(#"e:\test.txt", FileMode.Open, FileAccess.Read);
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
ht.Add(line.Split(':')[0], line.Split(':')[1]);
}
}
if (ht.ContainsKey("Book"))
{
listBox1.Items.Add(ht["Book"].ToString());
}
In the general case, you could use a List<string> for the value, and just Add to it. However, you can probably simplify with LINQ via ToLookup:
var groups = File.ReadLines(path)
.Select(line => line.Split(':'))
.ToLookup(x => x[0], x => x[1].Trim());
Now you can access groups[key] which gives you all the values with that prefix, or you can foreach over groups to get each combination of .Key and values.
In terms of your code, this is:
var groups = File.ReadLines(#"e:\test.txt")
.Select(line => line.Split(':'))
.ToLookup(x => x[0], x => x[1].Trim());
foreach(var val in groups["Book"])
listBox1.Items.Add(val);
(no need to check for existence first, it just works correctly if no match)
However! You only need to do this if you still want all the values after this code, i.e. you use groups somewhere else. If you don't, you can be more frugal and just abandon the unwanted data:
var values = File.ReadLines(#"e:\test.txt")
.Where(line => line.StartsWith("Book:"))
.Select(line => line.Substring(5).Trim());
foreach(var val in values)
listBox1.Items.Add(val);
Edit: minor thing - a vexing method signature means that line.Split(':') actually creates an array every time, because params; so I usually use:
static readonly char[] Colon = {':'};
and
line.Split(Colon)
Which is measurably more efficient if it is a hot path.
Use a Dictionary where the values is a list of strings:
var myDicitonary = new Dictionary<string, List<string>>
And now, you'd do the following:
if (!myDictionary.ContainsKey(key))
{
myDicitonary.Add(key, new List<string>());
}
myDicitonary[key].Add(value);
Use a Dictionary<string, List<string>> instead of the Hashtable.
Depending on what you want to achieve, you may use a SortedList or a SortedDictionary, that you initialze with your own IComparer, which allows duplicate keys.
Have a look at these Stackoverflow posts for details:
C# Sortable collection which allows duplicate keys
Is there an alternative to Dictionary/SortedList that allows duplicates?
The drawback of this solution is, that you cannot access the elements by key, but still by index.
use Dictionary instead.
Dictionary<string, list<string>> dic = new Dictionary<string, list<string>();

Flatten nested loops to one list with LINQ

I'm replacing an old parallelisation helper class of mine with the TPL classes now. My old code has proven very unreliable when errors occur in the action code and it doesn't seem to be built for what I'm doing now.
The first list of jobs was easily translated to Parallel.ForEach. But here comes a nested and indexed loop that I can't resolve so easily.
int streamIndex = 0;
foreach (var playlist in selectedPlaylists)
{
var localPlaylist = playlist;
foreach (var streamFile in playlist.StreamFiles)
{
var localStreamFile = streamFile;
var localStreamIndex = streamIndex++;
// Action that uses localPlaylist, localStreamFile and localStreamIndex
...
// Save each job's result to its assigned place in the list
lock (streamsList)
{
streamsList[localStreamIndex] = ...;
}
}
}
The local variables are for proper closure support as the foreach iteration variable was shared.
I'm thinking of something like
selectedPlaylists.SelectMany(p => p.StreamFiles)
but then I'm losing the association of where each streamFile came from, and the index which should be deterministic as it's used for ordering the results in the results list. Is there a way to keep these associations with Linq and also add that counter while enumerating the list? Maybe like this (made-up pseudocode):
selectedPlaylists
.SelectMany(p => new
{
Playlist = p,
StreamFile = ~~each one of p.StreamFiles~~,
Index = ~~Counter()~~
})
I could keep those old nested foreach loops and collect all jobs in a list, then use Parallel.Invoke, but that seems more complex than it needs to be. I'd like to know if there's a simple Linq feature I don't know yet.
Well you could do something like this...
//
Dictionary<int, object> streamsList = new Dictionary<int, object>();
// First create a composition that holds the playlist and the streamfile
selectedPlaylists.SelectMany(playList => playList.StreamFiles.Select(streamFile => new { PlayList = playList, StreamFile = streamFile }))
// thenfor all of theese add the respective index
.Select((composition, i) => new { StreamFile = composition.StreamFile, PlayList = composition.PlayList, LocalStreamIndex = i })
.AsParallel()
.WithCancellation(yourTokenGoesHere)
.WithDegreeOfParallelism(theDegreeGoesHere)
.ForAll(indexedComposition =>
{
object result =somefunc(indexedComposition.LocalStreamIndex, indexedComposition.PlayList, indexedComposition.StreamFile);;
lock(streamsList) // dont call the function insde the lock or the as parallel is useless.
streamsList[indexedComposition.LocalStreamIndex] = result;
});
To flatten the StreamFiles and keep association with PlayList and index them you canuse this query:
int index = 0;
var query = selectedPlaylists
.SelectMany(p => p.StreamFiles
.Select(s =>
new {
PlayList = p,
Index = index++,
StreamFile = s
}));

Linq Query On IDictionaryEnumerator Possible?

I need to clear items from cache that contain a specific string in the key. I have started with the following and thought I might be able to do a linq query
var enumerator = HttpContext.Current.Cache.GetEnumerator();
But I can't? I was hoping to do something like
var enumerator = HttpContext.Current.Cache.GetEnumerator().Key.Contains("subcat");
Any ideas on how I could achieve this?
The Enumerator created by the Cache generates DictionaryEntry objects. Furthermore, a Cache may have only string keys.
Thus, you can write the following:
var httpCache = HttpContext.Current.Cache;
var toRemove = httpCache.Cast<DictionaryEntry>()
.Select(de=>(string)de.Key)
.Where(key=>key.Contains("subcat"))
.ToArray(); //use .ToArray() to avoid concurrent modification issues.
foreach(var keyToRemove in toRemove)
httpCache.Remove(keyToRemove);
However, this is a potentially slow operation when the cache is large: the cache is not designed to be used like this. You should ask yourself whether an alternative design isn't possible and preferable. Why do you need to remove several cache keys at once, and why aren't you grouping cache keys by substring?
Since Cache is an IEnumerable, you can freely apply all LINQ methods you need to it. The only thing you need is to cast it to IEnumerable<DictionaryEntry>:
var keysQuery = HttpContext.Current.Cache
.Cast<DictionaryEntry>()
.Select(entry => (string)entry.Key)
.Where(key => key.Contains("subcat"));
Now keysQuery is a non-strict collection of all keys starting from "subcat". But if you need to remove such entries from cache the simplest way is to just use foreach statement.
I don't think it is a great idea to walk the entire cache anyway, but you could do it non-LINQ with something like:
var iter = HttpContext.Current.Cache.GetEnumerator();
using (iter as IDisposable)
{
while (iter.MoveNext())
{
string s;
if ((s = iter.Key as string) != null && s.Contains("subcat"))
{
//... let the magic happen
}
}
}
to do it with LINQ you could do something like:
public static class Utils
{
public static IEnumerable<KeyValuePair<object, object>> ForLinq(this IDictionaryEnumerator iter)
{
using (iter as IDisposable)
{
while (iter.MoveNext()) yield return new KeyValuePair<object, object>(iter.Key, iter.Value);
}
}
}
and use like:
var items = HttpContext.Current.Cache.GetEnumerator().ForLinq()
.Where(pair => ((string)pair.Key).Contains("subcat"));

Collection of strings to dictionary

Given an ordered collection of strings:
var strings = new string[] { "abc", "def", "def", "ghi", "ghi", "ghi", "klm" };
Use LINQ to create a dictionary of string to number of occurrences of that string in the collection:
IDictionary<string,int> stringToNumOccurrences = ...;
Preferably do this in a single pass over the strings collection...
var dico = strings.GroupBy(x => x).ToDictionary(x => x.Key, x => x.Count());
Timwi/Darin's suggestion will perform this in a single pass over the original collection, but it will create multiple buffers for the groupings. LINQ isn't really very good at doing this kind of counting, and a problem like this was my original motiviation for writing Push LINQ. You might like to read my blog post on it for more details about why LINQ isn't terribly efficient here.
Push LINQ and the rather more impressive implementation of the same idea - Reactive Extensions - can handle this more efficiently.
Of course, if you don't really care too much about the extra efficiency, go with the GroupBy answer :)
EDIT: I hadn't noticed that your strings were ordered. That means you can be much more efficient, because you know that once you've seen string x and then string y, if x and y are different, you'll never see x again. There's nothing in LINQ to make this particularly easier, but you can do it yourself quite easily:
public static IDictionary<string, int> CountEntries(IEnumerable<string> strings)
{
var dictionary = new Dictionary<string, int>();
using (var iterator = strings.GetEnumerator())
{
if (!iterator.MoveNext())
{
// No entries
return dictionary;
}
string current = iterator.Current;
int currentCount = 1;
while (iterator.MoveNext())
{
string next = iterator.Current;
if (next == current)
{
currentCount++;
}
else
{
dictionary[current] = currentCount;
current = next;
currentCount = 1;
}
}
// Write out the trailing result
dictionary[current] = currentCount;
}
return dictionary;
}
This is O(n), with no dictionary lookups involved other than when writing the values. An alternative implementation would use foreach and a current value starting off at null... but that ends up being pretty icky in a couple of other ways. (I've tried it :) When I need special-case handling for the first value, I generally go with the above pattern.
Actually you could do this with LINQ using Aggregate, but it would be pretty nasty.
The standard LINQ way is this:
stringToNumOccurrences = strings.GroupBy(s => s)
.ToDictionary(g => g.Key, g => g.Count());
If this is actual production code, I'd go with Timwi's response.
If this is indeed homework and you're expected to write your own implementation, it shouldn't be too tough. Here are just a couple of hints to point you in the right direction:
Dictionary<TKey, TValue> has a ContainsKey method.
The IDictionary<TKey, TValue> interface's this[TKey] property is settable; i.e., you can do dictionary[key] = 1 (which means you can also do dictionary[key] += 1).
From those clues I think you should be able to figure out how to do it "by hand."
If you are looking for a particularly efficient (fast) solution, then GroupBy is probably too slow for you. You could use a loop:
var strings = new string[] { "abc", "def", "def", "ghi", "ghi", "ghi", "klm" };
var stringToNumOccurrences = new Dictionary<string, int>();
foreach (var str in strings)
{
if (stringToNumOccurrences.ContainsKey(str))
stringToNumOccurrences[str]++;
else
stringToNumOccurrences[str] = 1;
}
return stringToNumOccurrences;
This is a foreach version like the one that Jon mentions that he finds "pretty icky" in his answer. I'm putting it in here, so there's something concrete to talk about.
I must admit that I find it simpler than Jon's version and can't really see what's icky about it. Jon? Anyone?
static Dictionary<string, int> CountOrderedSequence(IEnumerable<string> source)
{
var result = new Dictionary<string, int>();
string prev = null;
int count = 0;
foreach (var s in source)
{
if (prev != s && count > 0)
{
result.Add(prev, count);
count = 0;
}
prev = s;
++count;
}
if (count > 0)
{
result.Add(prev, count);
}
return result;
}
Updated to add a necessary check for empty source - I still think it's simpler than Jon's :-)

Categories

Resources