How to avoid collection modification during JSON serialization in looped multithreaded task?

How to avoid collection modification during JSON serialization in looped multithreaded task? - c#

I have a problem during serialization to JSON file, when using Newtonsoft.Json.
In a loop I am fiering tasks in various threads:
List<Task> jockeysTasks = new List<Task>();
for (int i = 1; i < 1100; i++)
{
int j = i;
Task task = Task.Run(async () =>
{
LoadedJockey jockey = new LoadedJockey();
jockey = await Task.Run(() => _scrapServices.ScrapSingleJockeyPL(j));
if (jockey.Name != null)
{
_allJockeys.Add(jockey);
}
UpdateStatusBar = j * 100 / 1100;
if (j % 100 == 0)
{
await Task.Run(() => _dataServices.SaveAllJockeys(_allJockeys)); //saves everything to JSON file
}
});
jockeysTasks.Add(task);
}
await Task.WhenAll(jockeysTasks);
And if (j % 100 == 0), it is rying to save the collection _allJockeys to the file (I will make some counter to make it more reliable, but that is not the point):
public void SaveAllJockeys(List<LoadedJockey> allJockeys)
{
if (allJockeys.Count != 0)
{
if (File.Exists(_jockeysFileName)) File.Delete(_jockeysFileName);
try
{
using (StreamWriter file = File.CreateText(_jockeysFileName))
{
JsonSerializer serializer = new JsonSerializer();
serializer.Serialize(file, allJockeys);
}
}
catch (Exception e)
{
dialog.ShowDialog("Could not save the results, " + e.ToString(), "Error");
}
}
}
During that time, as I belive, another tasks are adding new collection item to the collection, and it is throwing to me the exception:
Collection was modified; enumeration operation may not execute.
As I was reading in THE ARTICLE, you can change type of iteration to avoid an exception. As far as I know, I can not modify the way, how Newtonsoft.Json pack is doing it.
Thank you in advance for any tips how to avoid the exception and save the collection wihout unexpected changes.

You should probably inherit from List and use a ReaderWriterLock (https://learn.microsoft.com/en-us/dotnet/api/system.threading.readerwriterlock?view=netframework-4.8)
i.e. (not tested pseudo C#)
public class MyJockeys: List<LoadedJockey>
{
System.Threading.ReaderWriterLock _rw_lock = new System.Threading.ReaderWriterLock();
public new Add(LoadedJockey j)
{
try
{
_rw_lock.AcquireWriterLock(5000); // or whatever you deem an acceptable timeout
base.Add(j);
}
finally
{
_rw_lock.ReleaseWriterLock();
}
}
public ToJSON()
{
try
{
_rw_lock.AcquireReaderLock(5000); // or whatever you deem an acceptable timeout
string s = ""; // Serialize here using Newtonsoft
return s;
}
finally
{
_rw_lock.ReleaseReaderLock();
}
}
// And override Remove and anything else you need
}
Get the idea?
Hope this helps.
Regards,
Adam.

I tied to use ToList() on the collection, what creates copy of the list, with positive effect.

Related

Multiple consumers update result array inconsistently

I have a large file, each row can be process separately, so I launch one reader, and multiple parsers.
The each parser will write result back to a result holder array for further process.
I found if I launch more parser, the result holder array gives different content each time, no matter if I use ConcurrentQueue or BlockingCollection or some other things
I repeatedly run the program and output the result array many times, each time will give different if I use more than 1 parsers.
string[] result = new string[nRow];
static BlockingCollection<queueItem> myBlk = new BlockingCollection<queueItem>();
static void Main()
{
Reader();
}
static void parserThread()
{
while (myBlk.IsCompleted == false)
{
queueItem one;
if (myBlk.TryTake(out one) == false)
{
System.Threading.Thread.Sleep(tSleep);
}
else
{
oneDataRow(one.seqIndex, one.line);
}
}
}
static void oneDataRow(int rowIndex, string line)
{
result[rowIndex] = // some process with line
}
static void Reader()
{
for (int i = 0; i < 10; i++)
{
Task t = new Task(() => parserThread());
t.Start();
}
StreamReader sr = new StreamReader(path);
string line;
int nRead=0;
while((line = sr.ReadLine()) != null)
{
string innerLine = line;
int innerN = nRead;
myBlk.Add(new queueItem(innerN, innerLine));
nRead++;
}
siteBlk.CompleteAdding();
sw.close();
while (myBlk.IsCompleted == false)
{
System.Threading.Thread.Sleep(tSleep);
}
}
class queueItem
{
public int seqIndex = 0;
public string line = "";
public queueItem(int RowOrder, string content)
{
seqIndex = RowOrder;
line = content;
}
}

The way you are waiting for the process to complete is problematic:
while (myBlk.IsCompleted == false)
{
System.Threading.Thread.Sleep(tSleep);
}
Here is the description of the IsCompleted property:
Gets whether this BlockingCollection<T> has been marked as complete for adding and is empty.
In your case the completion of the BlockingCollection should not signal the completion of the whole operation, because the last lines taken from the collection may not be processed yet.
Instead you should store the worker tasks into an array (or list), and wait them to complete.
Task.WaitAll(tasks);
In general you should rarely use the IsCompleted property for anything other than for logging debug information. Using it for controlling the execution flow introduces race conditions in most cases.

Tasks combine result and continue

I have 16 tasks doing the same job, each of them return an array. I want to combine the results in pairs and do same job until I have only one task. I don't know what is the best way to do this.
public static IComparatorNetwork[] Prune(IComparatorNetwork[] nets, int numTasks)
{
var tasks = new Task[numTasks];
var netsPerTask = nets.Length/numTasks;
var start = 0;
var concurrentSet = new ConcurrentBag<IComparatorNetwork>();
for(var i = 0; i < numTasks; i++)
{
IComparatorNetwork[] taskNets;
if (i == numTasks - 1)
{
taskNets = nets.Skip(start).ToArray();
}
else
{
taskNets = nets.Skip(start).Take(netsPerTask).ToArray();
}
start += netsPerTask;
tasks[i] = Task.Factory.StartNew(() =>
{
var pruner = new Pruner();
concurrentSet.AddRange(pruner.Prune(taskNets));
});
}
Task.WaitAll(tasks.ToArray());
if(numTasks > 1)
{
return Prune(concurrentSet.ToArray(), numTasks/2);
}
return concurrentSet.ToArray();
}
Right now I am waiting for all tasks to complete then I repeat with half of the tasks until I have only one. I would like to not have to wait for all on each iteration. I am very new with parallel programming probably the approach is bad.
The code I am trying to parallelize is the following:
public IComparatorNetwork[] Prune(IComparatorNetwork[] nets)
{
var result = new List<IComparatorNetwork>();
for (var i = 0; i < nets.Length; i++)
{
var isSubsumed = false;
for (var index = result.Count - 1; index >= 0; index--)
{
var n = result[index];
if (nets[i].IsSubsumed(n))
{
isSubsumed = true;
break;
}
if (n.IsSubsumed(nets[i]))
{
result.Remove(n);
}
}
if (!isSubsumed)
{
result.Add(nets[i]);
}
}
return result.ToArray();
}`

So what you're fundamentally doing here is aggregating values, but in parallel. Fortunately, PLINQ already has an implementation of Aggregate that works in parallel. So in your case you can simply wrap each element in the original array in its own one element array, and then your Prune operation is able to combine any two arrays of nets into a new single array.
public static IComparatorNetwork[] Prune(IComparatorNetwork[] nets)
{
return nets.Select(net => new[] { net })
.AsParallel()
.Aggregate((a, b) => new Pruner().Prune(a.Concat(b).ToArray()));
}
I'm not super knowledgeable about the internals of their aggregate method, but I would imagine it's likely pretty good and doesn't spend a lot of time waiting unnecessarily. But, if you want to write your own, so that you can be sure the workers are always pulling in new work as soon as their is new work, here is my own implementation. Feel free to compare the two in your specific situation to see which performs best for your needs. Note that PLINQ is configurable in many ways, feel free to experiment with other configurations to see what works best for your situation.
public static T AggregateInParallel<T>(this IEnumerable<T> values, Func<T, T, T> function, int numTasks)
{
Queue<T> queue = new Queue<T>();
foreach (var value in values)
queue.Enqueue(value);
if (!queue.Any())
return default(T); //Consider throwing or doing something else here if the sequence is empty
(T, T)? GetFromQueue()
{
lock (queue)
{
if (queue.Count >= 2)
{
return (queue.Dequeue(), queue.Dequeue());
}
else
{
return null;
}
}
}
var tasks = Enumerable.Range(0, numTasks)
.Select(_ => Task.Run(() =>
{
var pair = GetFromQueue();
while (pair != null)
{
var result = function(pair.Value.Item1, pair.Value.Item2);
lock (queue)
{
queue.Enqueue(result);
}
pair = GetFromQueue();
}
}))
.ToArray();
Task.WaitAll(tasks);
return queue.Dequeue();
}
And the calling code for this version would look like:
public static IComparatorNetwork[] Prune2(IComparatorNetwork[] nets)
{
return nets.Select(net => new[] { net })
.AggregateInParallel((a, b) => new Pruner().Prune(a.Concat(b).ToArray()), nets.Length / 2);
}
As mentioned in comments, you can make the pruner's Prune method much more efficient by having it accept two collections, not just one, and only comparing items from each collection with the other, knowing that all items from the same collection will not subsume any others from that collection. This makes the method not only much shorter, simpler, and easier to understand, but also removes a sizeable portion of the expensive comparisons. A few minor adaptations can also greatly reduce the number of intermediate collections created.
public static IReadOnlyList<IComparatorNetwork> Prune(IReadOnlyList<IComparatorNetwork> first, IReadOnlyList<IComparatorNetwork> second)
{
var firstItemsNotSubsumed = first.Where(outerNet => !second.Any(innerNet => outerNet.IsSubsumed(innerNet)));
var secondItemsNotSubsumed = second.Where(outerNet => !first.Any(innerNet => outerNet.IsSubsumed(innerNet)));
return firstItemsNotSubsumed.Concat(secondItemsNotSubsumed).ToList();
}
With the the calling code just needs minor adaptations to ensure the types match up and that you pass in both collections rather than concatting them first.
public static IReadOnlyList<IComparatorNetwork> Prune(IReadOnlyList<IComparatorNetwork> nets)
{
return nets.Select(net => (IReadOnlyList<IComparatorNetwork>)new[] { net })
.AggregateInParallel((a, b) => Pruner.Prune(a, b), nets.Count / 2);
}

How to cache slow resource initialisation from C# Web API REST Server?

Context
I am trying to implement a REST API web service that "wraps" an existing C program.
Problem / Goal
Given that the C program has slow initialisation time and high RAM usage when I tell it to open a specific folder (assume this cannot be improved), I am thinking of caching the C handle/object, so the next time a GET request hits the same folder, I can use the existing handle.
What I've tried
First declare a static dictionary mapping from folder path to handle:
static ConcurrentDictionary<string, IHandle> handles = new ConcurrentDictionary<string, IHandle>();
In my GET function:
IHandle theHandle = handles.GetOrAdd(dir.Name, x => {
return new Handle(x); //this is the slow and memory-intensive function
});
This way, whenever a specific folder has been GET'd before, it will already have a handle ready for me to use.
Why it's not good
So now I run the risk of running out of memory if too many folders are cached simultaneously. How might I add a GC-like background process to TryRemove() and call IHandle.Dispose() on old handles, perhaps in a Least Recently Used or Least Frequently Used policy? Ideally it should start triggering only upon low physical memory available.
I have tried adding the following statement in the GET function, but it seems too hacky and is very limited in function. This way works OK only if I always want handles to expire after 10 seconds, and it does not restart the timer if a subsequent request comes in within 10 seconds.
HostingEnvironment.QueueBackgroundWorkItem(ct =>
{
System.Threading.Thread.Sleep(10000);
if (handles.TryRemove(dir.Name, out var handle2))
handle2.Dispose();
});
What this question is not
I don't think caching the output is the solution here. After I return the result of this GET request (it's just the metadata of the folder contents), there might be another GET request for more in-depth data, which requires calling Handle's methods.
I hope my question is clear enough!

Handles closing on low memory.
ConcurrentQueue<(string, IHandle)> handles = new ConcurrentQueue<(string, IHandle)>();
void CheckMemory_OptionallyReleaseOldHandles()
{
var performance = new System.Diagnostics.PerformanceCounter("Memory", "Available MBytes");
while (performance.NextValue() <= YOUR_TRESHHOLD)
{
if (handles.TryDequeue(out ValueTuple<string, IHandle> value))
{
value.Item2.Dispose();
}
}
}
Your Get method.
IHandle GetHandle()
{
IHandle theHandle = handles.FirstOrDefault(v => v.Item1 == dir.Name).Item2;
if (theHandle == null)
{
theHandle = new Handle(dir.Name);
handles.Enqueue((dir.Name, theHandle));
}
return theHandle;
});
Your background task.
void SetupMemoryCheck()
{
Action<CancellationToken> BeCheckingTheMemory = ct =>
{
for(;;)
{
if (ct.IsCancellationRequested)
{
break;
}
CheckMemory_OptionallyReleaseOldHandles();
Thread.Sleep(500);
};
};
HostingEnvironment.QueueBackgroundWorkItem(ct =>
{
var tf = new TaskFactory(ct, TaskCreationOptions.LongRunning, TaskContinuationOptions.None, TaskScheduler.Current);
tf.StartNew(() => BeCheckingTheMemory(ct));
});
}
I suppose the collection will have little elems so there is no need to dictionary.

I did’t catch your LRU/LFU demand first time. Here you can check for some hybrid LRU/LFU cache model.
Handles closing on low memory.
/*
* string – handle name,
* IHandle – the handle,
* int – hit count,
*/
ConcurrentDictionary<string, (IHandle, int)> handles = new ConcurrentDictionary<string, (IHandle, int)>();
void FreeResources()
{
if (handles.Count == 0)
{
return;
}
var performance = new System.Diagnostics.PerformanceCounter("Memory", "Available MBytes");
while (performance.NextValue() <= YOUR_TRESHHOLD)
{
int maxIndex = (int)Math.Ceiling(handles.Count / 2.0d);
KeyValuePair<string, (IHandle, int)> candidate = handles.First();
for (int index = 1; index < maxIndex; index++)
{
KeyValuePair<string, (IHandle, int)> item = handles.ElementAt(index);
if(item.Value.Item2 < candidate.Value.Item2)
{
candidate = item;
}
}
candidate.Value.Item1.Dispose();
handles.TryRemove(candidate.Key, out _);
}
}
Get method.
IHandle GetHandle(Dir dir, int handleOpenAttemps = 1)
{
if(handles.TryGetValue(dir.Name, out (IHandle, int) handle))
{
handle.Item2++;
}
else
{
if(new System.Diagnostics.PerformanceCounter("Memory", "Available MBytes").NextValue() < YOUR_TRESHHOLD)
{
FreeResources();
}
try
{
handle.Item1 = new Handle(dir.Name);
}
catch (OutOfMemoryException)
{
if (handleOpenAttemps == 2)
{
return null;
}
FreeResources();
return GetHandle(dir, handleOpenAttemps++);
}
catch (Exception)
{
// Your handling.
}
handle.Item2 = 1;
handles.TryAdd(dir.Name, handle);
}
return handle.Item1;
}
Background task.
void SetupMemoryCheck()
{
Action<CancellationToken> BeCheckingTheMemory = ct =>
{
for (;;)
{
if (ct.IsCancellationRequested) break;
FreeResources();
Thread.Sleep(500);
}
};
HostingEnvironment.QueueBackgroundWorkItem(ct =>
{
new Task(() => BeCheckingTheMemory(ct), TaskCreationOptions.LongRunning).Start();
});
}
If you expect big collection the for loop could be optimised.

Parallel.ForEach returning before object's method which makes rate limited API calls [duplicate]

This question already has answers here:
Parallel.ForEach and async-await [duplicate]
(4 answers)
Parallel foreach with asynchronous lambda
(10 answers)
Closed 23 days ago.
I am working on a plugin for a program that needs to make API calls, I was previously making them all synchronously which, well it worked, was slow.
To combat this I am trying to make the calls asynchronous, I can make 10 per second so I was trying the following:
Parallel.ForEach(
items.Values,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
await item.UpdateMarketData(client, HQOnly.Checked, retainers);
await Task.Delay(1000);
}
);
client is an HttpClient object and the rest is used to build the API call or for the stuff done to the result of the API call. Each time item.UpdateMarketData() is called 1 and only 1 API call is made.
This code seems to be finishing very quickly and as I understand it, the program should wait for a Parallel.ForEach() to complete before continuing.
The data that should be set by item.UpdateMarketData() is not being set either. In order to make sure, I have even set MaxDegreeOfParallelism = 1 and the Delay to 3 seconds and it still finished very quickly despite having ~44 items to go though. Any help would be appreciated.
UpdateMarketData() is included below just in case it is relevant:
public async Task UpdateMarketData(TextBox DebugTextBox,HttpClient client,
bool HQOnly, List<string> retainers)
{
HttpResponseMessage sellers_result = null;
try
{
sellers_result = await client.GetAsync(String.Format(
"www.apiImCalling/items/{0}?key=secretapikey", ID));
}
catch (Exception e)
{
System.Windows.Forms.MessageBox.Show(
String.Format("{0} Exception caught.", e));
sellers_result = null;
}
var results = JsonConvert.DeserializeObject<RootObjectMB>(
sellers_result.Content.ReadAsStringAsync().Result);
int count = 0;
OnMB = false;
LowestOnMB = false;
LowestPrice = int.MaxValue;
try
{
foreach (var x in results.Prices)
{
if (x.IsHQ | !(HQOnly && RequireHQ))
{
count++;
if (count == 1)
{
LowestPrice = x.PricePerUnit;
}
if (retainers.Contains(x.RetainerName))
{
Retainer = x.RetainerName;
OnMB = true;
Price = x.PricePerUnit;
if (count == 1)
{
LowestOnMB = true;
}
}
if (LowestPrice == x.PricePerUnit
&& x.RetainerName != Retainer)
{
LowestOnMB = false;
}
}
}
}
catch (Exception e)
{
System.Windows.Forms.MessageBox.Show(
String.Format("{0} Exception caught.", e));
}
}

async doesn't work with Parallel. One is asynchronous, the other is parallel, and these are two completely different styles of concurrency.
To restrict the concurrency of asynchronous operations, use SemaphoreSlim. E.g.:
var mutex = new SemaphoreSlim(10);
var tasks = items.Values.Select(item => DoUpdateMarketData(item)).ToList();
await Task.WhenAll(tasks);
async Task DoUpdateMarketData(Item item)
{
await mutex.WaitAsync();
try
{
await item.UpdateMarketData(client, HQOnly.Checked, retainers);
await Task.Delay(1000);
}
finally { mutex.Release(); }
}
You may find my book helpful; this is covered in recipe 11.5.

Rather then parallel.for loop , you can make use of Task and wait for all task to complete.
var tasks = new List<Task>();
foreach (var val in items.Values)
tasks.Add(Task.Factory.StartNew(val.UpdateMarketData(client, HQOnly.Checked, retainers)));
try
{
// Wait for all the tasks to finish.
Task.WaitAll(tasks.ToArray());
//make use of WhenAll method if you dont want to block thread, and want to use async/await
Console.WriteLine("update completed");
}
catch (AggregateException e)
{
Console.WriteLine("\nThe following exceptions have been thrown by WaitAll(): (THIS WAS EXPECTED)");
for (int j = 0; j < e.InnerExceptions.Count; j++)
{
Console.WriteLine("\n-------------------------------------------------\n{0}", e.InnerExceptions[j].ToString());
}
}

Unit tests testing thread safety - Object not available randomly

We have some legacy code that tests thread safety on a number of classes. A recent hardware upgrade (from 2 to 4 core) is presenting random failures with an exception accessing an item from List<>.
[Test]
public void CheckThreadSafeInThreadPool()
{
Console.WriteLine("Initialised ThreadLocalDataContextStore...");
var container = new ContextContainerTest();
Console.WriteLine("Starting...");
container.StartPool();
while (container.ThreadNumber < 5)
{
Thread.Sleep(1000);
}
foreach (var message in container.Messages)
{
Console.WriteLine(message);
if (message.Contains("A supposedly new thread is able to see the old value"))
{
Assert.Fail("Thread leaked values - not thread safe");
}
}
Console.WriteLine("Complete");
}
public class ContextContainerTest
{
private ThreadLocalDataContextStore store;
public int ThreadNumber;
public List<string> Messages;
public void StartPool()
{
Messages = new List<string>();
store = new ThreadLocalDataContextStore();
store.ClearContext();
var msoContext = new MsoContext();
msoContext.Principal = new GenericPrincipal(new GenericIdentity("0"), null);
store.StoreContext(msoContext);
for (var counter = 0; counter < 5; counter++)
{
Messages.Add(string.Format("Assigning work item {0}", counter));
ThreadPool.QueueUserWorkItem(ExecuteMe, counter);
}
}
public void ExecuteMe(object input)
{
string hashCode = Thread.CurrentThread.GetHashCode().ToString();
if (store.GetContext() == null || store.GetContext().Principal == null)
{
Messages.Add(string.Format("[{0}] A New Thread", hashCode));
var msoContext = new MsoContext();
msoContext.Principal = new GenericPrincipal(new GenericIdentity("2"), null);
store.StoreContext(msoContext);
}
else if (store.GetContext().Principal.Identity.Name == "1")
{
Messages.Add(string.Format("[{0}] Thread reused", hashCode));
}
else
{
Messages.Add(string.Format("[{0}] A supposedly new thread is able to see the old value {1}"
, hashCode, store.GetContext().GetDiagnosticInformation()));
}
Messages.Add(string.Format("[{0}] Context at starting: {1}", hashCode, store.GetContext().GetDiagnosticInformation()));
store.GetContext().SetAsCurrent(new GenericPrincipal(new GenericIdentity("99"), null));
Messages.Add(string.Format("[{0}] Context at End: {1}", hashCode, store.GetContext().GetDiagnosticInformation()));
store.GetContext().SetAsCurrent(new GenericPrincipal(new GenericIdentity("1"), null));
Thread.Sleep(80);
ThreadNumber++;
}
}
The failure is random, and occurs at the following section of code within the test itself;
foreach (var message in container.Messages)
{
Console.WriteLine(message);
if (message.Contains("A supposedly new thread is able to see the old value"))
{
Assert.Fail("Thread leaked values - not thread safe");
}
}
A subtle change resolves the issue, but someone is niggling that we should not need to do that, why is the message null if Messages is not and why does it work most of the time and not others.
if (message != null && message.Contains("A supposedly new thread is able to see the old value"))
{
}
Another solution was to change the List to be threadsafe, but that doesnt answer why the issue arose in the first place.

List<T> is not a thread safe element if you are using .Net 4 and above you can use ConcurrentBag<T> from System.Collection.Concurrent and if older you got to implement one yourself. See this might help.
Hope I was helpful.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to avoid collection modification during JSON serialization in looped multithreaded task? - c#

I tied to use ToList() on the collection, what creates copy of the list, with positive effect.

Related

Multiple consumers update result array inconsistently

Tasks combine result and continue

How to cache slow resource initialisation from C# Web API REST Server?

Parallel.ForEach returning before object's method which makes rate limited API calls [duplicate]

Unit tests testing thread safety - Object not available randomly

Categories

Resources