Concurrent Dictionary reading and removing items In parallel - c#

Question background:
I am currently learning about concurrent operations using the.NET 6 Parallel.ForEachAsync loop.
I have created the following program which features two Tasks running parallel to each other in a Task.WhenAll function. The KeepAliveAsync method runs up to 3 degrees of parallelism on the HttpClient calls for a total list of ten items. The DisposeAsync method will remove these items from the concurrent dictionary but before this, the CleanItemBeforeDisposal method removes the property values of the Item object.
Code:
{
[TestClass]
public class DisposeTests
{
private ConcurrentDictionary<int, Item> items = new ConcurrentDictionary<int, Item>();
private bool keepAlive = true;
[TestMethod]
public async Task Test()
{
//Arrange
string uri = "https://website.com";
IEnumerable<int> itemsToAdd = Enumerable.Range(1, 10);
IEnumerable<int> itemsToDispose = Enumerable.Range(1, 10);
foreach (var itemToAdd in itemsToAdd)
{
items.TryAdd(itemToAdd, new Item { Uri = uri });
}
//Act
await Task.WhenAll(KeepAliveAsync(), DisposeAsync(itemsToDispose));
//Assert
Assert.IsTrue(items.Count == 0);
}
private async Task KeepAliveAsync()
{
HttpClient httpClient = new HttpClient();
do
{
ParallelOptions parallelOptions = new()
{
MaxDegreeOfParallelism = 3,
};
await Parallel.ForEachAsync(items.ToArray(), parallelOptions, async (item, token) =>
{
var response = await httpClient.GetStringAsync(item.Value.Uri);
item.Value.DataResponse = response;
item.Value.DataResponse.ToUpper();
});
} while (keepAlive == true);
}
private async Task DisposeAsync(IEnumerable<int> itemsToRemove)
{
var itemsToDisposeFiltered = items.ToList().FindAll(a => itemsToRemove.Contains(a.Key));
ParallelOptions parallelOptions = new()
{
MaxDegreeOfParallelism = 3,
};
await Parallel.ForEachAsync(itemsToDisposeFiltered.ToArray(), parallelOptions, async (itemsToDispose, token) =>
{
await Task.Delay(500);
CleanItemBeforeDisposal(itemsToDispose);
bool removed = items.TryRemove(itemsToDispose);
if (removed == true)
{
Debug.WriteLine($"DisposeAsync - Removed item {itemsToDispose.Key} from the list");
}
else
{
Debug.WriteLine($"DisposeAsync - Did not remove item {itemsToDispose.Key} from the list");
}
});
keepAlive = false;
}
private void CleanItemBeforeDisposal(KeyValuePair<int, Item> itemToDispose)
{
itemToDispose.Value.Uri = null;
itemToDispose.Value.DataResponse = null;
}
}
}
The Issue:
The code runs but I am running into an issue where the Uri property of the Item object is being set null from the CleanItemBeforeDisposal method as called from the Dispose method by design but then the HttpClient call is being made in the parallel KeepAliveAsync method at which point the shared Item object is null and errors with:
System.InvalidOperationException: An invalid request URI was provided. Either the request URI must be an absolute URI or BaseAddress must be set.
I have used the ToArray method on the shared ConcurrentDictionary as I believe this will create a snapshot of the dictionary at the time it is called but obviously this wont solve this race condition.
What is the correct way to go about handling a situation where two processes are accessing one shared list where there is the possibility one process has changed properties of an entity of that list which the other process requires?

I'll try to answer the question directly without getting into details about the design, etc.
ConcurrentDictionary is thread safe, meaning that multiple threads can safely add and remove items from the dictionary. That thread safety does not apply at all to whatever objects are stored as values in the dictionary.
If multiple threads have a reference to an instance of Item and are updating its properties, all sorts of unpredictable things can happen.
To directly answer the question:
What is the correct way to go about handling a situation where two processes are accessing one shared list where there is the possibility one process has changed properties of an entity of that list which the other process requires?
There is no correct way to handle that possibility. If you want the code to work in a predictable way you must eliminate that possibility.
It looks like you might have hoped that somehow the two operations will stay in sync. They won't. Even if they did, just once, it might never happen again. It's unpredictable.
If you actually need to set the Uri and Response to null, it's probably better to do that to each Item in the same thread right after you're done using those values. If you do those three things
Execute the request for a single Item
Do something with the values
Set them to null
...one after another in the same thread, it's impossible for them to happen out of order.
(But do you need to set them to null at all? It's not clear what that's for. If you just didn't do it, then there wouldn't be a problem to solve.)

Related

adding items asynchronously to a list within an async method in C# .NET

I'm trying to better understand await and asynchronous operations in C# and .NET. I have an object objectA which among other things contains a list of errors. I also have a list of ids and I need to perform an async method on each id. If my async method catches an error, I want to add that to objectA field that contains the list of errors. My worry is that this might not be thread safe as multiple threads might be trying to modify the same object at the same time. I don't care about the order of the errors. Is some type of locking automatically handled? Should I just return the error from the method and then add that to the list at the end?
Task Main()
{
var objectA = new ExampleObject();
List<Task> getIdtasks = new List<Task>();
foreach (var id in ids)
{
tasks.Add(GetByIdAsync(id, objectA));
}
await Tasks.Whenall(getIdtasks)
}
private Task GetByIdAsync(object id, ExampleObject objectA)
{
try
{
var objectFromId = await DoSomethingAsync(id)
return objectFromId;
}
catch (Exception e)
{
objectA.errors.Add(e);
}
return null;
}
List is not Thread Safe. So if you try to do List.Add from different threads at hte same time it will fail.
You can use a lock for List.Add will solve the issue. OR
Use any Thread safe collections.

How to get state from service fabric actor without waiting for other methods to complete?

I have a service running that is iterating over X actors asking them for their state using the ActorProxy.
Its important to me that this service is not blocked waiting for some other long running method in the actor from ect reminder callbacks.
Is there some way to call the below simple example GetState() that would allow the method to complete right way without blocking incase of some reminder running.
class Actor : IMyActor{
public Task<MyState> GetState() => StateManager.GetAsync("key")
}
alterntive.
what is the proper way form the service to call and if it dont reply within 5 sec, just containue.
var proxy = ActorProxy.Create<IMyActor();
var state = await proxy.GetState(); // this will wait until the actor is ready to send back the state.
It is possible to read the actor state even for Actors that are currently executing a blocking method. Actors store their state using an IActorStateManager which in turn uses an IActorStateProvider. The IActorStateProvider is instantiated once per ActorService. Each partition instantiates the ActorService that is responsible for hosting and running actors. The actor service is at the core a StatefulService (or rather StatefulServiceBase which is the base class that regular stateful service uses). With this in mind, we can work with the ActorService that caters to our Actors the same way we would work with a regular service, i.e. with a service interface based on IService.
The IActorStateProvider (Implemented by KvsActorStateProvider if you are using Persisted state) has two methods that we can use:
Task<T> LoadStateAsync<T>(ActorId actorId, string stateName, CancellationToken cancellationToken = null);
Task<PagedResult<ActorId>> GetActorsAsync(int numItemsToReturn, ContinuationToken continuationToken, CancellationToken cancellationToken);
Calls to these methods are not affected by actors' locks, which makes sense since these are designed to support all actors on a partition.
Example:
Create a custom ActorService and use that one to host your actors:
public interface IManyfoldActorService : IService
{
Task<IDictionary<long, int>> GetCountsAsync(CancellationToken cancellationToken);
}
public class ManyfoldActorService : ActorService, IManyfoldActorService
{
...
}
Register the new ActorService in Program.Main:
ActorRuntime.RegisterActorAsync<ManyfoldActor>(
(context, actorType) => new ManyfoldActorService(context, actorType)).GetAwaiter().GetResult();
Assuming we have a simple Actor with the following method:
Task IManyfoldActor.SetCountAsync(int count, CancellationToken cancellationToken)
{
Task.Delay(TimeSpan.FromSeconds(30), cancellationToken).GetAwaiter().GetResult();
var task = this.StateManager.SetStateAsync("count", count, cancellationToken);
ActorEventSource.Current.ActorMessage(this, $"Finished set {count} on {this.Id.GetLongId()}");
return task;
}
It waits for 30 seconds (to simulate long running, blocking, method calls) and then set a state value "count" to an int.
In a separate service we can now call the SetCountAsync for the Actors to generate some state data:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var actorProxyFactory = new ActorProxyFactory();
long iterations = 0;
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
iterations += 1;
var actorId = iterations % 10;
var count = Environment.TickCount % 100;
var manyfoldActor = actorProxyFactory.CreateActorProxy<IManyfoldActor>(new ActorId(actorId));
manyfoldActor.SetCountAsync(count, cancellationToken).ConfigureAwait(false);
ServiceEventSource.Current.ServiceMessage(this.Context, $"Set count {count} on {actorId} # {iterations}");
await Task.Delay(TimeSpan.FromSeconds(3), cancellationToken);
}
}
This method simply loops endlessly changing the values of the actors. (Note the correlation between total of 10 actors, delay for 3 second and the 30 second delay in actor. Simply designed this way to prevent an infinite buildup of Actor calls waiting for a lock). Each call is also executed as fire-and-forget so we can continue on to updating the state of the next actor before that one returns. Its a silly piece of code, it is just designed this way to prove the theory.
Now in the actor service we can implement the method GetCountsAsync like this:
public async Task<IDictionary<long, int>> GetCountsAsync(CancellationToken cancellationToken)
{
ContinuationToken continuationToken = null;
var actors = new Dictionary<long, int>();
do
{
var page = await this.StateProvider.GetActorsAsync(100, continuationToken, cancellationToken);
foreach (var actor in page.Items)
{
var count = await this.StateProvider.LoadStateAsync<int>(actor, "count", cancellationToken);
actors.Add(actor.GetLongId(), count);
}
continuationToken = page.ContinuationToken;
}
while (continuationToken != null);
return actors;
}
This uses the underlying ActorStateProvider to query for all known Actors (for that partition) and then directly reads the state for each this way 'bypassing' the Actor and not being blocked by the actor's method execution.
Final piece, some method that can call our ActorService and call GetCountsAsync across all partitions:
public IDictionary<long, int> Get()
{
var applicationName = FabricRuntime.GetActivationContext().ApplicationName;
var actorServiceName = $"{typeof(IManyfoldActorService).Name.Substring(1)}";
var actorServiceUri = new Uri($"{applicationName}/{actorServiceName}");
var fabricClient = new FabricClient();
var partitions = new List<long>();
var servicePartitionList = fabricClient.QueryManager.GetPartitionListAsync(actorServiceUri).GetAwaiter().GetResult();
foreach (var servicePartition in servicePartitionList)
{
var partitionInformation = servicePartition.PartitionInformation as Int64RangePartitionInformation;
partitions.Add(partitionInformation.LowKey);
}
var serviceProxyFactory = new ServiceProxyFactory();
var actors = new Dictionary<long, int>();
foreach (var partition in partitions)
{
var actorService = serviceProxyFactory.CreateServiceProxy<IManyfoldActorService>(actorServiceUri, new ServicePartitionKey(partition));
var counts = actorService.GetCountsAsync(CancellationToken.None).GetAwaiter().GetResult();
foreach (var count in counts)
{
actors.Add(count.Key, count.Value);
}
}
return actors;
}
Running this code will now give us 10 actors that every 33:d second gets it's state updated and where each actor is busy for 30 seconds each time. The Actor service sees the updated state as soon as each actor method returns.
There are some things omitted in this sample, for instance, when you load the state in the actor service we should probably guard against timeouts.
There is no way to do this. Actors are single-threaded. If they are doing long running work that they are waiting to complete inside any actor method, then any other method including those from outside will have to wait.
Thank you for all the help. Was able to take your example and get it to work with a few tweaks. Only problem I had was that I was getting unknown type when passing Data Back to original Application Service. Was getting
"ArrayOfKeyValueOflonglong is not expected. Add any types not known statically to the list of known types - for example, by using the KnownTypeAttribute attribute or by adding them to the list of known types passed to DataContractSerializer"
So I changed my Return type of GetCountsAsync to List <T> and inside my class I used the DataContract and DataMember class attribute and it worked fine. Seems like the ability to retrieve state data from many actors in Partitions should be a core part of the Actor Service and you should not have to create a Custom Actor service to get the StateProvider information. Once again , thank you!

C# Parallel - Adding items to the collection being iterated over, or equivalent?

Right now, I've got a C# program that performs the following steps on a recurring basis:
Grab current list of tasks from the database
Using Parallel.ForEach(), do work for each task
However, some of these tasks are very long-running. This delays the processing of other pending tasks because we only look for new ones at the start of the program.
Now, I know that modifying the collection being iterated over isn't possible (right?), but is there some equivalent functionality in the C# Parallel framework that would allow me to add work to the list while also processing items in the list?
Generally speaking, you're right that modifying a collection while iterating it is not allowed. But there are other approaches you could be using:
Use ActionBlock<T> from TPL Dataflow. The code could look something like:
var actionBlock = new ActionBlock<MyTask>(
task => DoWorkForTask(task),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
while (true)
{
var tasks = GrabCurrentListOfTasks();
foreach (var task in tasks)
{
actionBlock.Post(task);
await Task.Delay(someShortDelay);
// or use Thread.Sleep() if you don't want to use async
}
}
Use BlockingCollection<T>, which can be modified while consuming items from it, along with GetConsumingParititioner() from ParallelExtensionsExtras to make it work with Parallel.ForEach():
var collection = new BlockingCollection<MyTask>();
Task.Run(async () =>
{
while (true)
{
var tasks = GrabCurrentListOfTasks();
foreach (var task in tasks)
{
collection.Add(task);
await Task.Delay(someShortDelay);
}
}
});
Parallel.ForEach(collection.GetConsumingPartitioner(), task => DoWorkForTask(task));
Here is an example of an approach you could try. I think you want to get away from Parallel.ForEaching and do something with asynchronous programming instead because you need to retrieve results as they finish, rather than in discrete chunks that could conceivably contain both long running tasks and tasks that finish very quickly.
This approach uses a simple sequential loop to retrieve results from a list of asynchronous tasks. In this case, you should be safe to use a simple non-thread safe mutable list because all of the mutation of the list happens sequentially in the same thread.
Note that this approach uses Task.WhenAny in a loop which isn't very efficient for large task lists and you should consider an alternative approach in that case. (See this blog: http://blogs.msdn.com/b/pfxteam/archive/2012/08/02/processing-tasks-as-they-complete.aspx)
This example is based on: https://msdn.microsoft.com/en-GB/library/jj155756.aspx
private async Task<ProcessResult> processTask(ProcessTask task)
{
// do something intensive with data
}
private IEnumerable<ProcessTask> GetOutstandingTasks()
{
// retreive some tasks from db
}
private void ProcessAllData()
{
List<Task<ProcessResult>> taskQueue =
GetOutstandingTasks()
.Select(tsk => processTask(tsk))
.ToList(); // grab initial task queue
while(taskQueue.Any()) // iterate while tasks need completing
{
Task<ProcessResult> firstFinishedTask = await Task.WhenAny(taskQueue); // get first to finish
taskQueue.Remove(firstFinishedTask); // remove the one that finished
ProcessResult result = await firstFinishedTask; // get the result
// do something with task result
taskQueue.AddRange(GetOutstandingTasks().Select(tsk => processData(tsk))) // add more tasks that need performing
}
}

Task Outliving lifetime of Http Request

I am attempting to implement some async code in my Application. I start a new Task and dont wait for the results. This task that has been started news up another Task which does await the result. The second Task uses Http.Context (as I need to get the User from the http context) as the secondary Tasks which I wait on fires off an API call which uses the http.context.current.user.
I was using this answer in order to pass the current context into the Task.
So my code is as below:
var context = HttpContext.Current;
Task.Factory.StartNew(() =>
{
HttpContext.Current = context;
ExecuteMethodAndContinue();
});
private static void ExecuteMethodAndContinue()
{
var myService = ServiceManager.GetMyService();
var query = GetQuery();
var files = myService.GetFiles(query).ToList();
//Remaining code removed for brevity
}
The implementation of GetFiles which is called from other places in the code as well is as below:
public IDictionary<FileName, FileDetails> GetFiles(MyQuery query)
{
var countries = GetAllCountries();
var context = HttpContext.Current;
var taskList = countries.Select(c => Task.Factory.StartNew(() =>
{
HttpContext.Current = context;
return new Dictionary<FileName, FileDetails> { { c, GetFilesInCountry(query, c) } };
})).ToList();
try
{
// Wait on all queries completing
Task.WaitAll(taskList.ToArray<Task>());
}
catch (AggregateException ae)
{
throw new ApplicationException("Failed.", ae);
}
// Return collated results
return taskList.SelectMany(t => t.Result).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
}
The GetFilesInCountry method that actually contains the API call which relies on the Http.Context.Current.User. However when I hit a breakpoint on the return new line in GetFiles I can see the http.current.context.user is correctly set as expected. When I breakpoint into the GetFilesInCountry method if I hover over the Http.Context.Current.User in GetFiles I find that it is null.
I think this is due to the fact that the http request from where I started the first call (ExecuteMethodAndContinue) is finished so this is why the User on the current context is null.
Is there something straight forward I can do to correctly work around this?
The easiest way of course would be to never use HttpContext.Current. It's not a good practice anyway - you should only access the HttpContext in the request thread it's associated with. Instead, you can just make sure all the methods that require e.g. a user name, get the user name as an argument:
var username = HttpContext.Current.User.Identity.Name;
var taskList = countries.Select(c => Task.Factory.StartNew(() =>
{
return new Dictionary<FileName, FileDetails> { { c, GetFilesInCountry(query, c, username) } };
})).ToList();
If this is impractical for some reason (it probably isn't a very good reason, but fixing legacy applications to work like this can be a chore), you can replace the HttpContext.Current accesses with something a bit more specific, and not tied to a particular request. And, uh, thread-safe:
public static class UserContext
{
[ThreadStatic]
public static string Username;
}
So your calling code would look something like this:
var username = HttpContext.Current.User.Identity.Name;
var taskList = countries.Select(c => Task.Factory.StartNew(() =>
{
UserContext.Username = username;
return new Dictionary<FileName, FileDetails> { { c, GetFilesInCountry(query, c) } };
})).ToList();
And whenever you'd usually use HttpContext.Current.User.Identity.Name, you'll use UserContext.Username instead (don't forget to also fill the UserContext in the main request thread).
The huge caveat with this is that it gets completely crazy when you have asynchronous code in there; you're on the thread-pool, so you're not the exclusive user of those threads, and any awaits or continuations are free to be performed on any available thread-pool thread (there's no marshalling to a synchronization context). So anywhere you're creating more tasks, be it through manual Task.Run, await, ContinueWith or whatever, you'll losing this context. Just as importantly, there's no place where you can clear this information - this can obviously be a huge security hole, as concurrent requests may have different parts of the code execute with different user contexts. If you choose to go this path, you better read up a lot about making this kind of thing safe. You'll probably have to code your own synchronization context to hold this information, and make sure all the asynchronous stuff in your application marshalls back to this synchronization context. In short - don't do this. Really. It isn't worth it. You'll have so many obscure bugs that are very hard to reproduce, there's no way it will be worth it.

Download HTML pages concurrently using the Async CTP

Attempting to write a HTML crawler using the Async CTP I have gotten stuck as to how to write a recursion free method for accomplishing this.
This is the code I have so far.
private readonly ConcurrentStack<LinkItem> _LinkStack;
private readonly Int32 _MaxStackSize;
private readonly WebClient client = new WebClient();
Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) =>
{
string html = await client.DownloadStringTaskAsync(uri);
return LinkFinder.Find(html, BaseURL);
};
Action<LinkItem> DownloadAndPush = async (o) =>
{
List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href);
if (this._LinkStack.Count() + result.Count <= this._MaxStackSize)
{
this._LinkStack.PushRange(result.ToArray());
o.Processed = true;
}
};
Parallel.ForEach(this._LinkStack, (o) =>
{
DownloadAndPush(o);
});
But obviously this doesn't work as I would hope because at the time that Parallel.ForEach executes the first (and only iteration) I only have only 1 item. The simplest approach I can think of to make the ForEach recursive but I can't (I don't think) do this as I would quickly run out of stack space.
Could anyone please guide me as to how I can restructure this code, to create what I would describe as a recursive continuation that adds items until either the MaxStackSize is reached or the system runs out of memory?
I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.
Basically, you create one "block" that takes care of downloading one URL and getting the link from it:
var cts = new CancellationTokenSource();
Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
async link =>
{
// WebClient is not guaranteed to be thread-safe,
// so we shouldn't use one shared instance
var client = new WebClient();
string html = await client.DownloadStringTaskAsync(link.Href);
return LinkFinder.Find(html, link.BaseURL);
};
var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
downloadFromLink,
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });
You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.
Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:
var links = new List<LinkItem>();
var storeBlock = new ActionBlock<LinkItem>(
linkItem =>
{
links.Add(linkItem);
if (links.Count == maxSize)
cts.Cancel();
});
Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.
We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.
var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);
The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.
Now we can connect the blocks together:
linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);
Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):
linkFinderBlock.Post(firstItem);
And finally wait until the processing is complete:
try
{
linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
if (!(ex.InnerException is TaskCanceledException))
throw;
}

Categories

Resources