Observable with backpressure in C#

Observable with backpressure in C# - c#

Is there a way in C# rx to handle backpressure?
I'm trying to call a web api from the results of a paged query. This web api is very fragile and I need to not have more than say 3 concurrent calls, so, the program should be something like:
Feth a page from db
Call the web api with a maximum of three concurrent calls per each record on the page
Save the results back to db
Fetch another page and repeat until there are no more results.
I'm not really getting the sequence that I'm after, basically the db gets all the records regardless of whether they can be processed or not.
I've tried a variety of things including tweaking at the ObserveOn operator, implementing a semaphore, and a few other things. Could I get a little bit of guidance to implement something like this?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reactive.Concurrency;
using System.Reactive.Linq;
using System.Reactive.Threading.Tasks;
using System.Threading;
using System.Threading.Tasks;
using Castle.Core.Internal;
using Xunit;
using Xunit.Abstractions;
namespace ProductValidation.CLI.Tests.Services
{
public class Example
{
private readonly ITestOutputHelper output;
public Example(ITestOutputHelper output)
{
this.output = output;
}
[Fact]
public async Task RunsObservableToCompletion()
{
var repo = new Repository(output);
var client = new ServiceClient(output);
var results = repo.FetchRecords()
.Select(x => client.FetchMoreInformation(x).ToObservable())
.Merge(1)
.Do(async x => await repo.Save(x));
await results.LastOrDefaultAsync();
}
}
public class Repository
{
private readonly ITestOutputHelper output;
public Repository(ITestOutputHelper output)
{
this.output = output;
}
public IObservable<int> FetchRecords()
{
return Observable.Create<int>(async (observer) =>
{
var page = 1;
var products = await FetchPage(page);
while (!products.IsNullOrEmpty())
{
foreach (var product in products)
{
observer.OnNext(product);
}
page += 1;
products = await FetchPage(page);
}
observer.OnCompleted();
})
.ObserveOn(SynchronizationContext.Current);
}
private async Task<IEnumerable<int>> FetchPage(int page)
{
// Simulate fetching a paged query.
await Task.Delay(500).ToObservable().ObserveOn(new TaskPoolScheduler(new TaskFactory()));
output.WriteLine("Fetching page {0}", page);
if (page >= 4) return Enumerable.Empty<int>();
return Enumerable.Range(1, 3).Select(_ => page);
}
public async Task Save(string id)
{
await Task.Delay(50); //Simulates latency
}
}
public class ServiceClient
{
private readonly ITestOutputHelper output;
private readonly SemaphoreSlim semaphore;
public ServiceClient(ITestOutputHelper output)
{
this.output = output;
this.semaphore = new SemaphoreSlim(2);
}
public async Task<string> FetchMoreInformation(int id)
{
try
{
output.WriteLine("Calling the web client for {0}", id);
await semaphore.WaitAsync(); // Protection for the webapi not sending too many calls
await Task.Delay(1000); //Simulates latency
return id.ToString();
}
finally
{
semaphore.Release();
}
}
}
}

The Rx does not support backpressure, so there is no easy way to fetch the records from the DB at the same tempo that the records are processed. Maybe you could use a Subject<Unit> as a signaling mechanism, push a value every time a record is processed, and devise a way to use these signals at the producing site to fetch a new record from the DB when a signal is received. But it will be a messy and idiomatic solution. The TPL Dataflow is a more suitable tool than the Rx for doing this kind of work. It supports natively the BoundedCapacity configuration option.
Some comments regarding the code you've posted, that are not directly related to the backpressure issue:
The Merge operator with a maxConcurrent parameter imposes a limit on the concurrent subscriptions to the inner sequences, but this will have no effect in case the inner sequences are already up and running. So you have to ensure that the inner sequences are cold, and a handy way to do this is the Defer operator:
.Select(x => Observable.Defer(() =>
client.FetchMoreInformation(x).ToObservable()))
A more common way to convert asynchronous methods to deferred observable sequences is the FromAsync operator:
.Select(x => Observable.FromAsync(() => client.FetchMoreInformation(x)))
Btw the Do operator does not understand async delegates, so instead of:
.Do(async x => await repo.Save(x));
...which creates async void lambdas, it's better to do this:
.Select(x => Observable.FromAsync(() => repo.Save(x)))
.Merge(1);
Update: Here is an example of how you could use a SemaphoreSlim in order to implement backpressure in Rx:
const int boundedCapacity = 10;
using var semaphore = new SemaphoreSlim(boundedCapacity, boundedCapacity);
IObservable<int> results = repo
.FetchRecords(semaphore)
.Select(x => Observable.FromAsync(() => client.FetchMoreInformation(x)))
.Merge(1)
.Select(x => Observable.FromAsync(() => repo.Save(x)))
.Merge(1)
.Do(_ => semaphore.Release());
await results.DefaultIfEmpty();
And inside the FetchRecords method:
//...
await semaphore.WaitAsync();
observer.OnNext(product);
//...
This is a fragile solution, because it depends on propagating all elements through the pipeline. If in the future you decide to include filtering or throttling inside the pipeline, then the one-to-one relationship between WaitAsync and Release will be violated, with the most probable outcome being a deadlocked pipeline.

Related

How to achieve parallelism and asynchrony in Web crawling using BFS

This question is going to be quite long but I want to explain my code and thought process as thoroughly as possible, so here goes...
I am coding a web crawler in C# which is supposed to search through Wikipedia from a given source link and find a way to a destination link. For example, you can give it a toaster Wiki page link and a pancake Wiki link and it should output a route which takes you to pancake from toast. In other words - I want to find the shortest path between two Wiki articles.
I think I have correctly coded that up, I created two classes: one is called a CrawlerPage and here is its body:
using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
namespace Wikipedia_Crawler
{
internal class CrawlerPage
{
public string mainLink;
private List<CrawlerPage> _pages = new();
public CrawlerPage(string mainLink)
{
this.mainLink = mainLink;
}
public async Task<List<CrawlerPage>> GetPages()
{
var pagesLinks = await Task.Run(() => GetPages(this));
foreach(var page in pagesLinks)
{
_pages.Add(new CrawlerPage(page));
}
return _pages;
}
private HashSet<string> GetPages(CrawlerPage page)
{
string result = "";
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = client.GetAsync(page.mainLink).Result)
{
using (HttpContent content = response.Content)
{
result = content.ReadAsStringAsync().Result;
}
}
}
var wikiLinksList = ParseLinks(result)
.Where(x => x.Contains("/wiki/") && !x.Contains("https://") && !x.Contains(".jpg") &&
!x.Contains(".png"))
.AsParallel()
.ToList();
var wikiLinksHashSet = new HashSet<string>();
foreach(var wikiLink in wikiLinksList)
{
wikiLinksHashSet.Add("https://en.wikipedia.org" + wikiLink);
}
HashSet<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
return nodes == null ? new HashSet<string>() : nodes.AsParallel().ToList().ConvertAll(
r => r.Attributes.AsParallel().ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).AsParallel().ToHashSet();
}
return wikiLinksHashSet;
}
}
}
The class above is supposed to represent a Wiki page article. It contains its own link (mainLink field) and a list of every other page that is on that page (_pages field). GetPages() methods are basically reading a page in HTML and parsing them to a HashSet with links that are of my interest (with links to other articles, this way we can discard any other junk links).
Second class is a Crawler class that performs BFS (Breadth-first search). Code below:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace Wikipedia_Crawler
{
internal class Crawler
{
private int _maxDepth;
private int _currDepth;
public Crawler(int maxDepth)
{
_currDepth = 0;
_maxDepth = maxDepth;
}
public async Task CrawlParallelAsync(string sourceLink, string destinationLink)
{
var sourcePage = new CrawlerPage(sourceLink);
var destinationPage = new CrawlerPage(destinationLink);
var visited = new HashSet<string>();
Queue <CrawlerPage> queue = new();
queue.Enqueue(sourcePage);
while (queue.Count > 0)
{
var currPage = queue.Dequeue();
Console.WriteLine(currPage.mainLink);
var currPageSubpages = await Task.Run(() => currPage.GetPages());
if (currPage.mainLink == destinationPage.mainLink || _currDepth == _maxDepth)
{
visited.Add(currPage.mainLink);
break;
}
if (visited.Contains(currPage.mainLink))
continue;
visited.Add(currPage.mainLink);
foreach (var page in currPageSubpages)
{
if (!visited.Contains(page.mainLink))
{
queue.Enqueue(page);
}
}
}
foreach (var visitedPage in visited)
{
Console.WriteLine(visitedPage);
}
}
}
}
Note that I am not incrementing currDepth - I want to make it so that if the depth of the search goes too far, the search would stop because of the route being too long.
The class above works as follows: it enqueues the page with sourceLink and performs standard BFS: it dequeues the page, checks if it has been visited, checks if this is the destination page and then gets every subpage of that page (using currPage.GetPages() and adds them to the queue. I believe that the algorithm works fine, although it is extremely sluggish and does not provide any use because of that.
My conclusion: it absolutely needs to be done asynchronously and parallel in order to be efficient. I have tried with Tasks as you can tell, but that doesn't improve the performance at all. My intuition tells me that every time we read subpages of a page, we should do that async and parallel and every time we start crawling on a page, we have to do that async and in parallel as well. I have no idea on how to achieve that, do I need to completely refactor my code? Should I create a new crawler every time I enqueue a subpage?
I'm lost, can you help me figure it out?

You could consider using the new (.NET 6) API Parallel.ForEachAsync. This method accepts an enumerable sequence, and invokes an asynchronous delegate for each element in the sequence, with a specific degree of parallelism. One overload of this method is particularly interesting, because it accepts an IAsyncEnumerable<T> as input, which is essentially an asynchronous stream of data. You could create such a stream dynamically with an iterator method (a method that yields), but it is probably easier to use a Channel<T> that exposes its contents as IAsyncEnumerable<T>. Here is a rough demonstration of this idea:
var channel = Channel.CreateUnbounded<CrawlerPage>();
channel.Writer.TryWrite(new CrawlerPage(sourceLink));
var cts = new CancellationTokenSource();
var options = new ParallelOptions()
{
MaxDegreeOfParallelism = 10,
CancellationToken = cts.Token
};
await Parallel.ForEachAsync(channel.Reader.ReadAllAsync(), async (page, ct) =>
{
CrawlerPage[] subpages = await GetPagesAsync(page);
foreach (var subpage in subpages) channel.Writer.TryWrite(subpage);
});
The parallel loop will continue crunching pages until the channel.Writer.Complete() method is called and then all remaining pages in the channel are consumed, or until the CancellationTokenSource is canceled.

Calling client.GetAsync(page.mainLink).Result makes your code wait synchronously. Use await client.GetAsync(page.mainLink). Doing so you should not use Task.Run. Task.Run can be used to have synchronous work be excecuted in asynchronously.
If you want parallelism you can await several Task using Task.WhenAll.

Best practice for task/await in a foreach loop

I have some time consuming code in a foreach that uses task/await.
it includes pulling data from the database, generating html, POSTing that to an API, and saving the replies to the DB.
A mock-up looks like this
List<label> labels = db.labels.ToList();
foreach (var x in list)
{
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
List<response> res = await api.sendMessage(object); //POST
//put all the responses in the db
foreach (var r in res)
{
db.responses.add(r);
}
db.SaveChanges();
}
Time wise, generating the Html and posting it to the API seem to be taking most of the time.
Ideally it would be great if I could generate the HTML for the next item, and wait for the post to finish, before posting the next item.
Other ideas are also welcome.
How would one go about this?
I first thought of adding a Task above the foreach and wait for that to finish before making the next POST, but then how do I process the last loop... it feels messy...

You can do it in parallel but you will need different context in each Task.
Entity framework is not thread safe, so if you can't use one context in parallel tasks.
var tasks = myLabels.Select( async label=>{
using(var db = new MyDbContext ()){
// do processing...
var response = await api.getresponse();
db.Responses.Add(response);
await db.SaveChangesAsync();
}
});
await Task.WhenAll(tasks);
In this case, all tasks will appear to run in parallel, and each task will have its own context.
If you don't create new Context per task, you will get error mentioned on this question Does Entity Framework support parallel async queries?

It's more an architecture problem than a code issue here, imo.
You could split your work into two separate parts:
Get data from database and generate HTML
Send API request and save response to database
You could run them both in parallel, and use a queue to coordinate that: whenever your HTML is ready it's added to a queue and another worker proceeds from there, taking that HTML and sending to the API.
Both parts can be done in multithreaded way too, e.g. you can process multiple items from the queue at the same time by having a set of workers looking for items to be processed in the queue.

This screams for the producer / consumer pattern: one producer produces data in a speed different than the consumer consumes it. Once the producer does not have anything to produce anymore it notifies the consumer that no data is expected anymore.
MSDN has a nice example of this pattern where several dataflowblocks are chained together: the output of one block is the input of another block.
Walkthrough: Creating a Dataflow Pipeline
The idea is as follows:
Create a class that will generate the HTML.
This class has an object of class System.Threading.Tasks.Dataflow.BufferBlock<T>
An async procedure creates all HTML output and await SendAsync the data to the bufferBlock
The buffer block implements interface ISourceBlock<T>. The class exposes this as a get property:
The code:
class MyProducer<T>
{
private System.Threading.Tasks.Dataflow.BufferBlock<T> bufferBlock = new BufferBlock<T>();
public ISourceBlock<T> Output {get {return this.bufferBlock;}
public async ProcessAsync()
{
while (somethingToProduce)
{
T producedData = ProduceOutput(...)
await this.bufferBlock.SendAsync(producedData);
}
// no date to send anymore. Mark the output complete:
this.bufferBlock.Complete()
}
}
A second class takes this ISourceBlock. It will wait at this source block until data arrives and processes it.
do this in an async function
stop when no more data is available
The code:
public class MyConsumer<T>
{
ISourceBlock<T> Source {get; set;}
public async Task ProcessAsync()
{
while (await this.Source.OutputAvailableAsync())
{ // there is input of type T, read it:
var input = await this.Source.ReceiveAsync();
// process input
}
// if here, no more input expected. finish.
}
}
Now put it together:
private async Task ProduceOutput<T>()
{
var producer = new MyProducer<T>();
var consumer = new MyConsumer<T>() {Source = producer.Output};
var producerTask = Task.Run( () => producer.ProcessAsync());
var consumerTask = Task.Run( () => consumer.ProcessAsync());
// while both tasks are working you can do other things.
// wait until both tasks are finished:
await Task.WhenAll(new Task[] {producerTask, consumerTask});
}
For simplicity I've left out exception handling and cancellation. StackOverFlow has artibles about exception handling and cancellation of Tasks:
Keep UI responsive using Tasks, Handle AggregateException
Cancel an Async Task or a List of Tasks

This is what I ended up using: (https://stackoverflow.com/a/25877042/275990)
List<ToSend> sendToAPI = new List<ToSend>();
List<label> labels = db.labels.ToList();
foreach (var x in list) {
var myLabels = labels.Where(q => !db.filter.Where(y => x.userid ==y.userid))
.Select(y => y.ID)
.Contains(q.id))
//Render the HTML
//do some fast stuff with objects
sendToAPI.add(the object with HTML);
}
int maxParallelPOSTs=5;
await TaskHelper.ForEachAsync(sendToAPI, maxParallelPOSTs, async i => {
using (NasContext db2 = new NasContext()) {
List<response> res = await api.sendMessage(i.object); //POST
//put all the responses in the db
foreach (var r in res)
{
db2.responses.add(r);
}
db2.SaveChanges();
}
});
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body) {
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext()) {
await body(partition.Current).ContinueWith(t => {
if (t.Exception != null) {
string problem = t.Exception.ToString();
}
//observe exceptions
});
}
}));
}
basically lets me generate the HTML sync, which is fine, since it only takes a few seconds to generate 1000's but lets me post and save to DB async, with as many threads as I predefine. In this case I'm posting to the Mandrill API, parallel posts are no problem.

How to constraint concurrency the right way in Rx.NET

Please, observe the following code snippet:
var result = await GetSource(1000).SelectMany(s => getResultAsync(s).ToObservable()).ToList();
The problem with this code is that getResultAsync runs concurrently in an unconstrained fashion. Which could be not what we want in certain cases. Suppose I want to restrict its concurrency to at most 10 concurrent invocations. What is the Rx.NET way to do it?
I am enclosing a simple console application that demonstrates the subject and my lame solution of the described problem.
There is a bit extra code, like the Stats class and the artificial random sleeps. They are there to ensure I truly get concurrent execution and can reliably compute the max concurrency reached during the process.
The method RunUnconstrained demonstrates the naive, unconstrained run. The method RunConstrained shows my solution, which is not very elegant. Ideally, I would like to ease constraining the concurrency by simply applying a dedicated Rx operator to the Monad. Of course, without sacrificing the performance.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reactive.Linq;
using System.Reactive.Threading.Tasks;
using System.Threading;
using System.Threading.Tasks;
namespace RxConstrainedConcurrency
{
class Program
{
public class Stats
{
public int MaxConcurrentCount;
public int CurConcurrentCount;
public readonly object MaxConcurrentCountGuard = new object();
}
static void Main()
{
RunUnconstrained().GetAwaiter().GetResult();
RunConstrained().GetAwaiter().GetResult();
}
static async Task RunUnconstrained()
{
await Run(AsyncOp);
}
static async Task RunConstrained()
{
using (var sem = new SemaphoreSlim(10))
{
await Run(async (s, pause, stats) =>
{
// ReSharper disable AccessToDisposedClosure
await sem.WaitAsync();
try
{
return await AsyncOp(s, pause, stats);
}
finally
{
sem.Release();
}
// ReSharper restore AccessToDisposedClosure
});
}
}
static async Task Run(Func<string, int, Stats, Task<int>> getResultAsync)
{
var stats = new Stats();
var rnd = new Random(0x1234);
var result = await GetSource(1000).SelectMany(s => getResultAsync(s, rnd.Next(30), stats).ToObservable()).ToList();
Debug.Assert(stats.CurConcurrentCount == 0);
Debug.Assert(result.Count == 1000);
Debug.Assert(!result.Contains(0));
Debug.WriteLine("Max concurrency = " + stats.MaxConcurrentCount);
}
static IObservable<string> GetSource(int count)
{
return Enumerable.Range(1, count).Select(i => i.ToString()).ToObservable();
}
static Task<int> AsyncOp(string s, int pause, Stats stats)
{
return Task.Run(() =>
{
int cur = Interlocked.Increment(ref stats.CurConcurrentCount);
if (stats.MaxConcurrentCount < cur)
{
lock (stats.MaxConcurrentCountGuard)
{
if (stats.MaxConcurrentCount < cur)
{
stats.MaxConcurrentCount = cur;
}
}
}
try
{
Thread.Sleep(pause);
return int.Parse(s);
}
finally
{
Interlocked.Decrement(ref stats.CurConcurrentCount);
}
});
}
}
}

You can do this in Rx using the overload of Merge that constrains the number of concurrent subscriptions to inner observables.
This form of Merge is applied to a stream of streams.
Ordinarily, using SelectMany to invoke an async task from an event does two jobs: it projects each event into an observable stream whose single event is the result, and it flattens all the resulting streams together.
To use Merge we must use a regular Select to project each event into the invocation of an async task, (thus creating a stream of streams), and use Merge to flatten the result. It will do this in a constrained way by only subscribing to a supplied fixed number of the inner streams at any point in time.
We must be careful to only invoke each asynchronous task invocation upon subscription to the wrapping inner stream. Conversion of an async task to an observable with ToObservable() will actually call the async task immediately, rather than on subscription, so we must defer the evaluation until subscription using Observable.Defer.
Here's an example putting all these steps together:
void Main()
{
var xs = Observable.Range(0, 10); // source events
// "Double" here is our async operation to be constrained,
// in this case to 3 concurrent invocations
xs.Select(x =>
Observable.Defer(() => Double(x).ToObservable())).Merge(3)
.Subscribe(Console.WriteLine,
() => Console.WriteLine("Max: " + MaxConcurrent));
}
private static int Concurrent;
private static int MaxConcurrent;
private static readonly object gate = new Object();
public async Task<int> Double(int x)
{
var concurrent = Interlocked.Increment(ref Concurrent);
lock(gate)
{
MaxConcurrent = Math.Max(concurrent, MaxConcurrent);
}
await Task.Delay(TimeSpan.FromSeconds(1));
Interlocked.Decrement(ref Concurrent);
return x * 2;
}
The maximum concurrency output here will be "3". Remove the Merge to go "unconstrained" and you'll get "10" instead.
Another (equivalent) way of getting the Defer effect that reads a bit nicer is to use FromAsync instead of Defer + ToObservable:
xs.Select(x => Observable.FromAsync(() => Double(x))).Merge(3)

Await list of async predicates, but drop out on first false

Imagine the following class:
public class Checker
{
public async Task<bool> Check() { ... }
}
Now, imagine a list of instances of this class:
IEnumerable<Checker> checkers = ...
Now I want to control that every instance will return true:
checkers.All(c => c.Check());
Now, this won't compile, since Check() returns a Task<bool> not a bool.
So my question is: How can I best enumerate the list of checkers?
And how can I shortcut the enumeration as soon as a checker returns false?
(something I presume All( ) does already)

"Asynchronous sequences" can always cause some confusion. For example, it's not clear whether your desired semantics are:
Start all checks simultaneously, and evaluate them as they complete.
Start the checks one at a time, evaluating them in sequence order.
There's a third possibility (start all checks simultaneously, and evaluate them in sequence order), but that would be silly in this scenario.
I recommend using Rx for asynchronous sequences. It gives you a lot of options, and it a bit hard to learn, but it also forces you to think about exactly what you want.
The following code will start all checks simultaneously and evaluate them as they complete:
IObservable<bool> result = checkers.ToObservable()
.SelectMany(c => c.Check()).All(b => b);
It first converts the sequence of checkers to an observable, calls Check on them all, and checks whether they are all true. The first Check that completes with a false value will cause result to produce a false value.
In contrast, the following code will start the checks one at a time, evaluating them in sequence order:
IObservable<bool> result = checkers.Select(c => c.Check().ToObservable())
.Concat().All(b => b);
It first converts the sequence of checkers to a sequence of observables, and then concatenates those sequences (which starts them one at a time).
If you do not wish to use observables much and don't want to mess with subscriptions, you can await them directly. E.g., to call Check on all checkers and evaluate the results as they complete:
bool all = await checkers.ToObservable().SelectMany(c => c.Check()).All(b => b);

And how can I shortcut the enumeration as soon as a checker returns false?
This will check the tasks' result in order of completion. So if task #5 is the first to complete, and returns false, the method returns false immediately, regardless of the other tasks. Slower tasks (#1, #2, etc) would never be checked.
public static async Task<bool> AllAsync(this IEnumerable<Task<bool>> source)
{
var tasks = source.ToList();
while(tasks.Count != 0)
{
var finishedTask = await Task.WhenAny(tasks);
if(! finishedTask.Result)
return false;
tasks.Remove(finishedTask);
}
return true;
}
Usage:
bool result = await checkers.Select(c => c.Check())
.AllAsync();

All wasn't built with async in mind (like all LINQ), so you would need to implement that yourself:
async Task<bool> CheckAll()
{
foreach(var checker in checkers)
{
if (!await checker.Check())
{
return false;
}
}
return true;
}
You could make it more reusable with a generic extension method:
public static async Task<bool> AllAsync<TSource>(this IEnumerable<TSource> source, Func<TSource, Task<bool>> predicate)
{
foreach (var item in source)
{
if (!await predicate(item))
{
return false;
}
}
return true;
}
And use it like this:
var result = await checkers.AllAsync(c => c.Check());

You could do
checkers.All(c => c.Check().Result);
but that would run the tasks synchronously, which may be very slow depending on the implementation of Check().

Here's a fully functional test program, following in steps of dcastro:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace AsyncCheckerTest
{
public class Checker
{
public int Seconds { get; private set; }
public Checker(int seconds)
{
Seconds = seconds;
}
public async Task<bool> CheckAsync()
{
await Task.Delay(Seconds * 1000);
return Seconds != 3;
}
}
class Program
{
static void Main(string[] args)
{
var task = RunAsync();
task.Wait();
Console.WriteLine("Overall result: " + task.Result);
Console.ReadLine();
}
public static async Task<bool> RunAsync()
{
var checkers = new List<Checker>();
checkers
.AddRange(Enumerable.Range(1, 5)
.Select(i => new Checker(i)));
return await checkers
.Select(c => c.CheckAsync())
.AllAsync();
}
}
public static class ExtensionMethods
{
public static async Task<bool> AllAsync(this IEnumerable<Task<bool>> source)
{
var tasks = source.ToList();
while (tasks.Count != 0)
{
Task<bool> finishedTask = await Task.WhenAny(tasks);
bool checkResult = finishedTask.Result;
if (!checkResult)
{
Console.WriteLine("Completed at " + DateTimeOffset.Now + "...false");
return false;
}
Console.WriteLine("Working... " + DateTimeOffset.Now);
tasks.Remove(finishedTask);
}
return true;
}
}
}
Here's sample output:
Working... 6/27/2014 1:47:35 AM -05:00
Working... 6/27/2014 1:47:36 AM -05:00
Completed at 6/27/2014 1:47:37 AM -05:00...false
Overall result: False
Note that entire eval ended when exit condition was reached, without waiting for the rest to finish.

As a more out-of-the-box alternative, this seems to run the tasks in parallel and return shortly after the first failure:
var allResult = checkers
.Select(c => Task.Factory.StartNew(() => c.Check().Result))
.AsParallel()
.All(t => t.Result);
I'm not too hot on TPL and PLINQ so feel free to tell me what's wrong with this.

async i/o and process results as they become available

I has a simple console app where I want to call many Urls in a loop and put the result in a database table. I am using .Net 4.5 and using async i/o to fetch the URL data. Here is a simplified version of what I am doing. All methods are async except for the database operation. Do you guys see any issues with this? Are there better ways of optimizing?
private async Task Run(){
var items = repo.GetItems(); // sync method to get list from database
var tasks = new List<Task>();
// add each call to task list and process result as it becomes available
// rather than waiting for all downloads
foreach(Item item in items){
tasks.Add(GetFromWeb(item.url).ContinueWith(response => { AddToDatabase(response.Result);}));
}
await Task.WhenAll(tasks); // wait for all tasks to complete.
}
private async Task<string> GetFromWeb(url) {
HttpResponseMessage response = await GetAsync(url);
return await response.Content.ReadAsStringAsync();
}
private void AddToDatabase(string item){
// add data to database.
}

Your solution is acceptable. But you should check out TPL Dataflow, which allows you to set up a dataflow "mesh" (or "pipeline") and then shove the data through it.
For a problem this simple, Dataflow won't really add much other than getting rid of the ContinueWith (I always find manual continuations awkward). But if you plan to add more steps or change your data flow in the future, Dataflow should be something you consider.

Your solution is pretty much correct, with just two minor mistakes (both of which cause compiler errors). First, you don't call ContinueWith on the result of List.Add, you need call continue with on the task and then add the continuation to your list, this is solved by just moving a parenthesis. You also need to call Result on the reponse Task.
Here is the section with the two minor changes:
tasks.Add(GetFromWeb(item.url)
.ContinueWith(response => { AddToDatabase(response.Result);}));
Another option is to leverage a method that takes a sequence of tasks and orders them by the order that they are completed. Here is my implementation of such a method:
public static IEnumerable<Task<T>> Order<T>(this IEnumerable<Task<T>> tasks)
{
var taskList = tasks.ToList();
var taskSources = new BlockingCollection<TaskCompletionSource<T>>();
var taskSourceList = new List<TaskCompletionSource<T>>(taskList.Count);
foreach (var task in taskList)
{
var newSource = new TaskCompletionSource<T>();
taskSources.Add(newSource);
taskSourceList.Add(newSource);
task.ContinueWith(t =>
{
var source = taskSources.Take();
if (t.IsCanceled)
source.TrySetCanceled();
else if (t.IsFaulted)
source.TrySetException(t.Exception.InnerExceptions);
else if (t.IsCompleted)
source.TrySetResult(t.Result);
}, CancellationToken.None, TaskContinuationOptions.PreferFairness, TaskScheduler.Default);
}
return taskSourceList.Select(tcs => tcs.Task);
}
Using this your code can become:
private async Task Run()
{
IEnumerable<Item> items = repo.GetItems(); // sync method to get list from database
foreach (var task in items.Select(item => GetFromWeb(item.url))
.Order())
{
await task.ConfigureAwait(false);
AddToDatabase(task.Result);
}
}

Just though I'd throw in my hat as well with the Rx solution
using System.Reactive;
using System.Reactive.Linq;
private Task Run()
{
var fromWebObservable = from item in repo.GetItems.ToObservable(Scheduler.Default)
select GetFromWeb(item.url);
fromWebObservable
.Select(async x => await x)
.Do(AddToDatabase)
.ToTask();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Observable with backpressure in C# - c#

Related

How to achieve parallelism and asynchrony in Web crawling using BFS

Best practice for task/await in a foreach loop

How to constraint concurrency the right way in Rx.NET

Await list of async predicates, but drop out on first false

async i/o and process results as they become available

Categories

Resources