TPL Dataflow: Bounded capacity and waiting for completion - c#

Below I have replicated a real life scenario as a LINQPad script for the sake of simplicity:
var total = 1 * 1000 * 1000;
var cts = new CancellationTokenSource();
var threads = Environment.ProcessorCount;
int capacity = 10;
var edbOptions = new ExecutionDataflowBlockOptions{BoundedCapacity = capacity, CancellationToken = cts.Token, MaxDegreeOfParallelism = threads};
var dbOptions = new DataflowBlockOptions {BoundedCapacity = capacity, CancellationToken = cts.Token};
var gdbOptions = new GroupingDataflowBlockOptions {BoundedCapacity = capacity, CancellationToken = cts.Token};
var dlOptions = new DataflowLinkOptions {PropagateCompletion = true};
var counter1 = 0;
var counter2 = 0;
var delay1 = 10;
var delay2 = 25;
var action1 = new Func<IEnumerable<string>, Task>(async x => {await Task.Delay(delay1); Interlocked.Increment(ref counter1);});
var action2 = new Func<IEnumerable<string>, Task>(async x => {await Task.Delay(delay2); Interlocked.Increment(ref counter2);});
var actionBlock1 = new ActionBlock<IEnumerable<string>>(action1, edbOptions);
var actionBlock2 = new ActionBlock<IEnumerable<string>>(action2, edbOptions);
var batchBlock1 = new BatchBlock<string>(5, gdbOptions);
var batchBlock2 = new BatchBlock<string>(5, gdbOptions);
batchBlock1.LinkTo(actionBlock1, dlOptions);
batchBlock2.LinkTo(actionBlock2, dlOptions);
var bufferBlock1 = new BufferBlock<string>(dbOptions);
var bufferBlock2 = new BufferBlock<string>(dbOptions);
bufferBlock1.LinkTo(batchBlock1, dlOptions);
bufferBlock2.LinkTo(batchBlock2, dlOptions);
var bcBlock = new BroadcastBlock<string>(x => x, dbOptions);
bcBlock.LinkTo(bufferBlock1, dlOptions);
bcBlock.LinkTo(bufferBlock2, dlOptions);
var mainBlock = new TransformBlock<int, string>(x => x.ToString(), edbOptions);
mainBlock.LinkTo(bcBlock, dlOptions);
mainBlock.Dump("Main Block");
bcBlock.Dump("Broadcast Block");
bufferBlock1.Dump("Buffer Block 1");
bufferBlock2.Dump("Buffer Block 2");
actionBlock1.Dump("Action Block 1");
actionBlock2.Dump("Action Block 2");
foreach(var i in Enumerable.Range(1, total))
await mainBlock.SendAsync(i, cts.Token);
mainBlock.Complete();
await Task.WhenAll(actionBlock1.Completion, actionBlock2.Completion);
counter1.Dump("Counter 1");
counter2.Dump("Counter 2");
I have two issues with this code:
Although I limited BoundedCapacity of all appropriate blocks to 10 elements, it seems like I can push all 1,000,000 messages almost at once. Is this expected behavior?
Although the entire network is configured to propagate completion, it seems like all blocks get completed almost immediately after calling mainBlock.Complete(). I expect that both counter1 and counter2 variables to be equal to total. Is there a way to achieve such behavior?

Yes, this is the expected behavior, because of the BroadcastBlock:
Provides a buffer for storing at most one element at time, overwriting each message with the next as it arrives.
This means that if you link BroadcastBlock to blocks with BoundedCapacity, you will lose messages.
To fix that, you could create a custom block that behaves like BroadcastBlock, but guarantees delivery to all targets. But doing that is not trivial, so you might be satisified with a simpler variant (originally from my old answer):
public static ITargetBlock<T> CreateGuaranteedBroadcastBlock<T>(
IEnumerable<ITargetBlock<T>> targets, DataflowBlockOptions options)
{
var targetsList = targets.ToList();
var block = new ActionBlock<T>(
async item =>
{
foreach (var target in targetsList)
{
await target.SendAsync(item);
}
}, new ExecutionDataflowBlockOptions
{
BoundedCapacity = options.BoundedCapacity,
CancellationToken = options.CancellationToken
});
block.Completion.ContinueWith(task =>
{
foreach (var target in targetsList)
{
if (task.Exception != null)
target.Fault(task.Exception);
else
target.Complete();
}
});
return block;
}
Usage in your case would be:
var bcBlock = CreateGuaranteedBroadcastBlock(
new[] { bufferBlock1, bufferBlock2 }, dbOptions);

Related

paging over all ingested docs in elastic search

I am trying to use primitive code like this:
var pageSize = 100;
var startPosition = 0;
do
{
var searchResponse = client.Search<Bla>(s => s
.Index(indexName)
.Query(q => q.MatchAll()
).From(startPosition).Size(pageSize)
);
startPosition = startPosition + pageSize;
} while (true);
to page over all ingested documents. This breaks the server as the requests are too frequent I believe. I could slow things down by going to sleep for a few milliseconds, but I think this would still not be best practice.
I know there is also the concept of scrolling. How would I use this in my scenario, where I would like to act upon each page's result?
PS:
static void Main(string[] args)
{
var indexName = "document";
var client = GetClient(indexName);
var pageSize = 1000;
var numberOfSlices = 4;
var scrollObserver = client.ScrollAll<Document>("1m", numberOfSlices, s => s
.MaxDegreeOfParallelism(numberOfSlices)
.Search(search => search
.Index(indexName).MatchAll()
.Size(pageSize)
)
).Wait(TimeSpan.FromMinutes(60), r =>
{
// do something with documents from a given response.
var documents = r.SearchResponse.Documents.ToList();
Console.WriteLine(documents[0].Id);
});
}
I am familiar with the observer pattern but not sure what exactly these components mean:
"1m"
numberOfSlices
TimeSpan.FromMinutes(60)
Something along those lines seems to work:
const string indexName = "bla";
var client = GetClient(indexName);
const int scrollTimeout = 1000;
var initialResponse = client.Search<Document>
(scr => scr.Index(indexName)
.From(0)
.Take(100)
.MatchAll()
.Scroll(scrollTimeout))
;
List<XYZ> results;
results = new List<XYZ>();
if (!initialResponse.IsValid || string.IsNullOrEmpty(initialResponse.ScrollId))
throw new Exception(initialResponse.ServerError.Error.Reason);
if (initialResponse.Documents.Any())
results.AddRange(initialResponse.Documents);
var scrollid = initialResponse.ScrollId;
bool isScrollSetHasData = true;
while (isScrollSetHasData)
{
var loopingResponse = client.Scroll<XYZ>(scrollTimeout, scrollid);
if (loopingResponse.IsValid)
{
results.AddRange(loopingResponse.Documents);
scrollid = loopingResponse.ScrollId;
}
isScrollSetHasData = loopingResponse.Documents.Any();
// do some amazing stuff
}
client.ClearScroll(new ClearScrollRequest(scrollid));

TPL Dataflow - block not processing as expected

I have a set of simple blocks which are mostly processed in a serial manner but I have two blocks which I want to process in parallel (processblock1 & processblock2). I just started playing around with TPL datablocks so new to it.
However in the code below, I can see paraellelblock1 is being called as but never parallelblock2 as expected. I was hoping they would both be kicked off in parallel.
class Program
{
static void Main(string[] args)
{
var readBlock = new TransformBlock<int, int>(x => DoSomething(x, "readBlock"),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 }); //1
var processBlock1 =
new TransformBlock<int, int>(x => DoSomething(x, "processBlock1")); //2
var processBlock2 =
new TransformBlock<int, int>(x => DoSomething(x, "processBlock2")); //3
var saveBlock =
new ActionBlock<int>(
x => Save(x)); //4
readBlock.LinkTo(processBlock1,
new DataflowLinkOptions { PropagateCompletion = true }); //5
readBlock.LinkTo(processBlock2,
new DataflowLinkOptions { PropagateCompletion = true }); //6
processBlock1.LinkTo(
saveBlock); //7
processBlock2.LinkTo(
saveBlock); //8
readBlock.Post(1); //10
Task.WhenAll(
processBlock1.Completion,
processBlock2.Completion)
.ContinueWith(_ => saveBlock.Complete()); //11
readBlock.Complete(); //12
saveBlock.Completion.Wait(); //13
Console.WriteLine("Processing complete!");
Console.ReadLine();
}
private static int DoSomething(int i, string method)
{
Console.WriteLine($"Do Something, callng method : { method}");
return i;
}
private static async Task<int> DoSomethingAsync(int i, string method)
{
DoSomething(i, method);
return i;
}
private static void Save(int i)
{
Console.WriteLine("Save!");
}
}
By default tpl block will only send a message to the first linked block.
Use a BroadcastBlock to send a message to many components.
void Main()
{
var random = new Random();
var readBlock = new TransformBlock<int, int>(x => { return DoSomething(x, "readBlock"); },
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 }); //1
var braodcastBlock = new BroadcastBlock<int>(i => i); // ⬅️ Here
var processBlock1 =
new TransformBlock<int, int>(x => DoSomething(x, "processBlock1")); //2
var processBlock2 =
new TransformBlock<int, int>(x => DoSomething(x, "processBlock2")); //3
var saveBlock =
new ActionBlock<int>(
x => Save(x)); //4
readBlock.LinkTo(braodcastBlock, new DataflowLinkOptions { PropagateCompletion = true });
braodcastBlock.LinkTo(processBlock1,
new DataflowLinkOptions { PropagateCompletion = true }); //5
braodcastBlock.LinkTo(processBlock2,
new DataflowLinkOptions { PropagateCompletion = true }); //6
processBlock1.LinkTo(
saveBlock); //7
processBlock2.LinkTo(
saveBlock); //8
readBlock.Post(1); //10
readBlock.Post(2); //10
Task.WhenAll(
processBlock1.Completion,
processBlock2.Completion)
.ContinueWith(_ => saveBlock.Complete());
readBlock.Complete(); //12
saveBlock.Completion.Wait(); //13
Console.WriteLine("Processing complete!");
}
// Define other methods and classes here
private static int DoSomething(int i, string method)
{
Console.WriteLine($"Do Something, callng method : { method} {i}");
return i;
}
private static Task<int> DoSomethingAsync(int i, string method)
{
DoSomething(i, method);
return Task.FromResult(i);
}
private static void Save(int i)
{
Console.WriteLine("Save! " + i);
}
It appears that you're posting only one item to the graph, and the first consumer to consume it wins. There's no implied 'tee' functionality in the graph you've made--so there's no possible parallelism there.

Blocking collection when collect results inside ActionBlock

I think in the test method the "results" collection variable has to be of type BlockingCollection<int> instead of List<int>. Prove it to me if I am wrong. I have taken this example from https://blog.stephencleary.com/2012/11/async-producerconsumer-queue-using.html
private static async Task Produce(BufferBlock<int> queue, IEnumerable<int> values)
{
foreach (var value in values)
{
await queue.SendAsync(value);
}
}
public async Task ProduceAll(BufferBlock<int> queue)
{
var producer1 = Produce(queue, Enumerable.Range(0, 10));
var producer2 = Produce(queue, Enumerable.Range(10, 10));
var producer3 = Produce(queue, Enumerable.Range(20, 10));
await Task.WhenAll(producer1, producer2, producer3);
queue.Complete();
}
[TestMethod]
public async Task ConsumerReceivesCorrectValues()
{
var results = new List<int>();
// Define the mesh.
var queue = new BufferBlock<int>(new DataflowBlockOptions { BoundedCapacity = 5, });
//var consumerOptions = new ExecutionDataflowBlockOptions { BoundedCapacity = 1, };
var consumer = new ActionBlock<int>(x => results.Add(x), consumerOptions);
queue.LinkTo(consumer, new DataflowLinkOptions { PropagateCompletion = true, });
// Start the producers.
var producers = ProduceAll(queue);
// Wait for everything to complete.
await Task.WhenAll(producers, consumer.Completion);
// Ensure the consumer got what the producer sent.
Assert.IsTrue(results.OrderBy(x => x).SequenceEqual(Enumerable.Range(0, 30)));
}
Since ActionBlock<T> restricts its delegate to one-execution-at-a-time by default (MaxDegreeOfParallelism of 1), it is not necessary to use BlockingCollection<T> instead of List<T>.
The test in your code passes just fine for me, as expected.
If ActionBlock<T> were passed an option with a higher MaxDegreeOfParallelism, then you would need to protect the List<T> or replace it with a BlockingCollection<T>.

Performance in compare collections method

I have a method Comparer, with I compare some properties of the objects of two collections.
public IEnumerable<Product> Comparer(IEnumerable<Product> collection, IEnumerable<Product> target, string comparissonKey)
{
var count = 0;
var stopWatch = new Stopwatch();
var result = new ConcurrentBag<Product>();
var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 };
Parallel.ForEach(collection, parallelOptions, obj =>
{
count++;
if (count == 60000)
{
stopwatch.Stop();
//breakpoint
var aux = stopwatch.Elapsed;
}
var comparableObj = obj;
comparableObj.IsDifferent = false;
bool hasTargetObject = false;
comparableObj.Exist = true;
Product objTarget = null;
foreach (Product p in target)
{
if (obj.Key == p.Key)
{
objTarget = p;
break;
}
}
if (objTarget != null)
{
//Do stuff
}
if (hasTargetObject) return;
if (comparableObj.IsDifferent)
{
//Do Stuff
}
});
return result.ToList();
}
If I execute this method like this, im getting almost 50 seconds to the breakpoint in aux variable breaks.
If I comment the second foreach (inside the Parallel.Foreach) it breaks in less than 1 second.
I need to find the corresponding object in the target collection using the Key, so I made the second foreach. I used LINQ where clause but I got no better results. Any suggestions to improve this method performance?
You can improve performance by using a dictionary:
public IEnumerable<Product> Comparer(IEnumerable<Product> collection, IEnumerable<Product> target, string comparissonKey)
{
var count = 0;
var stopWatch = new Stopwatch();
var result = new ConcurrentBag<Product>();
var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 2 };
// create a dictionary for fast lookup
var targetDictionary = target.ToDictionary(p => p.Key);
Parallel.ForEach(collection, parallelOptions, obj =>
{
count++;
if (count == 60000)
{
stopwatch.Stop();
//breakpoint
var aux = stopwatch.Elapsed;
}
var comparableObj = obj;
comparableObj.IsDifferent = false;
bool hasTargetObject = false;
comparableObj.Exist = true;
Product objTarget = null;
// lookup using dictionary
if (targetDictionary.TryGetValue(obj.Key, out objTarget))
{
//Do stuff
}
if (hasTargetObject) return;
if (comparableObj.IsDifferent)
{
//Do Stuff
}
});
return result.ToList();
}
If Key is indeed a key
Then use HashSet as it has IntersetWith and is smoking fast
http://msdn.microsoft.com/en-us/library/bb359438.aspx
On your class Product you will need to overwrite GetHashCode and Equals
Use the Key for the GetHashCode
Override GetHashCode on overriding Equals

Switch async Task to sync task

I have the following code:
Task.Factory.ContinueWhenAll(items.Select(p =>
{
return CreateItem(p);
}).ToArray(), completedTasks => { Console.WriteLine("completed"); });
Is it possible to convert ContinueWhenAll to a synchronous method? I want to switch back between async and sync.
Edit: I should metnion that each of the "tasks" in the continuewhenall method should be executing synchronously.
If you want to leave your existing code intact and have a variable option of executing synchronously you should make these changes:
bool isAsync = false; // some flag to check for async operation
var batch = Task.Factory.ContinueWhenAll(items.Select(p =>
{
return CreateItem(p);
}).ToArray(), completedTasks => { Console.WriteLine("completed"); });
if (!isAsync)
batch.Wait();
This way you can toggle it programmatically instead of by editing your source code. And you can keep the continuation code the same for both methods.
Edit:
Here is a simple pattern for having the same method represented as a synchronous and async version:
public Item CreateItem(string name)
{
return new Item(name);
}
public Task<Item> CreateItemAsync(string name)
{
return Task.Factory.StartNew(() => CreateItem(name));
}
Unless am mistaken this is what you're looking for
Task.WaitAll(tasks);
//continuation code here
i think you can try this.
using TaskContinuationOptions for a simple scenario.
var taskFactory = new TaskFactory(TaskScheduler.Defau
var random = new Random();
var tasks = Enumerable.Range(1, 30).Select(p => {
return taskFactory.StartNew(() => {
var timeout = random.Next(5, p * 50);
Thread.Sleep(timeout / 2);
Console.WriteLine(#" 1: ID = " + p);
return p;
}).ContinueWith(t => {
Console.WriteLine(#"* 2: ID = " + t.Result);
}, TaskContinuationOptions.ExecuteSynchronously);
}).ToArray();
Task.WaitAll(tasks);
or using TPL Dataflow for a complex scenario.
var step2 = new ActionBlock<int>(i => {
Thread.Sleep(i);
Console.WriteLine(#"* 2: ID = " + i);
}, new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 1,
//MaxMessagesPerTask = 1
});
var random = new Random();
var tasks = Enumerable.Range(1, 50).Select(p => {
return Task.Factory.StartNew(() => {
var timeout = random.Next(5, p * 50);
Thread.Sleep(timeout / 2);
Console.WriteLine(#" 1: ID = " + p);
return p;
}).ContinueWith(t => {
Thread.Sleep(t.Result);
step2.Post(t.Result);
});
}).ToArray();
await Task.WhenAll(tasks).ContinueWith(t => step2.Complete());
await step2.Completion;

Categories

Resources