Data Propagation in TPL Dataflow Pipeline with Batchblock.Triggerbatch() - c#

In my Producer-Consumer scenario, I have multiple consumers, and each of the consumers send an action to external hardware, which may take some time. My Pipeline looks somewhat like this:
BatchBlock --> TransformBlock --> BufferBlock --> (Several) ActionBlocks
I have assigned BoundedCapacity of my ActionBlocks to 1.
What I want in theory is, I want to trigger the Batchblock to send a group of items to the Transformblock only when one of my Actionblocks are available for operation. Till then the Batchblock should just keep buffering elements and not pass them on to the Transformblock. My batch-sizes are variable. As Batchsize is mandatory, I do have a really high upper-limit for BatchBlock batch size, however I really don't wish to reach upto that limit, I would like to trigger my batches depending upon the availability of the Actionblocks permforming the said task.
I have achieved this with the help of the Triggerbatch() method. I am calling the Batchblock.Triggerbatch() as the last action in my ActionBlock.However interestingly after several days of working properly the pipeline has come to a hault. Upon checking I found out that sometimes the inputs to the batchblock come in after the ActionBlocks are done with their work. In this case the ActionBlocks do actually call Triggerbatch at the end of their work, however since at this point there is no input to the Batchblock at all, the call to TriggerBatch is fruitless. And after a while when inputs do flow in to the Batchblock, there is no one left to call TriggerBatch and restart the Pipeline. I was looking for something where I could just check if something is infact present in the inputbuffer of the Batchblock, however there is no such feature available, I could also not find a way to check if the TriggerBatch was fruitful.
Could anyone suggest a possible solution to my problem. Unfortunately using a Timer to triggerbatches is not an option for me. Except for the start of the Pipeline, the throttling should be governed only by the availability of one of the ActionBlocks.
The example code is here:
static BatchBlock<int> _groupReadTags;
static void Main(string[] args)
{
_groupReadTags = new BatchBlock<int>(1000);
var bufferOptions = new DataflowBlockOptions{BoundedCapacity = 2};
BufferBlock<int> _frameBuffer = new BufferBlock<int>(bufferOptions);
var consumerOptions = new ExecutionDataflowBlockOptions { BoundedCapacity = 1};
int batchNo = 1;
TransformBlock<int[], int> _workingBlock = new TransformBlock<int[], int>(list =>
{
Console.WriteLine("\n\nWorking on Batch Number {0}", batchNo);
//_groupReadTags.TriggerBatch();
int sum = 0;
foreach (int item in list)
{
Console.WriteLine("Elements in batch {0} :: {1}", batchNo, item);
sum += item;
}
batchNo++;
return sum;
});
ActionBlock<int> _worker1 = new ActionBlock<int>(async x =>
{
Console.WriteLine("Number from ONE :{0}",x);
await Task.Delay(500);
Console.WriteLine("BatchBlock Output Count : {0}", _groupReadTags.OutputCount);
_groupReadTags.TriggerBatch();
},consumerOptions);
ActionBlock<int> _worker2 = new ActionBlock<int>(async x =>
{
Console.WriteLine("Number from TWO :{0}", x);
await Task.Delay(2000);
_groupReadTags.TriggerBatch();
}, consumerOptions);
_groupReadTags.LinkTo(_workingBlock);
_workingBlock.LinkTo(_frameBuffer);
_frameBuffer.LinkTo(_worker1);
_frameBuffer.LinkTo(_worker2);
_groupReadTags.Post(10);
_groupReadTags.Post(20);
_groupReadTags.TriggerBatch();
Task postingTask = new Task(() => PostStuff());
postingTask.Start();
Console.ReadLine();
}
static void PostStuff()
{
for (int i = 0; i < 10; i++)
{
_groupReadTags.Post(i);
Thread.Sleep(100);
}
Parallel.Invoke(
() => _groupReadTags.Post(100),
() => _groupReadTags.Post(200),
() => _groupReadTags.Post(300),
() => _groupReadTags.Post(400),
() => _groupReadTags.Post(500),
() => _groupReadTags.Post(600),
() => _groupReadTags.Post(700),
() => _groupReadTags.Post(800)
);
}

Here is an alternative BatchBlock implementation with some extra features. It includes a TriggerBatch method with this signature:
public int TriggerBatch(int nextMinBatchSizeIfEmpty);
Invoking this method will either trigger a batch immediately if the input queue is not empty, otherwise it will set a temporary MinBatchSize that will affect only the next batch. You could invoke this method with a small value for nextMinBatchSizeIfEmpty to ensure that in case a batch cannot be currently produced, the next batch will occur sooner than the configured BatchSize at the block's constructor.
This method returns the size of the batch produced. It returns 0 in case that the input queue is empty, or the output queue is full, or the block has completed.
public class BatchBlockEx<T> : ITargetBlock<T>, ISourceBlock<T[]>
{
private readonly ITargetBlock<T> _input;
private readonly IPropagatorBlock<T[], T[]> _output;
private readonly Queue<T> _queue;
private readonly object _locker = new object();
private int _nextMinBatchSize = Int32.MaxValue;
public Task Completion { get; }
public int InputCount { get { lock (_locker) return _queue.Count; } }
public int OutputCount => ((BufferBlock<T[]>)_output).Count;
public int BatchSize { get; }
public BatchBlockEx(int batchSize, DataflowBlockOptions dataflowBlockOptions = null)
{
if (batchSize < 1) throw new ArgumentOutOfRangeException(nameof(batchSize));
dataflowBlockOptions = dataflowBlockOptions ?? new DataflowBlockOptions();
if (dataflowBlockOptions.BoundedCapacity != DataflowBlockOptions.Unbounded &&
dataflowBlockOptions.BoundedCapacity < batchSize)
throw new ArgumentOutOfRangeException(nameof(batchSize),
"Number must be no greater than the value specified in BoundedCapacity.");
this.BatchSize = batchSize;
_output = new BufferBlock<T[]>(dataflowBlockOptions);
_queue = new Queue<T>(batchSize);
_input = new ActionBlock<T>(async item =>
{
T[] batch = null;
lock (_locker)
{
_queue.Enqueue(item);
if (_queue.Count == batchSize || _queue.Count >= _nextMinBatchSize)
{
batch = _queue.ToArray(); _queue.Clear();
_nextMinBatchSize = Int32.MaxValue;
}
}
if (batch != null) await _output.SendAsync(batch).ConfigureAwait(false);
}, new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 1,
CancellationToken = dataflowBlockOptions.CancellationToken
});
var inputContinuation = _input.Completion.ContinueWith(async t =>
{
try
{
T[] batch = null;
lock (_locker)
{
if (_queue.Count > 0)
{
batch = _queue.ToArray(); _queue.Clear();
}
}
if (batch != null) await _output.SendAsync(batch).ConfigureAwait(false);
}
finally
{
if (t.IsFaulted)
{
_output.Fault(t.Exception.InnerException);
}
else
{
_output.Complete();
}
}
}, TaskScheduler.Default).Unwrap();
this.Completion = Task.WhenAll(inputContinuation, _output.Completion);
}
public void Complete() => _input.Complete();
void IDataflowBlock.Fault(Exception ex) => _input.Fault(ex);
public int TriggerBatch(Func<T[], bool> condition, int nextMinBatchSizeIfEmpty)
{
if (nextMinBatchSizeIfEmpty < 1)
throw new ArgumentOutOfRangeException(nameof(nextMinBatchSizeIfEmpty));
int count = 0;
lock (_locker)
{
if (_queue.Count > 0)
{
T[] batch = _queue.ToArray();
if (condition == null || condition(batch))
{
bool accepted = _output.Post(batch);
if (accepted) { _queue.Clear(); count = batch.Length; }
}
_nextMinBatchSize = Int32.MaxValue;
}
else
{
_nextMinBatchSize = nextMinBatchSizeIfEmpty;
}
}
return count;
}
public int TriggerBatch(Func<T[], bool> condition)
=> TriggerBatch(condition, Int32.MaxValue);
public int TriggerBatch(int nextMinBatchSizeIfEmpty)
=> TriggerBatch(null, nextMinBatchSizeIfEmpty);
public int TriggerBatch() => TriggerBatch(null, Int32.MaxValue);
DataflowMessageStatus ITargetBlock<T>.OfferMessage(
DataflowMessageHeader messageHeader, T messageValue,
ISourceBlock<T> source, bool consumeToAccept)
{
return _input.OfferMessage(messageHeader, messageValue, source,
consumeToAccept);
}
T[] ISourceBlock<T[]>.ConsumeMessage(DataflowMessageHeader messageHeader,
ITargetBlock<T[]> target, out bool messageConsumed)
{
return _output.ConsumeMessage(messageHeader, target, out messageConsumed);
}
bool ISourceBlock<T[]>.ReserveMessage(DataflowMessageHeader messageHeader,
ITargetBlock<T[]> target)
{
return _output.ReserveMessage(messageHeader, target);
}
void ISourceBlock<T[]>.ReleaseReservation(DataflowMessageHeader messageHeader,
ITargetBlock<T[]> target)
{
_output.ReleaseReservation(messageHeader, target);
}
IDisposable ISourceBlock<T[]>.LinkTo(ITargetBlock<T[]> target,
DataflowLinkOptions linkOptions)
{
return _output.LinkTo(target, linkOptions);
}
}
Another overload of the TriggerBatch method allows to examine the batch that can be currently produced, and decide if it should be triggered or not:
public int TriggerBatch(Func<T[], bool> condition);
The BatchBlockEx class does not support the Greedy and MaxNumberOfGroups options of the built-in BatchBlock.

I have found that using TriggerBatch in this way is unreliable:
_groupReadTags.Post(10);
_groupReadTags.Post(20);
_groupReadTags.TriggerBatch();
Apparently TriggerBatch is intended to be used inside the block, not outside it like this. I have seen this result in odd timing issues, like items from next batch batch being included in the current batch, even though TriggerBatch was called first.
Please see my answer to this question for an alternative using DataflowBlock.Encapsulate: BatchBlock produces batch with elements sent after TriggerBatch()

Related

How to close all tasks if at least one is over

There is a decimal parameter. Suppose it is equal to 100. There is a Task that reduces it by 0.1 every 100ms. As soon as the parameter becomes equal to 1, the task should end and the parameter should not decrease any more. Works without problems if there is only one Task. But if there are 2, 3, 100... then the parameter will eventually become less than 1. I try to use CancellationToken to end all tasks, but the result is still the same. My code:
class Program
{
static decimal param = 100;
static CancellationTokenSource cancelTokenSource = new CancellationTokenSource();
static CancellationToken token;
static void Main(string[] args)
{
int tasksCount = 16;
token = cancelTokenSource.Token;
Console.WriteLine("Start Param = {0}", param);
Console.WriteLine("Tasks Count = {0}", tasksCount);
var tasksList = new List<Task>();
for (var i = 0; i < tasksCount; i++)
{
Task task = new Task(Decrementation, token);
tasksList.Add(task);
}
tasksList.ForEach(x => x.Start());
Task.WaitAny(tasksList.ToArray());
Console.WriteLine("Result = {0}", param);
Console.Read();
}
private static void Decrementation()
{
while (true)
{
if (token.IsCancellationRequested)
{
break;
}
if (CanTakeMore())
{
Task.Delay(100);
param = param - 0.1m;
}
else
{
cancelTokenSource.Cancel();
return;
}
}
}
private static bool CanTakeMore()
{
if (param > 1)
{
return true;
}
else
{
return false;
}
}
}
The output is different, but it is always less than 1. How to fix ?
Your tasks are checking and modifying the same shared value in parallel, to some degree as allowed by your CPU architecture and/or Operating System.
Any number of your tasks can encounter a CanTakeMore() result of true "at the same time" (when they call it with the shared value being 1.1m), and then all of them that received true from that call will proceed to decrease the shared value.
This problem can usually be avoided by using a lock statement:
private static object _lockObj = new object();
private static void Decrementation()
{
while (true)
{
if (token.IsCancellationRequested)
{
break;
}
lock (_lockObj)
{
if (CanTakeMore())
{
Task.Delay(100); // Note: this needs an `await`, but the method we're in is NOT `async`...!
param = param - 0.1m;
}
else
{
cancelTokenSource.Cancel();
return;
}
}
}
}
Here is a working .NET Fiddle: https://dotnetfiddle.net/BA6tHX
While your code has other issues (such as the fact that you must await Task.Delay), the fundamental problem is that you must either lock your entire read/write operation, or modify your implementation to enable atomic read and writes.
One option is to take your incoming decimal and convert it to a 32 bit integer, multiplying the param by the number of places of precision you need. In this case, it would be 100 * 10 since you have 1 place of precision.
This enables you to use Thread.VolatileRead in conjunction with Interlocked.CompareExchange to produce the behavior you are looking for (working example).
void Main()
{
int tasksCount = 16;
token = cancelTokenSource.Token;
Console.WriteLine("Start Param = {0}", param);
Console.WriteLine("Tasks Count = {0}", tasksCount);
var tasksList = new List<Task>();
for (var i = 0; i < tasksCount; i++)
{
Task task = new Task(Decrementation, token);
tasksList.Add(task);
}
tasksList.ForEach(x => x.Start());
Task.WaitAny(tasksList.ToArray());
Console.WriteLine("Result = {0}", param);
Console.Read();
}
static int param = 1000;
static CancellationTokenSource cancelTokenSource = new CancellationTokenSource();
static CancellationToken token;
private static void Decrementation()
{
while (true)
{
if (token.IsCancellationRequested)
{
break;
}
int temp = Thread.VolatileRead(ref param);
if (temp == 1)
{
cancelTokenSource.Cancel();
return;
}
int updatedValue = temp - 1;
if (Interlocked.CompareExchange(ref param, updatedValue, temp) == temp)
{
// the update was successful. Delay (or do additional work)
// this still does nothing
// You might want to make your method async or switch to a timer
Task.Delay(100);
}
}
}
The advantage of Thread.VolatileRead + Interlocked.CompareExchange over a straight lock is that if there is any significant work being done in the lock, this approach will perform significantly better. When benchmarked against the following reasonable locking implementation using decimal subtraction:
private static object _locker = new object();
private static decimal param2 = 100.0m;
private static void DecrementationLock()
{
while (true)
{
if (token.IsCancellationRequested)
{
break;
}
lock (_locker)
{
if (param2 > 1)
{
param2 = param2 - 0.1m;
Task.Delay(100);
}
else
{
cancelTokenSource.Cancel();
return;
}
}
}
}
even though the Task.Delay is not awaited in either case, the lock code is over 2.5x slower. That said, in cases where no work is being done, there is essentially no execution time difference between the two approaches.

How to keep track of faulted items in TPL pipeline in (thread)safe way

I am using TPL pipeline design together with Stephen Cleary's Try library In short it wraps value/exception and floats it down the pipeline. So even items that have thrown exceptions inside their processing methods, at the end when I await resultsBlock.Completion; have Status=RunToCompletion. So I need other way how to register faulted items. Here is small sample:
var downloadBlock = new TransformBlock<int, Try<int>>(construct => Try.Create(() =>
{
//SomeProcessingMethod();
return 1;
}));
var processBlock = new TransformBlock<Try<int>, Try<int>>(construct => construct.Map(value =>
{
//SomeProcessingMethod();
return 1;
}));
var resultsBlock = new ActionBlock<Try<int>>(construct =>
{
if (construct.IsException)
{
var exception = construct.Exception;
switch (exception)
{
case GoogleApiException gex:
//_notificationService.NotifyUser("OMG, my dear sir, I think I messed something up:/"
//Register that this item was faulted, so we know that we need to retry it.
break;
default:
break;
}
}
});
One solution would be to create a List<int> FaultedItems; where I would insert all faulted items in my Exception handling block and then after await resultsBlock.Completion; I could check if the list is not empty and create new pipeline for faulted items. My question is if I use a List<int> am I at risk of running into problems with thread safety if I decide to play with MaxDegreeOfParallelism settings and I'd be better off using some ConcurrentCollection? Or maybe this approach is flawed in some other way?
I converted a retry-block implementation from an answer to a similar question, to work with Stephen Cleary's Try types as input and output. The method CreateRetryTransformBlock returns a TransformBlock<Try<TInput>, Try<TOutput>>, and the method CreateRetryActionBlock returns something that is practically an ActionBlock<Try<TInput>>.
Three more options are available, the MaxAttemptsPerItem, MinimumRetryDelay and MaxRetriesTotal, on top of the standard execution options.
public class RetryExecutionDataflowBlockOptions : ExecutionDataflowBlockOptions
{
/// <summary>The limit after which an item is returned as failed.</summary>
public int MaxAttemptsPerItem { get; set; } = 1;
/// <summary>The minimum delay duration before retrying an item.</summary>
public TimeSpan MinimumRetryDelay { get; set; } = TimeSpan.Zero;
/// <summary>The limit after which the block transitions to a faulted
/// state (unlimited is the default).</summary>
public int MaxRetriesTotal { get; set; } = -1;
}
public class RetryLimitException : Exception
{
public RetryLimitException(string message, Exception innerException)
: base(message, innerException) { }
}
public static TransformBlock<Try<TInput>, Try<TOutput>>
CreateRetryTransformBlock<TInput, TOutput>(
Func<TInput, Task<TOutput>> transform,
RetryExecutionDataflowBlockOptions dataflowBlockOptions)
{
if (transform == null) throw new ArgumentNullException(nameof(transform));
if (dataflowBlockOptions == null)
throw new ArgumentNullException(nameof(dataflowBlockOptions));
int maxAttemptsPerItem = dataflowBlockOptions.MaxAttemptsPerItem;
int maxRetriesTotal = dataflowBlockOptions.MaxRetriesTotal;
TimeSpan retryDelay = dataflowBlockOptions.MinimumRetryDelay;
if (maxAttemptsPerItem < 1) throw new ArgumentOutOfRangeException(
nameof(dataflowBlockOptions.MaxAttemptsPerItem));
if (maxRetriesTotal < -1) throw new ArgumentOutOfRangeException(
nameof(dataflowBlockOptions.MaxRetriesTotal));
if (retryDelay < TimeSpan.Zero) throw new ArgumentOutOfRangeException(
nameof(dataflowBlockOptions.MinimumRetryDelay));
var internalCTS = CancellationTokenSource
.CreateLinkedTokenSource(dataflowBlockOptions.CancellationToken);
var maxDOP = dataflowBlockOptions.MaxDegreeOfParallelism;
var taskScheduler = dataflowBlockOptions.TaskScheduler;
var exceptionsCount = 0;
SemaphoreSlim semaphore;
if (maxDOP == DataflowBlockOptions.Unbounded)
{
semaphore = new SemaphoreSlim(Int32.MaxValue);
}
else
{
semaphore = new SemaphoreSlim(maxDOP, maxDOP);
// The degree of parallelism is controlled by the semaphore
dataflowBlockOptions.MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded;
// Use a limited-concurrency scheduler for preserving the processing order
dataflowBlockOptions.TaskScheduler = new ConcurrentExclusiveSchedulerPair(
taskScheduler, maxDOP).ConcurrentScheduler;
}
var block = new TransformBlock<Try<TInput>, Try<TOutput>>(async item =>
{
// Continue on captured context after every await
if (item.IsException) return Try<TOutput>.FromException(item.Exception);
var result1 = await ProcessOnceAsync(item);
if (item.IsException || result1.IsValue) return result1;
for (int i = 2; i <= maxAttemptsPerItem; i++)
{
await Task.Delay(retryDelay, internalCTS.Token);
var result = await ProcessOnceAsync(item);
if (result.IsValue) return result;
}
return result1; // Return the first-attempt exception
}, dataflowBlockOptions);
dataflowBlockOptions.MaxDegreeOfParallelism = maxDOP; // Restore initial value
dataflowBlockOptions.TaskScheduler = taskScheduler; // Restore initial value
_ = block.Completion.ContinueWith(_ => internalCTS.Dispose(),
TaskScheduler.Default);
return block;
async Task<Try<TOutput>> ProcessOnceAsync(Try<TInput> item)
{
await semaphore.WaitAsync(internalCTS.Token);
try
{
var result = await item.Map(transform);
if (item.IsValue && result.IsException)
{
ObserveNewException(result.Exception);
}
return result;
}
finally
{
semaphore.Release();
}
}
void ObserveNewException(Exception ex)
{
if (maxRetriesTotal == -1) return;
uint newCount = (uint)Interlocked.Increment(ref exceptionsCount);
if (newCount <= (uint)maxRetriesTotal) return;
if (newCount == (uint)maxRetriesTotal + 1)
{
internalCTS.Cancel(); // The block has failed
throw new RetryLimitException($"The max retry limit " +
$"({maxRetriesTotal}) has been reached.", ex);
}
throw new OperationCanceledException();
}
}
public static ITargetBlock<Try<TInput>> CreateRetryActionBlock<TInput>(
Func<TInput, Task> action,
RetryExecutionDataflowBlockOptions dataflowBlockOptions)
{
if (action == null) throw new ArgumentNullException(nameof(action));
var block = CreateRetryTransformBlock<TInput, object>(async input =>
{
await action(input).ConfigureAwait(false); return null;
}, dataflowBlockOptions);
var nullTarget = DataflowBlock.NullTarget<Try<object>>();
block.LinkTo(nullTarget);
return block;
}
Usage example:
var downloadBlock = CreateRetryTransformBlock(async (int construct) =>
{
int result = await DownloadAsync(construct);
return result;
}, new RetryExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 10,
MaxAttemptsPerItem = 3,
MaxRetriesTotal = 100,
MinimumRetryDelay = TimeSpan.FromSeconds(10)
});
var processBlock = new TransformBlock<Try<int>, Try<int>>(
construct => construct.Map(async value =>
{
return await ProcessAsync(value);
}));
downloadBlock.LinkTo(processBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
To keep things simple, in case that an item has been retried the maximum number of times, the exception preserved is the first one that occurred. The subsequent exceptions are lost. In most cases the lost exceptions are going to be of the same type as the first one anyway.
Caution: The above implementation does not have an efficient input queue. If you feed this block with millions of items, the memory usage will explode.

How to check an IEnumerable for multiple conditions with a single enumeration without buffering?

I have a very long sequence of data is the form of IEnumerable, and I would like to check it for a number of conditions. Each condition returns a value of true or false, and I want to know if all conditions are true. My problem is that I can not afford to materialize the IEnumerable by calling ToList, because it is simply too long (> 10,000,000,000 elements). Neither I can afford to enumerate the sequence multiple times, one for each condition, because each time I will get a different sequence. I am searching for an efficient way to perform this check, using the existing LINQ functionality if possible.
Clarification: I am asking for a general solution, not for a solution of the specific example problem that is presented bellow.
Here is a dummy version of my sequence:
static IEnumerable<int> GetLongSequence()
{
var random = new Random();
for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}
And here is an example of the conditions that the sequence must satisfy:
var source = GetLongSequence();
var result = source.Any(n => n % 28_413_803 == 0)
&& source.All(n => n < 99_999_999)
&& source.Average(n => n) > 50_000_001;
Unfortunately this approach invokes three times the GetLongSequence, so it doesn't satisfy the requirements of the problem.
I tried to write a Linqy extension method of the above, hoping that this could give me some ideas:
public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
params Func<IEnumerable<TSource>, bool>[] conditions)
{
foreach (var condition in conditions)
{
if (!condition(source)) return false;
}
return true;
}
This is how I intend to use it:
var result = source.AllConditions
(
s => s.Any(n => n % 28_413_803 == 0),
s => s.All(n => n < 99_999_999),
s => s.Average(n => n) > 50_000_001,
// more conditions...
);
Unfortunately this offers no improvement. The GetLongSequence is again invoked three times.
After hitting my head against the wall for an hour, without making any progress, I figured out a possible solution. I could run each condition in a separate thread, and synchronize their access to a single shared enumerator of the sequence. So I ended up with this monstrosity:
public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
params Func<IEnumerable<TSource>, bool>[] conditions)
{
var locker = new object();
var enumerator = source.GetEnumerator();
var barrier = new Barrier(conditions.Length);
long index = -1;
bool finished = false;
IEnumerable<TSource> OneByOne()
{
try
{
while (true)
{
TSource current;
lock (locker)
{
if (finished) break;
if (barrier.CurrentPhaseNumber > index)
{
index = barrier.CurrentPhaseNumber;
finished = !enumerator.MoveNext();
if (finished)
{
enumerator.Dispose(); break;
}
}
current = enumerator.Current;
}
yield return current;
barrier.SignalAndWait();
}
}
finally
{
barrier.RemoveParticipant();
}
}
var results = new ConcurrentQueue<bool>();
var threads = conditions.Select(condition => new Thread(() =>
{
var result = condition(OneByOne());
results.Enqueue(result);
})
{ IsBackground = true }).ToArray();
foreach (var thread in threads) thread.Start();
foreach (var thread in threads) thread.Join();
return results.All(r => r);
}
For the synchronization a used a Barrier. This solution actually works way better than I thought. It can process almost 1,000,000 elements per second in my machine. It is not fast enough though, since it needs almost 3 hours to process the full sequence of 10,000,000,000 elements. And I can't wait for the result for longer than 5 minutes. Any ideas about how I could run these conditions efficiently in a single thread?
If you need to ensure that the sequence is enumerated only once, conditions operating on the whole sequence are not useful.
One possibility that comes to my mind is to have an interface which is called for each element of the sequence and implement this interface in different ways for your specific conditions:
bool Example()
{
var source = GetLongSequence();
var conditions = new List<IEvaluate<int>>
{
new Any<int>(n => n % 28_413_803 == 0),
new All<int>(n => n < 99_999_999),
new Average(d => d > 50_000_001)
};
foreach (var item in source)
{
foreach (var condition in conditions)
{
condition.Evaluate(item);
}
}
return conditions.All(c => c.Result);
}
static IEnumerable<int> GetLongSequence()
{
var random = new Random();
for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}
interface IEvaluate<T>
{
void Evaluate(T item);
bool Result { get; }
}
class Any<T> : IEvaluate<T>
{
private bool _result;
private readonly Func<T, bool> _predicate;
public Any(Func<T, bool> predicate)
{
_predicate = predicate;
_result = false;
}
public void Evaluate(T item)
{
if (_predicate(item))
{
_result = true;
}
}
public bool Result => _result;
}
class All<T> : IEvaluate<T>
{
private bool _result;
private readonly Func<T, bool> _predicate;
public All(Func<T, bool> predicate)
{
_predicate = predicate;
_result = true;
}
public void Evaluate(T item)
{
if (!_predicate(item))
{
_result = false;
}
}
public bool Result => _result;
}
class Average : IEvaluate<int>
{
private long _sum;
private int _count;
Func<double, bool> _evaluate;
public Average(Func<double, bool> evaluate)
{
}
public void Evaluate(int item)
{
_sum += item;
_count++;
}
public bool Result => _evaluate((double)_sum / _count);
}
If all you want is check for these three conditions on a single thread in only one enumeration, I wouldn't use LINQ and manually aggregate the checks:
bool anyVerified = false;
bool allVerified = true;
double averageSoFar = 0;
foreach (int n in GetLongSequence()) {
anyVerified = anyVerified || n % 28_413_803 == 0;
allVerified = allVerified && n < 99_999_999;
averageSoFar += n / 10_000_000_000;
// Early out conditions here...
}
return anyVerified && allVerified && averageSoFar > 50_000_001;
This could be made more generic if you plan to do these checks often but it looks like it satisfies all your requirements.
Can I also suggest you another method based on the Enumerable.Aggregate LINQ extension method.
public static class Parsing {
public static bool ParseOnceAndCheck(this IEnumerable<int> collection, Func<int, bool> all, Func<int, bool> any, Func<double, bool> average) {
// Aggregate the two boolean results, the sum of all values and the count of values...
(bool allVerified, bool anyVerified, int sum, int count) = collection.Aggregate(
ValueTuple.Create(true, false, 0, 0),
(tuple, item) => ValueTuple.Create(tuple.Item1 && all(item), tuple.Item2 || any(item), tuple.Item3 + item, tuple.Item4 + 1)
);
// ... and summarizes the result
return allVerified && anyVerified && average(sum / count);
}
}
You could call this extension method in a very similar way than you would usual LINQ methods but there would be only one enumeration of your sequence:
IEnumerable<int> sequence = GetLongSequence();
bool result = sequence.ParseOnceAndCheck(
all: n => n < 99_999_999,
any: n => n % 28_413_803 == 0,
average: a => a > 50_000_001
);
I found a single-threaded solution that uses the Reactive Extensions library. On the one hand it's an excellent solution regarding features and ease of use, since all methods that are available in LINQ for IEnumerable are also available in RX for IObservable. On the other hand it is a bit disappointing regarding performance, as it is as slow as my wacky multi-threaded solution that is presented inside my question.
Update: I discarded the previous two implementations (one using the method Replay, the other using the method Publish) with a new one that uses the class Subject. This class is a low-level combination of an IObservable and IObserver. I am posting to it the items of the source IEnumerable, which are then propagated to all the IObservable<bool>'s provided by the caller. The performance is now decent, only 40% slower than Klaus Gütter's excellent solution. Also I can now break from the loop early if a condition (like All) can be determined to be false before the end of the enumeration.
public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
params Func<IObservable<TSource>, IObservable<bool>>[] conditions)
{
var subject = new Subject<TSource>();
var result = true;
foreach (var condition in conditions)
{
condition(subject).SingleAsync().Subscribe(onNext: value =>
{
if (value) return;
result = false;
});
}
foreach (var item in source)
{
if (!result) break;
subject.OnNext(item);
}
return result;
}
Usage example:
var result = source.AllConditions
(
o => o.Any(n => n % 28_413_803 == 0),
o => o.All(n => n < 99_999_999),
o => o.Average(n => n).Select(v => v > 50_000_001)
);
Each condition should return an IObservable containing a single boolean value. This is not enforcible by the RX API, so I used the System.Reactive.Linq.SingleAsync method to enforce it at runtime (by throwing an exception if a result doesn't comply to this contract).

C# - Lock not working but Volatile AND Lock does?

I am trying to poll an API as fast and as efficiently as possible to get market data. The API allows you to get market data from batchSize markets per request. The API allows you to have 3 concurrent requests but no more (or throws errors).
I may be requesting data from many more than batchSize different markets.
I continuously loop through all of the markets, requesting the data in batches, one batch per thread and 3 threads at any time.
The total number of markets (and hence batches) can change at any time.
I'm using the following code:
private static object lockObj = new object();
private void PollMarkets()
{
const int NumberOfConcurrentRequests = 3;
for (int i = 0; i < NumberOfConcurrentRequests; i++)
{
int batch = 0;
Task.Factory.StartNew(async () =>
{
while (true)
{
if (markets.Count > 0)
{
List<string> batchMarketIds;
lock (lockObj)
{
var numBatches = (int)Math.Ceiling((double)markets.Count / batchSize);
batchMarketIds = markets.Keys.Skip(batch*batchSize).Take(batchSize).ToList();
batch = (batch + 1) % numBatches;
}
var marketData = await GetMarketData(batchMarketIds);
// Do something with marketData
}
else
{
await Task.Delay(1000); // wait for some markets to be added.
}
}
}
});
}
}
Even though there is a lock in the critical section, each thread starts with batch = 0 (each thread is often polling for duplicate data).
If I change batch to a private volatile field the above code works as I want it to (volatile and lock).
So for some reason my lock doesn't work? I feel like it's something obvious but I'm missing it.
I believe that it is best here to use a lock instead of a volatile field, is this also correct?
Thanks
The issue was that you were defining the batch variable inside the for loop. That meant that the threads were using their own variable instead of sharing it.
In my mind you should use Queue<> to create a jobs pipeline.
Something like this
private int batchSize = 10;
private Queue<int> queue = new Queue<int>();
private void AddMarket(params int[] marketIDs)
{
lock (queue)
{
foreach (var marketID in marketIDs)
{
queue.Enqueue(marketID);
}
if (queue.Count >= batchSize)
{
Monitor.Pulse(queue);
}
}
}
private void Start()
{
for (var tid = 0; tid < 3; tid++)
{
Task.Run(async () =>
{
while (true)
{
List<int> toProcess;
lock (queue)
{
if (queue.Count < batchSize)
{
Monitor.Wait(queue);
continue;
}
toProcess = new List<int>(batchSize);
for (var count = 0; count < batchSize; count++)
{
toProcess.Add(queue.Dequeue());
}
if (queue.Count >= batchSize)
{
Monitor.Pulse(queue);
}
}
var marketData = await GetMarketData(toProcess);
}
});
}
}

Using Parallel Linq Extensions to union two sequences, how can one yield the fastest results first?

Let's say I have two sequences returning integers 1 to 5.
The first returns 1, 2 and 3 very fast, but 4 and 5 take 200ms each.
public static IEnumerable<int> FastFirst()
{
for (int i = 1; i < 6; i++)
{
if (i > 3) Thread.Sleep(200);
yield return i;
}
}
The second returns 1, 2 and 3 with a 200ms delay, but 4 and 5 are returned fast.
public static IEnumerable<int> SlowFirst()
{
for (int i = 1; i < 6; i++)
{
if (i < 4) Thread.Sleep(200);
yield return i;
}
}
Unioning both these sequences give me just numbers 1 to 5.
FastFirst().Union(SlowFirst());
I cannot guarantee which of the two methods has delays at what point, so the order of the execution cannot guarantee a solution for me. Therefore, I would like to parallelise the union, in order to minimise the (artifical) delay in my example.
A real-world scenario: I have a cache that returns some entities, and a datasource that returns all entities. I'd like to be able to return an iterator from a method that internally parallelises the request to both the cache and the datasource so that the cached results yield as fast as possible.
Note 1: I realise this is still wasting CPU cycles; I'm not asking how can I prevent the sequences from iterating over their slow elements, just how I can union them as fast as possible.
Update 1: I've tailored achitaka-san's great response to accept multiple producers, and to use ContinueWhenAll to set the BlockingCollection's CompleteAdding just the once. I just put it here since it would get lost in the lack of comments formatting. Any further feedback would be great!
public static IEnumerable<TResult> SelectAsync<TResult>(
params IEnumerable<TResult>[] producer)
{
var resultsQueue = new BlockingCollection<TResult>();
var taskList = new HashSet<Task>();
foreach (var result in producer)
{
taskList.Add(
Task.Factory.StartNew(
() =>
{
foreach (var product in result)
{
resultsQueue.Add(product);
}
}));
}
Task.Factory.ContinueWhenAll(taskList.ToArray(), x => resultsQueue.CompleteAdding());
return resultsQueue.GetConsumingEnumerable();
}
Take a look at this.
The first method just returns everything in order results come.
The second checks uniqueness. If you chain them you will get the result you want I think.
public static class Class1
{
public static IEnumerable<TResult> SelectAsync<TResult>(
IEnumerable<TResult> producer1,
IEnumerable<TResult> producer2,
int capacity)
{
var resultsQueue = new BlockingCollection<TResult>(capacity);
var producer1Done = false;
var producer2Done = false;
Task.Factory.StartNew(() =>
{
foreach (var product in producer1)
{
resultsQueue.Add(product);
}
producer1Done = true;
if (producer1Done && producer2Done) { resultsQueue.CompleteAdding(); }
});
Task.Factory.StartNew(() =>
{
foreach (var product in producer2)
{
resultsQueue.Add(product);
}
producer2Done = true;
if (producer1Done && producer2Done) { resultsQueue.CompleteAdding(); }
});
return resultsQueue.GetConsumingEnumerable();
}
public static IEnumerable<TResult> SelectAsyncUnique<TResult>(this IEnumerable<TResult> source)
{
HashSet<TResult> knownResults = new HashSet<TResult>();
foreach (TResult result in source)
{
if (knownResults.Contains(result)) {continue;}
knownResults.Add(result);
yield return result;
}
}
}
The cache would be nearly instant compared to fetching from the database, so you could read from the cache first and return those items, then read from the database and return the items except those that were found in the cache.
If you try to parallelise this, you will add a lot of complexity but get quite a small gain.
Edit:
If there is no predictable difference in the speed of the sources, you could run them in threads and use a synchronised hash set to keep track of which items you have already got, put the new items in a queue, and let the main thread read from the queue:
public static IEnumerable<TItem> GetParallel<TItem, TKey>(Func<TItem, TKey> getKey, params IEnumerable<TItem>[] sources) {
HashSet<TKey> found = new HashSet<TKey>();
List<TItem> queue = new List<TItem>();
object sync = new object();
int alive = 0;
object aliveSync = new object();
foreach (IEnumerable<TItem> source in sources) {
lock (aliveSync) {
alive++;
}
new Thread(s => {
foreach (TItem item in s as IEnumerable<TItem>) {
TKey key = getKey(item);
lock (sync) {
if (found.Add(key)) {
queue.Add(item);
}
}
}
lock (aliveSync) {
alive--;
}
}).Start(source);
}
while (true) {
lock (sync) {
if (queue.Count > 0) {
foreach (TItem item in queue) {
yield return item;
}
queue.Clear();
}
}
lock (aliveSync) {
if (alive == 0) break;
}
Thread.Sleep(100);
}
}
Test stream:
public static IEnumerable<int> SlowRandomFeed(Random rnd) {
int[] values = new int[100];
for (int i = 0; i < 100; i++) {
int pos = rnd.Next(i + 1);
values[i] = i;
int temp = values[pos];
values[pos] = values[i];
values[i] = temp;
}
foreach (int value in values) {
yield return value;
Thread.Sleep(rnd.Next(200));
}
}
Test:
Random rnd = new Random();
foreach (int item in GetParallel(n => n, SlowRandomFeed(rnd), SlowRandomFeed(rnd), SlowRandomFeed(rnd), SlowRandomFeed(rnd))) {
Console.Write("{0:0000 }", item);
}

Categories

Resources