TPL Complete vs Completion - c#

I need to import customer related data from legacy DB and perform several transformations during the process. This means a single entry needs to perform additional "events" (synchronize products, create invoices, etc.).
My initial solution was a simple parallel approach.
It works okay, but sometimes it has issues. If the currently processed customers need to wait for the same type of events, their processing queues might got stuck and eventually time out, causing every underlying events to fail too (they depend on the one which failed). It doesn't happen all the time, yet it's annoying.
So I got another idea, work in batches. I mean not only limiting the number of customers being processed at the same time, but also the number of the events which are broadcasted to the queues. While searching around for ideas, I found this answer, which points to the TPL DataFlow.
I made a skeleton to get familiar with it. I set up a simple pipeline, but I'm a bit confused about the usage of Complete() and awaiting Completion().
The steps are the following
Make a list of numbers (the ids of the customers to be imported) - this is outside the import logic, it just there to be able to trigger the rest of the logic
Create a BatchBlock (to be able to limit the number of customers to be processed at the same time)
Create a single MyClass1 item based on the id (TransformBlock<int, MyClass1>)
Perform some logic and generate a collection of MyClass2 (TransformManyBlock<MyClass1, MyClass2>) - as example, sleep for 1 second
Perform some logic on every item of the collection (ActionBlock<MyClass2>) - as example, sleep for 1 second
Here's the full code:
public static class Program
{
private static void Main(string[] args)
{
var batchBlock = new BatchBlock<int>(2);
for (var i = 1; i < 10; i++)
{
batchBlock.Post(i);
}
batchBlock.Complete();
while (batchBlock.TryReceive(null, out var ids))
{
var transformBlock = new TransformBlock<int, MyClass1>(delegate (int id)
{
Console.WriteLine($"TransformBlock(id: {id})");
return new MyClass1(id, "Star Wars");
});
var transformManyBlock = new TransformManyBlock<MyClass1, MyClass2>(delegate (MyClass1 myClass1)
{
Console.WriteLine($"TransformManyBlock(myClass1: {myClass1.Id}|{myClass1.Value})");
Thread.Sleep(1000);
return GetMyClass22Values(myClass1);
});
var actionBlock = new ActionBlock<MyClass2>(delegate (MyClass2 myClass2)
{
Console.WriteLine($"ActionBlock(myClass2: {myClass2.Id}|{myClass2.Value})");
Thread.Sleep(1000);
});
transformBlock.LinkTo(transformManyBlock);
transformManyBlock.LinkTo(actionBlock);
foreach (var id in ids)
{
transformBlock.Post(id);
}
// this is the point when I'm not 100% sure
//transformBlock.Complete();
//transformManyBlock.Complete();
//transformManyBlock.Completion.Wait();
actionBlock.Complete();
actionBlock.Completion.Wait();
}
Console.WriteLine();
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
private static IEnumerable<MyClass2> GetMyClass22Values(MyClass1 myClass1)
{
return new List<MyClass2>
{
new MyClass2(1, myClass1.Id+ " did this"),
new MyClass2(2, myClass1.Id+ " did that"),
new MyClass2(3, myClass1.Id+ " did this again")
};
}
}
public class MyClass1
{
public MyClass1(int id, string value)
{
Id = id;
Value = value;
}
public int Id { get; set; }
public string Value { get; set; }
}
public class MyClass2
{
public MyClass1(int id, string value)
{
Id = id;
Value = value;
}
public int Id { get; set; }
public string Value { get; set; }
}
So the point I struggle with is the end, where I'd need to call Complete() or wait for Completion. I can't seem to find the right combination. I'd like to see an output as follows:
TransformBlock(id: 1)
TransformBlock(id: 2)
TransformManyBlock(myClass1: 1|Star Wars)
TransformManyBlock(myClass1: 2|Star Wars)
ActionBlock(myClass2: 1|1 did this)
ActionBlock(myClass2: 2|1 did that)
ActionBlock(myClass2: 3|1 did this again)
ActionBlock(myClass2: 1|2 did this)
ActionBlock(myClass2: 2|2 did that)
ActionBlock(myClass2: 3|2 did this again)
TransformBlock(id: 3)
TransformBlock(id: 4)
TransformManyBlock(myClass1: 3|Star Wars)
TransformManyBlock(myClass1: 4|Star Wars)
ActionBlock(myClass2: 1|3 did this)
ActionBlock(myClass2: 2|3 did that)
ActionBlock(myClass2: 3|3 did this again)
ActionBlock(myClass2: 1|4 did this)
ActionBlock(myClass2: 2|4 did that)
ActionBlock(myClass2: 3|4 did this again)
[the rest of the items]
Press any key to exit...
Anyone can point me to the right direction?

You're almost there, you need to call Complete on the first block in the pipeline then await Completion on the last block. Then in your links you need to propagate completion like this:
private async static void Main(string[] args) {
var transformBlock = new TransformBlock<int, MyClass1>(delegate (int id)
{
Console.WriteLine($"TransformBlock(id: {id})");
return new MyClass1(id, "Star Wars");
});
var transformManyBlock = new TransformManyBlock<MyClass1, MyClass2>(delegate (MyClass1 myClass1)
{
Console.WriteLine($"TransformManyBlock(myClass1: {myClass1.Id}|{myClass1.Value})");
Thread.Sleep(1000);
return GetMyClass22Values(myClass1);
});
var actionBlock = new ActionBlock<MyClass2>(delegate (MyClass2 myClass2)
{
Console.WriteLine($"ActionBlock(myClass2: {myClass2.Id}|{myClass2.Value})");
Thread.Sleep(1000);
});
//propagate completion
transformBlock.LinkTo(transformManyBlock, new DataflowLinkOptions() { PropagateCompletion = true });
transformManyBlock.LinkTo(actionBlock, new DataflowLinkOptions() { PropagateCompletion = true});
foreach(var id in ids) {
transformBlock.Post(id);
}
//Complete the first block
transformBlock.Complete();
//wait for completion to flow to the last block
await actionBlock.Completion;
}
You can also incorporate the batch block into your pipeline and remove the need for the TryRecieve call but that seems like another part of your flow.
Edit
Example of propagating completion to multiple blocks:
public async static void Main(string[] args) {
var sourceBlock = new BufferBlock<int>();
var processBlock1 = new ActionBlock<int>(i => Console.WriteLine($"Block1 {i}"));
var processBlock2 = new ActionBlock<int>(i => Console.WriteLine($"Block2 {i}"));
sourceBlock.LinkTo(processBlock1);
sourceBlock.LinkTo(processBlock2);
var sourceBlockCompletion = sourceBlock.Completion.ContinueWith(tsk => {
if(!tsk.IsFaulted) {
processBlock1.Complete();
processBlock2.Complete();
} else {
((IDataflowBlock)processBlock1).Fault(tsk.Exception);
((IDataflowBlock)processBlock2).Fault(tsk.Exception);
}
});
//Send some data...
sourceBlock.Complete();
await Task.WhenAll(sourceBlockCompletion, processBlock1.Completion, processBlock2.Completion);
}

Related

DataflowBlock ITargetSource.AsObservable() not triggering OnNext()

I'm trying to use a dataflowblock and I need to spy the items passing through for unit testing.
In order to do this, I'm using the AsObservable() method on ISourceBlock<T> of my TransformBlock<Tinput, T>,
so I can check after execution that each block of my pipeline have generated the expected values.
Pipeline
{
...
var observer = new MyObserver<string>();
_block = new TransformManyBlock<string, string>(MyHandler, options);
_block.LinkTo(_nextBlock);
_block.AsObservable().Subscribe(observer);
_block.Post("Test");
...
}
MyObserver
public class MyObserver<T> : IObserver<T>
{
public List<Exception> Errors = new List<Exception>();
public bool IsComplete = false;
public List<T> Values = new List<T>();
public void OnCompleted()
{
IsComplete = true;
}
public void OnNext(T value)
{
Values.Add(value);
}
public void OnError(Exception e)
{
Errors.Add(e);
}
}
So basically I subscribe my observer to the transformblock, and I expect that each value passing through get registered in my observer "values" list.
But, while the IsComplete is set to true, and the OnError() successfully register exception,
the OnNext() method never get called unless it is the last block of the pipeline...
I can't figure out why, because the "nextblock" linked to this sourceBlock successfully receive the data, proving that some data are exiting the block.
From what I understand, the AsObservable is supposed to report every values exiting the block and not only the values that have not been consumed by other linked blocks...
What am I doing wrong ?
Your messages are being consumed by _nextBlock before you get a chance to read them.
If you comment out this line _block.LinkTo(_nextBlock); it would likely work.
AsObservable sole purpose is just to allow a block to be consumed from RX. It doesn't change the internal working of the block to broadcast messages to multiple targets. You need a special block for that BroadcastBlock
I would suggest broadcasting to another block and using that to Subscribe
BroadcastBlock’s mission in life is to enable all targets linked from
the block to get a copy of every element published
var options = new DataflowLinkOptions {PropagateCompletion = true};
var broadcastBlock = new BroadcastBlock<string>(x => x);
var bufferBlock = new BufferBlock<string>();
var actionBlock = new ActionBlock<string>(s => Console.WriteLine("Action " + s));
broadcastBlock.LinkTo(bufferBlock, options);
broadcastBlock.LinkTo(actionBlock, options);
bufferBlock.AsObservable().Subscribe(s => Console.WriteLine("peek " + s));
for (var i = 0; i < 5; i++)
await broadcastBlock.SendAsync(i.ToString());
broadcastBlock.Complete();
await actionBlock.Completion;
Output
peek 0
Action 0
Action 1
Action 2
Action 3
Action 4
peek 1
peek 2
peek 3
peek 4

How do I iterate over two lists using Foreach.Parallel

I am facing quite a struggle, I want to iterate over a list using Parallel.Foreach.
So picture this
static List<string> proxyList = new List<string>();
static List<string> websiteList = new List<string>();
Each list is looking something like this
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
And the website list
https://google.com
https://spotify.com
https://duckduckgo.com
https://amazon.com
I want to achieve something like this but I have no idea how, no matter how I twist and turn I cant seem to find any logic.
Foreach(var proxy in proxyList)
{
If(proxyIsAlive)
//Try to connect to the first website in the website list
else
//Try the next proxy until I get a working one
//and then try to connect to the most recent one
}
}
The issue I am facing is that I have to idea how to access the websites in the website list, I want to connect to
EDIT: this is what my logic looks like so far
private static void Connect()
{
string tproxy = "";
int port;
foreach (var website in websiteList)
{
foreach (var proxy in proxyList)
{
var proxySplit = proxy.Split(':');
tproxy = proxySplit[0];
port = Convert.ToInt32(proxySplit[1]);
//if(ProxyIsAlive)
}
//Use that proxy down here to connect
}
}
I only want to move out of the proxy foreach IF ProxyIsAlive returns true
Notes :
This uses nested Parallel as per your original question
It assumes you want to check the last good proxy first
If a good proxy is found it updates the success time
It processes each website and proxy in parallel
Note : Parallel.ForEach is suited for CPU-bound tasks, you need to be
careful you aren't just wasting resources blocking threads waiting for
IO operations to complete
Class to hold proxy info
public class Proxy
{
public string Host { get; set; }
public int Port { get; set; }
public DateTime LastSuccess { get; set; }
public Proxy(string value)
{
var proxySplit = value.Split(':');
Host = proxySplit[0];
Port = Convert.ToInt32(proxySplit[1]);
LastSuccess = DateTime.MinValue;
}
}
Code to run in parallel
var proxies = proxyList.Select(x => new Proxy(x)).ToList();
Parallel.ForEach(webSites, new ParallelOptions { MaxDegreeOfParallelism = 4 }, site =>
{
Parallel.ForEach(proxies.OrderByDescending(x => x.LastSuccess), new ParallelOptions { MaxDegreeOfParallelism = 4 }, proxy =>
{
if(!CheckProxy(proxy))
{
//check next proxy
return;
}
// if we found a good proxy
// update the lastSuccess so we check that first
proxy.LastSuccess = DateTime.Now;
// do something to the website
});
});
}
Note : This may not be the best approach, if you have CPU-bound code,
parallelism is appropriate; if you have I/O-bound code, asynchrony is
appropriate. In this case, an HttpClientExtension.GetHttpResponse is
clearly I/O, so the ideal consuming code would be asynchronous.
I would consider looking up the topics Parallel execution for IO bound operations
Existing SO questions deal with this Such as
Parallel execution for IO bound operations
Parallel.ForEach vs Async Forloop in Heavy I/O Ops
So this is what I ended up doing.
private static void ConnectToWebsite()
{
var proxyIP = "";
int port;
foreach (var website in WebsiteList)
{
foreach (var proxy in proxyList)
{
var proxySplit = proxy.Split(':');
proxyIP = proxySplit[0];
var convert = Int32.TryParse(proxySplit[1], out port);
if(HttpClientExtension.GetHttpResponse(getCMYIP, proxyIP, port))
Console.WriteLine(website + proxy);
}
}
}
That will check proxies until it finds a working one.
Now I need to make this async to speed things up.

DropQueue mechanism for RX.net

I came across a back pressure issue with RX.net I can't find a solution for. I have an observable real-time stream of log messages.
var logObservable = /* Observable stream of log messages */
Which I want to expose via a TCP interface which serializes the real-time log messages from the logObservable before they are sent over the wire. So I do the following:
foreach (var message in logObservable.ToEnumerable())
{
// 1. Serialize message
// 2. Send it over the wire.
}
The problem arises with the .ToEnumerable() if a back pressure scenario happens e.g. if the client on the other end pauses the stream. The problem is that .ToEnumerable() caches the items which result in a lot of memory usage. I'm looking for a mechanism something like a DropQueue which only buffers, let say, the last 10 messages e.g.
var observableStream = logObservable.DropQueue(10).ToEnumerable();
Is this the right way to way to solve this issue? And do you know to implement such a mechanism to avoid possible back pressure issue?
My DropQueue implementation:
public static IEnumerable<TSource> ToDropQueue<TSource>(
this IObservable<TSource> source,
int queueSize,
Action backPressureNotification = null,
CancellationToken token = default(CancellationToken))
{
var queue = new BlockingCollection<TSource>(new ConcurrentQueue<TSource>(), queueSize);
var isBackPressureNotified = false;
var subscription = source.Subscribe(
item =>
{
var isBackPressure = queue.Count == queue.BoundedCapacity;
if (isBackPressure)
{
queue.Take(); // Dequeue an item to make space for the next one
// Fire back-pressure notification if defined
if (!isBackPressureNotified && backPressureNotification != null)
{
backPressureNotification();
isBackPressureNotified = true;
}
}
else
{
isBackPressureNotified = false;
}
queue.Add(item);
},
exception => queue.CompleteAdding(),
() => queue.CompleteAdding());
token.Register(() => { subscription.Dispose(); });
using (new CompositeDisposable(subscription, queue))
{
foreach (var item in queue.GetConsumingEnumerable())
{
yield return item;
}
}
}

how to wait for several callbacks to be received

I have such code:
public void IssueOrders(List<OrderAction> actions)
{
foreach (var action in actions)
{
if (action is AddOrder)
{
uint userId = apiTransactions.PlaceOrder((action as AddOrder).order);
Console.WriteLine("order is placing userId = " + userId);
}
// TODO: implement other actions
}
// how to wait until OnApiTransactionsDataMessageReceived for all userId is received?
// TODO: need to update actions with received data here
}
private void OnApiTransactionsDataMessageReceived(object sender, DataMessageReceivedEventArgs e)
{
var dataMsg = e.message;
var userId = dataMsg.UserId;
apiTransactions.PlaceOrder is asynchronous so I receive userId as result but I will receive data in callback OnApiTransactionsDataMessageReceived.
So for example If I place 3 orders, i will receive 3 userId, for example 1, 3, and 4. Now I need to wait until data for all these userId is received.
userId is always increasing if this is important. This is almost integer numbers sequence, but some numbers may be ommited due parallel execution.
UPD Note:
IssueOrders can be executed parallel from different threads
callack may be called BEFORE PlaceOrder returns
UPD2
Likely I need to refactor PlaceOrder code below so I can guarantee that userId is known before "callback" is received:
public uint PlaceOrder(Order order)
{
Publisher pub = GetPublisher();
SchemeDesc schemeDesc = pub.Scheme;
MessageDesc messageDesc = schemeDesc.Messages[0]; //AddMM
FieldDesc fieldDesc = messageDesc.Fields[3];
Message sendMessage = pub.NewMessage(MessageKeyType.KeyName, "FutAddOrder");
DataMessage smsg = (DataMessage)sendMessage;
uint userId = counter.Next();
FillDataMessageWithPlaceOrder(smsg, order, userId);
System.Console.WriteLine("posting message dump: {0}", sendMessage);
pub.Post(sendMessage, PublishFlag.NeedReply);
sendMessage.Dispose();
return userId;
}
So I need to split PlaceOrder to two methods: userId CreateOrder and void PostOrder. This will guarantee that when callback is received I know userId.
I'd check out the ForkJoin method in the Reactive Framework. It will block until multiple async calls have completed.
Edit: It seems that ForkJoin() was only ever included in an experimental release of Rx. Here's a discussion of what you want based on Merge().
One of the most silly and working approaches would be:
public void IssueOrders(List<OrderAction> actions)
{
var userIds = new List<uint>();
lock(theHashMap)
theHashMap[userIds] = "blargh";
foreach (var action in actions)
{
if (action is AddOrder)
{
lock(userIds)
{
uint userId = apiTransactions.PlaceOrder((action as AddOrder).order);
Console.WriteLine("order is placing userId = " + userId);
userIds.Add(userId);
}
}
// TODO: implement other actions
}
// waiting:
do
{
lock(userIds)
if(userIds.Count == 0)
break;
Thread.Sleep(???); // adjust the time depending on how long you wait for a callback on average
}while(true);
lock(theHashMap)
theHashMap.Remove(userIds);
// now you have the guarantee that all were received
}
private Dictionary<List<uint>, string> theHashMap = new Dictionary<List<uint>,string>();
private void OnApiTransactionsDataMessageReceived(object sender, DataMessageReceivedEventArgs e)
{
var dataMsg = e.message;
var userId = dataMsg.UserId;
// do some other things
lock(theHashMap)
foreach(var list in theHashMap.Keys)
lock(list)
if(list.Remove(userId))
break;
}
but, this is quite crude approach.. Its hard to suggest anything more unless you explain what do yo umean by wait - as Jon asked in the comments. For example, you may might want to leave the IssueOrders, wait anywhere, and just be sure that the some extra job is done when all have arrived? Or maybe you cannot leave the IssueOrders unless all are received? etc..
Edit: please note that near ADD, the lock must be before PlaceOrder, or else, when the callback arrive hyper-fast, the callback may attempt to remove the ID before it is added. Also, note that this implementation is very naiive: the callback must search and lock through all the lists at each time. With a few additional dictionary/maps/indexes, it may be optimized much, but I did not do that here for readability.
In case you are able to change the API, consider to use Task Parallel Library, your code will get much easier with that.
Otherwise AutoResetEvent might help you:
private Dictionary<int, AutoResetEvent> m_Events = new ...;
public void IssueOrders(List<OrderAction> actions)
{
foreach (var action in actions)
{
if (action is AddOrder)
{
uint userId = apiTransactions.PlaceOrder((action as AddOrder).order);
// Attention: Race condition if PlaceOrder finishes
// before the MRE is created and added to the dictionary!
m_Events[userId] = new ManualResetEvent(false);
Console.WriteLine("order is placing userId = " + userId);
}
// TODO: implement other actions
}
WaitHandle.WaitAll(m_Events.Values);
// TODO: Dispose the created MREs
}
private void OnApiTransactionsDataMessageReceived(object sender, DataMessageReceivedEventArgs e)
{
var dataMsg = e.message;
var userId = dataMsg.UserId;
m_Events[userId].Set();
}

What is an efficent method for in-order processing of events using CCR?

I was experimenting with CCR iterators as a solution to a task that requires parallel processing of tons of data feeds, where the data from each feed needs to be processed in order. None of the feeds are dependent on each other, so the in-order processing can be paralleled per-feed.
Below is a quick and dirty mockup with one integer feed, which simply shoves integers into a Port at a rate of about 1.5K/second, and then pulls them out using a CCR iterator to keep the in-order processing guarantee.
class Program
{
static Dispatcher dispatcher = new Dispatcher();
static DispatcherQueue dispatcherQueue =
new DispatcherQueue("DefaultDispatcherQueue", dispatcher);
static Port<int> intPort = new Port<int>();
static void Main(string[] args)
{
Arbiter.Activate(
dispatcherQueue,
Arbiter.FromIteratorHandler(new IteratorHandler(ProcessInts)));
int counter = 0;
Timer t = new Timer( (x) =>
{ for(int i = 0; i < 1500; ++i) intPort.Post(counter++);}
, null, 0, 1000);
Console.ReadKey();
}
public static IEnumerator<ITask> ProcessInts()
{
while (true)
{
yield return intPort.Receive();
int currentValue;
if( (currentValue = intPort) % 1000 == 0)
{
Console.WriteLine("{0}, Current Items In Queue:{1}",
currentValue, intPort.ItemCount);
}
}
}
}
What surprised me about this greatly was that CCR could not keep up on a Corei7 box, with the queue size growing without bounds. In another test to measure the latency from the Post() to the Receive() under a load or ~100 Post/sec., the latency between the first Post() and Receive() in each batch was around 1ms.
Is there something wrong with my mockup? If so, what is a better way of doing this using CCR?
Yes, I agree, this does indeed seem weird. Your code seems initially to perform smoothly, but after a few thousand items, processor usage rises to the point where performance is really lacklustre. This disturbs me and suggests a problem in the framework. After a play with your code, I can't really identify why this is the case. I'd suggest taking this problem to the Microsoft Robotics Forums and seeing if you can get George Chrysanthakopoulos (or one of the other CCR brains) to tell you what the problem is. I can however surmise that your code as it stands is terribly inefficient.
The way that you are dealing with "popping" items from the Port is very inefficient. Essentially the iterator is woken each time there is a message in the Port and it deals with only one message (despite the fact that there might be several hundred more in the Port), then hangs on the yield while control is passed back to the framework. At the point that the yielded receiver causes another "awakening" of the iterator, many many messages have filled the Port. Pulling a thread from the Dispatcher to deal with only a single item (when many have piled up in the meantime) is almost certainly not the best way to get good throughput.
I've modded your code such that after the yield, we check the Port to see if there are any further messages queued and deal with them too, thereby completely emptying the Port before we yield back to the framework. I've also refactored your code somewhat to use CcrServiceBase which simplifies the syntax of some of the tasks you are doing:
internal class Test:CcrServiceBase
{
private readonly Port<int> intPort = new Port<int>();
private Timer timer;
public Test() : base(new DispatcherQueue("DefaultDispatcherQueue",
new Dispatcher(0,
"dispatcher")))
{
}
public void StartTest() {
SpawnIterator(ProcessInts);
var counter = 0;
timer = new Timer(x =>
{
for (var i = 0; i < 1500; ++i)
intPort.Post(counter++);
}
,
null,
0,
1000);
}
public IEnumerator<ITask> ProcessInts()
{
while (true)
{
yield return intPort.Receive();
int currentValue = intPort;
ReportCurrent(currentValue);
while(intPort.Test(out currentValue))
{
ReportCurrent(currentValue);
}
}
}
private void ReportCurrent(int currentValue)
{
if (currentValue % 1000 == 0)
{
Console.WriteLine("{0}, Current Items In Queue:{1}",
currentValue,
intPort.ItemCount);
}
}
}
Alternatively, you could do away with the iterator completely, as it's not really well used in your example (although I'm not entirely sure what effect this has on the order of processing):
internal class Test : CcrServiceBase
{
private readonly Port<int> intPort = new Port<int>();
private Timer timer;
public Test() : base(new DispatcherQueue("DefaultDispatcherQueue",
new Dispatcher(0,
"dispatcher")))
{
}
public void StartTest()
{
Activate(
Arbiter.Receive(true,
intPort,
i =>
{
ReportCurrent(i);
int currentValue;
while (intPort.Test(out currentValue))
{
ReportCurrent(currentValue);
}
}));
var counter = 0;
timer = new Timer(x =>
{
for (var i = 0; i < 500000; ++i)
{
intPort.Post(counter++);
}
}
,
null,
0,
1000);
}
private void ReportCurrent(int currentValue)
{
if (currentValue % 1000000 == 0)
{
Console.WriteLine("{0}, Current Items In Queue:{1}",
currentValue,
intPort.ItemCount);
}
}
}
Both these examples significantly increase throughput by orders of magnitude. Hope this helps.

Categories

Resources