Take from IObservable until collection reached count or time elapsed - c#

I want to fill a collection until any of these two conditions is satisfied:
either allowed time of 5 seconds has completed, or
collection reached the count of 5 items.
If any of these conditions is fulfilled, the method that i subscribed to should be executed (in this case Console.WriteLine)
static void Main(string[] args)
{
var sourceCollection = Source().ToObservable();
var bufferedCollection = sourceCollection.Buffer(
() => Observable.Amb(
Observable.Timer(TimeSpan.FromSeconds(5)//,
//Observable.TakeWhile(bufferedCollection, a=> a.Count < 5)
))
);
bufferedCollection.Subscribe(col =>
{
Console.WriteLine("count of items is now {0}", col.Count);
});
Console.ReadLine();
}
static IEnumerable<int> Source()
{
var random = new Random();
var lst = new List<int> { 1,2,3,4,5 };
while(true)
{
yield return lst[random.Next(lst.Count)];
Thread.Sleep(random.Next(0, 1500));
}
}
i managed to make it work with the Observable.Timer, but the TakeWhile doesnt work, how do I check for the collection count, does TakeWhile work for this or is there some other method? Im sure its something simple.

I got it, the answer was in the documentation of Buffer - there's an overload that takes a parameter that specifies the maximum count. So I don't need Observable.Amb, I can just say
var sourceCollection = Source().ToObservable();
var maxBufferCount = 5;
var bufferedCollection = sourceCollection.Buffer(TimeSpan.FromSeconds(5), maxBufferCount, Scheduler.Default);
bufferedCollection.Subscribe(col =>
{
Console.WriteLine("count of items is now {0}", col.Count);
});
Console.ReadLine();

Related

Tasks combine result and continue

I have 16 tasks doing the same job, each of them return an array. I want to combine the results in pairs and do same job until I have only one task. I don't know what is the best way to do this.
public static IComparatorNetwork[] Prune(IComparatorNetwork[] nets, int numTasks)
{
var tasks = new Task[numTasks];
var netsPerTask = nets.Length/numTasks;
var start = 0;
var concurrentSet = new ConcurrentBag<IComparatorNetwork>();
for(var i = 0; i < numTasks; i++)
{
IComparatorNetwork[] taskNets;
if (i == numTasks - 1)
{
taskNets = nets.Skip(start).ToArray();
}
else
{
taskNets = nets.Skip(start).Take(netsPerTask).ToArray();
}
start += netsPerTask;
tasks[i] = Task.Factory.StartNew(() =>
{
var pruner = new Pruner();
concurrentSet.AddRange(pruner.Prune(taskNets));
});
}
Task.WaitAll(tasks.ToArray());
if(numTasks > 1)
{
return Prune(concurrentSet.ToArray(), numTasks/2);
}
return concurrentSet.ToArray();
}
Right now I am waiting for all tasks to complete then I repeat with half of the tasks until I have only one. I would like to not have to wait for all on each iteration. I am very new with parallel programming probably the approach is bad.
The code I am trying to parallelize is the following:
public IComparatorNetwork[] Prune(IComparatorNetwork[] nets)
{
var result = new List<IComparatorNetwork>();
for (var i = 0; i < nets.Length; i++)
{
var isSubsumed = false;
for (var index = result.Count - 1; index >= 0; index--)
{
var n = result[index];
if (nets[i].IsSubsumed(n))
{
isSubsumed = true;
break;
}
if (n.IsSubsumed(nets[i]))
{
result.Remove(n);
}
}
if (!isSubsumed)
{
result.Add(nets[i]);
}
}
return result.ToArray();
}`
So what you're fundamentally doing here is aggregating values, but in parallel. Fortunately, PLINQ already has an implementation of Aggregate that works in parallel. So in your case you can simply wrap each element in the original array in its own one element array, and then your Prune operation is able to combine any two arrays of nets into a new single array.
public static IComparatorNetwork[] Prune(IComparatorNetwork[] nets)
{
return nets.Select(net => new[] { net })
.AsParallel()
.Aggregate((a, b) => new Pruner().Prune(a.Concat(b).ToArray()));
}
I'm not super knowledgeable about the internals of their aggregate method, but I would imagine it's likely pretty good and doesn't spend a lot of time waiting unnecessarily. But, if you want to write your own, so that you can be sure the workers are always pulling in new work as soon as their is new work, here is my own implementation. Feel free to compare the two in your specific situation to see which performs best for your needs. Note that PLINQ is configurable in many ways, feel free to experiment with other configurations to see what works best for your situation.
public static T AggregateInParallel<T>(this IEnumerable<T> values, Func<T, T, T> function, int numTasks)
{
Queue<T> queue = new Queue<T>();
foreach (var value in values)
queue.Enqueue(value);
if (!queue.Any())
return default(T); //Consider throwing or doing something else here if the sequence is empty
(T, T)? GetFromQueue()
{
lock (queue)
{
if (queue.Count >= 2)
{
return (queue.Dequeue(), queue.Dequeue());
}
else
{
return null;
}
}
}
var tasks = Enumerable.Range(0, numTasks)
.Select(_ => Task.Run(() =>
{
var pair = GetFromQueue();
while (pair != null)
{
var result = function(pair.Value.Item1, pair.Value.Item2);
lock (queue)
{
queue.Enqueue(result);
}
pair = GetFromQueue();
}
}))
.ToArray();
Task.WaitAll(tasks);
return queue.Dequeue();
}
And the calling code for this version would look like:
public static IComparatorNetwork[] Prune2(IComparatorNetwork[] nets)
{
return nets.Select(net => new[] { net })
.AggregateInParallel((a, b) => new Pruner().Prune(a.Concat(b).ToArray()), nets.Length / 2);
}
As mentioned in comments, you can make the pruner's Prune method much more efficient by having it accept two collections, not just one, and only comparing items from each collection with the other, knowing that all items from the same collection will not subsume any others from that collection. This makes the method not only much shorter, simpler, and easier to understand, but also removes a sizeable portion of the expensive comparisons. A few minor adaptations can also greatly reduce the number of intermediate collections created.
public static IReadOnlyList<IComparatorNetwork> Prune(IReadOnlyList<IComparatorNetwork> first, IReadOnlyList<IComparatorNetwork> second)
{
var firstItemsNotSubsumed = first.Where(outerNet => !second.Any(innerNet => outerNet.IsSubsumed(innerNet)));
var secondItemsNotSubsumed = second.Where(outerNet => !first.Any(innerNet => outerNet.IsSubsumed(innerNet)));
return firstItemsNotSubsumed.Concat(secondItemsNotSubsumed).ToList();
}
With the the calling code just needs minor adaptations to ensure the types match up and that you pass in both collections rather than concatting them first.
public static IReadOnlyList<IComparatorNetwork> Prune(IReadOnlyList<IComparatorNetwork> nets)
{
return nets.Select(net => (IReadOnlyList<IComparatorNetwork>)new[] { net })
.AggregateInParallel((a, b) => Pruner.Prune(a, b), nets.Count / 2);
}

Simoultanious data insertion into a list with multithreading

I'm trying to optimize a small program. So here is the basic idea:
I have an array of unfiltered data, and I wanna pass that to a function which will call another function, twice, for data filtering and insertion to a new list. The first call will take the data from original array in range from 0 => half of arrays length, and the second will do the same, but with range from half, to the last item. This way, I should make simultaneous insertion of filtered data into the same list. After the insertion is completed the filtered list can be passed to the rest of the program. Here's the code:
static void Main(string[]
{
// the unfiltered list
int[] oldArray = new int[6] {1,2,3,4,5,6};
// filtered list
List<int> newList= new List<int>();
// Functions is my static class
Functions.Insert(newList, oldArray )
Continue_Program_With_Filtered_List(newList);
// remaining functions...
}
And here is the Function class:
public static class Functions
{
public static void Insert(List<int> newList, int[] oldArray)
{
new Thread(() =>
{
Inserter(newList, oldArray, true);
}).Start();
new Thread(() =>
{
Inserter(newList, oldArray, false);
}).Start();
// I need to wait the result here of both threads
// and make sure that every item from oldArray has been filtered
// before I proceed to the next function in Main()
}
public static void Inserter(List<int> newList, int[] oldArray, bool countUp)
{
bool filterIsValid = false;
int length = oldArray.Length;
int halflen = (int)Math.Floor((decimal)length / 2);
if (countUp)
{
// from half length to 0
for (int i = 0; i < halflen; i++)
{
// filtering conditions here to set value of filterIsValid
if(filterIsValid)
newList.Add(oldArray[i]);
}
}
else
{
// from half length to full length
for (int i = halflen + 1; i < length; i++)
{
// filtering conditions here to set value of filterIsValid
if(filterIsValid)
newList.Add(oldArray[i]);
}
}
}
}
So the problem is that I must await Function.Insert() to complete every thread, and pass through every item before the newList is passed to the next function in Main().
I've no idea how to use Tasks or async method on something like this. This is just an outline of the program by the way. Any help?
In your case using PLINQ may also an option.
static void Main(string[] args)
{
// the unfiltered list
int[] oldArray = new int[6] { 1, 2, 3, 4, 5, 6 };
// filtered list
List<int> newList = oldArray.AsParallel().Where(filter).ToList();
// remaining functions...
}
You can also use AsOrdered() to preserve order
To come back to your initial question, here's what you can do
Note: Solution with minimal changes to your original code, whether there are other possible optimizations or not
Additional Note: Keep in mind that there can still be concurrency issues depending on what else you do with the arguments passing to that function.
public static async Task Insert(List<int> newList, int[] oldArray)
{
ConcurrentBag<int> concurrentBag = new ConcurrentBag<int>();
var task1 = Task.Factory.StartNew(() =>
{
Inserter(concurrentBag, oldArray, true);
});
var task2 = Task.Factory.StartNew(() =>
{
Inserter(concurrentBag, oldArray, false);
});
await Task.WhenAll(task1, task2);
newList.AddRange(concurrentBag);
}
public static void Inserter(ConcurrentBag<int> newList, int[] oldArray, bool countUp)
{
//Same code
}
Edit: Your second for-loop is wrong, change it to this or you will loose one item
for (int i = halflen; i < length; i++)

TPL Dataflow, confused about core design

I have been using TPL Dataflow quite a bit but am stumbling about an issue that I cannot resolve:
I have the following architecture:
BroadCastBlock<List<object1>> -> 2 different TransformBlock<List<Object1>, Tuple<int, List<Object1>>> -> both link to TransformManyBlock<Tuple<int, List<Object1>>, Object2>
I vary the lambda expression within the TransformManyBlock in the end of chain: (a) code that performs operations on the streamed tuple, (b) no code at all.
Within the TransformBlocks I measure the time starting from the arrival of the first item and stopping when TransformBlock.Completion indicates the block completed (broadCastBlock links to transfrom blocks with propagateCompletion set to true).
What I cannot reconcile is why the transformBlocks in the case of (b) complete about 5-6 times faster than with (a). This completely goes against the intent of the whole TDF design intentions. The items from the transform blocks were passed on to the transfromManyBlock, thus it should not matter at all what the transformManyBlock does to the items that influences when the transform blocks complete. I do not see a single reason why anything that goes on in the transfromManyBlock may have a bearing on the preceding TransformBlocks.
Anyone who can reconcile this weird observation?
Here is some code to show the difference. When running the code make sure to change the following two lines from:
tfb1.transformBlock.LinkTo(transformManyBlock);
tfb2.transformBlock.LinkTo(transformManyBlock);
to:
tfb1.transformBlock.LinkTo(transformManyBlockEmpty);
tfb2.transformBlock.LinkTo(transformManyBlockEmpty);
in order to observe the difference in runtime of the preceding transformBlocks.
class Program
{
static void Main(string[] args)
{
Test test = new Test();
test.Start();
}
}
class Test
{
private const int numberTransformBlocks = 2;
private int currentGridPointer;
private Dictionary<int, List<Tuple<int, List<Object1>>>> grid;
private BroadcastBlock<List<Object1>> broadCastBlock;
private TransformBlockClass tfb1;
private TransformBlockClass tfb2;
private TransformManyBlock<Tuple<int, List<Object1>>, Object2>
transformManyBlock;
private TransformManyBlock<Tuple<int, List<Object1>>, Object2>
transformManyBlockEmpty;
private ActionBlock<Object2> actionBlock;
public Test()
{
grid = new Dictionary<int, List<Tuple<int, List<Object1>>>>();
broadCastBlock = new BroadcastBlock<List<Object1>>(list => list);
tfb1 = new TransformBlockClass();
tfb2 = new TransformBlockClass();
transformManyBlock = new TransformManyBlock<Tuple<int, List<Object1>>, Object2>
(newTuple =>
{
for (int counter = 1; counter <= 10000000; counter++)
{
double result = Math.Sqrt(counter + 1.0);
}
return new Object2[0];
});
transformManyBlockEmpty
= new TransformManyBlock<Tuple<int, List<Object1>>, Object2>(
tuple =>
{
return new Object2[0];
});
actionBlock = new ActionBlock<Object2>(list =>
{
int tester = 1;
//flush transformManyBlock
});
//linking
broadCastBlock.LinkTo(tfb1.transformBlock
, new DataflowLinkOptions
{ PropagateCompletion = true }
);
broadCastBlock.LinkTo(tfb2.transformBlock
, new DataflowLinkOptions
{ PropagateCompletion = true }
);
//link either to ->transformManyBlock or -> transformManyBlockEmpty
tfb1.transformBlock.LinkTo(transformManyBlock);
tfb2.transformBlock.LinkTo(transformManyBlock);
transformManyBlock.LinkTo(actionBlock
, new DataflowLinkOptions
{ PropagateCompletion = true }
);
transformManyBlockEmpty.LinkTo(actionBlock
, new DataflowLinkOptions
{ PropagateCompletion = true }
);
//completion
Task.WhenAll(tfb1.transformBlock.Completion
, tfb2.transformBlock.Completion)
.ContinueWith(_ =>
{
transformManyBlockEmpty.Complete();
transformManyBlock.Complete();
});
transformManyBlock.Completion.ContinueWith(_ =>
{
Console.WriteLine("TransformManyBlock (with code) completed");
});
transformManyBlockEmpty.Completion.ContinueWith(_ =>
{
Console.WriteLine("TransformManyBlock (empty) completed");
});
}
public void Start()
{
const int numberBlocks = 100;
const int collectionSize = 300000;
//send collection numberBlock-times
for (int i = 0; i < numberBlocks; i++)
{
List<Object1> list = new List<Object1>();
for (int j = 0; j < collectionSize; j++)
{
list.Add(new Object1(j));
}
broadCastBlock.Post(list);
}
//mark broadCastBlock complete
broadCastBlock.Complete();
Console.WriteLine("Core routine finished");
Console.ReadLine();
}
}
class TransformBlockClass
{
private Stopwatch watch;
private bool isStarted;
private int currentIndex;
public TransformBlock<List<Object1>, Tuple<int, List<Object1>>> transformBlock;
public TransformBlockClass()
{
isStarted = false;
watch = new Stopwatch();
transformBlock = new TransformBlock<List<Object1>, Tuple<int, List<Object1>>>
(list =>
{
if (!isStarted)
{
StartUp();
isStarted = true;
}
return new Tuple<int, List<Object1>>(currentIndex++, list);
});
transformBlock.Completion.ContinueWith(_ =>
{
ShutDown();
});
}
private void StartUp()
{
watch.Start();
}
private void ShutDown()
{
watch.Stop();
Console.WriteLine("TransformBlock : Time elapsed in ms: "
+ watch.ElapsedMilliseconds);
}
}
class Object1
{
public int val { get; private set; }
public Object1(int val)
{
this.val = val;
}
}
class Object2
{
public int value { get; private set; }
public List<Object1> collection { get; private set; }
public Object2(int value, List<Object1> collection)
{
this.value = value;
this.collection = collection;
}
}
*EDIT: I posted another code piece, this time using collections of value types and I cannot reproduce the problem I am observing in above code. Could it be that passing around reference types and operating on them concurrently (even within different data flow blocks) could block and cause contention? *
class Program
{
static void Main(string[] args)
{
Test test = new Test();
test.Start();
}
}
class Test
{
private BroadcastBlock<List<int>> broadCastBlock;
private TransformBlock<List<int>, List<int>> tfb11;
private TransformBlock<List<int>, List<int>> tfb12;
private TransformBlock<List<int>, List<int>> tfb21;
private TransformBlock<List<int>, List<int>> tfb22;
private TransformManyBlock<List<int>, List<int>> transformManyBlock1;
private TransformManyBlock<List<int>, List<int>> transformManyBlock2;
private ActionBlock<List<int>> actionBlock1;
private ActionBlock<List<int>> actionBlock2;
public Test()
{
broadCastBlock = new BroadcastBlock<List<int>>(item => item);
tfb11 = new TransformBlock<List<int>, List<int>>(item =>
{
return item;
});
tfb12 = new TransformBlock<List<int>, List<int>>(item =>
{
return item;
});
tfb21 = new TransformBlock<List<int>, List<int>>(item =>
{
return item;
});
tfb22 = new TransformBlock<List<int>, List<int>>(item =>
{
return item;
});
transformManyBlock1 = new TransformManyBlock<List<int>, List<int>>(item =>
{
Thread.Sleep(100);
//or you can replace the Thread.Sleep(100) with actual work,
//no difference in results. This shows that the issue at hand is
//unrelated to starvation of threads.
return new List<int>[1] { item };
});
transformManyBlock2 = new TransformManyBlock<List<int>, List<int>>(item =>
{
return new List<int>[1] { item };
});
actionBlock1 = new ActionBlock<List<int>>(item =>
{
//flush transformManyBlock
});
actionBlock2 = new ActionBlock<List<int>>(item =>
{
//flush transformManyBlock
});
//linking
broadCastBlock.LinkTo(tfb11, new DataflowLinkOptions
{ PropagateCompletion = true });
broadCastBlock.LinkTo(tfb12, new DataflowLinkOptions
{ PropagateCompletion = true });
broadCastBlock.LinkTo(tfb21, new DataflowLinkOptions
{ PropagateCompletion = true });
broadCastBlock.LinkTo(tfb22, new DataflowLinkOptions
{ PropagateCompletion = true });
tfb11.LinkTo(transformManyBlock1);
tfb12.LinkTo(transformManyBlock1);
tfb21.LinkTo(transformManyBlock2);
tfb22.LinkTo(transformManyBlock2);
transformManyBlock1.LinkTo(actionBlock1
, new DataflowLinkOptions
{ PropagateCompletion = true }
);
transformManyBlock2.LinkTo(actionBlock2
, new DataflowLinkOptions
{ PropagateCompletion = true }
);
//completion
Task.WhenAll(tfb11.Completion, tfb12.Completion).ContinueWith(_ =>
{
Console.WriteLine("TransformBlocks 11 and 12 completed");
transformManyBlock1.Complete();
});
Task.WhenAll(tfb21.Completion, tfb22.Completion).ContinueWith(_ =>
{
Console.WriteLine("TransformBlocks 21 and 22 completed");
transformManyBlock2.Complete();
});
transformManyBlock1.Completion.ContinueWith(_ =>
{
Console.WriteLine
("TransformManyBlock (from tfb11 and tfb12) finished");
});
transformManyBlock2.Completion.ContinueWith(_ =>
{
Console.WriteLine
("TransformManyBlock (from tfb21 and tfb22) finished");
});
}
public void Start()
{
const int numberBlocks = 100;
const int collectionSize = 300000;
//send collection numberBlock-times
for (int i = 0; i < numberBlocks; i++)
{
List<int> list = new List<int>();
for (int j = 0; j < collectionSize; j++)
{
list.Add(j);
}
broadCastBlock.Post(list);
}
//mark broadCastBlock complete
broadCastBlock.Complete();
Console.WriteLine("Core routine finished");
Console.ReadLine();
}
}
Okay, final attempt ;-)
Synopsis:
The observed time delta in scenario 1 can be fully explained by differing behavior of the garbage collector.
When running scenario 1 linking the transformManyBlocks, the runtime behavior is such that garbage collections are triggered during the creation of new items (Lists) on the main thread, which is not the case when running scenario 1 with the transformManyBlockEmptys linked.
Note that creating a new reference type instance (Object1) results in a call to allocate memory in the GC heap which in turn may trigger a GC collection run. As quite a few Object1 instances (and lists) are created, the garbage collector has quite a bit more work to do scanning the heap for (potentially) unreachable objects.
Therefore the observed difference can be minimized by any of the following:
Turning Object1 from a class to a struct (thereby ensuring that memory for the instances is not allocated on the heap).
Keeping a reference to the generated lists (thereby reducing the time the garbage collector needs to identify unreachable objects).
Generating all the items before posting them to the network.
(Note: I cannot explain why the garbage collector behaves differently in scenario 1 "transformManyBlock" vs. scenario 1 "transformManyBlockEmpty", but data collected via the ConcurrencyVisualizer clearly shows the difference.)
Results:
(Tests were run on a Core i7 980X, 6 cores, HT enabled):
I modified scenario 2 as follows:
// Start a stopwatch per tfb
int tfb11Cnt = 0;
Stopwatch sw11 = new Stopwatch();
tfb11 = new TransformBlock<List<int>, List<int>>(item =>
{
if (Interlocked.CompareExchange(ref tfb11Cnt, 1, 0) == 0)
sw11.Start();
return item;
});
// [...]
// completion
Task.WhenAll(tfb11.Completion, tfb12.Completion).ContinueWith(_ =>
{
Console.WriteLine("TransformBlocks 11 and 12 completed. SW11: {0}, SW12: {1}",
sw11.ElapsedMilliseconds, sw12.ElapsedMilliseconds);
transformManyBlock1.Complete();
});
Results:
Scenario 1 (as posted, i.e. linked to transformManyBlock):
TransformBlock : Time elapsed in ms: 6826
TransformBlock : Time elapsed in ms: 6826
Scenario 1 (linked to transformManyBlockEmpty):
TransformBlock : Time elapsed in ms: 3140
TransformBlock : Time elapsed in ms: 3140
Scenario 1 (transformManyBlock, Thread.Sleep(200) in loop body):
TransformBlock : Time elapsed in ms: 4949
TransformBlock : Time elapsed in ms: 4950
Scenario 2 (as posted but modified to report times):
TransformBlocks 21 and 22 completed. SW21: 619 ms, SW22: 669 ms
TransformBlocks 11 and 12 completed. SW11: 669 ms, SW12: 667 ms
Next, I changed scenario 1 and 2 to prepare the input data prior to posting it to the network:
// Scenario 1
//send collection numberBlock-times
var input = new List<List<Object1>>(numberBlocks);
for (int i = 0; i < numberBlocks; i++)
{
var list = new List<Object1>(collectionSize);
for (int j = 0; j < collectionSize; j++)
{
list.Add(new Object1(j));
}
input.Add(list);
}
foreach (var inp in input)
{
broadCastBlock.Post(inp);
Thread.Sleep(10);
}
// Scenario 2
//send collection numberBlock-times
var input = new List<List<int>>(numberBlocks);
for (int i = 0; i < numberBlocks; i++)
{
List<int> list = new List<int>(collectionSize);
for (int j = 0; j < collectionSize; j++)
{
list.Add(j);
}
//broadCastBlock.Post(list);
input.Add(list);
}
foreach (var inp in input)
{
broadCastBlock.Post(inp);
Thread.Sleep(10);
}
Results:
Scenario 1 (transformManyBlock):
TransformBlock : Time elapsed in ms: 1029
TransformBlock : Time elapsed in ms: 1029
Scenario 1 (transformManyBlockEmpty):
TransformBlock : Time elapsed in ms: 975
TransformBlock : Time elapsed in ms: 975
Scenario 1 (transformManyBlock, Thread.Sleep(200) in loop body):
TransformBlock : Time elapsed in ms: 972
TransformBlock : Time elapsed in ms: 972
Finally, I changed the code back to the original version, but keeping a reference to the
created list around:
var lists = new List<List<Object1>>();
for (int i = 0; i < numberBlocks; i++)
{
List<Object1> list = new List<Object1>();
for (int j = 0; j < collectionSize; j++)
{
list.Add(new Object1(j));
}
lists.Add(list);
broadCastBlock.Post(list);
}
Results:
Scenario 1 (transformManyBlock):
TransformBlock : Time elapsed in ms: 6052
TransformBlock : Time elapsed in ms: 6052
Scenario 1 (transformManyBlockEmpty):
TransformBlock : Time elapsed in ms: 5524
TransformBlock : Time elapsed in ms: 5524
Scenario 1 (transformManyBlock, Thread.Sleep(200) in loop body):
TransformBlock : Time elapsed in ms: 5098
TransformBlock : Time elapsed in ms: 5098
Likewise, changing Object1 from a class to a struct results in both blocks to complete at about the same time (and about 10x faster).
Update: Below answer does not suffice to explain the behavior observed.
In scenario one a tight loop is executed inside the TransformMany lambda, which will hog the CPU and will starve other threads for processor resources. That's the reason why a delay in the execution of the Completion continuation task can be observed. In scenario two a Thread.Sleep is executed inside the TransformMany lambda giving other threads the chance to execute the Completion continuation task. The observed difference in runtime behavior is not related to TPL Dataflow. To improve the observed deltas it should suffice to introduce a Thread.Sleep inside the loop's body in scenario 1:
for (int counter = 1; counter <= 10000000; counter++)
{
double result = Math.Sqrt(counter + 1.0);
// Back off for a little while
Thread.Sleep(200);
}
(Below is my original answer. I didn't read the OP's question careful enough, and only understood what he was asking about after having read his comments. I still leave it here as a reference.)
Are you sure that you are measuring the right thing? Note that when you do something like this: transformBlock.Completion.ContinueWith(_ => ShutDown()); then your time measurement will be influenced by the behavior of the TaskScheduler (e.g. how long it takes until the continuation task starts executing). Although I was not able to observe the difference you saw on my machine I got preciser results (in terms of the delta between tfb1 and tfb2 completion times) when using dedicated threads for measuring time:
// Within your Test.Start() method...
Thread timewatch = new Thread(() =>
{
var sw = Stopwatch.StartNew();
tfb1.transformBlock.Completion.Wait();
Console.WriteLine("tfb1.transformBlock completed within {0} ms",
sw.ElapsedMilliseconds);
});
Thread timewatchempty = new Thread(() =>
{
var sw = Stopwatch.StartNew();
tfb2.transformBlock.Completion.Wait();
Console.WriteLine("tfb2.transformBlock completed within {0} ms",
sw.ElapsedMilliseconds);
});
timewatch.Start();
timewatchempty.Start();
//send collection numberBlock-times
for (int i = 0; i < numberBlocks; i++)
{
// ... rest of the code

Why one loop is performing better than other memory wise as well as performance wise?

I have following two loops in C#, and I am running these loops for a collection with 10,000 records being downloaded with paging using "yield return"
First
foreach(var k in collection) {
repo.Save(k);
}
Second
var collectionEnum = collection.GetEnumerator();
while (collectionEnum.MoveNext()) {
var k = collectionEnum.Current;
repo.Save(k);
k = null;
}
Seems like that the second loop consumes less memory and it faster than the first loop. Memory I understand may be because of k being set to null(Even though I am not sure). But how come it is faster than for each.
Following is the actual code
[Test]
public void BechmarkForEach_Test() {
bool isFirstTimeSync = true;
Func<Contact, bool> afterProcessing = contactItem => {
return true;
};
var contactService = CreateSerivce("/administrator/components/com_civicrm");
var contactRepo = new ContactRepository(new Mock<ILogger>().Object);
contactRepo.Drop();
contactRepo = new ContactRepository(new Mock<ILogger>().Object);
Profile("For Each Profiling",1,()=>{
var localenumertaor=contactService.Download();
foreach (var item in localenumertaor) {
if (isFirstTimeSync)
item.StateFlag = 1;
item.ClientTimeStamp = DateTime.UtcNow;
if (item.StateFlag == 1)
contactRepo.Insert(item);
else
contactRepo.Update(item);
afterProcessing(item);
}
contactRepo.DeleteAll();
});
}
[Test]
public void BechmarkWhile_Test() {
bool isFirstTimeSync = true;
Func<Contact, bool> afterProcessing = contactItem => {
return true;
};
var contactService = CreateSerivce("/administrator/components/com_civicrm");
var contactRepo = new ContactRepository(new Mock<ILogger>().Object);
contactRepo.Drop();
contactRepo = new ContactRepository(new Mock<ILogger>().Object);
var itemsCollection = contactService.Download().GetEnumerator();
Profile("While Profiling", 1, () =>
{
while (itemsCollection.MoveNext()) {
var item = itemsCollection.Current;
//if First time sync then ignore and overwrite the stateflag
if (isFirstTimeSync)
item.StateFlag = 1;
item.ClientTimeStamp = DateTime.UtcNow;
if (item.StateFlag == 1)
contactRepo.Insert(item);
else
contactRepo.Update(item);
afterProcessing(item);
item = null;
}
contactRepo.DeleteAll();
});
}
static void Profile(string description, int iterations, Action func) {
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
// warm up
func();
var watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++) {
func();
}
watch.Stop();
Console.Write(description);
Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}
I m using the micro bench marking, from a stackoverflow question itself benchmarking-small-code
The time taken is
For Each Profiling Time Elapsed 5249 ms
While Profiling Time Elapsed 116 ms
Your foreach version calls var localenumertaor = contactService.Download(); inside the profile action, while the enumerator version calls it outside of the Profile call.
On top of that, the first execution of the iterator version will exhaust the items in the enumerator, and on subsequent iterations itemsCollection.MoveNext() will return false and skip the inner loop entirely.

How to Quickly Remove Items From a List

I am looking for a way to quickly remove items from a C# List<T>. The documentation states that the List.Remove() and List.RemoveAt() operations are both O(n)
List.Remove
List.RemoveAt
This is severely affecting my application.
I wrote a few different remove methods and tested them all on a List<String> with 500,000 items. The test cases are shown below...
Overview
I wrote a method that would generate a list of strings that simply contains string representations of each number ("1", "2", "3", ...). I then attempted to remove every 5th item in the list. Here is the method used to generate the list:
private List<String> GetList(int size)
{
List<String> myList = new List<String>();
for (int i = 0; i < size; i++)
myList.Add(i.ToString());
return myList;
}
Test 1: RemoveAt()
Here is the test I used to test the RemoveAt() method.
private void RemoveTest1(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list.RemoveAt(i);
}
Test 2: Remove()
Here is the test I used to test the Remove() method.
private void RemoveTest2(ref List<String> list)
{
List<int> itemsToRemove = new List<int>();
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list.Remove(list[i]);
}
Test 3: Set to null, sort, then RemoveRange
In this test, I looped through the list one time and set the to-be-removed items to null. Then, I sorted the list (so null would be at the top), and removed all the items at the top that were set to null.
NOTE: This reordered my list, so I may have to go put it back in the correct order.
private void RemoveTest3(ref List<String> list)
{
int numToRemove = 0;
for (int i = 0; i < list.Count; i++)
{
if (i % 5 == 0)
{
list[i] = null;
numToRemove++;
}
}
list.Sort();
list.RemoveRange(0, numToRemove);
// Now they're out of order...
}
Test 4: Create a new list, and add all of the "good" values to the new list
In this test, I created a new list, and added all of my keep-items to the new list. Then, I put all of these items into the original list.
private void RemoveTest4(ref List<String> list)
{
List<String> newList = new List<String>();
for (int i = 0; i < list.Count; i++)
{
if (i % 5 == 0)
continue;
else
newList.Add(list[i]);
}
list.RemoveRange(0, list.Count);
list.AddRange(newList);
}
Test 5: Set to null and then FindAll()
In this test, I set all the to-be-deleted items to null, then used the FindAll() feature to find all the items that are not null
private void RemoveTest5(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list[i] = null;
list = list.FindAll(x => x != null);
}
Test 6: Set to null and then RemoveAll()
In this test, I set all the to-be-deleted items to null, then used the RemoveAll() feature to remove all the items that are not null
private void RemoveTest6(ref List<String> list)
{
for (int i = 0; i < list.Count; i++)
if (i % 5 == 0)
list[i] = null;
list.RemoveAll(x => x == null);
}
Client Application and Outputs
int numItems = 500000;
Stopwatch watch = new Stopwatch();
// List 1...
watch.Start();
List<String> list1 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest1(ref list1);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 2...
watch.Start();
List<String> list2 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest2(ref list2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 3...
watch.Reset(); watch.Start();
List<String> list3 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest3(ref list3);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 4...
watch.Reset(); watch.Start();
List<String> list4 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest4(ref list4);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 5...
watch.Reset(); watch.Start();
List<String> list5 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest5(ref list5);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
// List 6...
watch.Reset(); watch.Start();
List<String> list6 = GetList(numItems);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
RemoveTest6(ref list6);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine();
Results
00:00:00.1433089 // Create list
00:00:32.8031420 // RemoveAt()
00:00:32.9612512 // Forgot to reset stopwatch :(
00:04:40.3633045 // Remove()
00:00:00.2405003 // Create list
00:00:01.1054731 // Null, Sort(), RemoveRange()
00:00:00.1796988 // Create list
00:00:00.0166984 // Add good values to new list
00:00:00.2115022 // Create list
00:00:00.0194616 // FindAll()
00:00:00.3064646 // Create list
00:00:00.0167236 // RemoveAll()
Notes And Comments
The first two tests do not actually remove every 5th item from the list, because the list is being reordered after each remove. In fact, out of 500,000 items, only 83,334 were removed (should have been 100,000). I am okay with this - clearly the Remove()/RemoveAt() methods are not a good idea anyway.
Although I tried to remove the 5th item from the list, in reality there will not be such a pattern. Entries to be removed will be random.
Although I used a List<String> in this example, that will not always be the case. It could be a List<Anything>
Not putting the items in the list to begin with is not an option.
The other methods (3 - 6) all performed much better, comparatively, however I am a little concerned -- In 3, 5, and 6 I was forced to set a value to null, and then remove all the items according to this sentinel. I don't like that approach because I can envision a scenario where one of the items in the list might be null and it would get removed unintentionally.
My question is: What is the best way to quickly remove many items from a List<T>? Most of the approaches I've tried look really ugly, and potentially dangerous, to me. Is a List the wrong data structure?
Right now, I am leaning towards creating a new list and adding the good items to the new list, but it seems like there should be a better way.
List isn't an efficient data structure when it comes to removal. You would do better to use a double linked list (LinkedList) as removal simply requires reference updates in the adjacent entries.
If the order does not matter then there is a simple O(1) List.Remove method.
public static class ListExt
{
// O(1)
public static void RemoveBySwap<T>(this List<T> list, int index)
{
list[index] = list[list.Count - 1];
list.RemoveAt(list.Count - 1);
}
// O(n)
public static void RemoveBySwap<T>(this List<T> list, T item)
{
int index = list.IndexOf(item);
RemoveBySwap(list, index);
}
// O(n)
public static void RemoveBySwap<T>(this List<T> list, Predicate<T> predicate)
{
int index = list.FindIndex(predicate);
RemoveBySwap(list, index);
}
}
This solution is friendly for memory traversal, so even if you need to find the index first it will be very fast.
Notes:
Finding the index of an item must be O(n) since the list must be unsorted.
Linked lists are slow on traversal, especially for large collections with long life spans.
If you're happy creating a new list, you don't have to go through setting items to null. For example:
// This overload of Where provides the index as well as the value. Unless
// you need the index, use the simpler overload which just provides the value.
List<string> newList = oldList.Where((value, index) => index % 5 != 0)
.ToList();
However, you might want to look at alternative data structures, such as LinkedList<T> or HashSet<T>. It really depends on what features you need from your data structure.
I feel a HashSet, LinkedList or Dictionary will do you much better.
You could always remove the items from the end of the list. List removal is O(1) when performed on the last element since all it does is decrement count. There is no shifting of next elements involved. (which is the reason why list removal is O(n) generally)
for (int i = list.Count - 1; i >= 0; --i)
list.RemoveAt(i);
Or you could do this:
List<int> listA;
List<int> listB;
...
List<int> resultingList = listA.Except(listB);
Ok try RemoveAll used like this
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
watch.Start();
List<Int32> test = GetList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
test.RemoveAll( t=> t % 5 == 0);
List<String> test2 = test.ConvertAll(delegate(int i) { return i.ToString(); });
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.ReadLine();
}
static private List<Int32> GetList(int size)
{
List<Int32> test = new List<Int32>();
for (int i = 0; i < 500000; i++)
test.Add(i);
return test;
}
this only loops twice and removes eactly 100,000 items
My output for this code:
00:00:00.0099495
00:00:00.1945987
1000000
Updated to try a HashSet
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
do
{
// Test with list
watch.Reset(); watch.Start();
List<Int32> test = GetList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
List<String> myList = RemoveTest(test);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.WriteLine();
// Test with HashSet
watch.Reset(); watch.Start();
HashSet<String> test2 = GetStringList(500000);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
watch.Reset(); watch.Start();
HashSet<String> myList2 = RemoveTest(test2);
watch.Stop(); Console.WriteLine(watch.Elapsed.ToString());
Console.WriteLine((500000 - test.Count).ToString());
Console.WriteLine();
} while (Console.ReadKey().Key != ConsoleKey.Escape);
}
static private List<Int32> GetList(int size)
{
List<Int32> test = new List<Int32>();
for (int i = 0; i < 500000; i++)
test.Add(i);
return test;
}
static private HashSet<String> GetStringList(int size)
{
HashSet<String> test = new HashSet<String>();
for (int i = 0; i < 500000; i++)
test.Add(i.ToString());
return test;
}
static private List<String> RemoveTest(List<Int32> list)
{
list.RemoveAll(t => t % 5 == 0);
return list.ConvertAll(delegate(int i) { return i.ToString(); });
}
static private HashSet<String> RemoveTest(HashSet<String> list)
{
list.RemoveWhere(t => Convert.ToInt32(t) % 5 == 0);
return list;
}
This gives me:
00:00:00.0131586
00:00:00.1454723
100000
00:00:00.3459420
00:00:00.2122574
100000
I've found when dealing with large lists, this is often faster. The speed of the Remove and finding the right item in the dictionary to remove, more than makes up for creating the dictionary. A couple things though, the original list has to have unique values, and I don't think the order is guaranteed once you are done.
List<long> hundredThousandItemsInOrignalList;
List<long> fiftyThousandItemsToRemove;
// populate lists...
Dictionary<long, long> originalItems = hundredThousandItemsInOrignalList.ToDictionary(i => i);
foreach (long i in fiftyThousandItemsToRemove)
{
originalItems.Remove(i);
}
List<long> newList = originalItems.Select(i => i.Key).ToList();
Lists are faster than LinkedLists until n gets realy big. The reason for this is because so called cache misses occur quite more frequently using LinkedLists than Lists. Memory look ups are quite expensive. As a list is implemented as an array the CPU can load a bunch of data at once because it knows the required data is stored next to each other. However a linked list does not give the CPU any hint which data is required next which forces the CPU to do quite more memory look ups. By the way. With term memory I mean RAM.
For further details take a look at: https://jackmott.github.io/programming/2016/08/20/when-bigo-foolsya.html
The other answers (and the question itself) offer various ways of dealing with this "slug" (slowness bug) using the built-in .NET Framework classes.
But if you're willing to switch to a third-party library, you can get better performance simply by changing the data structure, and leaving your code unchanged except for the list type.
The Loyc Core libraries include two types that work the same way as List<T> but can remove items faster:
DList<T> is a simple data structure that gives you a 2x speedup over List<T> when removing items from random locations
AList<T> is a sophisticated data structure that gives you a large speedup over List<T> when your lists are very long (but may be slower when the list is short).
If you still want to use a List as an underlying structure, you can use the following extension method, which does the heavy lifting for you.
using System.Collections.Generic;
using System.Linq;
namespace Library.Extensions
{
public static class ListExtensions
{
public static IEnumerable<T> RemoveRange<T>(this List<T> list, IEnumerable<T> range)
{
var removed = list.Intersect(range).ToArray();
if (!removed.Any())
{
return Enumerable.Empty<T>();
}
var remaining = list.Except(removed).ToArray();
list.Clear();
list.AddRange(remaining);
return removed;
}
}
}
A simple stopwatch test gives results in about 200ms for removal. Keep in mind this is not a real benchmark usage.
public class Program
{
static void Main(string[] args)
{
var list = Enumerable
.Range(0, 500_000)
.Select(x => x.ToString())
.ToList();
var allFifthItems = list.Where((_, index) => index % 5 == 0).ToArray();
var sw = Stopwatch.StartNew();
list.RemoveRange(allFifthItems);
sw.Stop();
var message = $"{allFifthItems.Length} elements removed in {sw.Elapsed}";
Console.WriteLine(message);
}
}
Output:
100000 elements removed in 00:00:00.2291337

Categories

Resources