I have 16 tasks doing the same job, each of them return an array. I want to combine the results in pairs and do same job until I have only one task. I don't know what is the best way to do this.
public static IComparatorNetwork[] Prune(IComparatorNetwork[] nets, int numTasks)
{
var tasks = new Task[numTasks];
var netsPerTask = nets.Length/numTasks;
var start = 0;
var concurrentSet = new ConcurrentBag<IComparatorNetwork>();
for(var i = 0; i < numTasks; i++)
{
IComparatorNetwork[] taskNets;
if (i == numTasks - 1)
{
taskNets = nets.Skip(start).ToArray();
}
else
{
taskNets = nets.Skip(start).Take(netsPerTask).ToArray();
}
start += netsPerTask;
tasks[i] = Task.Factory.StartNew(() =>
{
var pruner = new Pruner();
concurrentSet.AddRange(pruner.Prune(taskNets));
});
}
Task.WaitAll(tasks.ToArray());
if(numTasks > 1)
{
return Prune(concurrentSet.ToArray(), numTasks/2);
}
return concurrentSet.ToArray();
}
Right now I am waiting for all tasks to complete then I repeat with half of the tasks until I have only one. I would like to not have to wait for all on each iteration. I am very new with parallel programming probably the approach is bad.
The code I am trying to parallelize is the following:
public IComparatorNetwork[] Prune(IComparatorNetwork[] nets)
{
var result = new List<IComparatorNetwork>();
for (var i = 0; i < nets.Length; i++)
{
var isSubsumed = false;
for (var index = result.Count - 1; index >= 0; index--)
{
var n = result[index];
if (nets[i].IsSubsumed(n))
{
isSubsumed = true;
break;
}
if (n.IsSubsumed(nets[i]))
{
result.Remove(n);
}
}
if (!isSubsumed)
{
result.Add(nets[i]);
}
}
return result.ToArray();
}`
So what you're fundamentally doing here is aggregating values, but in parallel. Fortunately, PLINQ already has an implementation of Aggregate that works in parallel. So in your case you can simply wrap each element in the original array in its own one element array, and then your Prune operation is able to combine any two arrays of nets into a new single array.
public static IComparatorNetwork[] Prune(IComparatorNetwork[] nets)
{
return nets.Select(net => new[] { net })
.AsParallel()
.Aggregate((a, b) => new Pruner().Prune(a.Concat(b).ToArray()));
}
I'm not super knowledgeable about the internals of their aggregate method, but I would imagine it's likely pretty good and doesn't spend a lot of time waiting unnecessarily. But, if you want to write your own, so that you can be sure the workers are always pulling in new work as soon as their is new work, here is my own implementation. Feel free to compare the two in your specific situation to see which performs best for your needs. Note that PLINQ is configurable in many ways, feel free to experiment with other configurations to see what works best for your situation.
public static T AggregateInParallel<T>(this IEnumerable<T> values, Func<T, T, T> function, int numTasks)
{
Queue<T> queue = new Queue<T>();
foreach (var value in values)
queue.Enqueue(value);
if (!queue.Any())
return default(T); //Consider throwing or doing something else here if the sequence is empty
(T, T)? GetFromQueue()
{
lock (queue)
{
if (queue.Count >= 2)
{
return (queue.Dequeue(), queue.Dequeue());
}
else
{
return null;
}
}
}
var tasks = Enumerable.Range(0, numTasks)
.Select(_ => Task.Run(() =>
{
var pair = GetFromQueue();
while (pair != null)
{
var result = function(pair.Value.Item1, pair.Value.Item2);
lock (queue)
{
queue.Enqueue(result);
}
pair = GetFromQueue();
}
}))
.ToArray();
Task.WaitAll(tasks);
return queue.Dequeue();
}
And the calling code for this version would look like:
public static IComparatorNetwork[] Prune2(IComparatorNetwork[] nets)
{
return nets.Select(net => new[] { net })
.AggregateInParallel((a, b) => new Pruner().Prune(a.Concat(b).ToArray()), nets.Length / 2);
}
As mentioned in comments, you can make the pruner's Prune method much more efficient by having it accept two collections, not just one, and only comparing items from each collection with the other, knowing that all items from the same collection will not subsume any others from that collection. This makes the method not only much shorter, simpler, and easier to understand, but also removes a sizeable portion of the expensive comparisons. A few minor adaptations can also greatly reduce the number of intermediate collections created.
public static IReadOnlyList<IComparatorNetwork> Prune(IReadOnlyList<IComparatorNetwork> first, IReadOnlyList<IComparatorNetwork> second)
{
var firstItemsNotSubsumed = first.Where(outerNet => !second.Any(innerNet => outerNet.IsSubsumed(innerNet)));
var secondItemsNotSubsumed = second.Where(outerNet => !first.Any(innerNet => outerNet.IsSubsumed(innerNet)));
return firstItemsNotSubsumed.Concat(secondItemsNotSubsumed).ToList();
}
With the the calling code just needs minor adaptations to ensure the types match up and that you pass in both collections rather than concatting them first.
public static IReadOnlyList<IComparatorNetwork> Prune(IReadOnlyList<IComparatorNetwork> nets)
{
return nets.Select(net => (IReadOnlyList<IComparatorNetwork>)new[] { net })
.AggregateInParallel((a, b) => Pruner.Prune(a, b), nets.Count / 2);
}
Related
I have a very long sequence of data is the form of IEnumerable, and I would like to check it for a number of conditions. Each condition returns a value of true or false, and I want to know if all conditions are true. My problem is that I can not afford to materialize the IEnumerable by calling ToList, because it is simply too long (> 10,000,000,000 elements). Neither I can afford to enumerate the sequence multiple times, one for each condition, because each time I will get a different sequence. I am searching for an efficient way to perform this check, using the existing LINQ functionality if possible.
Clarification: I am asking for a general solution, not for a solution of the specific example problem that is presented bellow.
Here is a dummy version of my sequence:
static IEnumerable<int> GetLongSequence()
{
var random = new Random();
for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}
And here is an example of the conditions that the sequence must satisfy:
var source = GetLongSequence();
var result = source.Any(n => n % 28_413_803 == 0)
&& source.All(n => n < 99_999_999)
&& source.Average(n => n) > 50_000_001;
Unfortunately this approach invokes three times the GetLongSequence, so it doesn't satisfy the requirements of the problem.
I tried to write a Linqy extension method of the above, hoping that this could give me some ideas:
public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
params Func<IEnumerable<TSource>, bool>[] conditions)
{
foreach (var condition in conditions)
{
if (!condition(source)) return false;
}
return true;
}
This is how I intend to use it:
var result = source.AllConditions
(
s => s.Any(n => n % 28_413_803 == 0),
s => s.All(n => n < 99_999_999),
s => s.Average(n => n) > 50_000_001,
// more conditions...
);
Unfortunately this offers no improvement. The GetLongSequence is again invoked three times.
After hitting my head against the wall for an hour, without making any progress, I figured out a possible solution. I could run each condition in a separate thread, and synchronize their access to a single shared enumerator of the sequence. So I ended up with this monstrosity:
public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
params Func<IEnumerable<TSource>, bool>[] conditions)
{
var locker = new object();
var enumerator = source.GetEnumerator();
var barrier = new Barrier(conditions.Length);
long index = -1;
bool finished = false;
IEnumerable<TSource> OneByOne()
{
try
{
while (true)
{
TSource current;
lock (locker)
{
if (finished) break;
if (barrier.CurrentPhaseNumber > index)
{
index = barrier.CurrentPhaseNumber;
finished = !enumerator.MoveNext();
if (finished)
{
enumerator.Dispose(); break;
}
}
current = enumerator.Current;
}
yield return current;
barrier.SignalAndWait();
}
}
finally
{
barrier.RemoveParticipant();
}
}
var results = new ConcurrentQueue<bool>();
var threads = conditions.Select(condition => new Thread(() =>
{
var result = condition(OneByOne());
results.Enqueue(result);
})
{ IsBackground = true }).ToArray();
foreach (var thread in threads) thread.Start();
foreach (var thread in threads) thread.Join();
return results.All(r => r);
}
For the synchronization a used a Barrier. This solution actually works way better than I thought. It can process almost 1,000,000 elements per second in my machine. It is not fast enough though, since it needs almost 3 hours to process the full sequence of 10,000,000,000 elements. And I can't wait for the result for longer than 5 minutes. Any ideas about how I could run these conditions efficiently in a single thread?
If you need to ensure that the sequence is enumerated only once, conditions operating on the whole sequence are not useful.
One possibility that comes to my mind is to have an interface which is called for each element of the sequence and implement this interface in different ways for your specific conditions:
bool Example()
{
var source = GetLongSequence();
var conditions = new List<IEvaluate<int>>
{
new Any<int>(n => n % 28_413_803 == 0),
new All<int>(n => n < 99_999_999),
new Average(d => d > 50_000_001)
};
foreach (var item in source)
{
foreach (var condition in conditions)
{
condition.Evaluate(item);
}
}
return conditions.All(c => c.Result);
}
static IEnumerable<int> GetLongSequence()
{
var random = new Random();
for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}
interface IEvaluate<T>
{
void Evaluate(T item);
bool Result { get; }
}
class Any<T> : IEvaluate<T>
{
private bool _result;
private readonly Func<T, bool> _predicate;
public Any(Func<T, bool> predicate)
{
_predicate = predicate;
_result = false;
}
public void Evaluate(T item)
{
if (_predicate(item))
{
_result = true;
}
}
public bool Result => _result;
}
class All<T> : IEvaluate<T>
{
private bool _result;
private readonly Func<T, bool> _predicate;
public All(Func<T, bool> predicate)
{
_predicate = predicate;
_result = true;
}
public void Evaluate(T item)
{
if (!_predicate(item))
{
_result = false;
}
}
public bool Result => _result;
}
class Average : IEvaluate<int>
{
private long _sum;
private int _count;
Func<double, bool> _evaluate;
public Average(Func<double, bool> evaluate)
{
}
public void Evaluate(int item)
{
_sum += item;
_count++;
}
public bool Result => _evaluate((double)_sum / _count);
}
If all you want is check for these three conditions on a single thread in only one enumeration, I wouldn't use LINQ and manually aggregate the checks:
bool anyVerified = false;
bool allVerified = true;
double averageSoFar = 0;
foreach (int n in GetLongSequence()) {
anyVerified = anyVerified || n % 28_413_803 == 0;
allVerified = allVerified && n < 99_999_999;
averageSoFar += n / 10_000_000_000;
// Early out conditions here...
}
return anyVerified && allVerified && averageSoFar > 50_000_001;
This could be made more generic if you plan to do these checks often but it looks like it satisfies all your requirements.
Can I also suggest you another method based on the Enumerable.Aggregate LINQ extension method.
public static class Parsing {
public static bool ParseOnceAndCheck(this IEnumerable<int> collection, Func<int, bool> all, Func<int, bool> any, Func<double, bool> average) {
// Aggregate the two boolean results, the sum of all values and the count of values...
(bool allVerified, bool anyVerified, int sum, int count) = collection.Aggregate(
ValueTuple.Create(true, false, 0, 0),
(tuple, item) => ValueTuple.Create(tuple.Item1 && all(item), tuple.Item2 || any(item), tuple.Item3 + item, tuple.Item4 + 1)
);
// ... and summarizes the result
return allVerified && anyVerified && average(sum / count);
}
}
You could call this extension method in a very similar way than you would usual LINQ methods but there would be only one enumeration of your sequence:
IEnumerable<int> sequence = GetLongSequence();
bool result = sequence.ParseOnceAndCheck(
all: n => n < 99_999_999,
any: n => n % 28_413_803 == 0,
average: a => a > 50_000_001
);
I found a single-threaded solution that uses the Reactive Extensions library. On the one hand it's an excellent solution regarding features and ease of use, since all methods that are available in LINQ for IEnumerable are also available in RX for IObservable. On the other hand it is a bit disappointing regarding performance, as it is as slow as my wacky multi-threaded solution that is presented inside my question.
Update: I discarded the previous two implementations (one using the method Replay, the other using the method Publish) with a new one that uses the class Subject. This class is a low-level combination of an IObservable and IObserver. I am posting to it the items of the source IEnumerable, which are then propagated to all the IObservable<bool>'s provided by the caller. The performance is now decent, only 40% slower than Klaus Gütter's excellent solution. Also I can now break from the loop early if a condition (like All) can be determined to be false before the end of the enumeration.
public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
params Func<IObservable<TSource>, IObservable<bool>>[] conditions)
{
var subject = new Subject<TSource>();
var result = true;
foreach (var condition in conditions)
{
condition(subject).SingleAsync().Subscribe(onNext: value =>
{
if (value) return;
result = false;
});
}
foreach (var item in source)
{
if (!result) break;
subject.OnNext(item);
}
return result;
}
Usage example:
var result = source.AllConditions
(
o => o.Any(n => n % 28_413_803 == 0),
o => o.All(n => n < 99_999_999),
o => o.Average(n => n).Select(v => v > 50_000_001)
);
Each condition should return an IObservable containing a single boolean value. This is not enforcible by the RX API, so I used the System.Reactive.Linq.SingleAsync method to enforce it at runtime (by throwing an exception if a result doesn't comply to this contract).
I know it's not the best idea to get all values from a cache, but I have a special use case for an internal tool.
I have the following code:
public IEnumerable<KeyValuePair<K, V>> GetAllStoreEntries()
{
var keys = RedisStore.Server.Keys(pattern: $"{GetCompletePrefix()}:*", pageSize: pageSize).ToList();
foreach (var subList in Split(keys, pageSize))
{
var subKeys = subList.ToArray();
var values = RedisStore.RedisCache.StringGet(subKeys); <---------
for (var i = 0; i < values.Length; i++)
{
var value = values[i];
if (value.HasValue)
{
var svalue = JsonConvert.DeserializeObject<V>(value);
yield return new KeyValuePair<K, V>(GetKKey(subKeys[i]), svalue);
}
}
}
}
On the marked position I want to use StringGetAsync, but I have no idea how to restructure the code that I can use it.
How can I rewrite GetAllStoreEntries, so that it fetches the splitted keys-list in parallel?
I wonder if your best bet here might be to populate a dictionary using an overlapped pipeline, something like (not exact code, I'm making this up as I go):
async Task<Dictionary<K, V>> GetAllStoreEntries()
{
var keys = RedisStore.Server.Keys(pattern: $"{GetCompletePrefix()}:*", pageSize: pageSize);
var result = new Dictionary<K, V>();
const int OVERLAP = 10; // the number of in-flight operations
await keys.Pipe<RedisKey, RedisValue>(OVERLAP,
key => redis.StringGet(key),
(key, val) => result.Add(GetKey(key), Deserialize(val)));
return result;
}
using (untested):
public static async Task Pipe<TSource, TResult>(this IEnumerable<TSource> source,
int pipeLength,
Func<TSource, Task<TResult>> map, Action<TSource, TResult> reduce)
{
var buffer = new (TSource val, Task<TResult> task)[pipeLength];
uint index = 0;
foreach (var item in source)
{
var newTask = map(item);
if (newTask.IsCompleted)
{ // completed synchronously; check for error and move on
reduce(item, newTask.Result);
}
else
{
var thisIndex = index++ % pipeLength;
var oldPair = buffer[thisIndex];
if (oldPair.task != null) reduce(oldPair.val, await oldPair.task);
buffer[thisIndex] = (item, newTask);
}
}
for (int i = 0; i < pipeLength; i++)
{
var oldPair = buffer[(i + index) % pipeLength];
if (oldPair.task != null) reduce(oldPair.val, await oldPair.task);
}
}
What this attempts to do is:
iterate the keys lazily - no ToList()
have multiple commands in flight at a time
process the results efficiently and sequentially
as an async operation that populates a dictionary
Completely untested, but... in theory it should work! And: it doesn't pay latency per key, thanks to overlapped pipelined async operations. Increasing OVERLAP may reduce latency delays further, as may increasing pageSize.
This question already has answers here:
Create batches in LINQ
(21 answers)
Closed 3 years ago.
I am developing a C# program which has an "IEnumerable users" that stores the ids of 4 million users. I need to loop through the IEnumerable and extract a batch 1000 ids each time to perform some operations in another method.
How do I extract 1000 ids at a time from start of the IEnumerable, do some thing else, then fetch the next batch of 1000 and so on?
Is this possible?
You can use MoreLINQ's Batch operator (available from NuGet):
foreach(IEnumerable<User> batch in users.Batch(1000))
// use batch
If simple usage of library is not an option, you can reuse implementation:
public static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> source, int size)
{
T[] bucket = null;
var count = 0;
foreach (var item in source)
{
if (bucket == null)
bucket = new T[size];
bucket[count++] = item;
if (count != size)
continue;
yield return bucket.Select(x => x);
bucket = null;
count = 0;
}
// Return the last bucket with all remaining elements
if (bucket != null && count > 0)
{
Array.Resize(ref bucket, count);
yield return bucket.Select(x => x);
}
}
BTW for performance you can simply return bucket without calling Select(x => x). Select is optimized for arrays, but selector delegate still would be invoked on each item. So, in your case it's better to use
yield return bucket;
Sounds like you need to use Skip and Take methods of your object. Example:
users.Skip(1000).Take(1000)
this would skip the first 1000 and take the next 1000. You'd just need to increase the amount skipped with each call
You could use an integer variable with the parameter for Skip and you can adjust how much is skipped. You can then call it in a method.
public IEnumerable<user> GetBatch(int pageNumber)
{
return users.Skip(pageNumber * 1000).Take(1000);
}
The easiest way to do this is probably just to use the GroupBy method in LINQ:
var batches = myEnumerable
.Select((x, i) => new { x, i })
.GroupBy(p => (p.i / 1000), (p, i) => p.x);
But for a more sophisticated solution, see this blog post on how to create your own extension method to do this. Duplicated here for posterity:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>();
// or nextbatch.Clear(); but see Servy's comment below
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
How about
int batchsize = 5;
List<string> colection = new List<string> { "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"};
for (int x = 0; x < Math.Ceiling((decimal)colection.Count / batchsize); x++)
{
var t = colection.Skip(x * batchsize).Take(batchsize);
}
try using this:
public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
this IEnumerable<TSource> source,
int batchSize)
{
var batch = new List<TSource>();
foreach (var item in source)
{
batch.Add(item);
if (batch.Count == batchSize)
{
yield return batch;
batch = new List<TSource>();
}
}
if (batch.Any()) yield return batch;
}
and to use above function:
foreach (var list in Users.Batch(1000))
{
}
You can achieve that using Take and Skip Enumerable extension method. For more information on usage checkout linq 101
Something like this would work:
List<MyClass> batch = new List<MyClass>();
foreach (MyClass item in items)
{
batch.Add(item);
if (batch.Count == 1000)
{
// Perform operation on batch
batch.Clear();
}
}
// Process last batch
if (batch.Any())
{
// Perform operation on batch
}
And you could generalize this into a generic method, like this:
static void PerformBatchedOperation<T>(IEnumerable<T> items,
Action<IEnumerable<T>> operation,
int batchSize)
{
List<T> batch = new List<T>();
foreach (T item in items)
{
batch.Add(item);
if (batch.Count == batchSize)
{
operation(batch);
batch.Clear();
}
}
// Process last batch
if (batch.Any())
{
operation(batch);
}
}
You can use Take operator linq
Link : http://msdn.microsoft.com/fr-fr/library/vstudio/bb503062.aspx
In a streaming context, where the enumerator might get blocked in the middle of the batch, simply because the value is not yet produced (yield) it is useful to have a timeout method so that the last batch is produced after a given time. I used this for example for tailing a cursor in MongoDB. It's a little bit complicated, because the enumeration has to be done in another thread.
public static IEnumerable<List<T>> TimedBatch<T>(this IEnumerable<T> collection, double timeoutMilliseconds, long maxItems)
{
object _lock = new object();
List<T> batch = new List<T>();
AutoResetEvent yieldEventTriggered = new AutoResetEvent(false);
AutoResetEvent yieldEventFinished = new AutoResetEvent(false);
bool yieldEventTriggering = false;
var task = Task.Run(delegate
{
foreach (T item in collection)
{
lock (_lock)
{
batch.Add(item);
if (batch.Count == maxItems)
{
yieldEventTriggering = true;
yieldEventTriggered.Set();
}
}
if (yieldEventTriggering)
{
yieldEventFinished.WaitOne(); //wait for the yield to finish, and batch to be cleaned
yieldEventTriggering = false;
}
}
});
while (!task.IsCompleted)
{
//Wait for the event to be triggered, or the timeout to finish
yieldEventTriggered.WaitOne(TimeSpan.FromMilliseconds(timeoutMilliseconds));
lock (_lock)
{
if (batch.Count > 0) //yield return only if the batch accumulated something
{
yield return batch;
batch.Clear();
yieldEventFinished.Set();
}
}
}
task.Wait();
}
Let's say I have two sequences returning integers 1 to 5.
The first returns 1, 2 and 3 very fast, but 4 and 5 take 200ms each.
public static IEnumerable<int> FastFirst()
{
for (int i = 1; i < 6; i++)
{
if (i > 3) Thread.Sleep(200);
yield return i;
}
}
The second returns 1, 2 and 3 with a 200ms delay, but 4 and 5 are returned fast.
public static IEnumerable<int> SlowFirst()
{
for (int i = 1; i < 6; i++)
{
if (i < 4) Thread.Sleep(200);
yield return i;
}
}
Unioning both these sequences give me just numbers 1 to 5.
FastFirst().Union(SlowFirst());
I cannot guarantee which of the two methods has delays at what point, so the order of the execution cannot guarantee a solution for me. Therefore, I would like to parallelise the union, in order to minimise the (artifical) delay in my example.
A real-world scenario: I have a cache that returns some entities, and a datasource that returns all entities. I'd like to be able to return an iterator from a method that internally parallelises the request to both the cache and the datasource so that the cached results yield as fast as possible.
Note 1: I realise this is still wasting CPU cycles; I'm not asking how can I prevent the sequences from iterating over their slow elements, just how I can union them as fast as possible.
Update 1: I've tailored achitaka-san's great response to accept multiple producers, and to use ContinueWhenAll to set the BlockingCollection's CompleteAdding just the once. I just put it here since it would get lost in the lack of comments formatting. Any further feedback would be great!
public static IEnumerable<TResult> SelectAsync<TResult>(
params IEnumerable<TResult>[] producer)
{
var resultsQueue = new BlockingCollection<TResult>();
var taskList = new HashSet<Task>();
foreach (var result in producer)
{
taskList.Add(
Task.Factory.StartNew(
() =>
{
foreach (var product in result)
{
resultsQueue.Add(product);
}
}));
}
Task.Factory.ContinueWhenAll(taskList.ToArray(), x => resultsQueue.CompleteAdding());
return resultsQueue.GetConsumingEnumerable();
}
Take a look at this.
The first method just returns everything in order results come.
The second checks uniqueness. If you chain them you will get the result you want I think.
public static class Class1
{
public static IEnumerable<TResult> SelectAsync<TResult>(
IEnumerable<TResult> producer1,
IEnumerable<TResult> producer2,
int capacity)
{
var resultsQueue = new BlockingCollection<TResult>(capacity);
var producer1Done = false;
var producer2Done = false;
Task.Factory.StartNew(() =>
{
foreach (var product in producer1)
{
resultsQueue.Add(product);
}
producer1Done = true;
if (producer1Done && producer2Done) { resultsQueue.CompleteAdding(); }
});
Task.Factory.StartNew(() =>
{
foreach (var product in producer2)
{
resultsQueue.Add(product);
}
producer2Done = true;
if (producer1Done && producer2Done) { resultsQueue.CompleteAdding(); }
});
return resultsQueue.GetConsumingEnumerable();
}
public static IEnumerable<TResult> SelectAsyncUnique<TResult>(this IEnumerable<TResult> source)
{
HashSet<TResult> knownResults = new HashSet<TResult>();
foreach (TResult result in source)
{
if (knownResults.Contains(result)) {continue;}
knownResults.Add(result);
yield return result;
}
}
}
The cache would be nearly instant compared to fetching from the database, so you could read from the cache first and return those items, then read from the database and return the items except those that were found in the cache.
If you try to parallelise this, you will add a lot of complexity but get quite a small gain.
Edit:
If there is no predictable difference in the speed of the sources, you could run them in threads and use a synchronised hash set to keep track of which items you have already got, put the new items in a queue, and let the main thread read from the queue:
public static IEnumerable<TItem> GetParallel<TItem, TKey>(Func<TItem, TKey> getKey, params IEnumerable<TItem>[] sources) {
HashSet<TKey> found = new HashSet<TKey>();
List<TItem> queue = new List<TItem>();
object sync = new object();
int alive = 0;
object aliveSync = new object();
foreach (IEnumerable<TItem> source in sources) {
lock (aliveSync) {
alive++;
}
new Thread(s => {
foreach (TItem item in s as IEnumerable<TItem>) {
TKey key = getKey(item);
lock (sync) {
if (found.Add(key)) {
queue.Add(item);
}
}
}
lock (aliveSync) {
alive--;
}
}).Start(source);
}
while (true) {
lock (sync) {
if (queue.Count > 0) {
foreach (TItem item in queue) {
yield return item;
}
queue.Clear();
}
}
lock (aliveSync) {
if (alive == 0) break;
}
Thread.Sleep(100);
}
}
Test stream:
public static IEnumerable<int> SlowRandomFeed(Random rnd) {
int[] values = new int[100];
for (int i = 0; i < 100; i++) {
int pos = rnd.Next(i + 1);
values[i] = i;
int temp = values[pos];
values[pos] = values[i];
values[i] = temp;
}
foreach (int value in values) {
yield return value;
Thread.Sleep(rnd.Next(200));
}
}
Test:
Random rnd = new Random();
foreach (int item in GetParallel(n => n, SlowRandomFeed(rnd), SlowRandomFeed(rnd), SlowRandomFeed(rnd), SlowRandomFeed(rnd))) {
Console.Write("{0:0000 }", item);
}
I'm looking for a better alternative for Enumerable.Count() == n. The best I've been able to come up with is:
static class EnumerableExtensions
{
public static bool CountEquals<T>(this IEnumerable<T> items, int n)
{
if (n <= 0) throw new ArgumentOutOfRangeException("n"); // use Any()
var iCollection = items as System.Collections.ICollection;
if (iCollection != null)
return iCollection.Count == n;
int count = 0;
bool? retval = null;
foreach (var item in items)
{
count++;
if (retval.HasValue)
return false;
if (count == n)
retval = true;
}
if (retval.HasValue)
return retval.Value;
return false;
}
}
class Program
{
static void Main(string[] args)
{
var items0 = new List<int>();
var items1 = new List<int>() { 314 };
var items3 = new List<int>() { 1, 2, 3 };
var items5 = new List<int>() { 1, 2, 3, 4, 5 };
var items10 = Enumerable.Range(0, 10);
var itemsLarge = Enumerable.Range(0, Int32.MaxValue);
Console.WriteLine(items0.CountEquals(3));
Console.WriteLine(items1.CountEquals(3));
Console.WriteLine(items3.CountEquals(3));
Console.WriteLine(items5.CountEquals(3));
Console.WriteLine(itemsLarge.CountEquals(3));
}
}
Can I do any better? Is there a way to generalize this even more—passing in the comparision?
You can use a combination of Take and Count to get rid of the loop entirely:
public static bool CountEquals<T>(this IEnumerable<T> items, int n)
{
var iCollection = items as System.Collections.ICollection;
if (iCollection != null)
return iCollection.Count == n;
return items.Take(n + 1).Count() == n;
}
Using Enumerable.Count would be much better than your code above. It already optimizes for ICollection internally.
That being said, if you must keep your extension, you could simplify the loop a bit:
int count = 0;
foreach (var item in items)
{
count++;
if(count > n)
return false;
}
return count == n;
What exactly do you mean by "better"? Faster? Easier?
Essentially what you seem to have done is written a specialized method that's optimized for one specific task. You mention generalizing it, but its performance advantage stems from the fact that it's so specific (assuming there is a performance advantage--methods like Count are already tuned pretty hard for performance, and the compiler's pretty good at optimizing stuff like this).
Premature optimization is the root of all evil. If the performance of this particular operation is so important that it's worth replacing the twenty-odd-character expression xyz.Count() == abc with dozens of lines of code, you may want to try other methods of increasing performance, like refactoring. For most situations, just the overhead from using managed code is going to dwarf the performance bonus you get (if any).
That being said -- if you have something like 10 million items, and your target count is a whole lot less, I'm pretty sure the below will short-circuit the iteration:
int count = 0;
var subset = items.TakeWhile(x => count++ < n + 1);
return count == n + 1;
Easy to read, easy to maintain, and probably just as fast.