I have a sequence of stock ticks coming in and I want to take all the data in the last hour and do some processing on it. I am trying to achieve this with reactive extensions 2.0. I read on another post to use Interval but i think that is deprecated.
Would this extension method solve your problem?
public static IObservable<T[]> RollingBuffer<T>(
this IObservable<T> #this,
TimeSpan buffering)
{
return Observable.Create<T[]>(o =>
{
var list = new LinkedList<Timestamped<T>>();
return #this.Timestamp().Subscribe(tx =>
{
list.AddLast(tx);
while (list.First.Value.Timestamp < DateTime.Now.Subtract(buffering))
{
list.RemoveFirst();
}
o.OnNext(list.Select(tx2 => tx2.Value).ToArray());
}, ex => o.OnError(ex), () => o.OnCompleted());
});
}
You are looking for the Window operators!
Here is a lengthy article I wrote on working with sequences of coincidence (overlapping windows of sequences)
http://introtorx.com/Content/v1.0.10621.0/17_SequencesOfCoincidence.html
So if you wanted to build a rolling average you could use this sort of code
var scheduler = new TestScheduler();
var notifications = new Recorded<Notification<double>>[30];
for (int i = 0; i < notifications.Length; i++)
{
notifications[i] = new Recorded<Notification<double>>(i*1000000, Notification.CreateOnNext<double>(i));
}
//Push values into an observable sequence 0.1 seconds apart with values from 0 to 30
var source = scheduler.CreateHotObservable(notifications);
source.GroupJoin(
source, //Take values from myself
_=>Observable.Return(0, scheduler), //Just the first value
_=>Observable.Timer(TimeSpan.FromSeconds(1), scheduler),//Window period, change to 1hour
(lhs, rhs)=>rhs.Sum()) //Aggregation you want to do.
.Subscribe(i=>Console.WriteLine (i));
scheduler.Start();
And we can see it output the rolling sums as it receives values.
0, 1, 3, 6, 10, 15, 21, 28...
Very likely Buffer is what you are looking for:
var hourlyBatch = ticks.Buffer(TimeSpan.FromHours(1));
Or assuming data is already Timestamped, simply using Scan:
public static IObservable<IReadOnlyList<Timestamped<T>>> SlidingWindow<T>(this IObservable<Timestamped<T>> self, TimeSpan length)
{
return self.Scan(new LinkedList<Timestamped<T>>(),
(ll, newSample) =>
{
ll.AddLast(newSample);
var oldest = newSample.Timestamp - length;
while (ll.Count > 0 && list.First.Value.Timestamp < oldest)
list.RemoveFirst();
return list;
}).Select(l => l.ToList().AsReadOnly());
}
Related
I am using C# and ASP.NET Core 6 MVC.
I have a requirement to fetch all results from API using offset whether it is just 64 records or 6300 records. Adjust the offset and do a concurrency call or parallel call to get all records at once. I need to do in the best way.
I am calling an API which results 100 max record per call. Although the overall total result (totalResult) can be around 65, 120, 1500 or 2520, or 6534 etc. There is an offset integer which I can pass into the API to get the further 100 results each time. By default, it is zero, which can brings 100 max records.
For example for totalResult of 65, the offset 0 is sufficient as it will bring all 65 records. For totalResult of 150, the offset 0 will bring 100 records and then for the next iteration, offset has to be 100 to bring more. And likewise for 6530 max records, the offset has to be adjusted 100, 200, 300... to get all results.
Now, I need to run this task parallel to avoid delay time.
This is my function:
var offset = 0
// My async call method
var addressResult = await _postcode.GetAddresses(strPostcode, offset);
if (addressResult?.Results != null && addressResult.Results.Any())
{
// concurrency code to run here with offset
int total = addressResult.Header.TotalResults; //Total Result e,g 6500
var thePostcoderesult = addressResult.Results;
// max result could be any number depends on the Total Result if it is
int maxresult = thePostcoderesult.Count();
}
So in the end when all concurrency calls to API finishes, thePostcoderesult should have all results added to it.
var thePostcoderesult = addressResult.Results;
Now, I am aware we can achieve this through
await Parallel.ForEachAsync(offsets, options, async (offset, ct) =>
with the help of the post ticked answer How to make multiple API calls faster?
I tried implementing that logic - but it gives me result only up to 1000 results as something to do with Offset and Parallel loop is not aligned. As tasks are running 10 times only and it gives 1000 results - although the results with the postcode I am searching is 1630.
Here is my updated code but as I mentioned, it does not wait to finish or run until the total number of offset.
var offset = 0
var addressResult = await _postcode.GetAddresses(strPostcode, offset);
if (addressResult?.Results != null && addressResult.Results.Any())
{
int total = addressResult.Header.TotalResults;
// Setting offset here - but something is not right
IEnumerable<int> offsets = Enumerable
.Range(0, total)
.Select(n => checked(n * 100))
.TakeWhile(offset => offset < Volatile.Read(ref total));
// wanted to use 10 parallel threads which is a safe bet I believe
var options = new ParallelOptions() { MaxDegreeOfParallelism = 10 };
var thePostcoderesult = new List<AddressResult>();
await Parallel.ForEachAsync(offsets, options, async (offset, ct) =>
{
var addressResult = await _postcode.GetAddresses(strPostcode, offset);
if (offset == 0)
{ //I am not using it
//Volatile.Write(ref total, Jresult.Results.Count());
}
thePostcoderesult.AddRange(addressResult.Results);
});
return thePostcoderesult;
}
Apologies in advance for the detailed post - If you can help to do this more correct or neat way, please you are welcome
Many thanks
You got a lot going on there, I don't think it needs to be quite that complicated. Since it seems the initial GetAddresses call tells you how many records you're going to have, you can do something like this:
var initialResponse = await _postcode.GetAddresses(strPostcode, 0);
if (initialResponse?.Results == null || !initialResponse.Results.Any())
{
return;
}
var totalPostCodeResults = new AddressResult[initialResponse.Header.TotalResults];
// fill up to the first 100 since you have it and bail if that's all there is
FillItems(initialResponse.Results, totalPostCodeResults, 0);
if(totalPostCodeResults.Length <= 100)
return totalPostCodeResults;
// Fill the offsets (aka start indexes) starting at 100
var offsets = new List<int>();
var offset = 100;
while(offset < totalPostCodeResults.Length)
{
offsets.Add(offset);
offset+=100;
}
// TODO: add the last one using modulus
// Kick off a task for each offset range
var tasks = new Task[offsets.Count()];
for(int i = 0; i < tasks.Length; i++)
{
// copy i to scoped variable to avoid parallel messiness
var index = i;
tasks[index] = Task.Run(async () => {
var response = await _postcode.GetAddresses(strPostcode, offsets[index]);
FillItems(response.Results, totalPostCodeResults, offsets[index]);
}
}
// Wait for all of them to finish
Task.WaitAll(tasks);
return totalPostCodeResults
void FillItems(List<AddressResult> results, AddressResult[] totalArray, int startIndex)
{
var index = startIndex;
results.ForEach(item => totalArray[index++] = item);
}
This question is a follow up on How to optimize performance in a simple TPL DataFlow pipeline?
The source code is here - https://github.com/MarkKharitonov/LearningTPLDataFlow
Given:
Several solutions covering about 400 C# projects encompassing thousands of C# source files totaling in more than 10,000,000 lines of code.
A file containing string literals, one per line.
I want to produce a JSON file listing all the occurrences of the literals in the source code. For every matching line I want to have the following pieces of information:
The project path
The C# file path
The matching line itself
The matching line number
And all the records arranged as a dictionary keyed by the respective literal.
So the challenge is to do it as efficiently as possible (in C#, of course).
The DataFlow pipeline can be found in this file - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs
Here it is:
private void Run(string workspaceRoot, string outFilePath, string[] literals, bool searchAllFiles, int workSize, int maxDOP1, int maxDOP2, int maxDOP3, int maxDOP4)
{
var res = new SortedDictionary<string, List<MatchingLine>>();
var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();
var progress = new Progress();
var taskSchedulerPair = new ConcurrentExclusiveSchedulerPair(TaskScheduler.Default, Environment.ProcessorCount);
var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(p => YieldCSFiles(p, searchAllFiles), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP1
});
var produceCSFileContent = new TransformBlock<CSFile, CSFile>(CSFile.PopulateContentAsync, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP2
});
var produceWorkItems = new TransformManyBlock<CSFile, (CSFile CSFile, int Pos, int Length)>(csFile => csFile.YieldWorkItems(literals, workSize, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP3,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var produceMatchingLines = new TransformManyBlock<(CSFile CSFile, int Pos, int Length), MatchingLine>(o => o.CSFile.YieldMatchingLines(literals, o.Pos, o.Length, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP4,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var getMatchingLines = new ActionBlock<MatchingLine>(o => AddResult(res, o));
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
produceCSFiles.LinkTo(produceCSFileContent, linkOptions);
produceCSFileContent.LinkTo(produceWorkItems, linkOptions);
produceWorkItems.LinkTo(produceMatchingLines, linkOptions);
produceMatchingLines.LinkTo(getMatchingLines, linkOptions);
var progressTask = Task.Factory.StartNew(() =>
{
var delay = literals.Length < 10 ? 1000 : 10000;
for (; ; )
{
var current = Interlocked.Read(ref progress.Current);
var total = Interlocked.Read(ref progress.Total);
Console.Write("Total = {0:n0}, Current = {1:n0}, Percents = {2:P} \r", total, current, ((double)current) / total);
if (progress.Done)
{
break;
}
Thread.Sleep(delay);
}
Console.WriteLine();
}, TaskCreationOptions.LongRunning);
projects.ForEach(p => produceCSFiles.Post(p));
produceCSFiles.Complete();
getMatchingLines.Completion.GetAwaiter().GetResult();
progress.Done = true;
progressTask.GetAwaiter().GetResult();
res.SaveAsJson(outFilePath);
}
The default parameters are (https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs#L24-L28):
private int m_maxDOP1 = 3;
private int m_maxDOP2 = 20;
private int m_maxDOP3 = Environment.ProcessorCount;
private int m_maxDOP4 = Environment.ProcessorCount;
private int m_workSize = 1_000_000;
My idea is to divide the work into work items, where a work item size is computed by multiplying the number of lines in the respective file by the count of the string literals. So, if a C# file contains 500 lines, then searching it for all the 3401 literals results in a work of size 3401 * 500 = 1700500
The unit of work is by default 1000000 lines, so in the aforementioned example the file would result in 2 work items:
Literals 0..1999
Literals 2000..3400
And it is the responsibility of the produceWorkItems block to generate these work items from files.
Example runs:
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\2.txt
Locating all the instances of the 4 literals found in the file C:\temp\2.txt in the C# code ...
Total = 49,844,516, Current = 49,702,532, Percents = 99.72%
Elapsed: 00:00:18.4320676
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\1.txt
Locating all the instances of the 3401 literals found in the file c:\temp\1.txt in the C# code ...
Total = 42,379,095,775, Current = 42,164,259,870, Percents = 99.49%
Elapsed: 01:44:13.4289270
Question
Many work items are undersized. If I have 3 C# files, 20 lines each, my current code would produce 3 work items, because in my current implementation work items never cross a file boundary. This is inefficient. Ideally, they would be batched into a single work item, because 60 * 3401 = 204060 < 1000000. But the BatchBlock cannot be used here, because it expects me to provide the batch size, which I do not know - it depends on the work items in the pipeline.
How would you achieve such batching ?
I have realized something. Maybe it is obvious, but I have just figured it out. The TPL DataFlow library is of no use if one can buffer all the items first. So in my case - I can do that. And so, I can buffer and sort the items from large to small. This way a simple Parallel.ForEach will do the work just fine. Having realized that I changed my implementation to use Reactive like this:
Phase 1 - get all the items, this is where all the IO is
var input = (workspaceRoot + "build\\projects.yml")
.YieldAllProjects()
.ToObservable()
.Select(project => Observable.FromAsync(() => Task.Run(() => YieldFiles(project, searchAllFiles))))
.Merge(2)
.SelectMany(files => files)
.Select(file => Observable.FromAsync(file.PopulateContentAsync))
.Merge(10)
.ToList()
.GetAwaiter().GetResult()
.AsList();
input.Sort((x, y) => y.EstimatedLineCount - x.EstimatedLineCount);
Phase 2 - find all the matching lines (CPU only)
var res = new SortedDictionary<string, List<MatchingLine>>();
input
.ToObservable()
.Select(file => Observable.FromAsync(() => Task.Run(() => file.YieldMatchingLines(literals, 0, literals.Count, progress).ToList())))
.Merge(maxDOP.Value)
.ToList()
.GetAwaiter().GetResult()
.SelectMany(m => m)
.ForEach(m => AddResult(res, m));
So, even though I have hundreds of projects, thousands of files and millions lines of code - it is not the scale for TPL DataFlow, because my tool can read all the files into memory, rearrange in a favorable order and then process.
Regarding the first question (configuring the pipeline), I can't really offer any guidance. Optimizing the parameters of a dataflow pipeline seems like a black art to me!
Regarding the second question (how to batch a work load consisting of work items having unknown size at compile time), you could use the custom BatchBlock<T> below. It uses the DataflowBlock.Encapsulate method in order to combine two dataflow blocks to one. The first block in an ActionBlock<T> that receives the input and puts it into a buffer, and the second is a BufferBlock<T[]> that holds the batched items and propagates them downstream. The weightSelector is a lambda that returns the weight of each received item. When the accumulated weight surpasses the batchWeight threshold, a batch is emitted.
public static IPropagatorBlock<T, T[]> CreateDynamicBatchBlock<T>(
int batchWeight, Func<T, int> weightSelector,
DataflowBlockOptions options = null)
{
// Arguments validation omitted
options ??= new DataflowBlockOptions();
var outputBlock = new BufferBlock<T[]>(options);
List<T> buffer = new List<T>();
int sumWeight = 0;
var inputBlock = new ActionBlock<T>(async item =>
{
checked
{
int weight = weightSelector(item);
if (weight + sumWeight > batchWeight && buffer.Count > 0)
await SendBatchAsync();
buffer.Add(item);
sumWeight += weight;
if (sumWeight >= batchWeight) await SendBatchAsync();
}
}, new()
{
BoundedCapacity = options.BoundedCapacity,
CancellationToken = options.CancellationToken,
TaskScheduler = options.TaskScheduler,
MaxMessagesPerTask = options.MaxMessagesPerTask,
NameFormat = options.NameFormat
});
PropagateCompletion(inputBlock, outputBlock, async () =>
{
if (buffer.Count > 0) await SendBatchAsync();
});
Task SendBatchAsync()
{
var batch = buffer.ToArray();
buffer.Clear();
sumWeight = 0;
return outputBlock.SendAsync(batch);
}
static async void PropagateCompletion(IDataflowBlock source,
IDataflowBlock target, Func<Task> postCompletionAction)
{
try { await source.Completion.ConfigureAwait(false); } catch { }
Exception ex =
source.Completion.IsFaulted ? source.Completion.Exception : null;
try { await postCompletionAction(); }
catch (Exception actionError) { ex = actionError; }
if (ex != null) target.Fault(ex); else target.Complete();
}
return DataflowBlock.Encapsulate(inputBlock, outputBlock);
}
Usage example:
var batchBlock = CreateDynamicBatchBlock<WorkItem>(1_000_000, wi => wi.Size);
If the weight int type has not enough range and overflows, you could switch to long or double.
I am trying to get minimum, maximum and sum (for the average) from a large array. I would love to substitute my regular for loop with parallel.for
UInt16 tempMin = (UInt16)(Math.Pow(2,mfvm.cameras[openCamIndex].bitDepth) - 1);
UInt16 tempMax = 0;
UInt64 tempSum = 0;
for (int i = 0; i < acquisition.frameDataShorts.Length; i++)
{
if (acquisition.frameDataShorts[i] < tempMin)
tempMin = acquisition.frameDataShorts[i];
if (acquisition.frameDataShorts[i] > tempMax)
tempMax = acquisition.frameDataShorts[i];
tempSum += acquisition.frameDataShorts[i];
}
I know how to solve this using Tasks with cutting the array myself. However I would love to learn how to use parallel.for for this. Since as I understand it, it should be able to do this very elegantly.
I found this tutorial from MSDN for calculating the Sum, however I have no idea how to extend it to do all three things (min, max and sum) in a single passage.
Results:
Ok I tried PLINQ solution and I have seen some serious improvements.
3 passes (Min, Max, Sum) are on my i7 (2x4 Cores) 4x times faster then sequential aproach. However I tried the same code on Xeon (2x8 core) and results are completelly different. Parallel (again 3 passes) are actually twice as slow as sequential aproach (which is like 5x faster then on my i7).
In the end I have separated the array myself with Task Factory and I have slightly better results on all computers.
I assume that the main issue here is that three different variables are have to be remembered each iteration. You can utilize Tuple for this purpose:
var lockObject = new object();
var arr = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
var min = arr[0];
var max = arr[0];
Parallel.For(0, arr.Length,
() => new Tuple<long, int, int>(0, arr[0], arr[0]),
(i, loop, temp) => new Tuple<long, int, int>(temp.Item1 + arr[i], Math.Min(temp.Item2, arr[i]),
Math.Max(temp.Item3, arr[i])),
x =>
{
lock (lockObject)
{
total += x.Item1;
min = Math.Min(min, x.Item2);
max = Math.Max(max, x.Item3);
}
}
);
I must warn you, though, that this implementation runs about 10x slower (on my machine) than the simple for loop approach you demonstrated in your question, so proceed with caution.
I don't think parallel.for is good fit here but try this out:
public class MyArrayHandler {
public async Task GetMinMaxSum() {
var myArray = Enumerable.Range(0, 1000);
var maxTask = Task.Run(() => myArray.Max());
var minTask = Task.Run(() => myArray.Min());
var sumTask = Task.Run(() => myArray.Sum());
var results = await Task.WhenAll(maxTask,
minTask,
sumTask);
var max = results[0];
var min = results[1];
var sum = results[2];
}
}
Edit
Just for fun due to the comments regarding performance I took a couple measurements. Also, found this Fastest way to find sum.
#10,000,000 values
GetMinMax: 218ms
GetMinMaxAsync: 308ms
public class MinMaxSumTests {
[Test]
public async Task GetMinMaxSumAsync() {
var myArray = Enumerable.Range(0, 10000000).Select(x => (long)x).ToArray();
var sw = new Stopwatch();
sw.Start();
var maxTask = Task.Run(() => myArray.Max());
var minTask = Task.Run(() => myArray.Min());
var sumTask = Task.Run(() => myArray.Sum());
var results = await Task.WhenAll(maxTask,
minTask,
sumTask);
var max = results[0];
var min = results[1];
var sum = results[2];
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
[Test]
public void GetMinMaxSum() {
var myArray = Enumerable.Range(0, 10000000).Select(x => (long)x).ToArray();
var sw = new Stopwatch();
sw.Start();
long tempMin = 0;
long tempMax = 0;
long tempSum = 0;
for (int i = 0; i < myArray.Length; i++) {
if (myArray[i] < tempMin)
tempMin = myArray[i];
if (myArray[i] > tempMax)
tempMax = myArray[i];
tempSum += myArray[i];
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
Do not reinvent the wheel, Min, Max Sum and similar operations are aggregations. Since .NET v3.5 you have a handy versions of LINQ extension methods which are already providing you the solution:
using System.Linq;
var sequence = Enumerable.Range(0, 10).Select(s => (uint)s).ToList();
Console.WriteLine(sequence.Sum(s => (double)s));
Console.WriteLine(sequence.Max());
Console.WriteLine(sequence.Min());
Though they are declared as the extensions for IEnumerable, they have some internal improvements for IList and Array types, so you should measure how your code will work on that types and on IEnumerable's.
In your case this isn't enough, as you clearly do not want to iterate other one array more than one time, so the magic goes here: PLINQ (a.k.a. Parallel-LINQ). You need to add only one method to aggregate your array in parallel:
var sequence = Enumerable.Range(0, 10000000).Select(s => (uint)s).AsParallel();
Console.WriteLine(sequence.Sum(s => (double)s));
Console.WriteLine(sequence.Max());
Console.WriteLine(sequence.Min());
This option add some overhead for synchronization the items, but it do scale well, providing a similar time either for small and big enumerations. From MSDN:
PLINQ is usually the recommended approach whenever you need to apply the parallel aggregation pattern to .NET applications. Its declarative nature makes it less prone to error than other approaches, and its performance on multicore computers is competitive with them.
Implementing parallel aggregation with PLINQ doesn't require adding locks in your code. Instead, all the synchronization occurs internally, within PLINQ.
However, if you still want to investigate the performance for different types of the operations, you can use the Parallel.For and Parallel.ForaEach methods overloads with some aggregation approach, something like this:
double[] sequence = ...
object lockObject = new object();
double sum = 0.0d;
Parallel.ForEach(
// The values to be aggregated
sequence,
// The local initial partial result
() => 0.0d,
// The loop body
(x, loopState, partialResult) =>
{
return Normalize(x) + partialResult;
},
// The final step of each local context
(localPartialSum) =>
{
// Enforce serial access to single, shared result
lock (lockObject)
{
sum += localPartialSum;
}
}
);
return sum;
If you need additional partition for your data, you can use a Partitioner for the methods:
var rangePartitioner = Partitioner.Create(0, sequence.Length);
Parallel.ForEach(
// The input intervals
rangePartitioner,
// same code here);
Also Aggregate method can be used for the PLINQ, with some merge logic
(illustration from MSDN again):
Useful links:
Parallel Aggregation
Enumerable.Min<TSource>(IEnumerable<TSource>) method
Enumerable.Sum method
Enumerable.Max<TSource> (IEnumerable<TSource>) method
I'm fairly new to programming and I've been playing around with writing some random functions.
I wrote the below function which works loosely based on eratosthenes sieve. Initially though, I was having a problem with the updatedEntries IEnumerable.
The updatedEntites was sort of populating (something to do with deferred execution I gather - in debug mode the 'current' was null but the results view contained the relevant items) but when the RemoveWhere was applied to oddPrimesAndMultiples the items in updatedEntries disappeared even though I don't see why they should still be linked to the items in oddPrimesAndMultiples. (I could just be completely misunderstanding what's going on of course and the problem might be something else entirely!)
The problem doesn't arise if I change updatedEntries to a List rather than an IEnumerable and I've actually now rewritten that statement without using LINQ to (potentially?) make better use of the fact I'm using a SortedSet anyway...but I would still like to know why the issue arose in the first place!
Here is my code:
public static IEnumerable<int> QuickPrimes()
{
int firstPrime = 2;
int firstOddPrime = 3;
int currentValue = firstOddPrime;
int currentMinimumMultiple;
SortedSet<Tuple<int, int>> oddPrimesAndMultiples = new SortedSet<Tuple<int, int>>() { new Tuple<int, int> (firstOddPrime, firstOddPrime) };
IEnumerable<Tuple<int, int>> updatedEntries;
yield return firstPrime;
yield return firstOddPrime;
while (true)
{
currentMinimumMultiple = oddPrimesAndMultiples.First().Item1;
while (currentValue < currentMinimumMultiple)
{
yield return currentValue;
oddPrimesAndMultiples.Add(new Tuple<int, int> (currentValue * 3, currentValue));
currentValue += 2;
}
updatedEntries = oddPrimesAndMultiples.Where(tuple => tuple.Item1 == currentMinimumMultiple)
.Select(t => new Tuple<int, int>(t.Item1 + 2 * t.Item2, t.Item2));
oddPrimesAndMultiples.RemoveWhere(t => t.Item1 == currentMinimumMultiple);
oddPrimesAndMultiples.UnionWith(updatedEntries);
currentValue += 2;
}
}
and the main where I'm testing the function:
static void Main(string[] args)
{
foreach(int prime in Problems.QuickPrimes())
{
Console.WriteLine(prime);
if (prime > 20) return;
}
}
Many thanks in advance!
The trap is that updatedEntries is defined in one line, but actually executed later.
To bring it back to the basics, see this code snippet (from Linqpad):
var ints = new SortedSet<int>( new[] { 1,2,3,4,5,6,7,8,9,10});
var updatedEntries = ints.Where(i => i > 5); // No ToList()!
updatedEntries.Dump();
This shows 6, 7, 8, 9, 10.
ints.RemoveWhere(i => i > 7);
updatedEntries.Dump();
Now this shows 6, 7, because updatedEntries is re-executed.
ints.UnionWith(updatedEntries);
This adds 6, 7, while you expected it to add the first listing 6, 7, 8, 9, 10.
So when defining an IEnumerable you should always be aware of when it's actually executed. It always acts on the state of the program at that particular point.
First off I want to apologize if my code is bad or if my description is poor. This is one of my first times working with C# threading/tasks. What I'm trying to do in my code is to go through a list of names and for each 50 names in the list, start a new task and pass off those 50 names to another method that will perform calculation heavy methods on the data. My code only works for the first 50 names in the list and it returns 0 results for every other time and I can't seem to figure out why.
public static async void startInitialDownload(string value)
{
IEnumerable<string> names = await Helper.getNames(value, 0);
decimal multiple = names.Count() / 50;
string[] results;
int num1 = 0;
int num2 = 0;
for (int i = 0; i < multiple + 1; i++)
{
num1 = i * 50;
num2 = (50 * (i + 1));
results = names.TakeWhile((name, index) => index >= num1 && index < num2).ToArray();
Task current = Task.Factory.StartNew(() => getCurrentData(results));
await current.ConfigureAwait(false);
}
}
Realise the enumerable into a list, so that it will be calculated once, not each iteration in the loop. You can use the Skip and Take methods to get a range of the list:
public static async void startInitialDownload(string value) {
IEnumerable<string> names = await Helper.getNames(value, 0);
List<string> nameList = names.ToList();
for (int i = 0; i < nameList.Count; i += 50) {
string[] results = nameList.Skip(i).Take(50).ToArray();
Task current = Task.Factory.StartNew(() => getCurrentData(results));
await current.ConfigureAwait(false);
}
}
Or you can add items to a list, and execute it when it has the right size:
public static async void startInitialDownload(string value) {
IEnumerable<string> names = await Helper.getNames(value, 0);
List<string> buffer = new List<string>();
foreach (string s in names) {
buffer.Add(s);
if (buffer.Count == 50) {
Task current = Task.Factory.StartNew(() => getCurrentData(buffer.ToArray()));
await current.ConfigureAwait(false);
buffer = new List<string>();
}
}
if (buffer.Count > 0) {
Task current = Task.Factory.StartNew(() => getCurrentData(buffer.ToArray()));
await current.ConfigureAwait(false);
}
}
The name TakeWhile suggests that it only takes entries while the condition is true. So if it starts off by reading an entry for which the condition is false, it never takes anything.
So the first loop, you're starting with num1 = 0. So it reads entries from num1 to num2.
The second loop, you're starting with num1 being 51. So it starts reading again ... and the first entry it hits, the condition is false, so it stops.
You might try using Where, or by using Skip before hand.
The tl;dr; of it: I don't think your problem has anything to do with parallel tasks, I think it's due to using the wrong LINQ method to pull the names you want to use.
As I understand it from Stephen Cleary's response to a similar (though not identical) question, you don't need to use ConfigureAwait() there.
Here's the link in question: on stack overflow
And here's what I would do instead with the last two lines of your for loop:
Task.Factory.StartNew(() => getCurrentData(results));
That's it. By using the factory, and by not awaiting, you are letting that task run on its own (possibly on a new thread). Provided that your storage is all thread safe (see: System.Collections.Concurrent btw) then you should be all set.
Caveat: if you aren't showing us what lies after the await then your results may vary.
its not a direct solution, but it might work.
public static IEnumerable<T[]> MakeBuckets<T>(IEnumerable<T> source, int maxSize)
{
List<T> currentBucket = new List<T>(maxSize);
foreach (var s in source)
{
currentBucket.Add(s);
if (currentBucket.Count >= maxSize)
{
yield return currentBucket.ToArray();
currentBucket = new List<T>(maxSize);
}
}
if(currentBucket.Any())
yield return currentBucket.ToArray();
}
later you can iterate through the result of the MakeBucket function.