Why does Observable.Generate() throw System.StackOverflowException? - c#

I´m writing a C# (.NET 4.5) application that is used to aggregate time based events for reporting purposes. To make my query logic reusable for both realtime and historical data I make use of the Reactive Extensions (2.0) and their IScheduler infrastructure (HistoricalScheduler and friends).
For example, assume we create a list of events (sorted chronologically, but they may coincide!) whose only payload ist their timestamp and want to know their distribution across buffers of a fixed duration:
const int num = 100000;
const int dist = 10;
var events = new List<DateTimeOffset>();
var curr = DateTimeOffset.Now;
var gap = new Random();
var time = new HistoricalScheduler(curr);
for (int i = 0; i < num; i++)
{
events.Add(curr);
curr += TimeSpan.FromMilliseconds(gap.Next(dist));
}
var stream = Observable.Generate<int, DateTimeOffset>(
0,
s => s < events.Count,
s => s + 1,
s => events[s],
s => events[s],
time);
stream.Buffer(TimeSpan.FromMilliseconds(num), time)
.Subscribe(l => Console.WriteLine(time.Now + ": " + l.Count));
time.AdvanceBy(TimeSpan.FromMilliseconds(num * dist));
Running this code results in a System.StackOverflowException with the following stack trace (it´s the last 3 lines all the way down):
mscorlib.dll!System.Threading.Interlocked.Exchange<System.IDisposable>(ref System.IDisposable location1, System.IDisposable value) + 0x3d bytes
System.Reactive.Core.dll!System.Reactive.Disposables.SingleAssignmentDisposable.Dispose() + 0x37 bytes
System.Reactive.Core.dll!System.Reactive.Concurrency.ScheduledItem<System.DateTimeOffset>.Cancel() + 0x23 bytes
...
System.Reactive.Core.dll!System.Reactive.Disposables.AnonymousDisposable.Dispose() + 0x4d bytes
System.Reactive.Core.dll!System.Reactive.Disposables.SingleAssignmentDisposable.Dispose() + 0x4f bytes
System.Reactive.Core.dll!System.Reactive.Concurrency.ScheduledItem<System.DateTimeOffset>.Cancel() + 0x23 bytes
...
Ok, the problem seems to come from my use of Observable.Generate(), depending on the list size (num) and regardless of the choice of scheduler.
What am I doing wrong? Or more generally, what would be the preferred way to create an IObservable from an IEnumerable of events that provide their own timestamps?

(update - realized I didn't provide an alternative: see at bottom of answer)
The problem is in how Observable.Generate works - it's used to unfold a corecursive (think recursion turned inside out) generator based on the arguments; if those arguments end up generating a very nested corecursive generator, you'll blow your stack.
From this point on, I'm speculating a lot (don't have the Rx source in front of me) (see below), but I'm willing to bet your definition ends up expanding into something like:
initial_state =>
generate_next(initial_state) =>
generate_next(generate_next(initial_state)) =>
generate_next(generate_next(generate_next(initial_state))) =>
generate_next(generate_next(generate_next(generate_next(initial_state)))) => ...
And on and on until your call stack gets big enough to overflow. At, say, a method signature + your int counter, that'd be something like 8-16 bytes per recursive call (more depending on how the state machine generator is implemented), so 60,000 sounds about right (1M / 16 ~ 62500 maximum depth)
EDIT: Pulled up the source - confirmed: the "Run" method of Generate looks like this - take note of the nested calls to Generate:
protected override IDisposable Run(
IObserver<TResult> observer,
IDisposable cancel,
Action<IDisposable> setSink)
{
if (this._timeSelectorA != null)
{
Generate<TState, TResult>.α α =
new Generate<TState, TResult>.α(
(Generate<TState, TResult>) this,
observer,
cancel);
setSink(α);
return α.Run();
}
if (this._timeSelectorR != null)
{
Generate<TState, TResult>.δ δ =
new Generate<TState, TResult>.δ(
(Generate<TState, TResult>) this,
observer,
cancel);
setSink(δ);
return δ.Run();
}
Generate<TState, TResult>._ _ =
new Generate<TState, TResult>._(
(Generate<TState, TResult>) this,
observer,
cancel);
setSink(_);
return _.Run();
}
EDIT: Derp, didn't offer any alternatives...here's one that might work:
(EDIT: fixed Enumerable.Range, so stream size won´t be multiplied by chunkSize)
const int num = 160000;
const int dist = 10;
var events = new List<DateTimeOffset>();
var curr = DateTimeOffset.Now;
var gap = new Random();
var time = new HistoricalScheduler(curr);
for (int i = 0; i < num; i++)
{
events.Add(curr);
curr += TimeSpan.FromMilliseconds(gap.Next(dist));
}
// Size too big? Fine, we'll chunk it up!
const int chunkSize = 10000;
var numberOfChunks = events.Count / chunkSize;
// Generate a whole mess of streams based on start/end indices
var streams =
from chunkIndex in Enumerable.Range(0, (int)Math.Ceiling((double)events.Count / chunkSize) - 1)
let startIdx = chunkIndex * chunkSize
let endIdx = Math.Min(events.Count, startIdx + chunkSize)
select Observable.Generate<int, DateTimeOffset>(
startIdx,
s => s < endIdx,
s => s + 1,
s => events[s],
s => events[s],
time);
// E pluribus streamum
var stream = Observable.Concat(streams);
stream.Buffer(TimeSpan.FromMilliseconds(num), time)
.Subscribe(l => Console.WriteLine(time.Now + ": " + l.Count));
time.AdvanceBy(TimeSpan.FromMilliseconds(num * dist));

OK, I´ve taken a different factory method that doesn´t require lamdba expressions as state transitions and now I don´t see any stack overflows anymore. I´m not yet sure if this would qualify as a correct answer to my question, but it works and I thought I´d share it here:
var stream = Observable.Create<DateTimeOffset>(o =>
{
foreach (var e in events)
{
time.Schedule(e, () => o.OnNext(e));
}
time.Schedule(events[events.Count - 1], () => o.OnCompleted());
return Disposable.Empty;
});
Manually scheduling the events before (!) returning the subscription seems awkward to me, but in this case it can be done inside the lambda expression.
If there is anything wrong about this approach, please correct me. Also, I´d still be happy to hear what implicit assumptions by System.Reactive I have violated with my original code.
(Oh my, I should have checked that earlier: with RX v1.0, the original Observable.Generate() does in fact seem to work!)

Related

How to search vast code base for multiple literal strings efficiently?

This question is a follow up on How to optimize performance in a simple TPL DataFlow pipeline?
The source code is here - https://github.com/MarkKharitonov/LearningTPLDataFlow
Given:
Several solutions covering about 400 C# projects encompassing thousands of C# source files totaling in more than 10,000,000 lines of code.
A file containing string literals, one per line.
I want to produce a JSON file listing all the occurrences of the literals in the source code. For every matching line I want to have the following pieces of information:
The project path
The C# file path
The matching line itself
The matching line number
And all the records arranged as a dictionary keyed by the respective literal.
So the challenge is to do it as efficiently as possible (in C#, of course).
The DataFlow pipeline can be found in this file - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs
Here it is:
private void Run(string workspaceRoot, string outFilePath, string[] literals, bool searchAllFiles, int workSize, int maxDOP1, int maxDOP2, int maxDOP3, int maxDOP4)
{
var res = new SortedDictionary<string, List<MatchingLine>>();
var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();
var progress = new Progress();
var taskSchedulerPair = new ConcurrentExclusiveSchedulerPair(TaskScheduler.Default, Environment.ProcessorCount);
var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(p => YieldCSFiles(p, searchAllFiles), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP1
});
var produceCSFileContent = new TransformBlock<CSFile, CSFile>(CSFile.PopulateContentAsync, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP2
});
var produceWorkItems = new TransformManyBlock<CSFile, (CSFile CSFile, int Pos, int Length)>(csFile => csFile.YieldWorkItems(literals, workSize, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP3,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var produceMatchingLines = new TransformManyBlock<(CSFile CSFile, int Pos, int Length), MatchingLine>(o => o.CSFile.YieldMatchingLines(literals, o.Pos, o.Length, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP4,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var getMatchingLines = new ActionBlock<MatchingLine>(o => AddResult(res, o));
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
produceCSFiles.LinkTo(produceCSFileContent, linkOptions);
produceCSFileContent.LinkTo(produceWorkItems, linkOptions);
produceWorkItems.LinkTo(produceMatchingLines, linkOptions);
produceMatchingLines.LinkTo(getMatchingLines, linkOptions);
var progressTask = Task.Factory.StartNew(() =>
{
var delay = literals.Length < 10 ? 1000 : 10000;
for (; ; )
{
var current = Interlocked.Read(ref progress.Current);
var total = Interlocked.Read(ref progress.Total);
Console.Write("Total = {0:n0}, Current = {1:n0}, Percents = {2:P} \r", total, current, ((double)current) / total);
if (progress.Done)
{
break;
}
Thread.Sleep(delay);
}
Console.WriteLine();
}, TaskCreationOptions.LongRunning);
projects.ForEach(p => produceCSFiles.Post(p));
produceCSFiles.Complete();
getMatchingLines.Completion.GetAwaiter().GetResult();
progress.Done = true;
progressTask.GetAwaiter().GetResult();
res.SaveAsJson(outFilePath);
}
The default parameters are (https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs#L24-L28):
private int m_maxDOP1 = 3;
private int m_maxDOP2 = 20;
private int m_maxDOP3 = Environment.ProcessorCount;
private int m_maxDOP4 = Environment.ProcessorCount;
private int m_workSize = 1_000_000;
My idea is to divide the work into work items, where a work item size is computed by multiplying the number of lines in the respective file by the count of the string literals. So, if a C# file contains 500 lines, then searching it for all the 3401 literals results in a work of size 3401 * 500 = 1700500
The unit of work is by default 1000000 lines, so in the aforementioned example the file would result in 2 work items:
Literals 0..1999
Literals 2000..3400
And it is the responsibility of the produceWorkItems block to generate these work items from files.
Example runs:
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\2.txt
Locating all the instances of the 4 literals found in the file C:\temp\2.txt in the C# code ...
Total = 49,844,516, Current = 49,702,532, Percents = 99.72%
Elapsed: 00:00:18.4320676
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\1.txt
Locating all the instances of the 3401 literals found in the file c:\temp\1.txt in the C# code ...
Total = 42,379,095,775, Current = 42,164,259,870, Percents = 99.49%
Elapsed: 01:44:13.4289270
Question
Many work items are undersized. If I have 3 C# files, 20 lines each, my current code would produce 3 work items, because in my current implementation work items never cross a file boundary. This is inefficient. Ideally, they would be batched into a single work item, because 60 * 3401 = 204060 < 1000000. But the BatchBlock cannot be used here, because it expects me to provide the batch size, which I do not know - it depends on the work items in the pipeline.
How would you achieve such batching ?
I have realized something. Maybe it is obvious, but I have just figured it out. The TPL DataFlow library is of no use if one can buffer all the items first. So in my case - I can do that. And so, I can buffer and sort the items from large to small. This way a simple Parallel.ForEach will do the work just fine. Having realized that I changed my implementation to use Reactive like this:
Phase 1 - get all the items, this is where all the IO is
var input = (workspaceRoot + "build\\projects.yml")
.YieldAllProjects()
.ToObservable()
.Select(project => Observable.FromAsync(() => Task.Run(() => YieldFiles(project, searchAllFiles))))
.Merge(2)
.SelectMany(files => files)
.Select(file => Observable.FromAsync(file.PopulateContentAsync))
.Merge(10)
.ToList()
.GetAwaiter().GetResult()
.AsList();
input.Sort((x, y) => y.EstimatedLineCount - x.EstimatedLineCount);
Phase 2 - find all the matching lines (CPU only)
var res = new SortedDictionary<string, List<MatchingLine>>();
input
.ToObservable()
.Select(file => Observable.FromAsync(() => Task.Run(() => file.YieldMatchingLines(literals, 0, literals.Count, progress).ToList())))
.Merge(maxDOP.Value)
.ToList()
.GetAwaiter().GetResult()
.SelectMany(m => m)
.ForEach(m => AddResult(res, m));
So, even though I have hundreds of projects, thousands of files and millions lines of code - it is not the scale for TPL DataFlow, because my tool can read all the files into memory, rearrange in a favorable order and then process.
Regarding the first question (configuring the pipeline), I can't really offer any guidance. Optimizing the parameters of a dataflow pipeline seems like a black art to me!
Regarding the second question (how to batch a work load consisting of work items having unknown size at compile time), you could use the custom BatchBlock<T> below. It uses the DataflowBlock.Encapsulate method in order to combine two dataflow blocks to one. The first block in an ActionBlock<T> that receives the input and puts it into a buffer, and the second is a BufferBlock<T[]> that holds the batched items and propagates them downstream. The weightSelector is a lambda that returns the weight of each received item. When the accumulated weight surpasses the batchWeight threshold, a batch is emitted.
public static IPropagatorBlock<T, T[]> CreateDynamicBatchBlock<T>(
int batchWeight, Func<T, int> weightSelector,
DataflowBlockOptions options = null)
{
// Arguments validation omitted
options ??= new DataflowBlockOptions();
var outputBlock = new BufferBlock<T[]>(options);
List<T> buffer = new List<T>();
int sumWeight = 0;
var inputBlock = new ActionBlock<T>(async item =>
{
checked
{
int weight = weightSelector(item);
if (weight + sumWeight > batchWeight && buffer.Count > 0)
await SendBatchAsync();
buffer.Add(item);
sumWeight += weight;
if (sumWeight >= batchWeight) await SendBatchAsync();
}
}, new()
{
BoundedCapacity = options.BoundedCapacity,
CancellationToken = options.CancellationToken,
TaskScheduler = options.TaskScheduler,
MaxMessagesPerTask = options.MaxMessagesPerTask,
NameFormat = options.NameFormat
});
PropagateCompletion(inputBlock, outputBlock, async () =>
{
if (buffer.Count > 0) await SendBatchAsync();
});
Task SendBatchAsync()
{
var batch = buffer.ToArray();
buffer.Clear();
sumWeight = 0;
return outputBlock.SendAsync(batch);
}
static async void PropagateCompletion(IDataflowBlock source,
IDataflowBlock target, Func<Task> postCompletionAction)
{
try { await source.Completion.ConfigureAwait(false); } catch { }
Exception ex =
source.Completion.IsFaulted ? source.Completion.Exception : null;
try { await postCompletionAction(); }
catch (Exception actionError) { ex = actionError; }
if (ex != null) target.Fault(ex); else target.Complete();
}
return DataflowBlock.Encapsulate(inputBlock, outputBlock);
}
Usage example:
var batchBlock = CreateDynamicBatchBlock<WorkItem>(1_000_000, wi => wi.Size);
If the weight int type has not enough range and overflows, you could switch to long or double.

How to assert uniqueness of a huge collection of strings?

Let's say I have an algorithm which takes an unsigned 64-bit integer as input, and yields a string as a result. The string's alphabet is limited to [a-z, A-Z, 0-9] and its' maximum length is 16. So that's or 47,672,401,706,823,533,450,263,330,816 possible results.
I would like to assert the uniqueness of the algorithm's output. Read: I want to verify there are no collisions.
Is there an easy/quick 'n dirty way to do this, without having to fall back to (e.g.) some kind of database?
[EDIT]
Some clarification: the concerns uttered in the comments are legit, but no worries, I wasn't really planning on iterating over all possible combinations, my lifespan will probably be sub-1 century ;) Nor did I write my own algorithm to generate unique ID's. I just saw this and started wondering how one would go about asserting uniqueness for algorithms with very large result sets that can't be handled in-memory
[/EDIT]
As said in the comments, It would take a very long time to compute every possible entries, but just for fun, here is a try:
var workspace = new DirectoryInfo("MyWorkspace");
if (workspace.Exists)
{
workspace.Delete();
}
workspace.Create();
var limit = 23997907;
var buffer = new HashSet<string>();
ulong i = 0;
int j = 0;
var stopWatch = Stopwatch.StartNew();
while (i <= ulong.MaxValue)
{
var result = YourSuperAlgorythm(i);
// Check the result with current results
if (buffer.Contains(result))
{
throw new Exception("Failure !");
}
// Check the result with older results
foreach (var file in workspace.GetFiles())
{
var content = new HashSet<string>(File.ReadAllText(file.FullName).Split(';'));
if (content.Contains(result))
{
throw new Exception("Failure !");
}
}
buffer[j] = result;
i++;
j++;
if (j == arrayLimit)
{
stopWatch.Stop();
Console.WriteLine("Resetting. This loop takes " + stopWatch.Elapsed.TotalMilliseconds + "ms");
j = 0;
var file = Path.GetRandomFileName();
File.WriteAllText(Path.Combine(workspace.FullName, file), String.Join(";", buffer));
buffer = new HashSet<string>();
stopWatch.Restart();
}
}
You could probably optimize it but you won't have enought of a lifetime to check the results. For now, it did not even create a file to store the first set of entries :D. I will edit this post when one loop will be done!
Your only option is to prove mathematically your algorithm. Good luck with that...
EDIT1: for my test, I use this function:
private static string YourSuperAlgorythm(ulong i)
{
return i.ToString("x");
}
EDIT2: One loop takes 1477221.4261ms (~25min). And then the String.Join(";", buffer) line failed (OutOfMemory). So 23997907 is not the max value for my try. It must be decreased!

Getting Min, Max, Sum with a single parallel for loop

I am trying to get minimum, maximum and sum (for the average) from a large array. I would love to substitute my regular for loop with parallel.for
UInt16 tempMin = (UInt16)(Math.Pow(2,mfvm.cameras[openCamIndex].bitDepth) - 1);
UInt16 tempMax = 0;
UInt64 tempSum = 0;
for (int i = 0; i < acquisition.frameDataShorts.Length; i++)
{
if (acquisition.frameDataShorts[i] < tempMin)
tempMin = acquisition.frameDataShorts[i];
if (acquisition.frameDataShorts[i] > tempMax)
tempMax = acquisition.frameDataShorts[i];
tempSum += acquisition.frameDataShorts[i];
}
I know how to solve this using Tasks with cutting the array myself. However I would love to learn how to use parallel.for for this. Since as I understand it, it should be able to do this very elegantly.
I found this tutorial from MSDN for calculating the Sum, however I have no idea how to extend it to do all three things (min, max and sum) in a single passage.
Results:
Ok I tried PLINQ solution and I have seen some serious improvements.
3 passes (Min, Max, Sum) are on my i7 (2x4 Cores) 4x times faster then sequential aproach. However I tried the same code on Xeon (2x8 core) and results are completelly different. Parallel (again 3 passes) are actually twice as slow as sequential aproach (which is like 5x faster then on my i7).
In the end I have separated the array myself with Task Factory and I have slightly better results on all computers.
I assume that the main issue here is that three different variables are have to be remembered each iteration. You can utilize Tuple for this purpose:
var lockObject = new object();
var arr = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
var min = arr[0];
var max = arr[0];
Parallel.For(0, arr.Length,
() => new Tuple<long, int, int>(0, arr[0], arr[0]),
(i, loop, temp) => new Tuple<long, int, int>(temp.Item1 + arr[i], Math.Min(temp.Item2, arr[i]),
Math.Max(temp.Item3, arr[i])),
x =>
{
lock (lockObject)
{
total += x.Item1;
min = Math.Min(min, x.Item2);
max = Math.Max(max, x.Item3);
}
}
);
I must warn you, though, that this implementation runs about 10x slower (on my machine) than the simple for loop approach you demonstrated in your question, so proceed with caution.
I don't think parallel.for is good fit here but try this out:
public class MyArrayHandler {
public async Task GetMinMaxSum() {
var myArray = Enumerable.Range(0, 1000);
var maxTask = Task.Run(() => myArray.Max());
var minTask = Task.Run(() => myArray.Min());
var sumTask = Task.Run(() => myArray.Sum());
var results = await Task.WhenAll(maxTask,
minTask,
sumTask);
var max = results[0];
var min = results[1];
var sum = results[2];
}
}
Edit
Just for fun due to the comments regarding performance I took a couple measurements. Also, found this Fastest way to find sum.
#10,000,000 values
GetMinMax: 218ms
GetMinMaxAsync: 308ms
public class MinMaxSumTests {
[Test]
public async Task GetMinMaxSumAsync() {
var myArray = Enumerable.Range(0, 10000000).Select(x => (long)x).ToArray();
var sw = new Stopwatch();
sw.Start();
var maxTask = Task.Run(() => myArray.Max());
var minTask = Task.Run(() => myArray.Min());
var sumTask = Task.Run(() => myArray.Sum());
var results = await Task.WhenAll(maxTask,
minTask,
sumTask);
var max = results[0];
var min = results[1];
var sum = results[2];
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
[Test]
public void GetMinMaxSum() {
var myArray = Enumerable.Range(0, 10000000).Select(x => (long)x).ToArray();
var sw = new Stopwatch();
sw.Start();
long tempMin = 0;
long tempMax = 0;
long tempSum = 0;
for (int i = 0; i < myArray.Length; i++) {
if (myArray[i] < tempMin)
tempMin = myArray[i];
if (myArray[i] > tempMax)
tempMax = myArray[i];
tempSum += myArray[i];
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
Do not reinvent the wheel, Min, Max Sum and similar operations are aggregations. Since .NET v3.5 you have a handy versions of LINQ extension methods which are already providing you the solution:
using System.Linq;
var sequence = Enumerable.Range(0, 10).Select(s => (uint)s).ToList();
Console.WriteLine(sequence.Sum(s => (double)s));
Console.WriteLine(sequence.Max());
Console.WriteLine(sequence.Min());
Though they are declared as the extensions for IEnumerable, they have some internal improvements for IList and Array types, so you should measure how your code will work on that types and on IEnumerable's.
In your case this isn't enough, as you clearly do not want to iterate other one array more than one time, so the magic goes here: PLINQ (a.k.a. Parallel-LINQ). You need to add only one method to aggregate your array in parallel:
var sequence = Enumerable.Range(0, 10000000).Select(s => (uint)s).AsParallel();
Console.WriteLine(sequence.Sum(s => (double)s));
Console.WriteLine(sequence.Max());
Console.WriteLine(sequence.Min());
This option add some overhead for synchronization the items, but it do scale well, providing a similar time either for small and big enumerations. From MSDN:
PLINQ is usually the recommended approach whenever you need to apply the parallel aggregation pattern to .NET applications. Its declarative nature makes it less prone to error than other approaches, and its performance on multicore computers is competitive with them.
Implementing parallel aggregation with PLINQ doesn't require adding locks in your code. Instead, all the synchronization occurs internally, within PLINQ.
However, if you still want to investigate the performance for different types of the operations, you can use the Parallel.For and Parallel.ForaEach methods overloads with some aggregation approach, something like this:
double[] sequence = ...
object lockObject = new object();
double sum = 0.0d;
Parallel.ForEach(
// The values to be aggregated
sequence,
// The local initial partial result
() => 0.0d,
// The loop body
(x, loopState, partialResult) =>
{
return Normalize(x) + partialResult;
},
// The final step of each local context
(localPartialSum) =>
{
// Enforce serial access to single, shared result
lock (lockObject)
{
sum += localPartialSum;
}
}
);
return sum;
If you need additional partition for your data, you can use a Partitioner for the methods:
var rangePartitioner = Partitioner.Create(0, sequence.Length);
Parallel.ForEach(
// The input intervals
rangePartitioner,
// same code here);
Also Aggregate method can be used for the PLINQ, with some merge logic
(illustration from MSDN again):
Useful links:
Parallel Aggregation
Enumerable.Min<TSource>(IEnumerable<TSource>) method
Enumerable.Sum method
Enumerable.Max<TSource> (IEnumerable<TSource>) method

Take N values from observable on each interval

I have an observable which streams a value for each ms. , this is done every 250 ms. ( meaning 250 values in 250 ms (give or take) ).
Mock sample code :
IObservable<IEnumerable<int>> input = from _ in Observable.Interval(TimeSpan.FromMilliseconds(250))
select CreateSamples(250);
input.Subscribe(values =>
{
for (int i = 0; i < values.Count(); i++)
{
Console.WriteLine("Value : {0}", i);
}
});
Console.ReadKey();
private static IEnumerable<int> CreateSamples(int count)
{
for (int i = 0; i < 250; i++)
{
yield return i;
}
}
What i need is to create some form of process observable which process the input observable in a rate of 8 values every 33 ms
Something along the line of this :
IObservable<IEnumerable<int>> process = from _ in Observable.Interval(TimeSpan.FromMilliseconds(33))
select stream.Take(8);
I was wondering 2 things :
1) How can i write the first sample with the built in operators that reactive extensions provides ?
2) How can i create that process stream which takes values from the input stream
which with the behavior iv'e described ?
I tried using Window as a suggestion from comments below .
input.Window(TimeSpan.FromMilliseconds(33)).Take(8).Subscribe(winObservable => Debug.WriteLine(" !! "));
It seems as though i get 8 and only 8 observables of an unknown number of values
What i require is a recurrence of 8 values every 33 ms. from input observable.
What the code above did is 8 observables of IEnumrable and then stand idle.
EDIT : Thanks to James World . here's a sample .
var input = Observable.Range(1, int.MaxValue);
var timedInput = Observable.Interval(TimeSpan.FromMilliseconds(33))
.Zip(input.Buffer(8), (_, buffer) => buffer);
timedInput.SelectMany(x => x).Subscribe(Console.WriteLine);
But now it get's trickier i need for the Buffer value to calculated
i need this to be done by the actual MS passed between Intervals
when you write a TimeSpan.FromMilliseconds(33) the Interval event of the timer would actually be raised around 45 ms give or take .
Is there any way to calculate the buffer , something like PSUDO
input.TimeInterval().Buffer( s => s.Interval.Milliseconds / 4)
You won't be able to do this with any kind of accuracy with a reasonable solution because .NET timer resolution is 15ms.
If the timer was fast enough, you would have to flatten and repackage the stream with a pacer, something like:
// flatten stream
var fs = input.SelectMany(x => x);
// buffer 8 values and release every 33 milliseconds
var xs = Observable.Interval(TimeSpan.FromMilliseconds(33))
.Zip(fs.Buffer(8), (_,buffer) => buffer);
Although as I said, this will give very jittery timing. If that kind of timing resolution is important to you, go native!
I agree with James' analysis.
I'm wondering if this query gives you a better result:
IObservable<IList<int>> input =
Observable
.Generate(
0,
x => true,
x => x < 250 ? x + 1 : 0,
x => x,
x => TimeSpan.FromMilliseconds(33.0 / 8.0))
.Buffer(TimeSpan.FromMilliseconds(33.0));

Grid Computing API

I want to write a distributed software system (system where you can execute programs faster than on a single pc), that can execute different kinds of programs.(As it is a school project, I'll probably execute programs like Prime finder and Pi calculator on it)
My preferences is that it should written for C# with .NET, have good documentation, be simple to write(not new in C# with .NET, but I'm not professional) and to be able to write tasks for the grid easily and/or to load programs to the network directly from .exe.
I've looked a little at the:
MPAPI
Utilify(from the makers of Alchemy)
NGrid (Outdated?)
Which one is the best for my case? Do you have any experience with them?
ps. I'm aware of many similar questions here, but they were either outdated, not with proper answers or didn't answer my question, and therefore I choose to ask again.
I just contacted the founder of Utilify (Krishna Nadiminti) and while active development has paused for now, he has kindly released all the source code here on Bitbucket.
I think it is worth continuing this project as there are literally no comparable alternative as of now (even commercial). I may start working on it but don't wait for me :).
Got same problem. I tried NGrid, Alchemi and MS
PI.net.
After all i decided to start my own open source project to play around, check here: http://lucygrid.codeplex.com/.
UPDATE:
See how looks PI example:
The function passed to AsParallelGrid will be executed by the grid nodes.
You can play with it running the DEMO project.
/// <summary>
/// Distributes simple const processing
/// </summary>
class PICalculation : AbstractDemo
{
public int Steps = 100000;
public int ChunkSize = 50;
public PICalculation()
{
}
override
public string Info()
{
return "Calculates PI over the grid.";
}
override
public string Run(bool enableLocalProcessing)
{
double sum = 0.0;
double step = 1.0 / (double)Steps;
/* ORIGINAL VERSION
object obj = new object();
Parallel.ForEach(
Partitioner.Create(0, Steps),
() => 0.0,
(range, state, partial) =>
{
for (long i = range.Item1; i < range.Item2; i++)
{
double x = (i - 0.5) * step;
partial += 4.0 / (1.0 + x * x);
}
return partial;
},
partial => { lock (obj) sum += partial; });
*/
sum = Enumerable
.Range(0, Steps)
// Create bucket
.GroupBy(s => s / 50)
// Local variable initialization is not distributed over the grid
.Select(i => new
{
Item1 = i.First(),
Item2 = i.Last() + 1, // Inclusive
Step = step
})
.AsParallelGrid(data =>
{
double partial = 0;
for (var i = data.Item1; i != data.Item2 ; ++i)
{
double x = (i - 0.5) * data.Step;
partial += (double)(4.0 / (1.0 + x * x));
}
return partial;
}, new GridSettings()
{
EnableLocalProcessing = enableLocalProcessing
})
.Sum() * step;
return sum.ToString();
}
}

Categories

Resources