How to search vast code base for multiple literal strings efficiently? - c#

This question is a follow up on How to optimize performance in a simple TPL DataFlow pipeline?
The source code is here - https://github.com/MarkKharitonov/LearningTPLDataFlow
Given:
Several solutions covering about 400 C# projects encompassing thousands of C# source files totaling in more than 10,000,000 lines of code.
A file containing string literals, one per line.
I want to produce a JSON file listing all the occurrences of the literals in the source code. For every matching line I want to have the following pieces of information:
The project path
The C# file path
The matching line itself
The matching line number
And all the records arranged as a dictionary keyed by the respective literal.
So the challenge is to do it as efficiently as possible (in C#, of course).
The DataFlow pipeline can be found in this file - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs
Here it is:
private void Run(string workspaceRoot, string outFilePath, string[] literals, bool searchAllFiles, int workSize, int maxDOP1, int maxDOP2, int maxDOP3, int maxDOP4)
{
var res = new SortedDictionary<string, List<MatchingLine>>();
var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();
var progress = new Progress();
var taskSchedulerPair = new ConcurrentExclusiveSchedulerPair(TaskScheduler.Default, Environment.ProcessorCount);
var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(p => YieldCSFiles(p, searchAllFiles), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP1
});
var produceCSFileContent = new TransformBlock<CSFile, CSFile>(CSFile.PopulateContentAsync, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP2
});
var produceWorkItems = new TransformManyBlock<CSFile, (CSFile CSFile, int Pos, int Length)>(csFile => csFile.YieldWorkItems(literals, workSize, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP3,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var produceMatchingLines = new TransformManyBlock<(CSFile CSFile, int Pos, int Length), MatchingLine>(o => o.CSFile.YieldMatchingLines(literals, o.Pos, o.Length, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP4,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var getMatchingLines = new ActionBlock<MatchingLine>(o => AddResult(res, o));
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
produceCSFiles.LinkTo(produceCSFileContent, linkOptions);
produceCSFileContent.LinkTo(produceWorkItems, linkOptions);
produceWorkItems.LinkTo(produceMatchingLines, linkOptions);
produceMatchingLines.LinkTo(getMatchingLines, linkOptions);
var progressTask = Task.Factory.StartNew(() =>
{
var delay = literals.Length < 10 ? 1000 : 10000;
for (; ; )
{
var current = Interlocked.Read(ref progress.Current);
var total = Interlocked.Read(ref progress.Total);
Console.Write("Total = {0:n0}, Current = {1:n0}, Percents = {2:P} \r", total, current, ((double)current) / total);
if (progress.Done)
{
break;
}
Thread.Sleep(delay);
}
Console.WriteLine();
}, TaskCreationOptions.LongRunning);
projects.ForEach(p => produceCSFiles.Post(p));
produceCSFiles.Complete();
getMatchingLines.Completion.GetAwaiter().GetResult();
progress.Done = true;
progressTask.GetAwaiter().GetResult();
res.SaveAsJson(outFilePath);
}
The default parameters are (https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs#L24-L28):
private int m_maxDOP1 = 3;
private int m_maxDOP2 = 20;
private int m_maxDOP3 = Environment.ProcessorCount;
private int m_maxDOP4 = Environment.ProcessorCount;
private int m_workSize = 1_000_000;
My idea is to divide the work into work items, where a work item size is computed by multiplying the number of lines in the respective file by the count of the string literals. So, if a C# file contains 500 lines, then searching it for all the 3401 literals results in a work of size 3401 * 500 = 1700500
The unit of work is by default 1000000 lines, so in the aforementioned example the file would result in 2 work items:
Literals 0..1999
Literals 2000..3400
And it is the responsibility of the produceWorkItems block to generate these work items from files.
Example runs:
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\2.txt
Locating all the instances of the 4 literals found in the file C:\temp\2.txt in the C# code ...
Total = 49,844,516, Current = 49,702,532, Percents = 99.72%
Elapsed: 00:00:18.4320676
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\1.txt
Locating all the instances of the 3401 literals found in the file c:\temp\1.txt in the C# code ...
Total = 42,379,095,775, Current = 42,164,259,870, Percents = 99.49%
Elapsed: 01:44:13.4289270
Question
Many work items are undersized. If I have 3 C# files, 20 lines each, my current code would produce 3 work items, because in my current implementation work items never cross a file boundary. This is inefficient. Ideally, they would be batched into a single work item, because 60 * 3401 = 204060 < 1000000. But the BatchBlock cannot be used here, because it expects me to provide the batch size, which I do not know - it depends on the work items in the pipeline.
How would you achieve such batching ?

I have realized something. Maybe it is obvious, but I have just figured it out. The TPL DataFlow library is of no use if one can buffer all the items first. So in my case - I can do that. And so, I can buffer and sort the items from large to small. This way a simple Parallel.ForEach will do the work just fine. Having realized that I changed my implementation to use Reactive like this:
Phase 1 - get all the items, this is where all the IO is
var input = (workspaceRoot + "build\\projects.yml")
.YieldAllProjects()
.ToObservable()
.Select(project => Observable.FromAsync(() => Task.Run(() => YieldFiles(project, searchAllFiles))))
.Merge(2)
.SelectMany(files => files)
.Select(file => Observable.FromAsync(file.PopulateContentAsync))
.Merge(10)
.ToList()
.GetAwaiter().GetResult()
.AsList();
input.Sort((x, y) => y.EstimatedLineCount - x.EstimatedLineCount);
Phase 2 - find all the matching lines (CPU only)
var res = new SortedDictionary<string, List<MatchingLine>>();
input
.ToObservable()
.Select(file => Observable.FromAsync(() => Task.Run(() => file.YieldMatchingLines(literals, 0, literals.Count, progress).ToList())))
.Merge(maxDOP.Value)
.ToList()
.GetAwaiter().GetResult()
.SelectMany(m => m)
.ForEach(m => AddResult(res, m));
So, even though I have hundreds of projects, thousands of files and millions lines of code - it is not the scale for TPL DataFlow, because my tool can read all the files into memory, rearrange in a favorable order and then process.

Regarding the first question (configuring the pipeline), I can't really offer any guidance. Optimizing the parameters of a dataflow pipeline seems like a black art to me!
Regarding the second question (how to batch a work load consisting of work items having unknown size at compile time), you could use the custom BatchBlock<T> below. It uses the DataflowBlock.Encapsulate method in order to combine two dataflow blocks to one. The first block in an ActionBlock<T> that receives the input and puts it into a buffer, and the second is a BufferBlock<T[]> that holds the batched items and propagates them downstream. The weightSelector is a lambda that returns the weight of each received item. When the accumulated weight surpasses the batchWeight threshold, a batch is emitted.
public static IPropagatorBlock<T, T[]> CreateDynamicBatchBlock<T>(
int batchWeight, Func<T, int> weightSelector,
DataflowBlockOptions options = null)
{
// Arguments validation omitted
options ??= new DataflowBlockOptions();
var outputBlock = new BufferBlock<T[]>(options);
List<T> buffer = new List<T>();
int sumWeight = 0;
var inputBlock = new ActionBlock<T>(async item =>
{
checked
{
int weight = weightSelector(item);
if (weight + sumWeight > batchWeight && buffer.Count > 0)
await SendBatchAsync();
buffer.Add(item);
sumWeight += weight;
if (sumWeight >= batchWeight) await SendBatchAsync();
}
}, new()
{
BoundedCapacity = options.BoundedCapacity,
CancellationToken = options.CancellationToken,
TaskScheduler = options.TaskScheduler,
MaxMessagesPerTask = options.MaxMessagesPerTask,
NameFormat = options.NameFormat
});
PropagateCompletion(inputBlock, outputBlock, async () =>
{
if (buffer.Count > 0) await SendBatchAsync();
});
Task SendBatchAsync()
{
var batch = buffer.ToArray();
buffer.Clear();
sumWeight = 0;
return outputBlock.SendAsync(batch);
}
static async void PropagateCompletion(IDataflowBlock source,
IDataflowBlock target, Func<Task> postCompletionAction)
{
try { await source.Completion.ConfigureAwait(false); } catch { }
Exception ex =
source.Completion.IsFaulted ? source.Completion.Exception : null;
try { await postCompletionAction(); }
catch (Exception actionError) { ex = actionError; }
if (ex != null) target.Fault(ex); else target.Complete();
}
return DataflowBlock.Encapsulate(inputBlock, outputBlock);
}
Usage example:
var batchBlock = CreateDynamicBatchBlock<WorkItem>(1_000_000, wi => wi.Size);
If the weight int type has not enough range and overflows, you could switch to long or double.

Related

Fast reading large table

I have csv file structured as below:
1,0,2.2,0,0,0,0,1.2,0
0,1,2,4,0,1,0.2,0.1,0
0,0,2,3,0,0,0,1.2,2.1
0,0,0,1,2,1,0,0.2,0.1
0,0,1,0,2.1,0.1,0,1.2
0,0,2,3,0,1.1,0.1,1.2
0,0.2,0,1.2,2,0,3.2,0
0,0,1.2,0,2.2,0,0,1.1
but with 10k columns and 10k rows.
I want to read it in such a way that in the result i get a dictionary
with Key as a index of the row and Value as float array filed with every value in this row.
For now my code look like this:
var lines = File.ReadAllLines(filePath).ToList();
var result = lines.AsParallel().AsOrdered().Select((line, index) =>
{
var values = line?.Split(',').Where(v =>!string.IsNullOrEmpty(v))
.Select(f => f.Replace('.', ','))
.Select(float.Parse).ToArray();
return (index, values);
}).ToDictionary(d => d.Item1, d => d.Item2);
but it takes up to 30 seconds to finish, so it's quite slow and i want to optimize it to be a bit faster.
While there are many small optimizations you can make, what is really killing you is the garbage collector because of all the allocations.
Your code takes 12 seconds to run on my machine. Reading the file uses 2 of those 12 seconds.
By using all the optimizations mentioned in the comments (using File.ReadLines, StringSplitOptions.RemoveEmptyEntries, also using float.Parse(f, CultureInfo.InvariantCulture) instead of calling string.Replace), we get down to 9 seconds. There's still a lot of allocations done, especially by File.ReadLines. Can we do better?
Just activate server GC in the app.config:
<runtime>
<gcServer enabled="true" />
</runtime>
With that, the execution time drops to 6 seconds using your code, and 3 seconds using the optimizations mentioned above. At that point, the file I/O are taking more than 60% of the execution time, so it's not really worth optimizing more.
Final version of the code:
var lines = File.ReadLines(filePath);
var separator = new[] {','};
var result = lines.AsParallel().AsOrdered().Select((line, index) =>
{
var values = line?.Split(separator, StringSplitOptions.RemoveEmptyEntries)
.Select(f => float.Parse(f, CultureInfo.InvariantCulture)).ToArray();
return (index, values);
}).ToDictionary(d => d.Item1, d => d.Item2);
Replacing the Split and Replace with hand parsing and using InvariantInfo to accept the period as decimal point, and then removing the wasteful ReadAllLines().ToList() and letting the AsParallel() read from the file while parsing, speeds up on my PC about four times.
var lines = File.ReadLines(filepath);
var result = lines.AsParallel().AsOrdered().Select((line, index) => {
var values = new List<float>(10000);
var pos = 0;
while (pos < line.Length) {
var commapos = line.IndexOf(',', pos);
commapos = commapos < 0 ? line.Length : commapos;
var fs = line.Substring(pos, commapos - pos);
if (fs != String.Empty) // remove if no value is ever missing
values.Add(float.Parse(fs, NumberFormatInfo.InvariantInfo));
pos = commapos + 1;
}
return values;
}).ToList();
Also replaced ToArray on values with a List as that is generally faster (ToList is preferred over ToArray).
using Microsoft.VisualBasic.FileIO;
protected void CSVImport(string importFilePath)
{
string csvData = System.IO.File.ReadAllText(importFilePath, System.Text.Encoding.GetEncoding("WINDOWS-1250"));
foreach (string row in csvData.Split('\n'))
{
var parser = new TextFieldParser(new StringReader(row));
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields;
fields = parser.ReadFields();
//do what you need with data in array
}
}

Getting Min, Max, Sum with a single parallel for loop

I am trying to get minimum, maximum and sum (for the average) from a large array. I would love to substitute my regular for loop with parallel.for
UInt16 tempMin = (UInt16)(Math.Pow(2,mfvm.cameras[openCamIndex].bitDepth) - 1);
UInt16 tempMax = 0;
UInt64 tempSum = 0;
for (int i = 0; i < acquisition.frameDataShorts.Length; i++)
{
if (acquisition.frameDataShorts[i] < tempMin)
tempMin = acquisition.frameDataShorts[i];
if (acquisition.frameDataShorts[i] > tempMax)
tempMax = acquisition.frameDataShorts[i];
tempSum += acquisition.frameDataShorts[i];
}
I know how to solve this using Tasks with cutting the array myself. However I would love to learn how to use parallel.for for this. Since as I understand it, it should be able to do this very elegantly.
I found this tutorial from MSDN for calculating the Sum, however I have no idea how to extend it to do all three things (min, max and sum) in a single passage.
Results:
Ok I tried PLINQ solution and I have seen some serious improvements.
3 passes (Min, Max, Sum) are on my i7 (2x4 Cores) 4x times faster then sequential aproach. However I tried the same code on Xeon (2x8 core) and results are completelly different. Parallel (again 3 passes) are actually twice as slow as sequential aproach (which is like 5x faster then on my i7).
In the end I have separated the array myself with Task Factory and I have slightly better results on all computers.
I assume that the main issue here is that three different variables are have to be remembered each iteration. You can utilize Tuple for this purpose:
var lockObject = new object();
var arr = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
var min = arr[0];
var max = arr[0];
Parallel.For(0, arr.Length,
() => new Tuple<long, int, int>(0, arr[0], arr[0]),
(i, loop, temp) => new Tuple<long, int, int>(temp.Item1 + arr[i], Math.Min(temp.Item2, arr[i]),
Math.Max(temp.Item3, arr[i])),
x =>
{
lock (lockObject)
{
total += x.Item1;
min = Math.Min(min, x.Item2);
max = Math.Max(max, x.Item3);
}
}
);
I must warn you, though, that this implementation runs about 10x slower (on my machine) than the simple for loop approach you demonstrated in your question, so proceed with caution.
I don't think parallel.for is good fit here but try this out:
public class MyArrayHandler {
public async Task GetMinMaxSum() {
var myArray = Enumerable.Range(0, 1000);
var maxTask = Task.Run(() => myArray.Max());
var minTask = Task.Run(() => myArray.Min());
var sumTask = Task.Run(() => myArray.Sum());
var results = await Task.WhenAll(maxTask,
minTask,
sumTask);
var max = results[0];
var min = results[1];
var sum = results[2];
}
}
Edit
Just for fun due to the comments regarding performance I took a couple measurements. Also, found this Fastest way to find sum.
#10,000,000 values
GetMinMax: 218ms
GetMinMaxAsync: 308ms
public class MinMaxSumTests {
[Test]
public async Task GetMinMaxSumAsync() {
var myArray = Enumerable.Range(0, 10000000).Select(x => (long)x).ToArray();
var sw = new Stopwatch();
sw.Start();
var maxTask = Task.Run(() => myArray.Max());
var minTask = Task.Run(() => myArray.Min());
var sumTask = Task.Run(() => myArray.Sum());
var results = await Task.WhenAll(maxTask,
minTask,
sumTask);
var max = results[0];
var min = results[1];
var sum = results[2];
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
[Test]
public void GetMinMaxSum() {
var myArray = Enumerable.Range(0, 10000000).Select(x => (long)x).ToArray();
var sw = new Stopwatch();
sw.Start();
long tempMin = 0;
long tempMax = 0;
long tempSum = 0;
for (int i = 0; i < myArray.Length; i++) {
if (myArray[i] < tempMin)
tempMin = myArray[i];
if (myArray[i] > tempMax)
tempMax = myArray[i];
tempSum += myArray[i];
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
Do not reinvent the wheel, Min, Max Sum and similar operations are aggregations. Since .NET v3.5 you have a handy versions of LINQ extension methods which are already providing you the solution:
using System.Linq;
var sequence = Enumerable.Range(0, 10).Select(s => (uint)s).ToList();
Console.WriteLine(sequence.Sum(s => (double)s));
Console.WriteLine(sequence.Max());
Console.WriteLine(sequence.Min());
Though they are declared as the extensions for IEnumerable, they have some internal improvements for IList and Array types, so you should measure how your code will work on that types and on IEnumerable's.
In your case this isn't enough, as you clearly do not want to iterate other one array more than one time, so the magic goes here: PLINQ (a.k.a. Parallel-LINQ). You need to add only one method to aggregate your array in parallel:
var sequence = Enumerable.Range(0, 10000000).Select(s => (uint)s).AsParallel();
Console.WriteLine(sequence.Sum(s => (double)s));
Console.WriteLine(sequence.Max());
Console.WriteLine(sequence.Min());
This option add some overhead for synchronization the items, but it do scale well, providing a similar time either for small and big enumerations. From MSDN:
PLINQ is usually the recommended approach whenever you need to apply the parallel aggregation pattern to .NET applications. Its declarative nature makes it less prone to error than other approaches, and its performance on multicore computers is competitive with them.
Implementing parallel aggregation with PLINQ doesn't require adding locks in your code. Instead, all the synchronization occurs internally, within PLINQ.
However, if you still want to investigate the performance for different types of the operations, you can use the Parallel.For and Parallel.ForaEach methods overloads with some aggregation approach, something like this:
double[] sequence = ...
object lockObject = new object();
double sum = 0.0d;
Parallel.ForEach(
// The values to be aggregated
sequence,
// The local initial partial result
() => 0.0d,
// The loop body
(x, loopState, partialResult) =>
{
return Normalize(x) + partialResult;
},
// The final step of each local context
(localPartialSum) =>
{
// Enforce serial access to single, shared result
lock (lockObject)
{
sum += localPartialSum;
}
}
);
return sum;
If you need additional partition for your data, you can use a Partitioner for the methods:
var rangePartitioner = Partitioner.Create(0, sequence.Length);
Parallel.ForEach(
// The input intervals
rangePartitioner,
// same code here);
Also Aggregate method can be used for the PLINQ, with some merge logic
(illustration from MSDN again):
Useful links:
Parallel Aggregation
Enumerable.Min<TSource>(IEnumerable<TSource>) method
Enumerable.Sum method
Enumerable.Max<TSource> (IEnumerable<TSource>) method

Using Ienumerable.TakeWhile but only returns one set of results

First off I want to apologize if my code is bad or if my description is poor. This is one of my first times working with C# threading/tasks. What I'm trying to do in my code is to go through a list of names and for each 50 names in the list, start a new task and pass off those 50 names to another method that will perform calculation heavy methods on the data. My code only works for the first 50 names in the list and it returns 0 results for every other time and I can't seem to figure out why.
public static async void startInitialDownload(string value)
{
IEnumerable<string> names = await Helper.getNames(value, 0);
decimal multiple = names.Count() / 50;
string[] results;
int num1 = 0;
int num2 = 0;
for (int i = 0; i < multiple + 1; i++)
{
num1 = i * 50;
num2 = (50 * (i + 1));
results = names.TakeWhile((name, index) => index >= num1 && index < num2).ToArray();
Task current = Task.Factory.StartNew(() => getCurrentData(results));
await current.ConfigureAwait(false);
}
}
Realise the enumerable into a list, so that it will be calculated once, not each iteration in the loop. You can use the Skip and Take methods to get a range of the list:
public static async void startInitialDownload(string value) {
IEnumerable<string> names = await Helper.getNames(value, 0);
List<string> nameList = names.ToList();
for (int i = 0; i < nameList.Count; i += 50) {
string[] results = nameList.Skip(i).Take(50).ToArray();
Task current = Task.Factory.StartNew(() => getCurrentData(results));
await current.ConfigureAwait(false);
}
}
Or you can add items to a list, and execute it when it has the right size:
public static async void startInitialDownload(string value) {
IEnumerable<string> names = await Helper.getNames(value, 0);
List<string> buffer = new List<string>();
foreach (string s in names) {
buffer.Add(s);
if (buffer.Count == 50) {
Task current = Task.Factory.StartNew(() => getCurrentData(buffer.ToArray()));
await current.ConfigureAwait(false);
buffer = new List<string>();
}
}
if (buffer.Count > 0) {
Task current = Task.Factory.StartNew(() => getCurrentData(buffer.ToArray()));
await current.ConfigureAwait(false);
}
}
The name TakeWhile suggests that it only takes entries while the condition is true. So if it starts off by reading an entry for which the condition is false, it never takes anything.
So the first loop, you're starting with num1 = 0. So it reads entries from num1 to num2.
The second loop, you're starting with num1 being 51. So it starts reading again ... and the first entry it hits, the condition is false, so it stops.
You might try using Where, or by using Skip before hand.
The tl;dr; of it: I don't think your problem has anything to do with parallel tasks, I think it's due to using the wrong LINQ method to pull the names you want to use.
As I understand it from Stephen Cleary's response to a similar (though not identical) question, you don't need to use ConfigureAwait() there.
Here's the link in question: on stack overflow
And here's what I would do instead with the last two lines of your for loop:
Task.Factory.StartNew(() => getCurrentData(results));
That's it. By using the factory, and by not awaiting, you are letting that task run on its own (possibly on a new thread). Provided that your storage is all thread safe (see: System.Collections.Concurrent btw) then you should be all set.
Caveat: if you aren't showing us what lies after the await then your results may vary.
its not a direct solution, but it might work.
public static IEnumerable<T[]> MakeBuckets<T>(IEnumerable<T> source, int maxSize)
{
List<T> currentBucket = new List<T>(maxSize);
foreach (var s in source)
{
currentBucket.Add(s);
if (currentBucket.Count >= maxSize)
{
yield return currentBucket.ToArray();
currentBucket = new List<T>(maxSize);
}
}
if(currentBucket.Any())
yield return currentBucket.ToArray();
}
later you can iterate through the result of the MakeBucket function.

Why does Observable.Generate() throw System.StackOverflowException?

I´m writing a C# (.NET 4.5) application that is used to aggregate time based events for reporting purposes. To make my query logic reusable for both realtime and historical data I make use of the Reactive Extensions (2.0) and their IScheduler infrastructure (HistoricalScheduler and friends).
For example, assume we create a list of events (sorted chronologically, but they may coincide!) whose only payload ist their timestamp and want to know their distribution across buffers of a fixed duration:
const int num = 100000;
const int dist = 10;
var events = new List<DateTimeOffset>();
var curr = DateTimeOffset.Now;
var gap = new Random();
var time = new HistoricalScheduler(curr);
for (int i = 0; i < num; i++)
{
events.Add(curr);
curr += TimeSpan.FromMilliseconds(gap.Next(dist));
}
var stream = Observable.Generate<int, DateTimeOffset>(
0,
s => s < events.Count,
s => s + 1,
s => events[s],
s => events[s],
time);
stream.Buffer(TimeSpan.FromMilliseconds(num), time)
.Subscribe(l => Console.WriteLine(time.Now + ": " + l.Count));
time.AdvanceBy(TimeSpan.FromMilliseconds(num * dist));
Running this code results in a System.StackOverflowException with the following stack trace (it´s the last 3 lines all the way down):
mscorlib.dll!System.Threading.Interlocked.Exchange<System.IDisposable>(ref System.IDisposable location1, System.IDisposable value) + 0x3d bytes
System.Reactive.Core.dll!System.Reactive.Disposables.SingleAssignmentDisposable.Dispose() + 0x37 bytes
System.Reactive.Core.dll!System.Reactive.Concurrency.ScheduledItem<System.DateTimeOffset>.Cancel() + 0x23 bytes
...
System.Reactive.Core.dll!System.Reactive.Disposables.AnonymousDisposable.Dispose() + 0x4d bytes
System.Reactive.Core.dll!System.Reactive.Disposables.SingleAssignmentDisposable.Dispose() + 0x4f bytes
System.Reactive.Core.dll!System.Reactive.Concurrency.ScheduledItem<System.DateTimeOffset>.Cancel() + 0x23 bytes
...
Ok, the problem seems to come from my use of Observable.Generate(), depending on the list size (num) and regardless of the choice of scheduler.
What am I doing wrong? Or more generally, what would be the preferred way to create an IObservable from an IEnumerable of events that provide their own timestamps?
(update - realized I didn't provide an alternative: see at bottom of answer)
The problem is in how Observable.Generate works - it's used to unfold a corecursive (think recursion turned inside out) generator based on the arguments; if those arguments end up generating a very nested corecursive generator, you'll blow your stack.
From this point on, I'm speculating a lot (don't have the Rx source in front of me) (see below), but I'm willing to bet your definition ends up expanding into something like:
initial_state =>
generate_next(initial_state) =>
generate_next(generate_next(initial_state)) =>
generate_next(generate_next(generate_next(initial_state))) =>
generate_next(generate_next(generate_next(generate_next(initial_state)))) => ...
And on and on until your call stack gets big enough to overflow. At, say, a method signature + your int counter, that'd be something like 8-16 bytes per recursive call (more depending on how the state machine generator is implemented), so 60,000 sounds about right (1M / 16 ~ 62500 maximum depth)
EDIT: Pulled up the source - confirmed: the "Run" method of Generate looks like this - take note of the nested calls to Generate:
protected override IDisposable Run(
IObserver<TResult> observer,
IDisposable cancel,
Action<IDisposable> setSink)
{
if (this._timeSelectorA != null)
{
Generate<TState, TResult>.α α =
new Generate<TState, TResult>.α(
(Generate<TState, TResult>) this,
observer,
cancel);
setSink(α);
return α.Run();
}
if (this._timeSelectorR != null)
{
Generate<TState, TResult>.δ δ =
new Generate<TState, TResult>.δ(
(Generate<TState, TResult>) this,
observer,
cancel);
setSink(δ);
return δ.Run();
}
Generate<TState, TResult>._ _ =
new Generate<TState, TResult>._(
(Generate<TState, TResult>) this,
observer,
cancel);
setSink(_);
return _.Run();
}
EDIT: Derp, didn't offer any alternatives...here's one that might work:
(EDIT: fixed Enumerable.Range, so stream size won´t be multiplied by chunkSize)
const int num = 160000;
const int dist = 10;
var events = new List<DateTimeOffset>();
var curr = DateTimeOffset.Now;
var gap = new Random();
var time = new HistoricalScheduler(curr);
for (int i = 0; i < num; i++)
{
events.Add(curr);
curr += TimeSpan.FromMilliseconds(gap.Next(dist));
}
// Size too big? Fine, we'll chunk it up!
const int chunkSize = 10000;
var numberOfChunks = events.Count / chunkSize;
// Generate a whole mess of streams based on start/end indices
var streams =
from chunkIndex in Enumerable.Range(0, (int)Math.Ceiling((double)events.Count / chunkSize) - 1)
let startIdx = chunkIndex * chunkSize
let endIdx = Math.Min(events.Count, startIdx + chunkSize)
select Observable.Generate<int, DateTimeOffset>(
startIdx,
s => s < endIdx,
s => s + 1,
s => events[s],
s => events[s],
time);
// E pluribus streamum
var stream = Observable.Concat(streams);
stream.Buffer(TimeSpan.FromMilliseconds(num), time)
.Subscribe(l => Console.WriteLine(time.Now + ": " + l.Count));
time.AdvanceBy(TimeSpan.FromMilliseconds(num * dist));
OK, I´ve taken a different factory method that doesn´t require lamdba expressions as state transitions and now I don´t see any stack overflows anymore. I´m not yet sure if this would qualify as a correct answer to my question, but it works and I thought I´d share it here:
var stream = Observable.Create<DateTimeOffset>(o =>
{
foreach (var e in events)
{
time.Schedule(e, () => o.OnNext(e));
}
time.Schedule(events[events.Count - 1], () => o.OnCompleted());
return Disposable.Empty;
});
Manually scheduling the events before (!) returning the subscription seems awkward to me, but in this case it can be done inside the lambda expression.
If there is anything wrong about this approach, please correct me. Also, I´d still be happy to hear what implicit assumptions by System.Reactive I have violated with my original code.
(Oh my, I should have checked that earlier: with RX v1.0, the original Observable.Generate() does in fact seem to work!)

reactive extensions sliding time window

I have a sequence of stock ticks coming in and I want to take all the data in the last hour and do some processing on it. I am trying to achieve this with reactive extensions 2.0. I read on another post to use Interval but i think that is deprecated.
Would this extension method solve your problem?
public static IObservable<T[]> RollingBuffer<T>(
this IObservable<T> #this,
TimeSpan buffering)
{
return Observable.Create<T[]>(o =>
{
var list = new LinkedList<Timestamped<T>>();
return #this.Timestamp().Subscribe(tx =>
{
list.AddLast(tx);
while (list.First.Value.Timestamp < DateTime.Now.Subtract(buffering))
{
list.RemoveFirst();
}
o.OnNext(list.Select(tx2 => tx2.Value).ToArray());
}, ex => o.OnError(ex), () => o.OnCompleted());
});
}
You are looking for the Window operators!
Here is a lengthy article I wrote on working with sequences of coincidence (overlapping windows of sequences)
http://introtorx.com/Content/v1.0.10621.0/17_SequencesOfCoincidence.html
So if you wanted to build a rolling average you could use this sort of code
var scheduler = new TestScheduler();
var notifications = new Recorded<Notification<double>>[30];
for (int i = 0; i < notifications.Length; i++)
{
notifications[i] = new Recorded<Notification<double>>(i*1000000, Notification.CreateOnNext<double>(i));
}
//Push values into an observable sequence 0.1 seconds apart with values from 0 to 30
var source = scheduler.CreateHotObservable(notifications);
source.GroupJoin(
source, //Take values from myself
_=>Observable.Return(0, scheduler), //Just the first value
_=>Observable.Timer(TimeSpan.FromSeconds(1), scheduler),//Window period, change to 1hour
(lhs, rhs)=>rhs.Sum()) //Aggregation you want to do.
.Subscribe(i=>Console.WriteLine (i));
scheduler.Start();
And we can see it output the rolling sums as it receives values.
0, 1, 3, 6, 10, 15, 21, 28...
Very likely Buffer is what you are looking for:
var hourlyBatch = ticks.Buffer(TimeSpan.FromHours(1));
Or assuming data is already Timestamped, simply using Scan:
public static IObservable<IReadOnlyList<Timestamped<T>>> SlidingWindow<T>(this IObservable<Timestamped<T>> self, TimeSpan length)
{
return self.Scan(new LinkedList<Timestamped<T>>(),
(ll, newSample) =>
{
ll.AddLast(newSample);
var oldest = newSample.Timestamp - length;
while (ll.Count > 0 && list.First.Value.Timestamp < oldest)
list.RemoveFirst();
return list;
}).Select(l => l.ToList().AsReadOnly());
}

Categories

Resources