Drawing signal with a lot of samples

Drawing signal with a lot of samples - c#

I need to display a set of signals. Each signal is defined by millions of samples. Just processing the collection (for converting samples to points according to bitmap size) of samples takes a significant amount of time (especially during scrolling).
So I implemented some kind of downsampling. I just skip some points: take every 2nd, every 3rd, every 50th point depending on signal characteristics. It increases speed very much but significantly distorts signal form.
Are there any smarter approaches?

We've had a similar issue in a recent application. Our visualization (a simple line graph) became too cluttered when zoomed out to see the full extent of the data (about 7 days of samples with a sample taken every 6 seconds more or less), so down-sampling was actually the way to go. If we didn't do that, zooming out wouldn't have much meaning, as all you would see was just a big blob of lines smeared out over the screen.
It all depends on how you are going to implement the down-sampling. There's two (simple) approaches: down-sample at the moment you get your sample or down-sample at display time.
What really gives a huge performance boost in both of these cases is the proper selection of your data-sources.
Let's say you have 7 million samples, and your viewing window is just interested in the last million points. If your implementation depends on an IEnumerable, this means that the IEnumerable will have to MoveNext 6 million times before actually starting. However, if you're using something which is optimized for random reads (a List comes to mind), you can implement your own enumerator for that, more or less like this:
public IEnumerator<T> GetEnumerator(int start, int count, int skip)
{
// assume we have a field in the class which contains the data as a List<T>, named _data
for(int i = start;i<count && i < _data.Count;i+=skip)
{
yield return _data[i];
}
}
Obviously this is a very naive implementation, but you can do whatever you want within the for-loop (use an algorithm based on the surrounding samples to average?). However, this approach will make usually smooth out any extreme spikes in your signal, so be wary of that.
Another approach would be to create some generalized versions of your dataset for different ranges, which update itself whenever you receive a new signal. You usually don't need to update the complete dataset; just updating the end of your set is probably good enough. This allows you do do a bit more advanced processing of your data, but it will cost more memory. You will have to cache the distinct 'layers' of detail in your application.
However, reading your (short) explanation, I think a display-time optimization might be good enough. You will always get a distortion in your signal if you generalize. You always lose data. It's up to the algorithm you choose on how this distortion will occur, and how noticeable it will be.

You need a better sampling algorithm, also you can employ parallel processing features of c#. Refer to Task Parallel Library

Related

Data Structures & Techniques for operating on large data volumes (1 mln. recs and more)

A WPF .NET 4.5 app that I have been developing, initially to work on small data volumes, now works on much larger data volumes in the region of 1 million and more and of course I started running out of memory. The data comes from a MS SQL DB and data processing needs to be loaded to a local data structure, because this data is then transformed / processed / references by the code in CLR a continuous and uninterrupted data access is required, however not all data has to be loaded into memory straight away, but only when it is actually accessed. As a small example an Inverse Distance Interpolator uses this data to produce interpolated maps and all data needs to be passed to it for a continuous grid generation.
I have re-written some parts of the app for processing data, such as only load x amount of rows at any given time and implement a sliding window approach to data processing which works. However doing this for the rest of the app will require some time investment and I wonder if there can be a more robust and standard way of approaching this design problem (there has to be, I am not the first one)?
tldr; Does C# provide any data structures or techniques for accessing large data amounts in an interrupted manner, so it behaves like a IEnumerable but data is not in memory until it is actually accessed or required, or is it completely up to me to manage memory usage? My ideal would be a structure that would automatically implement a buffer like mechanism and load in more data as of when that data is accessed and freeing memory from the data that has been accessed and no longer of interest. Like some DataTable with an internal buffer maybe?

As far as iterating through a very large data set that is too large to fit in memory goes, you can use a producer-consumer model. I used something like this when I was working with a custom data set that contained billions of records--about 2 terabytes of data total.
The idea is to have a single class that contains both producer and consumer. When you create a new instance of the class, it spins up a producer thread that fills a constrained concurrent queue. And that thread keeps the queue full. The consumer part is the API that lets you get the next record.
You start with a shared concurrent queue. I like the .NET BlockingCollection for this.
Here's an example that reads a text file and maintains a queue of 10,000 text lines.
public class TextFileLineBuffer
{
private const int QueueSize = 10000;
private BlockingCollection<string> _buffer = new BlockingCollection<string>(QueueSize);
private CancellationTokenSource _cancelToken;
private StreamReader reader;
public TextFileLineBuffer(string filename)
{
// File is opened here so that any exception is thrown on the calling thread.
_reader = new StreamReader(filename);
_cancelToken = new CancellationTokenSource();
// start task that reads the file
Task.Factory.StartNew(ProcessFile, TaskCreationOptions.LongRunning);
}
public string GetNextLine()
{
if (_buffer.IsCompleted)
{
// The buffer is empty because the file has been read
// and all lines returned.
// You can either call this an error and throw an exception,
// or you can return null.
return null;
}
// If there is a record in the buffer, it is returned immediately.
// Otherwise, Take does a non-busy wait.
// You might want to catch the OperationCancelledException here and return null
// rather than letting the exception escape.
return _buffer.Take(_cancelToken.Token);
}
private void ProcessFile()
{
while (!_reader.EndOfStream && !_cancelToken.Token.IsCancellationRequested)
{
var line = _reader.ReadLine();
try
{
// This will block if the buffer already contains QueueSize records.
// As soon as a space becomes available, this will add the record
// to the buffer.
_buffer.Add(line, _cancelToken.Token);
}
catch (OperationCancelledException)
{
;
}
}
_buffer.CompleteAdding();
}
public void Cancel()
{
_cancelToken.Cancel();
}
}
That's the bare bones of it. You'll want to add a Dispose method that will make sure that the thread is terminated and that the file is closed.
I've used this basic approach to good effect in many different programs. You'll have to do some analysis and testing to determine the optimum buffer size for your application. You want something large enough to keep up with the normal data flow and also handle bursts of activity, but not so large that it exceeds your memory budget.
IEnumerable modifications
If you want to support IEnumerable<T>, you have to make some minor modifications. I'll extend my example to support IEnumerable<String>.
First, you have to change the class declaration:
public class TextFileLineBuffer: IEnumerable<string>
Then, you have to implement GetEnumerator:
public IEnumerator<String> GetEnumerator()
{
foreach (var s in _buffer.GetConsumingEnumerable())
{
yield return s;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
With that, you can initialize the thing and then pass it to any code that expects an IEnumerable<string>. So it becomes:
var items = new TextFileLineBuffer(filename);
DoSomething(items);
void DoSomething(IEnumerable<string> list)
{
foreach (var s in list)
Console.WriteLine(s);
}

#Sergey The producer-consumer model is probably your safest solution (Proposed by Jim Mischel) for complete scalability.
However, if you were to increase the room for the elephant (using your visual metaphor that fits very well), then compression on the fly is a viable option. Decompress when used and discard after use, leaving the core data structure compressed in memory. Obviously it depends on the data - how much it lends itself to compression, but there is a hell of alot of room in most data structures. If you have ON and OFF flags for some meta data, this can be buried in the unused bits of 16/32 bit numbers, or at least held in bits not bytes; use 16 bit integers for lat / longs with a constant scaling factor to convert each to real numbers before use; strings can be compressed using winzip type libraries - or indexed so that only ONE copy is held and no duplicates exist in memory, etc....
Decompression (albeit custom made) on the fly can be lightning fast.
This whole process can be very laborious I admit, but can definitely keep the room large enough as the elephant grows - in some instances. (Of course, it may never be good enough if the data is simply growing indefinitely)
EDIT: Re any sources...
Hi #Sergey, I wish I could!! Truly! I have used this technique for data compression and really the whole thing was custom designed on a whiteboard with one or two coders involved.
Its certainly not (all) rocket science, but its good to fully scope out the nature of all the data, then you know (for example) that a certain figure will never exceed 9999, so then you can choose how to store it in minimum bits, and then allocate the left over bits (assuming 32 bit storage) to other values. (A real world example is the number of fingers a person has...loosely speaking you could set an upper limit at 8 or 10, although 12 is possible, and even 20 is remotely feasible, etc if they have extra fingers. You can see what I mean) Lat / Longs are the PERFECT example of numbers that will never cross logical boundaries (unless you use wrap around values...). That is, they are always in between -90 and +90 (just guessing which type of Lat Longs) - which is very easy to reduce / convert as the range of values is so neat.
So we did not rely 'directly' on any third party literature. Only upon algorithms designed for specific types of data.
In other projects, for fast real time DSP (processing) the smarter (experienced game programmers) coders would convert floats to 16 bit ints and have a global scaling factor calculated to give max precision for the particular data stream (accelerometers, LVDT, Pressure gauges, etc) you are collecting.
This reduced the transmitted AND stored data without losing ANY information. Similarly, for real time wave / signal data you could use (Fast) Fourier Transform to turn your noisy wave into its Amplitude, Phase and Spectrum components - literally half of the data values, without actually losing any (significant) data. (Within these algorithms, the data 'loss' is completely measurable - so you can decide if you are in fact losing data)
Similarly there are algorithms like Rainfall Analysis (nothing to do with rain, more about cycles and frequency) which reduces your data alot. Peak detection and vector analysis can be enough for some other signals, which basically throws out about 99% of the data...The list is endless, but the technique MUST be intimately suited to your data. And you may have many different types of data, each lending itself to a different 'reduction' technique. I'm sure you can google 'lossless data reduction' (although I think the term lossless is coined by music processing and a little misleading since digital music has already lost the upper and lower freq ranges...I digress)....Please post what you find (if of course you have the time / inclination to research this further)
I would be interested to discuss your meta data, perhaps a large chunk can be 'reduced' quite elegantly...

Making C# mandelbrot drawing more efficient

First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)

If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.

WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.

For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).

Norms, rules or guidelines for calculating and showing "ETA/ETC" for a process

ETC = "Estimated Time of Completion"
I'm counting the time it takes to run through a loop and showing the user some numbers that tells him/her how much time, approximately, the full process will take. I feel like this is a common thing that everyone does on occasion and I would like to know if you have any guidelines that you follow.
Here's an example I'm using at the moment:
int itemsLeft; //This holds the number of items to run through.
double timeLeft;
TimeSpan TsTimeLeft;
list<double> avrage;
double milliseconds; //This holds the time each loop takes to complete, reset every loop.
//The background worker calls this event once for each item. The total number
//of items are in the hundreds for this particular application and every loop takes
//roughly one second.
private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
//An item has been completed!
itemsLeft--;
avrage.Add(milliseconds);
//Get an avgrage time per item and multiply it with items left.
timeLeft = avrage.Sum() / avrage.Count * itemsLeft;
TsTimeLeft = TimeSpan.FromSeconds(timeLeft);
this.Text = String.Format("ETC: {0}:{1:D2}:{2:D2} ({3:N2}s/file)",
TsTimeLeft.Hours,
TsTimeLeft.Minutes,
TsTimeLeft.Seconds,
avrage.Sum() / avrage.Count);
//Only using the last 20-30 logs in the calculation to prevent an unnecessarily long List<>.
if (avrage.Count > 30)
avrage.RemoveRange(0, 10);
milliseconds = 0;
}
//this.profiler.Interval = 10;
private void profiler_Tick(object sender, EventArgs e)
{
milliseconds += 0.01;
}
As I am a programmer at the very start of my career I'm curious to see what you would do in this situation. My main concern is the fact that I calculate and update the UI for every loop, is this bad practice?
Are there any do's/don't's when it comes to estimations like this? Are there any preferred ways of doing it, e.g. update every second, update every ten logs, calculate and update UI separately? Also when would an ETA/ETC be a good/bad idea.

The real problem with estimation of time taken by a process is the quantification of the workload. Once you can quantify that, you can made a better estimate
Examples of good estimates
File system I/O or network transfer. Whether or not file systems have bad performance, you can get to know in advance, you can quantify the total number of bytes to be processed and you can measure the speed. Once you have these, and once you can monitor how many bytes have you transferred, you get a good estimate. Random factors may affect your estimate (i.e. an application starts meanwhile), but you still get a significative value
Encryption on large streams. For the reasons above. Even if you are computing a MD5 hash, you always know how many blocks have been processed, how many are to be processed and the total.
Item synchronization. This is a little trickier. If you can assume that the per-unit workload is constant or you can make a good estimate of the time required to process an item when variance is low or insignificant, then you can make another good estimate of the process. Pick email synchronization: if you don't know the byte size of the messages (otherwise you fall in case 1) but common practice tells that the majority of emails have quite the same size, then you can use the mean of the time taken to download/upload all processed emails to estimate the time taken to process a single email. This won't work in 100% of the cases and is subject to error, but you still see progress bar progressing on a large account
In general the rule is that you can make a good estimate of ETC/ETA (ETA is actually the date and time the operation is expected to complete) if you have a homogeneous process about of which you know the numbers. Homogeneity grants that the time to process a work item is comparable to others, i.e. the time taken to process a previous item can be used to estimate future. Numbers are used to make correct calculations.
Examples of bad estimates
Operations on a number of files of unknown size. This time you know only how many files you want to process (e.g. to download) but you don't know their size in advance. Once the size of the files has a high variance you see troubles. Having downloaded half of the file, when these were the smallest and sum up to 10% of total bytes, can be said being halfway? No! You just see the progress bar growing fast to 50% and then much slowly
Heterogenous processes. E.g. Windows installations. As pointed out by #HansPassant, Windows installations provide a worse-than-bad estimate. Installing a Windows software involves several processes including: file copy (this can be estimated), registry modifications (usually never estimated), execution of transactional code. The real problem is the last. Transactional processes involving execution of custom installer code are discusses below
Execution of generic code. This can never be estimated. A code fragment involves conditional statements. The execution of these involve changing paths depending on a condition external to the code. This means, for example, that a program behaves differently whether you have a printer installed or not, whether you have a local or a domain account, etc.
Conclusions
Estimating the duration of a software process isn't both an impossible and an exact/*deterministic* task.
It's not impossible because, even in the case of code fragments, you can either find a model for your code (pick a LU factorization as an example, this may be estimated). Or you might redesign your code splitting it into an estimation phase - where you first determine the branch conditions - and an execution phase, where all pre-determined branches are taken. I said might because this task is in practice impossible: most code determines branches as effects of previous conditions, meaning that estimating a branch actually involves running the code. Chicken and egg circle
It's not a deterministic process. Computer systems, especially if multitasking are affected by a number of random factors that may impact on your estimated process. You will never get a correct estimate before running your process. At most, you can detect external factors and re-estimate your process. The fork between your estimate and the real duration of process is mathematically converging to zero when you get closer to process end (lim [x->N] |est(N) - real(N)| == 0, where N is the process duration)

If your user interface is so obscure that you have to explain that ETC doesn't mean Etcetera then you are doing it wrong. Every user understands what a progress bar does, don't help.
Nothing is quite as annoying as an inaccurate progress bar. Particularly ones that promise a quick finish but then don't deliver. I'd give the progress bar displayed by any installer on Windows as a good example of one that is fundamentally broken. Just not a shining example of an implementation that you should pursue.
Such a progress bar is broken because it is utterly impossible to guess up front how long it is going to take to install a program. File systems have very unpredictable perf. This is a very common problem with estimating execution time. Better UI models are the spinning dots you'd see in a video player and many programs in Windows 8. Or the marquee style supported by the common ProgressBar control. Just feedback that says "I'm not dead, working on it". Even the hour-glass cursor is better than a bad estimate. If you have something to report beyond a technicality that no user is really interested in then don't hesitate to display that. Like the number of files you've processed or the number of kilobytes you've downloaded. The actual value of the number isn't that useful, seeing the rate at which it increases is the interesting tidbit.

HLSL Computation - process pixels in order?

Imagine I want to, say, compute the first one million terms of the Fibonacci sequence using the GPU. (I realize this will exceed the precision limit of a 32-bit data type - just used as an example)
Given a GPU with 40 shaders/stream processors, and cheating by using a reference book, I can break up the million terms into 40 blocks of 250,000 strips, and seed each shader with the two start values:
unit 0: 1,1 (which then calculates 2,3,5,8,blah blah blah)
unit 1: 250,000th term
unit 2: 500,000th term
...
How, if possible, could I go about ensuring that pixels are processed in order? If the first few pixels in the input texture have values (with RGBA for simplicity)
0,0,0,1 // initial condition
0,0,0,1 // initial condition
0,0,0,2
0,0,0,3
0,0,0,5
...
How can I ensure that I don't try to calculate the 5th term before the first four are ready?
I realize this could be done in multiple passes but setting a "ready" bit whenever a value is calculated, but that seems incredibly inefficient and sort of eliminates the benefit of performing this type of calculation on the GPU.
OpenCL/CUDA/etc probably provide nice ways to do this, but I'm trying (for my own edification) to get this to work with XNA/HLSL.
Links or examples are appreciated.
Update/Simplification
Is it possible to write a shader that uses values from one pixel to influence the values from a neighboring pixel?

You cannot determine the order the pixels are processed. If you could, that would break the massive pixel throughput of the shader pipelines. What you can do is calculating the Fibonacci sequence using the non-recursive formula.
In your question, you are actually trying to serialize the shader units to run one after another. You can use the CPU right away and it will be much faster.
By the way, multiple passes aren't as slow as you might think, but they won't help you in your case. You cannot really calculate any next value without knowing the previous ones, thus killing any parallelization.

Double ended priority queue

I have a set of data and I want to find the biggest and smallest items (multiple times), what's the best way to do this?
For anyone interested in the application, I'm developing a level of detail system and I need to find the items with the biggest and smallest screen space error, obviously every time I subdivide/merge an item I have to insert it into the queue but every time the camera moves the entire dataset changes - so it might be best to just use a sorted list and defer adding new items until the next time I sort (since it happens so often)

You can use a Min-Max Heap as described in the paper Min-Max Heaps and Generalized Priority Queues:
A simple implementation of
double ended priority queues is
presented. The proposed structure,
called a min-max heap, can be built in
linear time; in contrast to
conventional heaps, it allows both
FindMin and FindMax to be performed in
constant time; Insert, DeleteMin, and
DeleteMax operations can be performed
in logarithmic time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.