Making C# mandelbrot drawing more efficient

Making C# mandelbrot drawing more efficient - c#

First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)

If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.

WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.

For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).

Related

Full CPU usage for Parallel.For loops

I am writing a WPF application that processes an image data stream from an IR camera. The application uses a class library for processing steps such as rescaling or colorizing, which I am also writing myself. An image processing step looks something like this:
ProcessFrame(double[,] frame)
{
int width = frame.GetLength(1);
int height = frame.GetLength(0);
byte[,] result = new byte[height, width];
Parallel.For(0, height, row =>
{
for(var col = 0; col < width; ++col)
ManipulatePixel(frame[row, col]);
});
}
Frames are processed by a task that runs in the background. The issue is, that depending on how costly the specific processing algorithm is ( ManipulatePixel() ), the application can't keep up with the camera's frame rate any more. However, I have noticed that despite me using parallel for loops, the application simply won't use all of the CPU that is available - task manager performance tab shows about 60-80% CPU usage.
I have used the same processing algorithms in C++ before, using the concurrency::parallel_for loops from the parallel patterns library. The C++ code uses all of the CPU it can get, as I would expect, and I also tried PInvoking a C++ DLL from my C# code, doing the same algorithm that runs slowly in the C# library - it also uses all the CPU power available, CPU usage is right at 100% virtually the whole time and there is no trouble at all keeping up with the camera.
Outsourcing the code into a C++ DLL and then marshalling it back into C# is an extra hassle I'd of course rather avoid. How do I make my C# code actually make use of all the CPU potential? I have tried increasing process priority like this:
using (Process process = Process.GetCurrentProcess())
process.PriorityClass = ProcessPriorityClass.RealTime;
Which has an effect, but only a very small one. I also tried setting the degree of parallelism for the Parallel.For() loops like this:
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
and then passing that to the Parallel.For() loop, this had no effect at all but I suppose that's not surprising since the default settings should already be optimized. I also tried setting this in the application configuration:
<runtime>
<Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
<GCCpuGroup enabled="true"></GCCpuGroup>
<gcServer enabled="true"></gcServer>
</runtime>
but this actually makes it run even slower.
EDIT:
The ProcessFrame code block I quoted originally was actually not quite correct. What I was doing at the time was:
ProcessFrame(double[,] frame)
{
byte[,] result = new byte[frame.GetLength(0), frame.GetLength(1)];
Parallel.For(0, frame.GetLength(0), row =>
{
for(var col = 0; col < frame.GetLength(1); ++col)
ManipulatePixel(frame[row, col]);
});
}
Sorry for this, I was paraphrasing code at the time and I didn't realize that this is an actual pitfall, that produces different results. I have since changed the code to what I originally wrote (i.e. the width and height variables set at the beginning of the function, and the array's length properties only queried once each instead of in the for loop's conditional statements). Thank you #Seabizkit, your second comment inspired me to try this. The change in fact already makes the code run noticeably faster - I didn't realize this because C++ doesn't know 2D arrays so I had to pass the pixel dimensions as separate arguments anyway. Whether it is fast enough as it is I cannot say yet however.
Also thank you for the other answers, they contain a lot of things I don't know yet but it's great to know what I have to look for. I'll update once I reached a satisfactory result.

I would need to have all of your code and be able to run it locally in order to diagnose the problem because your posting is devoid of details (I would need to see inside your ManipulatePixel function, as well as the code that calls ProcessFrame). but here's some general tips that apply in your case.
2D arrays in .NET are significantly slower than 1D arrays and staggered arrays, even in .NET Core today - this is a longstanding bug.
See here:
https://github.com/dotnet/coreclr/issues/4059
Why are multi-dimensional arrays in .NET slower than normal arrays?
Multi-dimensional array vs. One-dimensional
So consider changing your code to use either a jagged array (which also helps with memory locality/proximity caching, as each thread would have its own private buffer) or a 1D array with your own code being responsible for bounds-checking.
Or better-yet: use stackalloc to manage the buffer's lifetime and pass that by-pointer (unsafe ahoy!) to your thread delegate.
Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.
Avoid allocating a new buffer for each frame encountered - if a frame has a limited lifespan then consider using reusable buffers using a buffer-pool.
Consider using the SIMD and AVX features in .NET. While modern C/C++ compilers are smart enough to compile code to use those instructions, the .NET JIT isn't so hot - but you can make explicit calls into SMID/AVX instructions using the SIMD-enabled types (you'll need to use .NET Core 2.0 or later for the best accelerated functionality)
Also, avoid copying individual bytes or scalar values inside a for loop in C#, instead consider using Buffer.BlockCopy for bulk copy operations (as these can use hardware memory copy features).
Regarding your observation of "80% CPU usage" - if you have a loop in a program then that will cause 100% CPU usage within the time-slices provided by the operating-system - if you don't see 100% usage then your code then:
Your code is actually running faster than real-time (this is a good thing!) - (unless you're certain your program can't keep-up with the input?)
Your codes' thread (or threads) is blocked by something, such as a blocking IO call or a misplaced Thread.Sleep. Use tools like ETW to see what your process is doing when you think it should be CPU-bound.
Ensure you aren't using any lock (Monitor) calls or using other thread or memory synchronization primitives.

Efficiency matters ( it is not true-[PARALLEL], but may, yet need not, benefit from a "just"-[CONCURRENT] work
The BEST, yet a rather hard way, if ultimate performance is a MUST :
in-line an assembly, optimised as per cache-line sizes in the CPU hierarchy and keep indexing that follows the actual memory-layout of the 2D data { column-wise | row-wise }. Given there is no 2D-kernel-transformation mentioned, your process does not need to "touch" any topological-neighbours, the indexing can step in whatever order "across" both of the ranges of the 2D-domain and the ManipulatePixel() may get more efficient on transforming rather blocks-of pixels, instead of bearing all overheads for calling a process just for each isolated atomicised-1px ( ILP + cache-efficiency are on your side ).
Given your target production-platform CPU-family, best use (block-SIMD)-vectorised instructions available from AVX2, best AVX512 code. As you most probably know, may use C/C++ using AVX-intrinsics for performance optimisations with assembly-inspection and finally "copy" the best resulting assembly for your C# assembly-inlining. Nothing will run faster. Tricks with CPU-core affinity mapping and eviction/reservation are indeed a last resort, yet may help for indeed an almost hard-real-time production settings ( though, hard R/T systems are seldom to get developed in an ecosystem with non-deterministic behaviour )
A CHEAP, few-seconds step :
Test and benchmark the run-time per batch of frames of a reversed composition of moving the more-"expensive"-part, the Parallel.For(...{...}) inside the for(var col = 0; col < width; ++col){...} to see the change of the costs of instantiations of the Parallel.For() instrumentation.
Next, if going this cheap way, think about re-factoring the ManipulatePixel() to at least use a block of data, aligned with data-storage layout and being a multiple of cache-line length ( for cache-hits ~ 0.5 ~ 5 [ns] improved costs-of-memory accesses, being ~ 100 ~ 380 [ns] otherwise - here, a will to distribute a work (the worse per 1px) across all NUMA-CPU-cores will result in paying way more time, due to extended access-latencies for cross-NUMA-(non-local) memory addresses and besides never re-using an expensively cached block-of-fetched-data, you knowingly pay excessive costs from cross-NUMA-(non-local) memory fetches ( from which you "use" just 1px and "throw" away all the rest of the cached-block ( as those pixels will get re-fetched and manipulated in some other CPU-core in some other time ~ a triple-waste of time ~ sorry to have mentioned that explicitly, but when shaving each possible [ns] this cannot happen in production pipeline ) )
Anyway, let me wish you perseverance and good luck on your steps forwards to gain the needed efficiency back onto your side.

Here's what I ended up doing, mostly based on Dai's answer:
made sure to query image pixel dimensions once at the beginning of the processing functions, not within the for loop's conditional statement. With parallel loops, it would seem this creates competitive access of those properties from multriple threads which noticeably slows things down.
removed allocation of output buffers within the processing functions. They now return void and accept the output buffer as an argument. The caller creates one buffer for each image processing step (filtering, scaling, colorizing) only, which doesn't change in size but gets overwritten with each frame.
removed an extra data processing step where raw image data in the format ushort (what the camera originally spits out) was converted to double (actual temperature values). Instead, processing is applied to the raw data directly. Conversion to actual temperatures will be dealt with later, as necessary.
I also tried, without success, to use 1D arrays instead of 2D but there is actually no difference in performance. I don't know if it's because the bug Dai mentioned was fixed in the meantime, but I couldn't confirm 2D arrays to be any slower than 1D arrays.
Probably also worth mentioning, the ManipulatePixel() function in my original post was actually more of a placeholder rather than a real call to another function. Here's a more proper example of what I am doing to a frame, including the changes I made:
private static void Rescale(ushort[,] originalImg, byte[,] scaledImg, in (ushort, ushort) limits)
{
Debug.Assert(originalImg != null);
Debug.Assert(originalImg.Length != 0);
Debug.Assert(scaledImg != null);
Debug.Assert(scaledImg.Length == originalImg.Length);
ushort min = limits.Item1;
ushort max = limits.Item2;
int width = originalImg.GetLength(1);
int height = originalImg.GetLength(0);
Parallel.For(0, height, row =>
{
for (var col = 0; col < width; ++col)
{
ushort value = originalImg[row, col];
if (value < min)
scaledImg[row, col] = 0;
else if (value > max)
scaledImg[row, col] = 255;
else
scaledImg[row, col] = (byte)(255.0 * (value - min) / (max - min));
}
});
}
This is just one step and some others are much more complex but the approach would be similar.
Some of the things mentioned like SIMD/AVX or the answer of user3666197 unfortunately are well beyond my abilities right now so I couldn't test that out.
It's still relatively easy to put enough processing load into the stream to tank the frame rate but for my application the performance should be enough now. Thanks to everyone who provided input, I'll mark Dai's answer as accepted because I found it the most helpful.

Possible Rendering Performance Optimizations

I was doing some benchmarking today using C# and OpenTK, just to see how much I could actually render before the framerate dropped. The numbers I got were pretty astronomical, and I am quite happy with the outcome of my tests.
In my project I am loading the blender monkey, which is 968 triangles. I then instance it and render it 100 times. This means that I am rendering 96,800 triangles per frame. This number far exceeds anything that I would need to render during any given scene in my game. And after this I pushed it even further and rendered 2000 monkeys at varying locations. I was now rendering a whopping 1,936,000 (almost 2 million triangles per frame) and the framerate was still locked at 60 frames per second! That number just blew my mind. I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
I was just wondering though, because I am using some legacy OpenGL, if this could still be pushed even further—or should I even bother?
For my tests I load the blender monkey model, store it into a display list using the deprecated calls like:
modelMeshID = MeshGenerator.Generate( delegate {
GL.Begin( PrimitiveType.Triangles );
foreach( Face f in model.Faces ) {
foreach( ModelVertex p in f.Points ) {
Vector3 v = model.Vertices[ p.Vertex ];
Vector3 n = model.Normals[ p.Normal ];
Vector2 tc = model.TexCoords[ p.TexCoord ];
GL.Normal3( n.X , n.Y , n.Z );
GL.TexCoord2( tc.Y , tc.X );
GL.Vertex3( v.X , v.Y , v.Z );
}
}
GL.End();
} );
and then call that list x amount of times. My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
Here is a screen grab of a few thousand:
My system specs:
CPU: AMD Athlon IIx4(quad core) 620 2.60 GHz
Graphics Card: AMD Radeon HD 6800

My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
No.
The main problem you're going to run into is, that Display Lists and Vertex Arrays don't go well with each other. Using buffer objects they kind of work, but display lists themself are legacy like the immediate mode drawing API.
However, even if you manage to get the VBO drawing from within a display list right, there'll be slightly an improvement: When compiling the display list the OpenGL driver knows, that everything that is arriving will be "frozen" eventually. This allows for some very aggressive internal optimization; all the geometry data will be packed up into a buffer object on the GPU, state changes are coalesced. AMD is not quite as good at this game as NVidia, but they're not bad either; display lists are heavily used in CAD applications and before ATI addressed the entertainment market, they were focused on CAD, so their display list implementation is not bad at all. If you pack up all the relevant state changes required for a particular drawing call into the display list, then when calling the display list you'll likely drop into the fast path.
I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
What's actually limiting you there is the overhead on calling the display list. I suggest you add a little bit more geometry into the DL and try again.
Display Lists are shockingly efficient. That they got removed from modern OpenGL is mostly because they can be effectively used only with the immediate mode drawing commands. Also recent things like transform feedback and conditional rendering would have been very difficult to integrate into display lists. So they got removed; and rightfully so, because Display Lists are kind of awkward to work with.
Now if you look at Vulkan the essential idea is to set up as much of the drawing commands (state changes, resource bindings and so on) upfront in command buffers and reuse those for varying data. This is like if you could create multiple display lists and have them make babies.

Using vertex lists, begin and end causes the monkey geometry to be sent to the GPU every iteration, going through PCI-E, which is the slowest memory interface you have during rendering. Also, depending on your GL implementation, every call to GL can have more or less overhead on it's own. If you used buffer objects, all that overhead would be gone, because you only send the monkey over once and then all you need is a draw call every iteration.
However, the monkey geometry is tiny (just a few kb), so sending it over the PCI-E bus (at like 16 GB/s?), plus the few hundred iterations of the "geometry loop", would not even take a millisecond. And even that will not touch your frame-rate because, unless you are explicitly synchronizing, it will be completely absorbed by pipelining: the copying and the draw call will run while the GPU is still busy rendering the previous frame. At the time, the GPU starts rendering the next frame, the data is already there.
That is why I am guessing, given you have a fairly optimized GL implementation (good drivers) that using buffer objects, would not yield any speed-up. Note that in the face of bigger and more complex geometry and rendering operations, buffer objects will of course become crucial to performance. Small buffers might even stay cached on chip between draw calls.
Nevertheless, as a serious speed-freak, you definitely want to double-check and verify these sorts of guesstimates :)

Drawing signal with a lot of samples

I need to display a set of signals. Each signal is defined by millions of samples. Just processing the collection (for converting samples to points according to bitmap size) of samples takes a significant amount of time (especially during scrolling).
So I implemented some kind of downsampling. I just skip some points: take every 2nd, every 3rd, every 50th point depending on signal characteristics. It increases speed very much but significantly distorts signal form.
Are there any smarter approaches?

We've had a similar issue in a recent application. Our visualization (a simple line graph) became too cluttered when zoomed out to see the full extent of the data (about 7 days of samples with a sample taken every 6 seconds more or less), so down-sampling was actually the way to go. If we didn't do that, zooming out wouldn't have much meaning, as all you would see was just a big blob of lines smeared out over the screen.
It all depends on how you are going to implement the down-sampling. There's two (simple) approaches: down-sample at the moment you get your sample or down-sample at display time.
What really gives a huge performance boost in both of these cases is the proper selection of your data-sources.
Let's say you have 7 million samples, and your viewing window is just interested in the last million points. If your implementation depends on an IEnumerable, this means that the IEnumerable will have to MoveNext 6 million times before actually starting. However, if you're using something which is optimized for random reads (a List comes to mind), you can implement your own enumerator for that, more or less like this:
public IEnumerator<T> GetEnumerator(int start, int count, int skip)
{
// assume we have a field in the class which contains the data as a List<T>, named _data
for(int i = start;i<count && i < _data.Count;i+=skip)
{
yield return _data[i];
}
}
Obviously this is a very naive implementation, but you can do whatever you want within the for-loop (use an algorithm based on the surrounding samples to average?). However, this approach will make usually smooth out any extreme spikes in your signal, so be wary of that.
Another approach would be to create some generalized versions of your dataset for different ranges, which update itself whenever you receive a new signal. You usually don't need to update the complete dataset; just updating the end of your set is probably good enough. This allows you do do a bit more advanced processing of your data, but it will cost more memory. You will have to cache the distinct 'layers' of detail in your application.
However, reading your (short) explanation, I think a display-time optimization might be good enough. You will always get a distortion in your signal if you generalize. You always lose data. It's up to the algorithm you choose on how this distortion will occur, and how noticeable it will be.

You need a better sampling algorithm, also you can employ parallel processing features of c#. Refer to Task Parallel Library

HLSL Computation - process pixels in order?

Imagine I want to, say, compute the first one million terms of the Fibonacci sequence using the GPU. (I realize this will exceed the precision limit of a 32-bit data type - just used as an example)
Given a GPU with 40 shaders/stream processors, and cheating by using a reference book, I can break up the million terms into 40 blocks of 250,000 strips, and seed each shader with the two start values:
unit 0: 1,1 (which then calculates 2,3,5,8,blah blah blah)
unit 1: 250,000th term
unit 2: 500,000th term
...
How, if possible, could I go about ensuring that pixels are processed in order? If the first few pixels in the input texture have values (with RGBA for simplicity)
0,0,0,1 // initial condition
0,0,0,1 // initial condition
0,0,0,2
0,0,0,3
0,0,0,5
...
How can I ensure that I don't try to calculate the 5th term before the first four are ready?
I realize this could be done in multiple passes but setting a "ready" bit whenever a value is calculated, but that seems incredibly inefficient and sort of eliminates the benefit of performing this type of calculation on the GPU.
OpenCL/CUDA/etc probably provide nice ways to do this, but I'm trying (for my own edification) to get this to work with XNA/HLSL.
Links or examples are appreciated.
Update/Simplification
Is it possible to write a shader that uses values from one pixel to influence the values from a neighboring pixel?

You cannot determine the order the pixels are processed. If you could, that would break the massive pixel throughput of the shader pipelines. What you can do is calculating the Fibonacci sequence using the non-recursive formula.
In your question, you are actually trying to serialize the shader units to run one after another. You can use the CPU right away and it will be much faster.
By the way, multiple passes aren't as slow as you might think, but they won't help you in your case. You cannot really calculate any next value without knowing the previous ones, thus killing any parallelization.

Fast sub-pixel laser dot detection

I am using XNA to build a project where I can draw "graffiti" on my wall using an LCD projector and a monochrome camera that is filtered to see only hand held laser dot pointers. I want to use any number of laser pointers -- don't really care about differentiating them at this point.
The wall is 10' x 10', and the camera is only 640x480 so I'm attempting to use sub-pixel measurement using a spline curve as outlined here: tpub.com
The camera runs at 120fps (8-bit), so my question to you all is the fastest way to to find that subpixel laser dot center. Currently I'm using a brute force 2D search to find the brightest pixel on the image (0 - 254) before doing the spline interpolation. That method is not very fast and each frame takes longer to computer than they are coming in.
Edit: To clarify, in the end my camera data is represented by a 2D array of bytes indicating pixel brightness.
What I'd like to do is use an XNA shader to crunch the image for me. Is that practical? From what I understand, there really isn't a way to keep persistent variables in a Pixel Shader such as running totals, averages, etc.
But for arguments sake, let's say I found the brightest pixels using brute force, then stored them and their neighboring pixels for the spline curve into X number of vertices using texcoords. Is is practical then to use HLSL to compute a spline curve using texcoords?
I am also open to suggestions outside of my XNA box, be it DX10/DX11, maybe some sort of FPGA, etc. I just don't really have much experience with ways of crunching data in this way. I figure if they can do something like this on a Wii-Mote using 2 AA batteries than I'm probably going about this the wrong way.
Any ideas?

If by Brute-forcing you mean looking at every pixel independently, it is basically the only way of doing it. You will have to scan through all the images pixels, no matter what you want to do with the image. Althought you might not need to find the brightest pixels, you can filter the image by color (ex.: if your using a red laser). This is easily done using a HSV color coded image. If you are looking for some faster algorithms, try OpenCV. It's been optimized again and again for image treatment, and you can use it in C# via a wrapper:
[http://www.codeproject.com/KB/cs/Intel_OpenCV.aspx][1]
OpenCV can also help you easily find the point centers and track each points.
Is there a reason you are using a 120fps camera? you know the human eye can only see about 30fps right? I'm guessing it's to follow very fast laser movements... You might want to consider bringning it down, because real-time processing of 120fps will be very hard to acheive.

running through 640*480 bytes to find the highest byte should run within a ms. Even on slow processors. No need to take the route of shaders.
I would advice to optimize your loop.
for instance: this is really slow (because it does a multiplication with every array lookup):
byte highest=0;
foundX=-1, foundY=-1;
for(y=0; y<480; y++)
{
for(x=0; x<640; x++)
{
if(myBytes[x][y] > highest)
{
highest = myBytes[x][y];
foundX = x;
foundY = y;
}
}
}
this is much faster:
byte [] myBytes = new byte[640*480];
//fill it with your image
byte highest=0;
int found=-1, foundX=-1, foundY=-1;
int len = 640*480;
for(i=0; i<len; i++)
{
if(myBytes[i] > highest)
{
highest = myBytes[i];
found = i;
}
}
if(found!=-1)
{
foundX = i%640;
foundY = i/640;
}
This is off the top of my head so sorry for errors ;^)

You're dealing with some pretty complex maths if you want sub-pixel accuracy. I think this paper is something to consider. Unfortunately, you'll have to pay to see it using that site. If you've got access to a suitable library, they may be able to get hold of it for you.
The link in the original post suggested doing 1000 spline calculations for each axis - it treated x and y independantly, which is OK for circular images but is a bit off if the image is a skewed ellipse. You could use the following to get a reasonable estimate:
xc = sum (xn.f(xn)) / sum (f(xn))
where xc is the mean, xn is the a point along the x-axis and f(xn) is the value at the point xn. So for this:
*
* *
* *
* *
* *
* *
* * *
* * * *
* * * *
* * * * * *
------------------
2 3 4 5 6 7
gives:
sum (xn.f(xn)) = 1 * 2 + 3 * 3 + 4 * 9 + 5 * 10 + 6 * 4 + 7 * 1
sum (f(xn)) = 1 + 3 + 9 + 10 + 4 + 1
xc = 128 / 28 = 4.57
and repeat for the y-axis.

Brute-force is the only real way, however your idea of using a shader is good - you'd be offloading the brute-force check from the CPU, which can only look at a small number of pixels simultaneously (roughly 1 per core), to the GPU, which likely has 100+ dumb cores (pipelines) that can simultaneously compare pixels (your algorithm may need to be modified a bit to work well with the 1 instruction-many cores arrangement of a GPU).
The biggest issue I see is whether or not you can move that data to the GPU fast enough.

Another optimization to consider: if you're drawing, then the current location of the pointer is probably close the last location of the pointer. Remember the last recorded position of the pointer between frames, and only scan a region close to that position... say a 1'x1' area. Only if the pointer isn't found in that area should you scan the whole surface.
Obviously, there will be a tradeoff between how quickly your program can scan, and how quickly you'll be able to move your mouse before the camera "loses" the pointer and has to go to the slow, full-image scan. A little experimentation will probably reveal the optimum value.
Cool project, by the way.

Put the camera slightly out of focus and bitblt against a neutral sample. You can quickly scan rows for non 0 values. Also if you are at 8 bits and pick up 4 bytes at a time you can process the image faster. As other pointed out you might reduce the frame rate. If you have less fidelity than the resulting image there isn't much point in the high scan rate.
(The slight out of focus camera will help get just the brightest points and reduce false positives if you have a busy surface... of course assuming you are not shooting a smooth/flat surface)

Start with a black output buffer. Forget about subpixel for now. Every frame, every pixel, do this:
outbuff=max(outbuff,inbuff);
Do subpixel filtering to a third "clean" buffer when you're done with the image. Or do a chunk or a line of the screen at a time in real time. Advantage: real-time "rough" view of the drawing, cleaned up as you go.
When you convert from the rough output buffer to the "clean" third buffer, you can clear the rough to black. This lets you keep drawing forever without slowing down.
By drawing the "clean" over top the "rough," maybe in a slightly different color, you'll have the best of both worlds.
This is similar to what paint programs do--if you draw really fast, you see a rough version, then the paint program "cleans up" the image when it has time.
Some comments on the algorithm:
I've seen a lot of cheats in this arena. I've played Sonic on a Sega Genesis emulator that upsamples. and it has some pretty wild algorithms that work very well and are very fast.
You actually have some advantages you can gain because you might know the brightness and the radius on the dot.
You might just look at each pixel and its 8 neighbors and let those 9 pixels "vote" according to their brightness for where the subpixel lies.
Other thoughts
Your hand is not that accurate when you control a laser pointer. Try getting all the dots every 10 frames or so, identifying which beams are which (based on previous motion, and accounting for new dots, turned-off lasers, and dots that have entered or left the visual field), then just drawing a high resolution curve. Don't worry about sub pixel in the input--just draw the curve into the high res output.
Use a Catmull-Rom spline, which goes through all control points.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.