I have a legacy map viewer application using WinForms. It is sloooooow. (The speed used to be acceptable, but Google Maps, Google Earth came along and users got spoiled. Now I am permitted to make if faster :)
After doing all the obvious speed improvements (caching, parallel execution, not drawing what does not need to be drawn, etc), my profiler shows me that the real choking point is the coordinate transformations when converting points from map-space to screen-space.
Normally a conversion code looks like this:
public Point MapToScreen(PointF input)
{
// Note that North is negative!
var result = new Point(
(int)((input.X - this.currentView.X) * this.Scale),
(int)((input.Y - this.currentView.Y) * this.Scale));
return result;
}
The real implementation is trickier. Latitudes/longitues are represented as integers. To avoid loosing precision, they are multiplied up by 2^20 (~ 1 million). This is how a coordinate is represented.
public struct Position
{
public const int PrecisionCompensationPower = 20;
public const int PrecisionCompensationScale = 1048576; // 2^20
public readonly int LatitudeInt; // North is negative!
public readonly int LongitudeInt;
}
It is important that the possible scale factors are also explicitly bound to power of 2. This allows us to replace the multiplication with a bitshift. So the real algorithm looks like this:
public Point MapToScreen(Position input)
{
Point result = new Point();
result.X = (input.LongitudeInt - this.UpperLeftPosition.LongitudeInt) >>
(Position.PrecisionCompensationPower - this.ZoomLevel);
result.Y = (input.LatitudeInt - this.UpperLeftPosition.LatitudeInt) >>
(Position.PrecisionCompensationPower - this.ZoomLevel);
return result;
}
(UpperLeftPosition representents the upper-left corner of the screen in the map space.)
I am thinking now of offloading this calculation to the GPU. Can anyone show me an example how to do that?
We use .NET4.0, but the code should preferably run on Windows XP, too. Furthermore, libraries under GPL we cannot use.
I suggest you look at using OpenCL and Cloo to do this - take a look at the vector add example and then change this to map the values by using two ComputeBuffers (one for each of LatitudeInt and LongitudeInt in each point) to 2 output ComputeBuffers. I suspect the OpenCL code would looks something like this:
__kernel void CoordTrans(__global int *lat,
__global int *lon,
__constant int ulpLat,
__constant int ulpLon,
__constant int zl,
__global int *outx,
__global int *outy)
{
int i = get_global_id(0);
const int pcp = 20;
outx[i] = (lon[i] - ulpLon) >> (pcp - zl);
outy[i] = (lat[i] - ulpLat) >> (pcp - zl);
}
but you would do more than one coord-transform per core. I need to rush off, I recommend you read up on opencl before doing this.
Also, if the number of coords is reasonable (<100,000/1,000,000) the non-gpu based solution will likely be faster.
Now one year later the problem arose again, and we found a very banal answer. I feel a bit stupid not realizing it earlier. We draw the geographic elements to bitmap via ordinary WinForms GDI. GDI is hardware accelerated. All we have to do is NOT to do the transformation by ourselves but set the scale parameters of System.Drawing.Graphics object:
Graphics.TranslateTransform(...) and Graphics.ScaleTransform(...)
We do not even need the trick with the bit shifting.
:)
I'm coming from a CUDA background, and can only speak for NVIDIA GPUs, but here goes.
The problem with doing this on a GPU is your operation/transfer time.
You have on the order of 1 operation to perform per element. You'd really want to do more than this per element to get a real speed improvement. The bandwidth between global memory and the threads on a GPU is around 100GB/s. So, if you have to load one 4 Byte integer to do one FLOP, you theoretical maximum speed is 100/4 = 25 FLOPS. This is far from the hundreds of FLOPS advertised.
Note this is the theoretical maximum, the real result might be worse. And this is even worse if you're loading more than one element. In your case, it looks like 2, so you might get a maximum of 12.5 FLOPS from it. In practice, it will almost certainly be lower.
If this sounds ok to you though, then go for it!
XNA can be used to do all the transformations you require and gives very good performance. It can also be displayed inside a winforms application: http://create.msdn.com/en-US/education/catalog/sample/winforms_series_1
Related
First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)
If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.
WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.
For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).
I use gdal_retile.py script to cut rasters into tiles (rewritten to C#). Everything works fine, but I want my script to work in a little different way. What I want is to change the first level scale. I want it to be calculated using the pattern:
private const double MetersPerInch = 0.0254;
private const double DPI = 96;
private double GetScale(int meters, int pixels)
{
return meters/pixels/MetersPerInch*DPI;
}
For example.
If I get a raster of size 4k x 4k px and it's 100 km, then:
scale = 100000 / 4000 / 0.0254 * 96 = ~94488
Now I need to find the first scale higher than counted that is a power of 2. In this case it's 1:131072. And I should set it as my scale for the first level. Next levels' scale should be a power of 2: [1:262144, 1:524288, 1:1048576, ...].
Could any1 help me to modify the script? I don't care about the language (it can be done in Python or C#).
Thanks in advance for any solutions!
Ok, I did it!
Here's the source code (maybe some1 will need it in the future - be careful, cuz I did my changes in lines 370, 432 and 433 - if u want to revert my changes into original, just replace _scaleFactor variable by 2 and remove all the _scaleFactor calculations): http://pastebin.com/0qUCVk9J
I'm working on a n image processing library which extends OpenCV, HALCON, ... . The library must be with .NET Framework 3.5 and since my experiences with .NET are limited I would like to ask some questions regarding the performance.
I have encountered a few specific things which I cannot explain to myself properly and would like you to ask a) why and b) what is the best practise to deal with the cases.
My first question is about Math.pow. I already found some answers here on StackOverflow which explains it quite well (a) but not what to do about this(b). My benchmark Program looks like this
Stopwatch watch = new Stopwatch(); // from the Diagnostics class
watch.Start();
for (int i = 0; i < 1000000; i++)
double result = Math.Pow(4,7) // the function call
watch.Stop()
The result was not very nice (~300ms on my computer) (I have run the test 10 times and calcuated the average value).
My first idea was to check wether this is because it is a static function. So I implemented my own class
class MyMath
{
public static double Pow (double x, double y) //Using some expensive functions to calculate the power
{
return Math.Exp(Math.Log(x) * y);
}
public static double PowLoop (double x, int y) // Using Loop
{
double res = x;
for(int i = 1; i < y; i++)
res *= x;
return res;
}
public static double Pow7 (double x) // Using inline calls
{
return x * x * x * x * x * x * x;
}
}
THe third thing I checked were if I would replace the Math.Pow(4,7) directly through 4*4*4*4*4*4*4.
The results are (the average out of 10 test runs)
300 ms Math.Pow(4,7)
356 ms MyMath.Pow(4,7) //gives wrong rounded results
264 ms MyMath.PowLoop(4,7)
92 ms MyMath.Pow7(4)
16 ms 4*4*4*4*4*4*4
Now my situation now is basically like this: Don't use Math for Pow. My only problem is just that... do I really have to implement my own Math-class now? It seems somehow ineffective to implement an own class just for the power function. (Btw. PowLoop and Pow7 are even faster in the Release build by ~25% while Math.Pow is not).
So my final questions are
a) am I wrong if I wouldn't use Math.Pow at all (but for fractions maybe) (which makes me somehow sad).
b) if you have code to optimize, are you really writing all such mathematical operations directly?
c) is there maybe already a faster (open-source^^) library for mathematical operations
d) the source of my question is basically: I have assumed that the .NET Framework itself already provides very optimized code / compile results for such basic operations - be it the Math-Class or handling arrays and I was a little surprised how much benefit I would gain by writing my own code. Are there some other, general "fields" or something else to look out in C# where I cannot trust C# directly.
Two things to bear in mind:
You probably don't need to optimise this bit of code. You've just done a million calls to the function in less than a second. Is this really going to cause big problems in your program?
Math.Pow is probably fairly optimal anyway. At a guess, it will be calling a proper numerics library written in a lower level language, which means you shouldn't expect orders of magnitude increases.
Numerical programming is harder than you think. Even the algorithms that you think you know how to calculate, aren't calculated that way. For example, when you calculate the mean, you shouldn't just add up the numbers and divide by how many numbers you have. (Modern numerics libraries use a two pass routine to correct for floating point errors.)
That said, if you decide that you definitely do need to optimise, then consider using integers rather than floating point values, or outsourcing this to another numerics library.
Firstly, integer operations are much faster than floating point. If you don't need floating point values, don't use the floating point data type. This generally true for any programming language.
Secondly, as you have stated yourself, Math.Pow can handle reals. It makes use of a much more intricate algorithm than a simple loop. No wonder it is slower than simply looping. If you get rid of the loop and just do n multiplications, you are also cutting off the overhead of setting up the loop - thus making it faster. But if you don't use a loop, you have to know
the value of the exponent beforehand - it can't be supplied at runtime.
I am not really sure why Math.Exp and Math.Log is faster. But if you use Math.Log, you can't find the power of negative values.
Basically int are faster and avoiding loops avoid extra overhead. But you are trading off some flexibility when you go for those. But it is generally a good idea to avoid reals when all you need are integers, but in this case coding up a custom function when one already exists seems a little too much.
The question you have to ask yourself is whether this is worth it. Is Math.Pow actually slowing your program down? And in any case, the Math.Pow already bundled with your language is often the fastest or very close to that. If you really wanted to make an alternate implementation that is really general purpose (i.e. not limited to only integers, positive values, etc.), you will probably end up using the same algorithm used in the default implementation anyway.
When you are talking about making a million iterations of a line of code then obviously every little detail will make a difference.
Math.Pow() is a function call which will be substantially slower than your manual 4*4...*4 example.
Don't write your own class as its doubtful you'll be able to write anything more optimised than the standard Math class.
I'm measuring some system performance data to store it in a database. From those data points I'm drawing line graphs over time. In their nature, those data points are a bit noisy, ie. every single point deviates at least a bit from the local mean value. When drawing the line graph straight from one point to the next, it produces jagged graphs. At a large time scale like > 10 data points per pixel, this noise is compressed into a wide jagged line area that is, say, 20px high instead of 1px as in smaller scales.
I've read about line smoothing, anti-aliasing, simplifying and all these things. But everything I've found seems to be about something else.
I don't need anti-aliasing, .NET already does that for me when drawing the line on the screen.
I don't want simplification. I need the extreme values to remain visible, at least most of them.
I think it goes in the direction of spline curves but I couldn't find much example images to evaluate whether the described thing is what I want. I did find a highly scientific book at Google Books though, full of half-page long formulas, which I wasn't like reading through now...
To give you an example, just look at Linux/Gnome's system monitor application. I draws the recent CPU/memory/network usage with a smoothed line. This may be a bit oversimplified, but I'd give it a try and see if I can tweak it.
I'd prefer C# code but algorithms or code in other languages is fine, too, as long as I can port it to C# without external references.
You can do some data-smoothing. Instead of using the real data, apply a simple smoothing algorithm that keeps the peaks like a Savitzky-Golayfilter.
You can get the coefficients here.
The easiest to do is:
Take the top coefficients from the website I linked to:
// For np = 5 = 5 data points
var h = 35.0;
var coeff = new float[] { 17, 12, -3 }; // coefficients from the site
var easyCoeff = new float[] {-3, 12, 17, 12, -3}; // Its symmetrical
var center = 2; // = the center of the easyCoeff array
// now for every point from your data you calculate a smoothed point:
smoothed[x] =
((data[x - 2] * easyCoeff[center - 2]) +
(data[x - 1] * easyCoeff[center - 1]) +
(data[x - 0] * easyCoeff[center - 0]) +
(data[x + 1] * easyCoeff[center + 1]) +
(data[x + 2] * easyCoeff[center + 2])) / h;
The first 2 and last 2 points you cannoth smooth when using 5 points.
If you want your data to be more "smoothed" you can experiment with coefficents with larger data points.
Now you can draw a line through your "smoothed" data. The larger your np = number of points, the smoother your data. But you also loose peak accuracy, but not as much when simply averaging some points together.
You cannot fix this in the graphics code. If your data is noisy then the graph is going to be noisy as well, no matter what kind of line smoothing algorithm you use. You'll need to filter the data first. Create a second data set with points that are interpolated from the original data. A Least Squares fit is a common technique. Averaging is simple to implement but tends to hide extremes.
I think what you are looking for is a routine to provide 'splines'. Here is a link describing splines:
http://en.wikipedia.org/wiki/Spline_(mathematics)
If that is the case I don't have any recommendations for a spline library, but an initial google search turned up a bunch.
Sorry for no code, but hopefully knowing the terminology will aid you in your search.
Bob
Reduce the number of data points, using MIN/MAX/AVG before you display them. It'll look nicer and it'll be faster
Graphs of network traffic often use a weighted average. You can sample once per second into a circular list of length 10 and for the graph, at each sample, graph the average of the samples.
If 10 isn't enough you can store many more. You don't need to recalculate the average from scratch, either:
new_average = (old_average*10 - replaced_sample + new_sample)/10
If you don't want to store all 10, however, you can approximate with this:
new_average = old_average*9/10 + new_sample/10
Lots of routers use this to save on storage. This ramps toward the current traffic rate exponentially.
If you do implement this, do something like this:
new_average = old_average*min(9,number_of_samples)/10 + new_sample/10
number_of_samples++
to avoid the initial ramp-up. You should also adjust the 9/10, 1/10 ratio to actually reflect the time preiod of each sample because your timer won't fire exactly once per second.
I am using XNA to build a project where I can draw "graffiti" on my wall using an LCD projector and a monochrome camera that is filtered to see only hand held laser dot pointers. I want to use any number of laser pointers -- don't really care about differentiating them at this point.
The wall is 10' x 10', and the camera is only 640x480 so I'm attempting to use sub-pixel measurement using a spline curve as outlined here: tpub.com
The camera runs at 120fps (8-bit), so my question to you all is the fastest way to to find that subpixel laser dot center. Currently I'm using a brute force 2D search to find the brightest pixel on the image (0 - 254) before doing the spline interpolation. That method is not very fast and each frame takes longer to computer than they are coming in.
Edit: To clarify, in the end my camera data is represented by a 2D array of bytes indicating pixel brightness.
What I'd like to do is use an XNA shader to crunch the image for me. Is that practical? From what I understand, there really isn't a way to keep persistent variables in a Pixel Shader such as running totals, averages, etc.
But for arguments sake, let's say I found the brightest pixels using brute force, then stored them and their neighboring pixels for the spline curve into X number of vertices using texcoords. Is is practical then to use HLSL to compute a spline curve using texcoords?
I am also open to suggestions outside of my XNA box, be it DX10/DX11, maybe some sort of FPGA, etc. I just don't really have much experience with ways of crunching data in this way. I figure if they can do something like this on a Wii-Mote using 2 AA batteries than I'm probably going about this the wrong way.
Any ideas?
If by Brute-forcing you mean looking at every pixel independently, it is basically the only way of doing it. You will have to scan through all the images pixels, no matter what you want to do with the image. Althought you might not need to find the brightest pixels, you can filter the image by color (ex.: if your using a red laser). This is easily done using a HSV color coded image. If you are looking for some faster algorithms, try OpenCV. It's been optimized again and again for image treatment, and you can use it in C# via a wrapper:
[http://www.codeproject.com/KB/cs/Intel_OpenCV.aspx][1]
OpenCV can also help you easily find the point centers and track each points.
Is there a reason you are using a 120fps camera? you know the human eye can only see about 30fps right? I'm guessing it's to follow very fast laser movements... You might want to consider bringning it down, because real-time processing of 120fps will be very hard to acheive.
running through 640*480 bytes to find the highest byte should run within a ms. Even on slow processors. No need to take the route of shaders.
I would advice to optimize your loop.
for instance: this is really slow (because it does a multiplication with every array lookup):
byte highest=0;
foundX=-1, foundY=-1;
for(y=0; y<480; y++)
{
for(x=0; x<640; x++)
{
if(myBytes[x][y] > highest)
{
highest = myBytes[x][y];
foundX = x;
foundY = y;
}
}
}
this is much faster:
byte [] myBytes = new byte[640*480];
//fill it with your image
byte highest=0;
int found=-1, foundX=-1, foundY=-1;
int len = 640*480;
for(i=0; i<len; i++)
{
if(myBytes[i] > highest)
{
highest = myBytes[i];
found = i;
}
}
if(found!=-1)
{
foundX = i%640;
foundY = i/640;
}
This is off the top of my head so sorry for errors ;^)
You're dealing with some pretty complex maths if you want sub-pixel accuracy. I think this paper is something to consider. Unfortunately, you'll have to pay to see it using that site. If you've got access to a suitable library, they may be able to get hold of it for you.
The link in the original post suggested doing 1000 spline calculations for each axis - it treated x and y independantly, which is OK for circular images but is a bit off if the image is a skewed ellipse. You could use the following to get a reasonable estimate:
xc = sum (xn.f(xn)) / sum (f(xn))
where xc is the mean, xn is the a point along the x-axis and f(xn) is the value at the point xn. So for this:
*
* *
* *
* *
* *
* *
* * *
* * * *
* * * *
* * * * * *
------------------
2 3 4 5 6 7
gives:
sum (xn.f(xn)) = 1 * 2 + 3 * 3 + 4 * 9 + 5 * 10 + 6 * 4 + 7 * 1
sum (f(xn)) = 1 + 3 + 9 + 10 + 4 + 1
xc = 128 / 28 = 4.57
and repeat for the y-axis.
Brute-force is the only real way, however your idea of using a shader is good - you'd be offloading the brute-force check from the CPU, which can only look at a small number of pixels simultaneously (roughly 1 per core), to the GPU, which likely has 100+ dumb cores (pipelines) that can simultaneously compare pixels (your algorithm may need to be modified a bit to work well with the 1 instruction-many cores arrangement of a GPU).
The biggest issue I see is whether or not you can move that data to the GPU fast enough.
Another optimization to consider: if you're drawing, then the current location of the pointer is probably close the last location of the pointer. Remember the last recorded position of the pointer between frames, and only scan a region close to that position... say a 1'x1' area. Only if the pointer isn't found in that area should you scan the whole surface.
Obviously, there will be a tradeoff between how quickly your program can scan, and how quickly you'll be able to move your mouse before the camera "loses" the pointer and has to go to the slow, full-image scan. A little experimentation will probably reveal the optimum value.
Cool project, by the way.
Put the camera slightly out of focus and bitblt against a neutral sample. You can quickly scan rows for non 0 values. Also if you are at 8 bits and pick up 4 bytes at a time you can process the image faster. As other pointed out you might reduce the frame rate. If you have less fidelity than the resulting image there isn't much point in the high scan rate.
(The slight out of focus camera will help get just the brightest points and reduce false positives if you have a busy surface... of course assuming you are not shooting a smooth/flat surface)
Start with a black output buffer. Forget about subpixel for now. Every frame, every pixel, do this:
outbuff=max(outbuff,inbuff);
Do subpixel filtering to a third "clean" buffer when you're done with the image. Or do a chunk or a line of the screen at a time in real time. Advantage: real-time "rough" view of the drawing, cleaned up as you go.
When you convert from the rough output buffer to the "clean" third buffer, you can clear the rough to black. This lets you keep drawing forever without slowing down.
By drawing the "clean" over top the "rough," maybe in a slightly different color, you'll have the best of both worlds.
This is similar to what paint programs do--if you draw really fast, you see a rough version, then the paint program "cleans up" the image when it has time.
Some comments on the algorithm:
I've seen a lot of cheats in this arena. I've played Sonic on a Sega Genesis emulator that upsamples. and it has some pretty wild algorithms that work very well and are very fast.
You actually have some advantages you can gain because you might know the brightness and the radius on the dot.
You might just look at each pixel and its 8 neighbors and let those 9 pixels "vote" according to their brightness for where the subpixel lies.
Other thoughts
Your hand is not that accurate when you control a laser pointer. Try getting all the dots every 10 frames or so, identifying which beams are which (based on previous motion, and accounting for new dots, turned-off lasers, and dots that have entered or left the visual field), then just drawing a high resolution curve. Don't worry about sub pixel in the input--just draw the curve into the high res output.
Use a Catmull-Rom spline, which goes through all control points.