HLSL Computation - process pixels in order?

HLSL Computation - process pixels in order? - c#

Imagine I want to, say, compute the first one million terms of the Fibonacci sequence using the GPU. (I realize this will exceed the precision limit of a 32-bit data type - just used as an example)
Given a GPU with 40 shaders/stream processors, and cheating by using a reference book, I can break up the million terms into 40 blocks of 250,000 strips, and seed each shader with the two start values:
unit 0: 1,1 (which then calculates 2,3,5,8,blah blah blah)
unit 1: 250,000th term
unit 2: 500,000th term
...
How, if possible, could I go about ensuring that pixels are processed in order? If the first few pixels in the input texture have values (with RGBA for simplicity)
0,0,0,1 // initial condition
0,0,0,1 // initial condition
0,0,0,2
0,0,0,3
0,0,0,5
...
How can I ensure that I don't try to calculate the 5th term before the first four are ready?
I realize this could be done in multiple passes but setting a "ready" bit whenever a value is calculated, but that seems incredibly inefficient and sort of eliminates the benefit of performing this type of calculation on the GPU.
OpenCL/CUDA/etc probably provide nice ways to do this, but I'm trying (for my own edification) to get this to work with XNA/HLSL.
Links or examples are appreciated.
Update/Simplification
Is it possible to write a shader that uses values from one pixel to influence the values from a neighboring pixel?

You cannot determine the order the pixels are processed. If you could, that would break the massive pixel throughput of the shader pipelines. What you can do is calculating the Fibonacci sequence using the non-recursive formula.
In your question, you are actually trying to serialize the shader units to run one after another. You can use the CPU right away and it will be much faster.
By the way, multiple passes aren't as slow as you might think, but they won't help you in your case. You cannot really calculate any next value without knowing the previous ones, thus killing any parallelization.

Related

Dijkstra algorithm expanded with extra limit variable

I am having trouble implementing this into my current path finding algorithm.
Currently I have Dijkstra written and works like it should, but I need to step further away and add a limit (range). I can better explain with an image:
Let's say I have range of 80. I want to go from A to E. My current algorithm, works as it should, so it results in A->B-E.
However, I need to go only on paths with weight not more than the range - 80, which would mean that A->B->E is not the option any more, but A->C->D->B->E (considering that range/limit resets on every stop)
So far, I have implemented a bool named Possible which would return for the single part of path (e.g. A->B) is it possible comparing to my limit / range.
My main problem is that I do not know where/how to start. My only idea was to see where Possible is false (A->B on the total route A->B->E) and run the algorithm from A to A->E again without / excluding B stop/vertex.
Is this a good approach? Because of that my big O notation would increment twice (as far as I understand it).

I see two ways of doing this
Create a new graph G' that contains only edges < 80, and look for shortest path there... reduction time is O(V+E), and additional O(V+E) memory usage
You can change Dijkstra's algorithm, to ignore edges > 80, just skip edges >80, when giving values to neighbor vertices, the complexity and memory usage will stay the same in this case

Create a temporary version of your graph, and set all weights above the threshold to infinity. Then run the ordinary Dijkstra algorithm on it.
Complexity will increase or not, depending on your version of the algorithm:
if you have O(V^2) then it will increase to O(E + V^2)
if you have the O(ElogV) version then it will increase to O(E + ElogV)
if you have the O(E + VlogV) version it will remain the same
As noted by ArsenMkrt you can as well remove these edges, which makes even more sense but will make the complexity a bit worse. Modifying the algorithm to just skip those edges seems to be the best option though, as he suggested in his answer.

Making C# mandelbrot drawing more efficient

First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)

If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.

WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.

For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).

Drawing signal with a lot of samples

I need to display a set of signals. Each signal is defined by millions of samples. Just processing the collection (for converting samples to points according to bitmap size) of samples takes a significant amount of time (especially during scrolling).
So I implemented some kind of downsampling. I just skip some points: take every 2nd, every 3rd, every 50th point depending on signal characteristics. It increases speed very much but significantly distorts signal form.
Are there any smarter approaches?

We've had a similar issue in a recent application. Our visualization (a simple line graph) became too cluttered when zoomed out to see the full extent of the data (about 7 days of samples with a sample taken every 6 seconds more or less), so down-sampling was actually the way to go. If we didn't do that, zooming out wouldn't have much meaning, as all you would see was just a big blob of lines smeared out over the screen.
It all depends on how you are going to implement the down-sampling. There's two (simple) approaches: down-sample at the moment you get your sample or down-sample at display time.
What really gives a huge performance boost in both of these cases is the proper selection of your data-sources.
Let's say you have 7 million samples, and your viewing window is just interested in the last million points. If your implementation depends on an IEnumerable, this means that the IEnumerable will have to MoveNext 6 million times before actually starting. However, if you're using something which is optimized for random reads (a List comes to mind), you can implement your own enumerator for that, more or less like this:
public IEnumerator<T> GetEnumerator(int start, int count, int skip)
{
// assume we have a field in the class which contains the data as a List<T>, named _data
for(int i = start;i<count && i < _data.Count;i+=skip)
{
yield return _data[i];
}
}
Obviously this is a very naive implementation, but you can do whatever you want within the for-loop (use an algorithm based on the surrounding samples to average?). However, this approach will make usually smooth out any extreme spikes in your signal, so be wary of that.
Another approach would be to create some generalized versions of your dataset for different ranges, which update itself whenever you receive a new signal. You usually don't need to update the complete dataset; just updating the end of your set is probably good enough. This allows you do do a bit more advanced processing of your data, but it will cost more memory. You will have to cache the distinct 'layers' of detail in your application.
However, reading your (short) explanation, I think a display-time optimization might be good enough. You will always get a distortion in your signal if you generalize. You always lose data. It's up to the algorithm you choose on how this distortion will occur, and how noticeable it will be.

You need a better sampling algorithm, also you can employ parallel processing features of c#. Refer to Task Parallel Library

How to do interpolation

I have very little data for my analysis, and so I want to produce more data for analysis through interpolation.
My dataset contain 23 independent attributes and 1 dependent attribute.....how can this done interpolation?
EDIT:
my main problem is of shortage of data, i hv to increase the size of my dataset, n attributes are categorical for example attribute A may be low, high, meduim, so interpolation is the right approach for it or not????

This is a mathematical problem but there is too little information in the question to properly answer. Depending on distribution of your real data you may try to find a function that it follows. You can also try to interpolate data using artificial neural network but that would be complex. The thing is that to find interpolations you need to analyze data you already have and that defeats the purpose. There is probably more to this problem but not explained. What is the nature of the data? Can you place it in n-dimensional space? What do you expect to get from analysis?

Roughly speaking, to interpolate an array:
double[] data = LoadData();
double requestedIndex = /* set to the index you want - e.g. 1.25 to interpolate between values at data[1] and data[2] */;
int previousIndex = (int)requestedIndex; // in example, would be 1
int nextIndex = previousIndex + 1; // in example, would be 2
double factor = requestedIndex - (double)previousIndex; // in example, would be 0.25
// in example, this would give 75% of data[1] plus 25% of data[2]
double result = (data[previousIndex] * (1.0 - factor)) + (data[nextIndex] * factor);
This is really pseudo-code; it doesn't perform range-checking, assumes your data is in an object or array with an indexer, and so on.
Hope that helps to get you started - any questions please post a comment.

If the 23 independent variables are sampled in a hyper-grid (regularly spaced), then you can choose to partition into hyper-cubes and do linear interpolation of the dependent value from the vertex closest to the origin along the vectors defined from that vertex along the hyper-cube edges away from the origin. In general, for a given partitioning, you project the interpolation point onto each vector, which gives you a new 'coordinate' in that particular space, which can then be used to compute the new value by multiplying each coordinate by the difference of the dependent variable, summing the results, and adding to the dependent value at the local origin. For hyper-cubes, this projection is straightforward (you simply subtract the nearest vertex position closest to the origin.)
If your samples are not uniformly spaced, then the problem is much more challenging, as you would need to choose an appropriate partitioning if you wanted to perform linear interpolation. In principle, Delaunay triangulation generalizes to N dimensions, but it's not easy to do and the resulting geometric objects are a lot harder to understand and interpolate than a simple hyper-cube.
One thing you might consider is if your data set is naturally amenable to projection so that you can reduce the number of dimensions. For instance, if two of your independent variables dominate, you can collapse the problem to 2-dimensions, which is much easier to solve. Another thing you might consider is taking the sampling points and arranging them in a matrix. You can perform an SVD decomposition and look at the singular values. If there are a few dominant singular values, you can use this to perform a projection to the hyper-plane defined by those basis vectors and reduce the dimensions for your interpolation. Basically, if your data is spread in a particular set of dimensions, you can use those dominating dimensions to perform your interpolation, since you don't really have much information in the other dimensions anyway.
I agree with the other commentators, however, that your premise may be off. You generally don't want to interpolate to perform analysis, as you're just choosing to interpolate your data in different ways and the choice of interpolation biases the analysis. It only makes sense if you have a compelling reason to believe that a particular interpolation is physically consistent and you simply need additional points for a particular algorithm.

May I suggest Cubic Spline Interpolation
http://www.coastrd.com/basic-cubic-spline-interpolation
unless you have a very specific need, this is easy to implement and calculates splines well.

Have a look at the regression methods presented in Elements of statistical learning; most of them may be tested in R. There are plenty of models that can be used: linear regression, local models and so on.

How to effeciently spread objects on a 2D surface in a "natural" way?

i would like to effeciently generate positions for objects on a given surface. As you probably guessed this is for a game. The surface is actually a 3D terrain, but the third dimension does not matter as it is determined by terrain height.
The problem is i would like to do this in the most effecient and easy way, but still get good results. What i mean by "natural" is something like mentoined in this article about Perlin noise. (trees forming forests, large to small groups spread out on the land) The approach is nice, but too complicated. I need to do this quite often and prefferably without any more textures involved, even at the cost of worse performance (so the results won't be as pretty, but still good enough to give a nice natural terrain with vegetation).
The amount of objects placed varies, but generally is around 50. A nice enhancement would be to somehow restrict placement of objects at areas with very high altitude (mountains) but i guess it could be done by placing a bit more objects and deleting those placed above a given altitude.

This might not be the answer you are looking for, but I believe that Perlin Noise is the solution to your problem.
Perlin Noise itself involves no textures; I do believe that you have a misunderstanding about what it is. It's basically, for your purposes, a 2D index of, for each point, a value between 0 and 1. You don't need to generate any textures. See this description of it for more information and an elegant explanation. The basics of Perlin Noise involves making a few random noise maps, starting with one with very few points, and each new one having twice as many points of randomness (and lower amplitude), and adding them together.
Especially, if your map is discretely tiled, you don't even have to generate the noise at a high resolution :)
How "often" are you planning to do this? If you're going to be doing it 10+ times every single frame, then Perlin Noise might not be your answer. However, if you're doing it once every few seconds (or less), then I don't think that you should have any worries about speed impact -- at least, for 2D Perlin Noise.
Establishing that, you could look at this question and my personal answer to it, which is trying to do something very similar to what you are trying to do. The basic steps involve this:
Generate perlin noise; higher turbulence = less clumping and more isolated features.
Set a "threshold" (ie, 0.5) -- anything above this threshold is considered "on" and anything above it is considered "off". Higher threshold = more frequent, lower threshold = less frequent.
Populate "on" tiles with whatever you are making.
Here are some samples of Perlin Noise to generate 50x50 tile based map. Note that the only difference between the nature of the two are the "threshold". Bigger clumps means lower threshold, smaller clumps means a higher one.
A forest, with blue trees and brown undergrowth
A marsh, with deep areas surrounded by shallower areas

Note you'll have to tweak the constants a bit, but you could do something like this
First, pick a random point. (say 24,50).
Next, identify points of interest for this object. If it's a rock, your points might be the two mountains at 15,13 or 50,42. If it was a forest, it would maybe do some metrics to find the "center" of a couple local forests.
Next, calculate the distance vectors between the the point and the points of interest, and scale them by some constant.
Now, add all those vectors to the point.
Next determine if the object is in a legal position. If it is, move to the next object. If it's not, repeat the process.
Adapt as necessary. :-)

One thing: If you want to reject things like trees on mountains you don't add extra tries, you keep trying to place an object until you find a suitable location or you've tried it a bunch of times and you need to bail out because it doesn't look placeable.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.