I was doing some benchmarking today using C# and OpenTK, just to see how much I could actually render before the framerate dropped. The numbers I got were pretty astronomical, and I am quite happy with the outcome of my tests.
In my project I am loading the blender monkey, which is 968 triangles. I then instance it and render it 100 times. This means that I am rendering 96,800 triangles per frame. This number far exceeds anything that I would need to render during any given scene in my game. And after this I pushed it even further and rendered 2000 monkeys at varying locations. I was now rendering a whopping 1,936,000 (almost 2 million triangles per frame) and the framerate was still locked at 60 frames per second! That number just blew my mind. I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
I was just wondering though, because I am using some legacy OpenGL, if this could still be pushed even further—or should I even bother?
For my tests I load the blender monkey model, store it into a display list using the deprecated calls like:
modelMeshID = MeshGenerator.Generate( delegate {
GL.Begin( PrimitiveType.Triangles );
foreach( Face f in model.Faces ) {
foreach( ModelVertex p in f.Points ) {
Vector3 v = model.Vertices[ p.Vertex ];
Vector3 n = model.Normals[ p.Normal ];
Vector2 tc = model.TexCoords[ p.TexCoord ];
GL.Normal3( n.X , n.Y , n.Z );
GL.TexCoord2( tc.Y , tc.X );
GL.Vertex3( v.X , v.Y , v.Z );
}
}
GL.End();
} );
and then call that list x amount of times. My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
Here is a screen grab of a few thousand:
My system specs:
CPU: AMD Athlon IIx4(quad core) 620 2.60 GHz
Graphics Card: AMD Radeon HD 6800
My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
No.
The main problem you're going to run into is, that Display Lists and Vertex Arrays don't go well with each other. Using buffer objects they kind of work, but display lists themself are legacy like the immediate mode drawing API.
However, even if you manage to get the VBO drawing from within a display list right, there'll be slightly an improvement: When compiling the display list the OpenGL driver knows, that everything that is arriving will be "frozen" eventually. This allows for some very aggressive internal optimization; all the geometry data will be packed up into a buffer object on the GPU, state changes are coalesced. AMD is not quite as good at this game as NVidia, but they're not bad either; display lists are heavily used in CAD applications and before ATI addressed the entertainment market, they were focused on CAD, so their display list implementation is not bad at all. If you pack up all the relevant state changes required for a particular drawing call into the display list, then when calling the display list you'll likely drop into the fast path.
I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
What's actually limiting you there is the overhead on calling the display list. I suggest you add a little bit more geometry into the DL and try again.
Display Lists are shockingly efficient. That they got removed from modern OpenGL is mostly because they can be effectively used only with the immediate mode drawing commands. Also recent things like transform feedback and conditional rendering would have been very difficult to integrate into display lists. So they got removed; and rightfully so, because Display Lists are kind of awkward to work with.
Now if you look at Vulkan the essential idea is to set up as much of the drawing commands (state changes, resource bindings and so on) upfront in command buffers and reuse those for varying data. This is like if you could create multiple display lists and have them make babies.
Using vertex lists, begin and end causes the monkey geometry to be sent to the GPU every iteration, going through PCI-E, which is the slowest memory interface you have during rendering. Also, depending on your GL implementation, every call to GL can have more or less overhead on it's own. If you used buffer objects, all that overhead would be gone, because you only send the monkey over once and then all you need is a draw call every iteration.
However, the monkey geometry is tiny (just a few kb), so sending it over the PCI-E bus (at like 16 GB/s?), plus the few hundred iterations of the "geometry loop", would not even take a millisecond. And even that will not touch your frame-rate because, unless you are explicitly synchronizing, it will be completely absorbed by pipelining: the copying and the draw call will run while the GPU is still busy rendering the previous frame. At the time, the GPU starts rendering the next frame, the data is already there.
That is why I am guessing, given you have a fairly optimized GL implementation (good drivers) that using buffer objects, would not yield any speed-up. Note that in the face of bigger and more complex geometry and rendering operations, buffer objects will of course become crucial to performance. Small buffers might even stay cached on chip between draw calls.
Nevertheless, as a serious speed-freak, you definitely want to double-check and verify these sorts of guesstimates :)
Related
I'm trying to construct a program in C# that generates a 3D model of a structure composed of beams, and then creates some views of the object (front, side, top and isometric).
As I don't need to draw surfaces (the edges are enough), I've been calculating each line to draw, and then do it with
GraphicObject.DrawLine(myPen, x1, y1, x2, y2)
This worked fine so far, but as I get adding parts to the structure, the refresh of GraphicObject takes too much time. So I'm getting into line visibility check to reduce the amount of lines to draw.
I've searched Wikipedia and some PDFs on the subject, but all I found is oriented by surfaces. So my question: Is there a simplified algorithm to check visibility of object edges, or should i go for a different approach, like considering surfaces?
Any suggestions would be appreciated, thanks for your help.
Additional notes/questions:
My current approach:
calculate every beam in a local axis (all vertices)
=> move them to their global position
=> create a list with pairs of points (projected and scaled to the view)
=> GraphicObject.DrawLine the list of point pairs)
would the whole thing be faster if I'd calculate the view by pixels rather than using the DrawLine method?
Screenshots follow with the type of structure it's going to do (not fully complete yet):
Structure view
Structure detail
There are 2 solutions to improve the performance.
a) switch the computation to the graphics card.
b) Use a kd-tree or some other similar data structure to quickly remove the non visible edges.
Here's more details:
For a), a lot of you computations are multiplying many vertices (vectors of length 3) by some matrix. The CPUs are slow because they only do a couple of these operations at a time. Switch to a GPU, for example using CUDA, which will allow you to do them more in parallel, with better memory access infrastructure. You can also use OpenGL/DirectX/Vulkan or whatever to render the lines themselves to skip having to get the results back from the graphics card and whatever other hiccups get introduced by windows code/libraries. This will help in almost all cases to improve performance.
For b), it only helps when you are not looking at the entire scene (in that case you really need to draw everything). In this cases you can store you scene in a kd-tree or some other data structure and use it to quickly remove things that are for sure outside of the view area. You usually need to intersect some cuboid with a pyramid/fustrum so there's more math involved.
As a compromise that should help in a large scenes where you want to see everything you can consider adjusting the level of detail. From your example, the read beans across are composed of 8 or so components. If you are far enough you are not going to be able to distinguish the 8, so just draw one. This will work great if you have a large number of rounded edges as you can simplify a lot of them.
First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)
If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.
WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.
For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).
Probably not a very descriptive title, but I'm doing my best. It's my first time posting on StackOverflow, and I'm relatively new to programming in C# (first started around a year ago using Unity, and decided a few days ago to upgrade to XNA). That being said, please be kind to me.
I'm planning out the mechanics of a 2D game that I'm designing, and while most of it seems straightforward after playing around in XNA, there's one issue that I keep coming back to that I have yet to come up with a satisfactory answer for. The issue involves the layering of sprites into composite / complex sprites. For example, a character in the game might be wielding one or two of any number of weapons. I did do a bit of research on the topic, and found some people recommending to use the RenderTarget class to draw a series of sprites as one, and some recommending simply drawing the sprites on top of one another during Draw(). These topics, however, were mostly focused on the relatively simple case of having a single character in the game.
In my case, the game will have a number of sprite-based characters who have totally different postures / animations. There are around 10 right now, and there will probably be more added later in development. There will likewise be a largish number of weapons (probably around 20 to start) that will be composited onto the characters. That much I'm comfortable with. However, the problem is that each of the characters would require the weapon sprites to be draw in different locations and with different rotations during each frame of a character's animation.
I've considered a couple approaches to how to pull this off, but they all have pretty massive drawbacks.
The first was simply drawing a spritesheet of each weapon for each character, that would be the same size as the appropriate character. The benefit to this approach would be the ease of just adding the call to draw the additional sprite on top of the base character without having to do any calculations. The downside would be that that creates an inordinate amount of extra sprite sheets (200 extra sheets for 10 characters x 20 weapons).
The second was creating a class to handle the weapon sprite. The WeaponSprite class would be attached to a single texture for each weapon, and would then store information about the offset / rotation to use when drawn, based on the character that it is attached to. The problem with this is that organizing the offsets / rotations on a per-frame basis would be incredibly tedious, and I can't think of any easy way to pull the information based on the frame required. (I had the idea of making an AnimationFrame class to keep track of the animation name, directional facing and frame number of each character, and then using a dictionary in the weapon class to load the proper data based on the name of the current frame, but something about the idea seemed really ill-conceived). This method also has the drawback of requiring a relatively large amount of memory to pull off (assuming a Vector2 is 8 bytes and a float is 4, having 10 characters and 20 weapons would require 192KB of memory given the current number of frames being used, which would only get larger as more weapons were added). I had an offshoot of that idea (which I sort of stole from another post on here about the same topic) of using a reserved alpha value pixel to link the offset and the 'origin' of each weapon, calculating the position at runtime and then only having to store the rotational float in the aforementioned dictionary.
Since I'm new to XNA (and still pretty green on C#), I figured I'd post and let the experts chime in. Am I on the right track with my methods, or am I missing something really simple? Thank you very much in advance for your help, and please let me know if you need any additional information.
Wow, big question. I can't really tell you exactly how to implement this. But I can give you some helpful nuggets of advice:
Advice #0: Whenever any kind of compositing problem comes up, people come out of the woodwork recommending "render targets" as some kind of compositing panacea. They are usually wrong. Avoid using render targets if you can. You only need them if you are doing effects on the final, composite image (blends, blurs, etc). Otherwise just draw your sprites over the top of each other directly to the backbuffer.
Advice #1: You want to pack all your sprites onto a single sprite sheet, if possible. If you exceed the texture size limit, you'll have to be clever about how you partition your sprites across sheets. The reason is performance - you want to limit the number of texture swaps - see this answer for details.
You may be able to use an existing sprite-packer for XNA. If you can find a suitable one, I recommend you use it. A good one will allow you to treat a packed sprite just as you would treat a texture when calling SpriteBatch.Draw.
Advice #2: Do not worry about how much space your positioning data takes at runtime. 192kb is almost nothing - the size of a small texture.
The upshot of this, and #1, is to store as much as possible in your positioning meta-data, and avoid duplicate textures.
How you store your meta-data almost doesn't matter.
Advice #3: You can change both your storage requirements and content creation story from an n × m problem to an n + m problem (n characters and m weapons). Simply store weapons with only an "origin", and store characters with an "origin" and a "hand position & rotation". Simply render such that the origin of the weapon lines up with the hand of the character (the maths is very simple).
Then you can add characters without worrying about what weapons exist, and add weapons without worrying about what characters exist.
Here's an example of how much space this needs: 10 characters × 20 bytes + 20 weapons × 8 bytes = 360 bytes. Nice and small! (Although you'll probably want many more attachment points - different kinds for different weapons, hats, whatever. Edit: oops I didn't include animation frames - but it's still a relatively small amount of data.)
Advice #4: The trickiest part, as you seem to be hinting at in your post, is content creation.
As you hint at, ideally you would want to be able to edit the attachment points directly in your image editor. This is a compelling idea. Special alpha values are only appropriate if your sprites have no anti-aliasing. You could theoretically do something with layers and different colours. The hardest part is figuring out how to encode rotation.
You could use an XNA content pipeline processor to extract data from the image at build-time. However this gets very expensive to implement (especially if you've not done it before - the content pipeline is badly under-documented). Unless your art requirements are truly enormous, it is almost certainly not worth the extra development time required to make the content pipeline extension. By the time you're done, you could have hand-coded the positioning data several times over.
My recommendation, then, is to store the extra data in an easy-to-edit XML file. I recommend using XNA's XML Content Importer. It can be tricky to get the hang of the formatting at first, and you have to remember to include the appropriate assembly referencing. But once you know how to use it, it's the easiest way to get structured data into XNA quickly.
Until recently, our game checked collisions by getting the colour data from a section of the background texture of the scene. This worked very well, but as the design changed, we needed to check against multiple textures and it was decided to render these all to a single RenderTarget2D and check collisions on that.
public bool TouchingBlackPixel(GameObject p)
{
/*
Calculate rectangle under the player...
SourceX,SourceY: Position of top left corner of rectangle
SizeX,SizeY: Aproximated (cast to int from float) size of box
*/
Rectangle sourceRectangle = new Rectangle(sourceX, sourceY,
(int)sizeX, (int)sizeY);
Color[] retrievedColor = new Color[(int)(sizeX * sizeY)];
p.tileCurrentlyOn.collisionLayer.GetData(0, sourceRectangle, retrievedColor,
0, retrievedColor.Count());
/*
Check collisions
*/
}
The problem that we've been having is that, since moving to the render target, we are experiencing massive reductions in FPS.
From what we've read, it seems as if the issue is that in order to get data from the RenderTarget2D, you need to transfer data from the GPU to the CPU and that this is slow. This is compounded by us needing to run the same function twice (once for each player) and not being able to keep the same data (they may not be on the same tile).
We've tried moving the GetData calls to the tile's Draw function and storing the data in a member array, but this does not seem to have solved the problem (As we are still calling GetData on a tile quite regularly - down from twice every update to once every draw).
Any help which you could give us would be great as the power that this collision system affords us is quite fantastic, but the overhead which render targets have introduced will make it impossible to keep.
The simple answer is: Don't do that.
It sounds like offloading the compositing of your collision data to the GPU was a performance optimisation that didn't work - so the correct course of action would be to revert the change.
You should simply do your collision checks all on the CPU. And I would further suggest that it is probably faster to run your collision algorithm multiple times and determine a collision response by combining the results, rather than compositing the whole scene onto a single layer and running collision detection once.
This is particularly the case if you are using the render target to support transformations before doing collision.
(For simple 2D pixel collision detection, see this sample. If you need support for transformations, see this sample as well.)
I suppose, your tile's collision layer does not change. Or at least changes not very frequently. So you can store the colors for each tile in an array or other structure. This would decrease the amount of data that is transfered from the GPU to CPU, but requires that the additional data stored in the RAM is not too big.
While profiling my app using Pix, I noticed that the GPU is passing (in DX10 mode) most of its time in idle waiting for a resource not available. (and is always in row with the CPU (for example if the CPU is processing frame X, the GPU is also processing frame X) for this problem)
Some note :
1) The app is GPU limited (the CPU is basically idle (20% of CPU usage in the most heavy scene))
My questions are :
1) How do I have to interpret these results? In Pix every frame on the GPU side I see 2-3 little red bar (as far as i know means resource unavailable) and after them a medium/big gray bar (that means GPU idle). The CPU on another side has some operations, a big empty bar and then some other operations (is waiting for something?)
Another note, when the GPU is idle generally the CPU is working. (The contrary is not valid obviously)
2) What calls can make the resource become unavailable?
A MAP with DISCARD is considerated a blocking call?
A query to get the DESC of an object?
Sharing a Shader Effect is considered a contention?
What others?
My general frame is :
41 DrawPrimitives/DrawIndexedPrimitives (most object are instanced)
7/8 Locks on a vertex buffer with discard
9 change of pixel shader/vertex shader
1 setrendertarget
Thanks!
P.S. Screenshot of pix
http://img191.imageshack.us/img191/6800/42594100.jpg
If I use a single draw call (with the same gpu load (for example a particle engine with x particles or an instanced object)) instead of the full game I get a full blue bar and the GPU correctly 2-3 frame behind the CPU...
EDIT : I'm focusing more and more on the Effect Framework that probably is the reason of this problem. I share one effect between more objects to save memory and time to create them. Is this safe to assume without contention?
What comes to mind with the provided information:
Do you use double buffering with vsync? Maybe they are both waiting for the backbuffer to become available. Try triple buffering or immediate presentation.
Have you tried locking your vertex buffer with a NOOVERWITE circular strategy instead of 8 times DISCARD? Maybe there is too much memory pressure for the GPU to reallocate a new buffer for your discard. Also, some hardware doesn't allow discarding the same vertex buffer more that X times before it gets to render it's stuff.
Since you are sharing the same effect, are the parameters also shared?