WinForms: How to show output of Usb3 Vision webcam - c#

I got a task to show frames from camera as fast as possible. The camera is Basler Dart and it can produce more than 70 frames per second. Unfortunately it does not support stream output, but it produce bitmaps.
Now our solution is similar to the one from example from Basler, but it uses PictureBox to show frames. I've read somewhere that it is slow and better solution should be used. I agree that it is slow because it takes 25% of CPU (only camera, rest of app takes only 5%) when displaying all 70 fps. Unfortunately I haven't found the better solution.
private void OnImageGrabbed(Object sender, ImageGrabbedEventArgs e)
{
if (InvokeRequired)
{
// If called from a different thread, we must use the Invoke method to marshal the call to the proper GUI thread.
// The grab result will be disposed after the event call. Clone the event arguments for marshaling to the GUI thread.
BeginInvoke( new EventHandler<ImageGrabbedEventArgs>( OnImageGrabbed ), sender, e.Clone() );
return;
}
try
{
// Acquire the image from the camera. Only show the latest image. The camera may acquire images faster than the images can be displayed.
// Get the grab result.
IGrabResult grabResult = e.GrabResult;
// Check if the image can be displayed.
if (grabResult.IsValid)
{
// Reduce the number of displayed images to a reasonable amount if the camera is acquiring images very fast.
if (!stopWatch.IsRunning || stopWatch.ElapsedMilliseconds > 100)
{
stopWatch.Restart();
bool reqBitmapOldDispose=true;
if(grConvBitmap==null || grConvBitmap.Width!=grabResult.Width || grConvBitmap.Height!=grabResult.Height)
{
grConvBitmap = new Bitmap(grabResult.Width, grabResult.Height, PixelFormat.Format32bppRgb);
grConvRect=new Rectangle(0, 0, grConvBitmap.Width, grConvBitmap.Height);
}
else
reqBitmapOldDispose=false;
// Lock the bits of the bitmap.
BitmapData bmpData = grConvBitmap.LockBits(grConvRect, ImageLockMode.ReadWrite, grConvBitmap.PixelFormat);
// Place the pointer to the buffer of the bitmap.
converter.OutputPixelFormat = PixelType.BGRA8packed;
IntPtr ptrBmp = bmpData.Scan0;
converter.Convert( ptrBmp, bmpData.Stride * grConvBitmap.Height, grabResult );
grConvBitmap.UnlockBits( bmpData );
// Assign a temporary variable to dispose the bitmap after assigning the new bitmap to the display control.
Bitmap bitmapOld = pictureBox.Image as Bitmap;
// Provide the display control with the new bitmap. This action automatically updates the display.
pictureBox.Image = grConvBitmap;
if (bitmapOld != null && reqBitmapOldDispose)
{
// Dispose the bitmap.
bitmapOld.Dispose();
}
}
}
}
catch (Exception exception)
{
ShowException(exception, "OnImageGrabbed" );
}
finally
{
e.DisposeGrabResultIfClone();
}
}
My idea was to move the load to GPU. I've tryed SharpGL (OpenGL for C#) but it seemed that it cannot consume 70 textures per second but I guess it's because I made the solution in 60 minutes with learning basis of OpenGL.
My question is: What should I use instead of PictureBox to improve power and decrease CPU load? Should I use OpenGL or just limit displayed frames (as it's in example)? The PC has only integrated graphics (Intel i3 6th gen).

Overall when it comes to performance I would recommend doing some profiling to see where the actual bottleneck is. It is much easier to find performance problems if you have some data to go on rather than just guessing at the problem. A good profiler should also tell you a bit about garbage-collection penalty & allocation rates.
Overall I would say the example code looks quite decent:
There is some rate control, even if the limit of 100ms/10fps looks rather low to me.
The there does not seem to be much unnecessary copying going on as far as I can see.
It looks like you are reusing and updating the bitmap rather than recreating it every frame.
Some possible things you could try:
If the camera is a monochrome you could probably skip the conversion stage, and just do a memory copy from the grab-buffer to the bitmap.
If the camera is a high resolution model you could consider binning pixels to reduce the resolution of the images.
We are using WritableBitmap in wpf with fairly good performance. But I'm not sure how it compares to a winforms picturebox.
You could try doing the drawing yourself, i.e. attach an eventHandler to OnPaint of a panel and use one of the Graphics.DrawImage* methods to draw the latest bitmap. I have no idea if this will make any significant performance difference.
If the conversion takes any significant time you could try doing this on a background thread. If you do this you might need some way to handle bitmaps in such a way that no bitmap may be accessed from both the worker thread and UI thread at the same time.
The rate control could probably be improved to "process images as fast the UI thread can handle" instead of a fixed rate.
Another thing to consider is hardware rendering is used. This is normally the case, but there are situations where windows will fallback to software rendering, with a very high CPU usage as a result:
Servers may lack a GPU at all
Virtualization might not virtualize the GPU
Remote desktop might use software rendering

Related

C# how to find image inside larger image in around 500-700 milliseconds?

I've been working on image recognition that grabs the screen using bitmap in winforms at 727, 115 area every 700 milliseconds. The get set pixel method is a way to slow and any other method I have found I don't really know how to use.
Bitmap bitmap = new Bitmap(Screen.PrimaryScreen.Bounds.Width, Screen.PrimaryScreen.Bounds.Height);
Graphics g = Graphics.FromImage(bitmap);
g.CopyFromScreen(896, 1250, 0, 0, bitmap.Size);
Bitmap myPic = Resources.SARCUT;
This creates the image on the area on the screen, and the myPic image is the image needing to be found in a 727, 115 area, as stated before. I've tried using aForge, Emgu, and LockPixel but I couldn't convert the bitmaps to the right format and never got it to work.
Any suggestions?
Bitmap and any image operation, together with rendering, is handled by GDI+ in .NET. The GDI+ albeit being faster than its predecessor GDI, it's still notably slow. Also, you seem to be performing a copy operation and this will always represent a performance hit. If you really need to improve performance you should not use the GDI+ framework, this means you have to operate on bitmaps directly and at a lower level. However, this last statement is very broad because it depends on exactly what you want to accomplish and how. Finally, if you want to compare two images you should avoid doing it pixel by pixel and instead do it byte by byte, it's faster since no indexing format and no value encoding has to be taken into account.

Pre-generate Graphics object to improve printing performance

I have an application that prints invoices. I'd like to be able to pre-generate the invoices in a background task/process so I can reduce the downtime required to send the document to the printer when prompted by the user or other automation events. I'm looking for something like this...
Graphics _g;
// background task would call this method
void GenerateInvoice(Invoice i)
{
_g = ???? // ????
_g.DrawImage...
_g.DrawString....
}
// user action, or automation event, would call this method...
void PrintInvoice()
{
if (_g == null)
throw new DocumentNotPreparedException();
PrintDocument pd = new PrintDocument();
pd.PrinterSettings.PrinterName = "My Fast Printer";
pd.PrintPage += PrintHandler;
pd.Print();
}
void PrintHandler(object o, PrintPageEventArgs e)
{
// ????
e.Graphics = _g;
}
Any suggestions on what needs to be done in and around the '???' sections?
I'd like to be able to pre-generate the invoices in a background task/process so I can reduce the downtime required to send the document to the printer
First step is to make sure you know what the source of the "downtime" is. It would be unusual for the bottleneck to exist in your own program's rendering code. Most often, a major source of printer slowness is either in the print driver itself (e.g. a driver with a lot of code and data that has to be paged in to handle the job), or dealing with a printer that requires client-side rasterization of the page images (which requires lots of memory to support the high-resolution bitmaps needed, which in turn can be slow on some machines, and of course greatly increases the time spent sending those rasterized images to the printer, over whatever connection you're using).
If and when you've determined it's your own code that's slow, and after you've also determined that your own code is fundamentally as efficient as you can make it, then you might consider pre-rendering as a way of improving the user experience. You have two main options here: rendering into a bitmap, and rendering into a metafile.
Personally, I would recommend the latter. A metafile will preserve your original rendering commands, providing a resolution-independent and memory-efficient representation of your printing data. This would be particularly valuable if your output consists primarily of line-drawings and text output.
If you render into a bitmap instead, you will want to make sure you allocate a bitmap at least the same resolution as that being supported by the printer for your print job. Otherwise, you will lose significant image quality in the process and your printouts will not look very good. Note though that if you go this route, you run the risk of incurring the same sort of memory-related slowdown that would theoretically be an issue when dealing with the printer driver directly.
Finally, in terms of choosing between the two techniques, one scenario in which the bitmap approach might be preferable to the metafile approach is if your print job output consists primarily of a large number of bitmaps which are already at or near the resolution supported by the printer. In this case, flattening those bitmaps into a single page-sized bitmap could actually reduce the memory footprint. Drawing them into a metafile would require each individual bitmap to be stored in the metafile, and if the total size of those bitmaps is larger than the single page-sized bitmap, that would of course use even more memory. Flattening them into a single bitmap would allow you to avoid having a large number of individual, large bitmaps in memory all at once.
But really, the above is mostly theoretical. You're suggesting adding a great level of complexity to your printing code, in order to address a problem that is most likely not one you can solve in the first place, because the problem most likely does not lie in your own code at all. You should make sure you've examined very carefully the reason for slow printing, before heading down this path.

Possible Rendering Performance Optimizations

I was doing some benchmarking today using C# and OpenTK, just to see how much I could actually render before the framerate dropped. The numbers I got were pretty astronomical, and I am quite happy with the outcome of my tests.
In my project I am loading the blender monkey, which is 968 triangles. I then instance it and render it 100 times. This means that I am rendering 96,800 triangles per frame. This number far exceeds anything that I would need to render during any given scene in my game. And after this I pushed it even further and rendered 2000 monkeys at varying locations. I was now rendering a whopping 1,936,000 (almost 2 million triangles per frame) and the framerate was still locked at 60 frames per second! That number just blew my mind. I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
I was just wondering though, because I am using some legacy OpenGL, if this could still be pushed even further—or should I even bother?
For my tests I load the blender monkey model, store it into a display list using the deprecated calls like:
modelMeshID = MeshGenerator.Generate( delegate {
GL.Begin( PrimitiveType.Triangles );
foreach( Face f in model.Faces ) {
foreach( ModelVertex p in f.Points ) {
Vector3 v = model.Vertices[ p.Vertex ];
Vector3 n = model.Normals[ p.Normal ];
Vector2 tc = model.TexCoords[ p.TexCoord ];
GL.Normal3( n.X , n.Y , n.Z );
GL.TexCoord2( tc.Y , tc.X );
GL.Vertex3( v.X , v.Y , v.Z );
}
}
GL.End();
} );
and then call that list x amount of times. My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
Here is a screen grab of a few thousand:
My system specs:
CPU: AMD Athlon IIx4(quad core) 620 2.60 GHz
Graphics Card: AMD Radeon HD 6800
My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
No.
The main problem you're going to run into is, that Display Lists and Vertex Arrays don't go well with each other. Using buffer objects they kind of work, but display lists themself are legacy like the immediate mode drawing API.
However, even if you manage to get the VBO drawing from within a display list right, there'll be slightly an improvement: When compiling the display list the OpenGL driver knows, that everything that is arriving will be "frozen" eventually. This allows for some very aggressive internal optimization; all the geometry data will be packed up into a buffer object on the GPU, state changes are coalesced. AMD is not quite as good at this game as NVidia, but they're not bad either; display lists are heavily used in CAD applications and before ATI addressed the entertainment market, they were focused on CAD, so their display list implementation is not bad at all. If you pack up all the relevant state changes required for a particular drawing call into the display list, then when calling the display list you'll likely drop into the fast path.
I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
What's actually limiting you there is the overhead on calling the display list. I suggest you add a little bit more geometry into the DL and try again.
Display Lists are shockingly efficient. That they got removed from modern OpenGL is mostly because they can be effectively used only with the immediate mode drawing commands. Also recent things like transform feedback and conditional rendering would have been very difficult to integrate into display lists. So they got removed; and rightfully so, because Display Lists are kind of awkward to work with.
Now if you look at Vulkan the essential idea is to set up as much of the drawing commands (state changes, resource bindings and so on) upfront in command buffers and reuse those for varying data. This is like if you could create multiple display lists and have them make babies.
Using vertex lists, begin and end causes the monkey geometry to be sent to the GPU every iteration, going through PCI-E, which is the slowest memory interface you have during rendering. Also, depending on your GL implementation, every call to GL can have more or less overhead on it's own. If you used buffer objects, all that overhead would be gone, because you only send the monkey over once and then all you need is a draw call every iteration.
However, the monkey geometry is tiny (just a few kb), so sending it over the PCI-E bus (at like 16 GB/s?), plus the few hundred iterations of the "geometry loop", would not even take a millisecond. And even that will not touch your frame-rate because, unless you are explicitly synchronizing, it will be completely absorbed by pipelining: the copying and the draw call will run while the GPU is still busy rendering the previous frame. At the time, the GPU starts rendering the next frame, the data is already there.
That is why I am guessing, given you have a fairly optimized GL implementation (good drivers) that using buffer objects, would not yield any speed-up. Note that in the face of bigger and more complex geometry and rendering operations, buffer objects will of course become crucial to performance. Small buffers might even stay cached on chip between draw calls.
Nevertheless, as a serious speed-freak, you definitely want to double-check and verify these sorts of guesstimates :)

Fast desktop image capture

I am trying to develop a basic screen sharing and collaboration app in C#. I am currently working on capturing the screen, finding areas of the screen that have changed and subsequently need to be transmitted to the end client.
I am having a problem in that the overall frame rate of the screen capture is too low. I have a fairly good algorithm for finding areas of the screen that have changed. Given a byte array of pixels on the screen it calculates areas that have changed in 2-4ms, however the overall frame rate I am getting is 15-18 fps (i.e. taking somewhere around 60ms per frame). The bottleneck is capturing the data on the screen as a byte array which is taking around 35-50ms. I have tried a couple of different techniques and can't push the fps past 20.
At first I tried something like this:
var _bmp = new Bitmap(screenSectionToMonitor.Width, screenSectionToMonitor.Height);
var _gfx = Graphics.FromImage(_bmp);
_gfx.CopyFromScreen(_screenSectionToMonitor.X, _screenSectionToMonitor.Y, 0, 0, new Size(_screenSectionToMonitor.Width, _screenSectionToMonitor.Height), CopyPixelOperation.SourceCopy);
var data = _bmp.LockBits(new Rectangle(0, 0, _screenSectionToMonitor.Width, _screenSectionToMonitor.Height), ImageLockMode.ReadOnly, _bmp.PixelFormat);
var ptr = data.Scan0;
Marshal.Copy(ptr, _screenshot, 0, _screenSectionToMonitor.Height * _screenSectionToMonitor.Width * _bytesPerPixel);
_bmp.UnlockBits(data);
This is too slow taking around 45ms just to run the code above for a single 1080p screen. This makes the overall frame rate too slow to be smooth, so I then tried using DirectX as per the example here:
http://www.codeproject.com/Articles/274461/Very-fast-screen-capture-using-DirectX-in-Csharp
However this didn't really net any results. It marginally increased the speed of the screen capture but it was still much too slow (taking around 25-40ms, and the small increase wasn't worth the overhead of the extra DLLs, code, etc.
After googling around a bit I couldn't really find any better solutions, so my question is what is the best way to capture the pixels currently displaying on the screen? An ideal solution would:
Capture the screen as an array of bytes as RGBA
Work on older windows platforms (e.g. Windows XP and above)
Work with multiple displays
Uses existing system libraries rather than 3rd party DLLs
All these points are negotiable for a solution that return a decent overall framerate, in the region of 5-10ms for the actual capturing so the framerate can be 40-60fps.
Alternatively, If there no solution that matches above, am I taking the wrong path to calculate screen changes. Is there a better way to calculate areas of the screen that have changed?
Perhaps you can access the screen buffers at a lower level of code and hook directly into the layers and regions Windows uses as part of its screen updates. It sounds like you are after the raw display changes and Windows already has to keep track of this data. Just offering a direction for you to pursue while you find someone more knowledgeable.

Resource Contention

While profiling my app using Pix, I noticed that the GPU is passing (in DX10 mode) most of its time in idle waiting for a resource not available. (and is always in row with the CPU (for example if the CPU is processing frame X, the GPU is also processing frame X) for this problem)
Some note :
1) The app is GPU limited (the CPU is basically idle (20% of CPU usage in the most heavy scene))
My questions are :
1) How do I have to interpret these results? In Pix every frame on the GPU side I see 2-3 little red bar (as far as i know means resource unavailable) and after them a medium/big gray bar (that means GPU idle). The CPU on another side has some operations, a big empty bar and then some other operations (is waiting for something?)
Another note, when the GPU is idle generally the CPU is working. (The contrary is not valid obviously)
2) What calls can make the resource become unavailable?
A MAP with DISCARD is considerated a blocking call?
A query to get the DESC of an object?
Sharing a Shader Effect is considered a contention?
What others?
My general frame is :
41 DrawPrimitives/DrawIndexedPrimitives (most object are instanced)
7/8 Locks on a vertex buffer with discard
9 change of pixel shader/vertex shader
1 setrendertarget
Thanks!
P.S. Screenshot of pix
http://img191.imageshack.us/img191/6800/42594100.jpg
If I use a single draw call (with the same gpu load (for example a particle engine with x particles or an instanced object)) instead of the full game I get a full blue bar and the GPU correctly 2-3 frame behind the CPU...
EDIT : I'm focusing more and more on the Effect Framework that probably is the reason of this problem. I share one effect between more objects to save memory and time to create them. Is this safe to assume without contention?
What comes to mind with the provided information:
Do you use double buffering with vsync? Maybe they are both waiting for the backbuffer to become available. Try triple buffering or immediate presentation.
Have you tried locking your vertex buffer with a NOOVERWITE circular strategy instead of 8 times DISCARD? Maybe there is too much memory pressure for the GPU to reallocate a new buffer for your discard. Also, some hardware doesn't allow discarding the same vertex buffer more that X times before it gets to render it's stuff.
Since you are sharing the same effect, are the parameters also shared?

Categories

Resources