C# - slow float array processing (frame averaging)

C# - slow float array processing (frame averaging) - c#

i build an project in c# with winform GUI on the .net framework. The aim of the programm is to get data (image frames) from a frame grabber card and saving the data after doing some pre-processing on it.
The plugged camera is quite fast and has a pixel size of 640x480 with 16 Bit for each pixel and is working at 350 frames per second.
I added the library from the manufacturer that rises an event for every incoming frame and transfers the data as one-dimensional int16 array.
What i want to do is to average the bunch of frames for every second and saving the result in binary as an float array. Look here:
void Camera_OnRawImage(object sender, short[] data)
{
float[] frame = Array.ConvertAll(data, x => (float)x);
this.AverageProcess.AddImage(frame);
}
My problem is that there is a lot of data coming into my application. So the program is not able to process and save all the data before the next frames are pending to rise an event.
I already tried different approaches like:
- multiple threads to parallel calculate the arrays
- using marshalling methods for a better array access
- converting the arrays to MathNET arrays and using the INTEL MKL on MathNET library to speed up the process. look here:
internal void AddImage(float[] image)
{
currentFrameOfMeasurement++;
Vector<float> newImage = Vector<float>.Build.DenseOfArray(image);
newImage = newImage.Multiply(preAverageFactor);
var row = FrameMat.Column(currentFrameOfPeriod).Add(newImage);
FrameMat.SetColumn(currentFrameOfPeriod, row);
currentFrameOfPeriod++;
if (currentFrameOfPeriod >= this.Settings.FramesPerPeriod)
{
currentFrameOfPeriod = 0;
currentPeriodCount++;
}
}
But that is also a litte bit to slow. The RAM is increasing while grabbing the data. My methods need more than 10ms - wich is quite to slow.
To process and save the data must be possible. There is an application from the manufacturer wich is doing a lot of processing on the grabbed data. It seems like the program is written in delphi. Maybe they are using better library for doing all these operations.
My question ist now. Does anybody has an suggestion what i can do to speed up the process and calulation these arrays.
Thank you in advance

Use the Synchronized Queue (here) feature and queue in all the frames that come into your program and write a thread which can process the items in queue.

Related

How to Plot Data Faster

I want to plot all the data points, I get from the TCP server. But I could not figure out a way to plot all the data points. Instead currently I print the string to the text box. From the text box only the first line is printed.
This is a real time data plotting for an oscilloscope GUI.
How can I plot all the values.
I tested with a sine wave with a I2S mic, it gave a distorted signal when plotted with the following code.
int t;
private void Timer1_Tick(object sender, EventArgs e)
{
one = new Thread(test);
one.Start();
t++;
}
public void test()
{
byte[] bytes = new byte[client.ReceiveBufferSize];
var readCount = stream.Read(bytes, 0, bytes.Length);
string datastring = Encoding.UTF8.GetString(bytes);
txtdata.Invoke((MethodInvoker)(() => txtdata.Text = datastring.Substring(0, 100)));
txtno.Invoke((MethodInvoker)(() =>
txtno.Text = ("\nnumber of bytes read: " + readCount)
));
String ch1 = txtdata.Text; ;
String[] ch1y = ch1.Split(new char[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
for (int a = 1; a < ch1y.Length - 1; a++)
{
chart1.Invoke((MethodInvoker)(() =>
chart1.Series[0].Points.AddXY(t, Convert.ToDouble(ch1y[a]))
));
}
}

The issue here is not how fast you plot the data, but the fact that you are trying to plot real-time analogue values using a non-uniform asynchronous pattern.
It sounds like you are trying to approach this from a First Principals perspective but you are skipping a lot of background understanding that is necessary to pull this off.
The oscilloscope has many functions that allow you to focus in on the specific width or length of the analogue sample that you see on the display, this might represent a tiny sample of audio, the amplitude of the line in the display usually represents the voltage of the electrical pulse at a relative point in time, not a specific point in time.
If your sine wave represents a constant 2KHz, and you can see 2 peaks (and 2 troughs) on the screen then you are actually looking at the data captured over an interval of 1ms, to draw a wavy line like that in a chart requires many points, so your initial attempt to plot this as a single X,Y ordinate is going to need many, hundreds perhaps, of discrete points over 1 ms to render the same sort of line, but you would also have to be extra specially careful to ensure that the x interval represented the exact amount of time, or the points you plot will not be in the right place in respect to the time the value was sampled.
This frequency of processing is not something you want to try to achieve with C#, that would need to be done way down at the hardware level.
The ESP32 you are using to sample the analogue input will be capturing a specific packet of data at a specific interval, you may have seen phrases like 8-Bit 16Khz Mono to describe audio quality.
Depending on what processing is going on in the ESP32, the bytes that you receieve will usually represent the total sample, we then need to use the bitness to determine how to break the bytes into an array of values, it is not a single value.
So 8-Bit Mono means that every 8 bits will represent a single value, so in this case each byte is a separate value. Mono means that this is a 1 dimensional array of values, in Stereo or 2 channel interleaved the bytes represents 2 separate arrays of values, so every second byte actually goes into the second array...
So you don't simply convert the bytes to text using UTF8 Encoding, you need to use other libraries, like NAudio to decode this, you can try it manually though.
In simple c# terms, assuming 8-bit mono, you could plot the entire sample using the array index as X and the byte value at the x index as Y, however this still isn't going to quite match what you see in the oscilloscope.
Instead have a think about what you actually want to see on the screen, do some research into using Fast Fourier Transform calculations (FFT) to analyse the analog readings, there are even ways to do this directly on the ESP32 which may reduce the processing you do in the C# code.

I reduced the baud rate. The wave shows almost a pure sine wave at 2kHz wave after that. But only the 2kHz wave looks like that. instead of running a thread, the chart was plotted for each clock tick with Add command.

Possible Rendering Performance Optimizations

I was doing some benchmarking today using C# and OpenTK, just to see how much I could actually render before the framerate dropped. The numbers I got were pretty astronomical, and I am quite happy with the outcome of my tests.
In my project I am loading the blender monkey, which is 968 triangles. I then instance it and render it 100 times. This means that I am rendering 96,800 triangles per frame. This number far exceeds anything that I would need to render during any given scene in my game. And after this I pushed it even further and rendered 2000 monkeys at varying locations. I was now rendering a whopping 1,936,000 (almost 2 million triangles per frame) and the framerate was still locked at 60 frames per second! That number just blew my mind. I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
I was just wondering though, because I am using some legacy OpenGL, if this could still be pushed even further—or should I even bother?
For my tests I load the blender monkey model, store it into a display list using the deprecated calls like:
modelMeshID = MeshGenerator.Generate( delegate {
GL.Begin( PrimitiveType.Triangles );
foreach( Face f in model.Faces ) {
foreach( ModelVertex p in f.Points ) {
Vector3 v = model.Vertices[ p.Vertex ];
Vector3 n = model.Normals[ p.Normal ];
Vector2 tc = model.TexCoords[ p.TexCoord ];
GL.Normal3( n.X , n.Y , n.Z );
GL.TexCoord2( tc.Y , tc.X );
GL.Vertex3( v.X , v.Y , v.Z );
}
}
GL.End();
} );
and then call that list x amount of times. My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
Here is a screen grab of a few thousand:
My system specs:
CPU: AMD Athlon IIx4(quad core) 620 2.60 GHz
Graphics Card: AMD Radeon HD 6800

My question though, is if I could speed this up if I threw VAO's (Vertex Array Objects) into the display list instead of the old GL.Vertex3 api? Would this effect performance at all? Or would it create the same outcome with the display list?
No.
The main problem you're going to run into is, that Display Lists and Vertex Arrays don't go well with each other. Using buffer objects they kind of work, but display lists themself are legacy like the immediate mode drawing API.
However, even if you manage to get the VBO drawing from within a display list right, there'll be slightly an improvement: When compiling the display list the OpenGL driver knows, that everything that is arriving will be "frozen" eventually. This allows for some very aggressive internal optimization; all the geometry data will be packed up into a buffer object on the GPU, state changes are coalesced. AMD is not quite as good at this game as NVidia, but they're not bad either; display lists are heavily used in CAD applications and before ATI addressed the entertainment market, they were focused on CAD, so their display list implementation is not bad at all. If you pack up all the relevant state changes required for a particular drawing call into the display list, then when calling the display list you'll likely drop into the fast path.
I pushed it even further and finally the framerate started to drop, but this just means that the limit is roughly 4 million triangles per frame with instancing.
What's actually limiting you there is the overhead on calling the display list. I suggest you add a little bit more geometry into the DL and try again.
Display Lists are shockingly efficient. That they got removed from modern OpenGL is mostly because they can be effectively used only with the immediate mode drawing commands. Also recent things like transform feedback and conditional rendering would have been very difficult to integrate into display lists. So they got removed; and rightfully so, because Display Lists are kind of awkward to work with.
Now if you look at Vulkan the essential idea is to set up as much of the drawing commands (state changes, resource bindings and so on) upfront in command buffers and reuse those for varying data. This is like if you could create multiple display lists and have them make babies.

Using vertex lists, begin and end causes the monkey geometry to be sent to the GPU every iteration, going through PCI-E, which is the slowest memory interface you have during rendering. Also, depending on your GL implementation, every call to GL can have more or less overhead on it's own. If you used buffer objects, all that overhead would be gone, because you only send the monkey over once and then all you need is a draw call every iteration.
However, the monkey geometry is tiny (just a few kb), so sending it over the PCI-E bus (at like 16 GB/s?), plus the few hundred iterations of the "geometry loop", would not even take a millisecond. And even that will not touch your frame-rate because, unless you are explicitly synchronizing, it will be completely absorbed by pipelining: the copying and the draw call will run while the GPU is still busy rendering the previous frame. At the time, the GPU starts rendering the next frame, the data is already there.
That is why I am guessing, given you have a fairly optimized GL implementation (good drivers) that using buffer objects, would not yield any speed-up. Note that in the face of bigger and more complex geometry and rendering operations, buffer objects will of course become crucial to performance. Small buffers might even stay cached on chip between draw calls.
Nevertheless, as a serious speed-freak, you definitely want to double-check and verify these sorts of guesstimates :)

My application slows down when processing Kinect RGB images with EMGU library

I'm currently using Kinect SDK with C# ( WPF application). I need to get RGB stream and process the images with EMGU library.
The problem is when i try to process the image with EMGU ( like converting image's format and change the colour of some pixels ) the application slows down and takes too long to respond .
I'm using 8GO RAM / Intel HD graphics 4000 / Intel core i7 .
Here's my simple code :
http://pastebin.com/5frLRwMN
Please help me :'(

I have run considerably heavier code (blob analysis) with the Kinect on a per frame basis and got away with great performance on a machine of similar configuration as yours, so I believe we can rule out your machine as the problem. I don't see any EMGU code in your sample however. In your example, you loop through 307k pixels with a pair of for loops. This is naturally a costly procedure to run, depending on the code in your loops. As you might expect, GetPixel and SetPixel are very slow methods to execute.
To speed up your code, first turn your image into an Emgu Image. Then to access your image, use a Byte:
Byte workImageRed = image.Data[x, y, 0];
Byte workImageGreen = image.Data[x, y, 1];
...
The third column refers to the BGR data. To set the pixel to another colour, try something like this:
byte[,,] workIm = image.Data;
workIm[x, y, 0] = 255;
workIm[x, y, 1] = 20;
...
Alternatively, you can set the pixel to a colour directly:
image[x, y] = new Bgr(Color.Blue);
This might be slower however.

Image processing is always slow. And if you do it at 30fps, it's normal that you get you app to hang: real time image processing is always a challenge. You may need to drop some frames in order to increase performace (...or perhaps switch to native C++ and seek a faster library).

Making C# mandelbrot drawing more efficient

First of all, I am aware that this question really sounds as if I didn't search, but I did, a lot.
I wrote a small Mandelbrot drawing code for C#, it's basically a windows form with a PictureBox on which I draw the Mandelbrot set.
My problem is, is that it's pretty slow. Without a deep zoom it does a pretty good job and moving around and zooming is pretty smooth, takes less than a second per drawing, but once I start to zoom in a little and get to places which require more calculations it becomes really slow.
On other Mandelbrot applications my computer does really fine on places which work much slower in my application, so I'm guessing there is much I can do to improve the speed.
I did the following things to optimize it:
Instead of using the SetPixel GetPixel methods on the bitmap object, I used LockBits method to write directly to memory which made things a lot faster.
Instead of using complex number objects (with classes I made myself, not the built-in ones), I emulated complex numbers using 2 variables, re and im. Doing this allowed me to cut down on multiplications because squaring the real part and the imaginary part is something that is done a few time during the calculation, so I just save the square in a variable and reuse the result without the need to recalculate it.
I use 4 threads to draw the Mandelbrot, each thread does a different quarter of the image and they all work simultaneously. As I understood, that means my CPU will use 4 of its cores to draw the image.
I use the Escape Time Algorithm, which as I understood is the fastest?
Here is my how I move between the pixels and calculate, it's commented out so I hope it's understandable:
//Pixel by pixel loop:
for (int r = rRes; r < wTo; r++)
{
for (int i = iRes; i < hTo; i++)
{
//These calculations are to determine what complex number corresponds to the (r,i) pixel.
double re = (r - (w/2))*step + zeroX ;
double im = (i - (h/2))*step - zeroY;
//Create the Z complex number
double zRe = 0;
double zIm = 0;
//Variables to store the squares of the real and imaginary part.
double multZre = 0;
double multZim = 0;
//Start iterating the with the complex number to determine it's escape time (mandelValue)
int mandelValue = 0;
while (multZre + multZim < 4 && mandelValue < iters)
{
/*The new real part equals re(z)^2 - im(z)^2 + re(c), we store it in a temp variable
tempRe because we still need re(z) in the next calculation
*/
double tempRe = multZre - multZim + re;
/*The new imaginary part is equal to 2*re(z)*im(z) + im(c)
* Instead of multiplying these by 2 I add re(z) to itself and then multiply by im(z), which
* means I just do 1 multiplication instead of 2.
*/
zRe += zRe;
zIm = zRe * zIm + im;
zRe = tempRe; // We can now put the temp value in its place.
// Do the squaring now, they will be used in the next calculation.
multZre = zRe * zRe;
multZim = zIm * zIm;
//Increase the mandelValue by one, because the iteration is now finished.
mandelValue += 1;
}
//After the mandelValue is found, this colors its pixel accordingly (unsafe code, accesses memory directly):
//(Unimportant for my question, I doubt the problem is with this because my code becomes really slow
// as the number of ITERATIONS grow, this only executes more as the number of pixels grow).
Byte* pos = px + (i * str) + (pixelSize * r);
byte col = (byte)((1 - ((double)mandelValue / iters)) * 255);
pos[0] = col;
pos[1] = col;
pos[2] = col;
}
}
What can I do to improve this? Do you find any obvious optimization problems in my code?
Right now there are 2 ways I know I can improve it:
I need to use a different type for numbers, double is limited with accuracy and I'm sure there are better non-built-in alternative types which are faster (they multiply and add faster) and have more accuracy, I just need someone to point me where I need to look and tell me if it's true.
I can move processing to the GPU. I have no idea how to do this (OpenGL maybe? DirectX? is it even that simple or will I need to learn a lot of stuff?). If someone can send me links to proper tutorials on this subject or tell me in general about it that would be great.
Thanks a lot for reading that far and hope you can help me :)

If you decide to move the processing to the gpu, you can choose from a number of options. Since you are using C#, XNA will allow you to use HLSL. RB Whitaker has the easiest XNA tutorials if you choose this option. Another option is OpenCL. OpenTK comes with a demo program of a julia set fractal. This would be very simple to modify to display the mandlebrot set. See here
Just remember to find the GLSL shader that goes with the source code.
About the GPU, examples are no help for me because I have absolutely
no idea about this topic, how does it even work and what kind of
calculations the GPU can do (or how is it even accessed?)
Different GPU software works differently however ...
Typically a programmer will write a program for the GPU in a shader language such as HLSL, GLSL or OpenCL. The program written in C# will load the shader code and compile it, and then use functions in an API to send a job to the GPU and get the result back afterwards.
Take a look at FX Composer or render monkey if you want some practice with shaders with out having to worry about APIs.
If you are using HLSL, the rendering pipeline looks like this.
The vertex shader is responsible for taking points in 3D space and calculating their position in your 2D viewing field. (Not a big concern for you since you are working in 2D)
The pixel shader is responsible for applying shader effects to the pixels after the vertex shader is done.
OpenCL is a different story, its geared towards general purpose GPU computing (ie: not just graphics). Its more powerful and can be used for GPUs, DSPs, and building super computers.

WRT coding for the GPU, you can look at Cudafy.Net (it does OpenCL too, which is not tied to NVidia) to start getting an understanding of what's going on and perhaps even do everything you need there. I've quickly found it - and my graphics card - unsuitable for my needs, but for the Mandelbrot at the stage you're at, it should be fine.
In brief: You code for the GPU with a flavour of C (Cuda C or OpenCL normally) then push the "kernel" (your compiled C method) to the GPU followed by any source data, and then invoke that "kernel", often with parameters to say what data to use - or perhaps a few parameters to tell it where to place the results in its memory.
When I've been doing fractal rendering myself, I've avoided drawing to a bitmap for the reasons already outlined and deferred the render phase. Besides that, I tend to write massively multithreaded code which is really bad for trying to access a bitmap. Instead, I write to a common store - most recently I've used a MemoryMappedFile (a builtin .Net class) since that gives me pretty decent random access speed and a huge addressable area. I also tend to write my results to a queue and have another thread deal with committing the data to storage; the compute times of each Mandelbrot pixel will be "ragged" - that is to say that they will not always take the same length of time. As a result, your pixel commit could be the bottleneck for very low iteration counts. Farming it out to another thread means your compute threads are never waiting for storage to complete.
I'm currently playing with the Buddhabrot visualisation of the Mandelbrot set, looking at using a GPU to scale out the rendering (since it's taking a very long time with the CPU) and having a huge result-set. I was thinking of targetting an 8 gigapixel image, but I've come to the realisation that I need to diverge from the constraints of pixels, and possibly away from floating point arithmetic due to precision issues. I'm also going to have to buy some new hardware so I can interact with the GPU differently - different compute jobs will finish at different times (as per my iteration count comment earlier) so I can't just fire batches of threads and wait for them all to complete without potentially wasting a lot of time waiting for one particularly high iteration count out of the whole batch.
Another point to make that I hardly ever see being made about the Mandelbrot Set is that it is symmetrical. You might be doing twice as much calculating as you need to.

For moving the processing to the GPU, you have lots of excellent examples here:
https://www.shadertoy.com/results?query=mandelbrot
Note that you need an WebGL capable browser to view that link. Works best in Chrome.
I'm no expert on fractals but you seem to have come far already with the optimizations. Going beyond that may make the code much harder to read and maintain so you should ask yourself it is worth it.
One technique I've often observed in other fractal programs is this: While zooming, calculate the fractal at a lower resolution and stretch it to full size during render. Then render at full resolution as soon as zooming stops.
Another suggestion is that when you use multiple threads you should take care that each thread don't read/write memory of other threads because this will cause cache collisions and hurt performance. One good algorithm could be split the work up in scanlines (instead of four quarters like you did now). Create a number of threads, then as long as there as lines left to process, assign a scanline to a thread that is available. Let each thread write the pixel data to a local piece of memory and copy this back to main bitmap after each line (to avoid cache collisions).

HLSL: Returning an array of float4?

I have the following function in HLSL:
float4[] GetAllTiles(float type) {
float4 tiles[128];
int i=0;
[unroll(32768)] for(int x=0;x<MapWidth;x++) {
[unroll(32768)] for(int y=0;y<MapHeight;y++) {
float2 coordinate = float2(x,y);
float4 entry = tex2D(MapLayoutSampler, coordinate);
float entryType=GetTileType(entry);
if(entryType == type) {
tiles[i++]=entry;
}
}
}
return tiles;
}
However, it says that it can't define a return type of float4[]. How do I do this?

In short:
You can't return an array of floats defined in the function in HLSL.
HLSL code (on the GPU) is not like C code on the CPU. It is executed concurrently on many GPU cores.
HLSL code gets executed at every vertex (in the vertex shader) or every at pixel (in the pixel shader). So for every vertex you give the GPU, this code will be executed.
This HLSL introduction should give you a sense of how a few lines of HLSL code get executed on every pixel, producing a new image from the input:
http://www.neatware.com/lbstudio/web/hlsl.html
In your example code, you are looping over the entire map, which is probably not what you want to do at all, as the function you posted will be executed at every pixel (or vertex) given in your input.
Transferring your logic from the CPU to the GPU via HLSL code can be very difficult, as GPUs are not currently designed to do general purpose computation. The task you are trying to do must be very parallel, and if you want it to be fast on the GPU, then you need to express the problem in terms of drawing images, and reading from textures.
Read the tutorial I linked to get started with HLSL :)

Return a structure with the array in it. You can send parametres in as a raw array, but it must be in a structure if its the return value. :)
What Olhovsky said is true tho, if your converting from c to direct c/compute you should have the iterations layed out as separate threads, But dont forget that a gpu also has a lot of series power as well, and u need to take that into account for your budget for maximum efficiency. For example, the least amount of threads you need is the amount of cores on your gpu. For a gtx980, its 2048.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.