Faster conversion of BGR packed to RGB planar pixel format - c#

From a SDK I get images that have the pixel format BGR packed, i.e. BGRBGRBGR. For another application, I need to convert this format to RGB planar RRRGGGBBB.
I am using C# .NET 4.5 32bit and the data is in byte arrays which have the same size.
Right now I am iterating through the array source and assigning the BGR values to there appropriate places in the target array, but that takes too long (180ms for a 1,3megapixel image). The processor the code runs at has access to MMX, SSE, SSE2, SSE3, SSSE3.
Is there a way to speed up the conversion?
edit: Here is the conversion I am using:
// the array with the BGRBGRBGR pixel data
byte[] source;
// the array with the RRRGGGBBB pixel data
byte[] result;
// the amount of pixels in one channel, width*height
int imageSize;
for (int i = 0; i < source.Length; i += 3)
{
result[i/3] = source[i + 2]; // R
result[i/3 + imageSize] = source[i + 1]; // G
result[i/3 + imageSize * 2] = source[i]; // B
}
edit: I tried splitting the access to the source array into three loops, one for each channel, but it didn't really help. So I'm open to suggestions.
for (int i = 0; i < source.Length; i += 3)
{
result[i/3] = source[i + 2]; // R
}
for (int i = 0; i < source.Length; i += 3)
{
result[i/3 + imageSize] = source[i + 1]; // G
}
for (int i = 0; i < source.Length; i += 3)
{
result[i/3 + imageSize * 2] = source[i]; // B
}
Bump because the question is still unanswered. Any advise is very appreciated!

You could try to use SSE3's PSHUFB - Packed Shuffle Bytes instruction. Make sure you are using aligned memory read/writes. You will have to do something tricky to deal with the last dangling B value in each XMMWORD-sized block. Might be tough to get it right but should be a huge speedup. You could also look for library code. I'm guessing you will need to make a C or C++ DLL and use P/Invoke, but maybe there is a way to use SSE instructions from C# that I don't know about.
edit - this question is for a slightly different problem, ARGB to BGR, but the techniques used are similar to what you need.

Basler company has a SDK for their cameras, called Basler Pylon, working in Windows and Linux.
This SDK has APIs for C++, C# and more.
It has an image conversion class PixelDataConverter, which seems to be what you need.

Related

How to take array segments out of a byte array after every X step?

I got a big byte array (around 50kb) and i need to extract numeric values from it. Every three bytes are representing one value.
What i tried is to work with LINQs skip & take but it's really slow regarding the large size of the array.
This is my very slow routine:
List<int> ints = new List<int>();
for (int i = 0; i <= fullFile.Count(); i+=3)
{
ints.Add(BitConverter.ToInt16(fullFile.Skip(i).Take(i + 3).ToArray(), 0));
}
I think i got a wrong approach to this.
Your code
First of all, ToInt16 only uses two bytes. So your third byte will be discarded.
You can't use ToInt32 as it would include one extra byte.
Let's review this:
fullFile.Skip(i).Take(i + 3).ToArray()
..and take a careful look at Take(i + 3). It says that you want to copy a larger and larger buffer. For instance, when i is on index 32000 you copy 32003 bytes into your new buffer.
That's why the code is quite slow.
The code is also slow since you allocate a lot of byte buffers which will need to be garbage collected. 65535 extra buffers of growing size which would have to be garbage collected.
You could also have done like this:
List<int> ints = new List<int>();
var workBuffer = new byte[4];
for (int i = 0; i <= fullFile.Length; i += 3)
{
// Copy the three bytes into the beginning of the temp buffer
Buffer.BlockCopy(fullFile, i, workBuffer, 0, 3);
// Now we can use ToInt32 as the last byte always is zero
var value = BitConverter.ToInt32(workBuffer, 0);
ints.Add(value);
}
Quite easy to understand, but not the fastest code.
A better solution
So the most efficient way is to do the conversion by yourself (bit shifting).
Something like:
List<int> ints = new List<int>();
for (int i = 0; i <= fullFile.Length; i += 3)
{
// This code assume little endianess
var value = (fullFile[i + 2] << 16)
+ (fullFile[i + 1] << 8)
+ fullFile[i];
ints.Add(value);
}
This code do not allocate anything extra (except the ints), and should be quite fast.
You can read more about Shift operators in MSDN. And about endianess

Converting image into CvMat in OpenCV for training neural network

I'm writing a program which uses OpenCv neural networks module along with C# and OpenCvSharp library. It must recognise the face of user, so in order to train the network, i need a set of samples. The problem is how to convert a sample image into array suitable for training. What i've got is 200x200 BitMap image, and network with 40000 input neurons, 200 hidden neurons and one output:
CvMat layerSizes = Cv.CreateMat(3, 1, MatrixType.S32C1);
layerSizes[0, 0] = 40000;
layerSizes[1, 0] = 200;
layerSizes[2, 0] = 1;
Network = new CvANN_MLP(layerSizes,MLPActivationFunc.SigmoidSym,0.6,1);
So then I'm trying to convert BitMap image into CvMat array:
private void getTrainingMat(int cell_count, CvMat trainMAt, CvMat responses)
{
CvMat res = Cv.CreateMat(cell_count, 10, MatrixType.F32C1);//10 is a number of samples
responses = Cv.CreateMat(10, 1, MatrixType.F32C1);//array of supposed outputs
int counter = 0;
foreach (Bitmap b in trainSet)
{
IplImage img = BitmapConverter.ToIplImage(b);
Mat imgMat = new Mat(img);
for (int i=0;i<imgMat.Height;i++)
{
for (int j = 0; j < imgMat.Width; j++)
{
int val =imgMat.Get<int>(i, j);
res[counter, 0] = imgMat.Get<int>(i, j);
}
responses[i, 0] = 1;
}
trainMAt = res;
}
}
And then, when trying to train it, I've got this exception:
input training data should be a floating-point matrix withthe number of rows equal to the number of training samples and the number of columns equal to the size of 0-th (input) layer
Code for training:
trainMAt = Cv.CreateMat(inp_layer_size, 10, MatrixType.F32C1);
responses = Cv.CreateMat(inp_layer_size, 1, MatrixType.F32C1);
getTrainingMat(inp_layer_size, trainMAt, responses);
Network.Train(trainMAt, responses, new CvMat(),null, Parameters);
I'm new to OpenCV and I think I did something wrong in converting because of lack of understanding CvMat structure. Where is my error and is there any other way of transforming the bitmap?
With the number of rows equal to the number of training samples
That's 10 samples.
and the number of columns equal to the size of 0-th (input) layer
That's inp_layer_size.
trainMAt = Cv.CreateMat(10, inp_layer_size, MatrixType.F32C1);
responses = Cv.CreateMat(10, 1, MatrixType.F32C1); // 10 labels for 10 samples
I primarily do C++, so forgive me if I'm misunderstanding, but your pixel loop will need adapting in addition.
Your inner loop looks broken, as you assign to val, but never use it, and also never increment your counter.
In addition, in your outer loop assigning trainMAt = res; for every image doesn't seem like a very good idea.
I am certain you will get it to operate correctly, just keep in mind the fact that the goal is to flatten each image into a single row, so you end up with 10 rows and inp_layer_size columns.

How do i downsample an 8 bit bitmap array of size 20x30 to 10x15

How do i downsample an 8 bit bitmap array of size 20x30 to 10x15
original_bitmap_array[20][30] to downsample_array[10][15]
this is the original array which is 1 and 0 representation of character 'A'
00000000111100000000
00000001111100000000
00000001111100000000
00000001111110000000
00000011100110000000
00000011000110000000
00000011000111000000
00000011000011000000
00000111000011000000
00000110000011000000
00000110000011000000
00000110000011000000
00001110000001100000
00001110000001100000
00001100000001100000
00001100000000110000
00001100000000110000
00011111111111110000
00011111111111111000
00011111111111111000
00011111111111111000
00111111111111111000
00110000000000011000
00110000000000011100
01110000000000011100
01110000000000001100
01110000000000001110
01100000000000001110
11100000000000000110
11100000000000000111
Now can somebody tell how to downsample into 10x15 array, without losing the char 'A'?
It all depends on how you want to sample the image.
One easy sampling for your example is to take every second pixel from the 20x30 and put in 10x15.
You probably want to expand this and sample at the cross-section of 2x2 pixels and use bi-linear interpolation.
Because i don't exactly now how you want to sample the image it is hard for me to give you more information on this.
Update:
org_img[20][30]; --monochrome values
sampled_img[10][15];
for(int i=0; i < 10; i++)
{
for(int j=0; j < 15; j++)
{
int average = org_img[2*i][2*j] + org_img[2*i+1][2*j]+ org_img[2*i][2*j+1] + org_img[2*i+1][2*j+1];
average = average>>2; --integer division by 4.
sampled_img[i][j] = average;
}
}
You can use Matrix.Scale function for this.
Here is an example

FFT implementation

I have a problem with implementing FFT. The target device is Windows Phone 7.
This is how i'm doing it.
buffer is a byte array with fixed size 1024.
var o = Observable.FromEvent<EventArgs>(Microphone.Default, "BufferReady");
o.Subscribe(evt =>
{
double[] dImageArray = this.buffer.Select(i => Convert.ToDouble(i)).ToArray();
fftoutput = Saluse.MediaKit.Sample.FourierTransform.FFTDb(ref dImageArray);
});
The class i'm using(as you can see) is from SaluseMediakit (source)
Is this the right path? Or i'm somewhere mistaken?
I've manage to perform a good FFT, with AFORGE(this library saved me several times). The proper way to obtain the waveform info from the mic.
double[] sampleBuffer = new double[buffer.Length / 2];
int h = 0;
for (int i = 0; i < buffer.Length; i += 2)
{
sampleBuffer[h] = Convert.ToDouble(BitConverter.ToInt16((byte[])buffer, i));
h++;
}
Following up with another Question. I would love to make a visual equalizer. But I have no idea how-to.

Can this code be optimised?

I have some image processing code that loops through 2 multi-dimensional byte arrays (of the same size). It takes a value from the source array, performs a calculation on it and then stores the result in another array.
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int x = 0; x < xSize; x++)
{
for (int y = 0; y < ySize; y++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
The loop currently takes ~11ms, which I assume is mostly due to accessing the byte arrays values as the calculation is pretty simple (2 multiplications and 1 addition).
Is there anything I can do to speed this up? It is a time critical part of my program and this code gets called 80-100 times per second, so any speed gains, however small will make a difference. Also at the moment xSize = 768 and ySize = 576, but this will increase in the future.
Update: Thanks to Guffa (see answer below), the following code saves me 4-5ms per loop. Although it is unsafe code.
int size = ResultImageData.Length;
int counter = 0;
unsafe
{
fixed (byte* r = ResultImageData, c = CurrentImageData, a = AlphaImageData)
{
while (size > 0)
{
*(r + counter) = (byte)(*(c + counter) * AlphaValue +
*(a + counter) * OneMinusAlphaValue);
counter++;
size--;
}
}
}
To get any real speadup for this code you would need to use pointers to access the arrays, that removes all the index calculations and bounds checking.
int size = ResultImageData.Length;
unsafe
{
fixed(byte* rp = ResultImageData, cp = CurrentImageData, ap = AlphaImageData)
{
byte* r = rp;
byte* c = cp;
byte* a = ap;
while (size > 0)
{
*r = (byte)(*c * AlphaValue + *a * OneMinusAlphaValue);
r++;
c++;
a++;
size--;
}
}
}
Edit:
Fixed variables can't be changed, so I added code to copy the pointers to new pointers that can be changed.
These are all independent calculations so if you have a multicore CPU you should be able to gain some benefit by parallelizing the calculation. Note that you'd need to keep the threads around and just hand them work to do since the overhead of thread creation would probably make this slower rather than faster if the threads are recreated each time.
The other thing that may work is farming the work off to the graphics processor. Look at this question for some ideas, for example, using Accelerator.
An option would be to use unsafe code: fixing the array in memory and use pointer operations. I doubt the speed increase will be that dramatic though.
One note: how are you timing? If you are using DateTime then be aware that this class has poor resolution. You should add an outer loop and repeat the operation say ten times -- I bet the result is less than 110ms.
for (int outer = 0; outer < 10; ++outer)
{
for (int x = 0; x < xSize; x++)
{
for (int y = 0; y < ySize; y++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
}
Since it appears that each cell in the matrix is calculated entirely independent of the others. You may want to look into having more than one thread handle this. To avoid the cost of creating threads you could have a thread pool.
If the matrix is of sufficient size, it could be a very nice speed gain. On the other hand, if it is too small, it may not help (even hurt). Worth a try though.
An example (pseudo code) could be like this:
void process(int x, int y) {
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
ThreadPool pool(3); // 3 threads big
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int x = 0; x < xSize; x++) {
for (int y = 0; y < ySize; y++) {
pool.schedule(x, y); // this will add all tasks to the pool's work queue
}
}
pool.waitTilFinished(); // wait until all scheduled tasks are complete
EDIT: Michael Meadows mentioned in a comment that plinq may be a suitable alternative: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
I'd recommend running a few empty tests to figure out what your theoretical bounds are. For example, take out the calculation from inside the loop and see how much time is saved. Try replacing the double loop with a single loop that runs the same number of times and see how much time that saves. Then you can be sure you are going down the right path for optimization (the two paths I see are flattening the double loop into a single loop and working with the multiplication [maybe using a lookup table would be faster]).
Just real quick, you can get an optimization by looping in reverse and comparing against 0. Most CPUs have a fast op for comparison to 0.
E.g.
int xSize = ResultImageData.GetLength(0) -1;
int ySize = ResultImageData.GetLength(1) -1; //minor optimization suggested by commenter
for (int x = xSize; x >= 0; --x)
{
for (int y = ySize; y >=0; --y)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
See http://dotnetperls.com/Content/Decrement-Optimization.aspx
You are probably suffering from Boundschecking. Like Jon Skeet states, a jagged array instead of a multidimensional (that is data[][] instead of data[,]) will be faster, strange as that may seem.
The compiler will optimize
for (int i = 0; i < data.Length; i++)
by eliminating the per-element range check. But it's some kind of special case, it won't do the same for Getlength().
For the same reason, caching or hoisting the Length property (putting it in a variable like xSize) also used to be a bad thing though I haven't been able to verify that with Framework 3.5
Try swapping the x and y for loops for a more linear memory access pattern and (thus) less cache misses, like so.
int xSize = ResultImageData.GetLength(0);
int ySize = ResultImageData.GetLength(1);
for (int y = 0; y < ySize; y++)
{
for (int x = 0; x < xSize; x++)
{
ResultImageData[x, y] = (byte)((CurrentImageData[x, y] * AlphaValue) +
(AlphaImageData[x, y] * OneMinusAlphaValue));
}
}
If you are using LockBits to get at the image buffer, you should loop through y in the outer loop and x in the inner loop as that is how it is stored in memory (by row, not column). I would say that 11ms is pretty darn fast though...
Does the image data have to be stored in a multi-dimensional (rectangular) array? If you use jagged arrays instead, you may well find the JIT has more optimizations available (including removing the bounds checking).
If CurrentImageData and/or AlphaImageData don't change every time you run your code snippet, you could store the product prior to running the code snippet you show and avoid that multiplication in your loops.
Edit: Another thing I just thought of: Sometimes int operations are quicker than byte operations. Offset this with your processor cache utilization (you'll increase the data size considerably and stand a greater risk of a cache miss).
442,368 additions and 884,736 multiplications for the calculation i would think 11ms is actually extremely slow on a modern CPU.
while i don't know much about the specifics of .net i do know high speed calculation is not its strong suit. In the past i've built java apps with similar problems, i've always used C libraries to do the image / audio processing.
coming from a hardware perspective you want to make sure the memory accesses are sequential, that is step through the buffer in the order it exists in memory. you also may need to reorder this such that the compiler takes advantage of available instructions such as SIMD. How to approach this will end up being dependent on your compiler and i can't help on vs.net.
on an embedded DSP i would break out
(AlphaImageData[x, y] * OneMinusAlphaValue) and (CurrentImageData[x, y] * AlphaValue) and use SIMD instructions to calculate buffers, possibly in parallel before performing the addition. perhaps doing small enough chunks to keep the buffers in cache on the cpu.
i believe anything you do will require more direct access to the memory/cpu than .net allows.
You may also want to take a look at the Mono runtime and its Simd extensions. Perhaps some of your calculations can make use of the SSE acceleration as I gather that you basically do vector calculations (I don't know up to which vector size there is acceleration for multiplication but there is for some sizes)
(Blog post announcing Mono.Simd: http://tirania.org/blog/archive/2008/Nov-03.html)
Of course, that wouldn't work on Microsoft .NET but maybe you are interested in some experimentation.
Interestingly, image data is frequently pretty similar, meaning that the calculations are likely very repetitive. Have you explored doing a lookup table for the calculations? So any time 0.8 was multiplied by 128 - value[80,128] which you've precalculated to 102.4, you simply looked that up? You're basically trading memory space for CPU speed, but it could work for you.
Of course, if your image data has too high a resolution (and goes to too significant a digit), this may not be practical.

Categories

Resources