c# AddMemoryPressure memory leak - c#

I wrote the test. And run it on net5 and net472.
dotMemory shows growth in unmanaged Memory.
var size = 1920 * 1080 * 3 / 2;
for (int i = 0; i < int.MaxValue; i++)
{
GC.AddMemoryPressure(size);
GC.RemoveMemoryPressure(size);
Thread.Sleep(1);
}
Is it a memory leak?
Why is dotMemory showing this?

Related

Why is rotation much faster on Image than using BitmapEncoder?

Rotating an Image using
image.RenderTransform = new RotateTransform()...
is almost immediate.
On the other hand, using
bitmapEncoder.BitmapTransform.Rotation = BitmapRotation.Clockwise90Degrees...
is much slower (in the FlushAsync()) - more than half a second.
Why is that? And is there a way to harness the fast rotation in order to rotate bitmaps?
The first one image.RenderTransform will render the bitmap by using hardware rendering. (GPU) The image isn't rotate but will be displayed rotated/scaled. (will access only visible pixels directly from/in the videomemory)
The second one will rotate the image itself by the CPU (all pixels). It will create new memory for the result. (non-video memory)
update:
Is there a way to use the GPU to edit bitmaps?
Depends on what you need:
If you want to use a GPU. You could use an managed wrapper (like Slim DX/Sharp DX) This will take much time to get results. Don't forget, rerasterizing images via gpu could lead to quality lost.
If you want to rotate images only (0, 90, 180, 270)? you could use a Bitmap class with de ScanLine0 option. (this is to preserve quality and size) and you could create a fast implementation.
Look here:
Fast work with Bitmaps in C#
I would create an algoritm foreach angle (0,90,180,270). Because you don't want to calculate the x, y position for each pixel. Something like below..
Tip:
try to lose the multiplies/divides.
/*This time we convert the IntPtr to a ptr*/
byte* scan0 = (byte*)bData.Scan0.ToPointer();
for (int i = 0; i < bData.Height; ++i)
{
for (int j = 0; j < bData.Width; ++j)
{
byte* data = scan0 + i * bData.Stride + j * bitsPerPixel / 8;
//data is a pointer to the first byte of the 3-byte color data
}
}
Becomes something like:
/*This time we convert the IntPtr to a ptr*/
byte* scan0 = (byte*)bData.Scan0.ToPointer();
byte* data = scan0;
int bytesPerPixel = bitsPerPixel / 8;
for (int i = 0; i < bData.Height; ++i)
{
byte* data2 = data;
for (int j = 0; j < bData.Width; ++j)
{
//data2 is a pointer to the first byte of the 3-byte color data
data2 += bytesPerPixel;
}
data += bData.Stride;
}

Parallel.For statement return "System.InvalidOperationException" with a Bitmap Processing

Well, I have a code to apply a Rain Bow filter in "x" image, I have to do in two ways: Sequential & parallel, my sequential code is working without problems, but the parallel section doesn't work. And I have no idea, why?.
Code
public static Bitmap RainbowFilterParallel(Bitmap bmp)
{
Bitmap temp = new Bitmap(bmp.Width, bmp.Height);
int raz = bmp.Height / 4;
Parallel.For(0, bmp.Width, i =>
{
Parallel.For(0, bmp.Height, x =>
{
if (i < (raz))
{
temp.SetPixel(i, x, Color.FromArgb(bmp.GetPixel(i, x).R / 5, bmp.GetPixel(i, x).G, bmp.GetPixel(i, x).B));
}
else if (i < (raz * 2))
{
temp.SetPixel(i, x, Color.FromArgb(bmp.GetPixel(i, x).R, bmp.GetPixel(i, x).G / 5, bmp.GetPixel(i, x).B));
}
else if (i < (raz * 3))
{
temp.SetPixel(i, x, Color.FromArgb(bmp.GetPixel(i, x).R, bmp.GetPixel(i, x).G, bmp.GetPixel(i, x).B / 5));
}
else if (i < (raz * 4))
{
temp.SetPixel(i, x, Color.FromArgb(bmp.GetPixel(i, x).R / 5, bmp.GetPixel(i, x).G, bmp.GetPixel(i, x).B / 5));
}
else
{
temp.SetPixel(i, x, Color.FromArgb(bmp.GetPixel(i, x).R / 5, bmp.GetPixel(i, x).G / 5, bmp.GetPixel(i, x).B / 5));
}
});
});
return temp;
}
Besides, In a moments the program return the same error but says "The object is already in use".
PS. I'm beginner with c#, and I Searched this topic in another post and I found nothing.
Thank you very much in advance
As commenter Ron Beyer points out, using the SetPixel() and GetPixel() methods is very slow. Each call to one of those methods involves a lot of overhead in the transition between your managed code down to the actual binary buffer that the Bitmap object represents. There are a lot of layers there, and the video driver typically gets involved which requires transitions between user and kernel level execution.
But besides being slow, these methods also make the object "busy", and so if an attempt to use the bitmap (including calling one of those methods) is made between the time one of those methods is called and when it returns (i.e. while the call is in progress), an error occurs with the exception you saw.
Since the only way that parallelizing your current code would be helpful is if these method calls could occur concurrently, and since they simply cannot, this approach isn't going to work.
On the other hand, using the LockBits() method is not only guaranteed to work, there's a very good chance that you will find the performance is so much better using LockBits() that you don't even need to parallelize the algorithm. But should you decide you do, because of the way LockBits() works — you gain access to a raw buffer of bytes that represents the bitmap image — you can easily parallelize the algorithm and take advantage of multiple CPU cores (if present).
Note that when using LockBits() you will be working with the Bitmap object at a level that you might not be accustomed to. If you are not already knowledgeable with how bitmaps really work "under the hood", you will have to familiarize yourself with the way that bitmaps are actually stored in memory. This includes understanding what the different pixel formats mean, how to interpret and modify pixels for a given format, and how a bitmap is laid out in memory (e.g. the order of rows, which can vary depending on the bitmap, as well as the "stride" of the bitmap).
These things are not terribly hard to learn, but it will require patience. It is well worth the effort though, if performance is your goal.
Parallel is hard on the singular mind. And mixing it with legacy GDI+ code can lead to strange results..
Your code has numerous issues:
You call GetPixel three times per pixel instead of once
You are accessing the pixels not horizontally as you should
You call y x and x i; the machine won't mind but us people do
You are using way too much parallelization. No use to have much more of it than you have cores. It creates overhead that is bound to eat up any gains, unless your inner loop has a really hard job to do, like millions of calculations..
But the exception you get has nothing to do with these issues. And one mistake you don't make is to access the same pixel in parallel... So why the crash?
After cleaning up the code I found that the error in the stack trace pointed to SetPixel and there to System.Drawing.Image.get_Width(). The former is obvious, the latter not part of our code..!?
So I dug into the source code at referencesource.microsoft.com and found this:
/// <include file='doc\Bitmap.uex' path='docs/doc[#for="Bitmap.SetPixel"]/*' />
/// <devdoc>
/// <para>
/// Sets the color of the specified pixel in this <see cref='System.Drawing.Bitmap'/> .
/// </para>
/// </devdoc>
public void SetPixel(int x, int y, Color color) {
if ((PixelFormat & PixelFormat.Indexed) != 0) {
throw new InvalidOperationException(SR.GetString(SR.GdiplusCannotSetPixelFromIndexedPixelFormat));
}
if (x < 0 || x >= Width) {
throw new ArgumentOutOfRangeException("x", SR.GetString(SR.ValidRangeX));
}
if (y < 0 || y >= Height) {
throw new ArgumentOutOfRangeException("y", SR.GetString(SR.ValidRangeY));
}
int status = SafeNativeMethods.Gdip.GdipBitmapSetPixel(new HandleRef(this, nativeImage), x, y, color.ToArgb());
if (status != SafeNativeMethods.Gdip.Ok)
throw SafeNativeMethods.Gdip.StatusException(status);
}
The real work is done by SafeNativeMethods.Gdip.GdipBitmapSetPixel but before that the method does a bounds check on the Bitmap's Width and Height. And while these in our case of course never change the system still won't allow accessing them in parallel and hence crashes when at some point the checks are happening interwoven. Totally uncesessary, of course, but there you go..
So GetPixel (which has the same behaviour) and SetPixel can't safely be used in a parallel processing.
Two ways out of it:
We can add locks to the code and thus make sure the checks won't happen at the 'same' time:
public static Bitmap RainbowFilterParallel(Bitmap bmp)
{
Bitmap temp = new Bitmap(bmp);
int raz = bmp.Height / 4;
int height = bmp.Height;
int width = bmp.Width;
// set a limit to parallesim
int maxCore = 7;
int blockH = height / maxCore + 1;
//lock (temp)
Parallel.For(0, maxCore, cor =>
{
//Parallel.For(0, bmp.Height, x =>
for (int yb = 0; yb < blockH; yb++)
{
int i = cor * blockH + yb;
if (i >= height) continue;
for (int x = 0; x < width; x++)
{
{
Color c;
// lock the Bitmap just for the GetPixel:
lock (temp) c = temp.GetPixel(x, i);
byte R = c.R;
byte G = c.G;
byte B = c.B;
if (i < (raz)) { R = (byte)(c.R / 5); }
else if (i < raz + raz) { G = (byte)(c.G / 5); }
else if (i < raz * 3) { B = (byte)(c.B / 5); }
else if (i < raz * 4) { R = (byte)(c.R / 5); B = (byte)(c.B / 5); }
else { G = (byte)(c.G / 5); R = (byte)(c.R / 5); }
// lock the Bitmap just for the SetPixel:
lock (temp) temp.SetPixel(x, i, Color.FromArgb(R,G,B));
};
}
};
});
return temp;
}
Note that limiting parallism is so important there is even a member in the ParallelOptions class and a parameter inParallel.For to control it! I have set the maximum core numer to 7, but this would be better:
int degreeOfParallelism = Environment.ProcessorCount - 1;
So this should save us some overhead. But still: I'd expect that to be slower than a corrected sequential method!
Instead going for a LockBits as Peter and Ron have suggested method makes things really fast (1ox) and adding parallelism potentially even faster still..
So finally to finish up this length answer, here is a Lockbits plus Limited-Parallel solution:
public static Bitmap RainbowFilterParallelLockbits(Bitmap bmp)
{
Bitmap temp = null;
temp = new Bitmap(bmp);
int raz = bmp.Height / 4;
int height = bmp.Height;
int width = bmp.Width;
Rectangle rect = new Rectangle(Point.Empty, bmp.Size);
BitmapData bmpData = temp.LockBits(rect,ImageLockMode.ReadOnly, temp.PixelFormat);
int bpp = (temp.PixelFormat == PixelFormat.Format32bppArgb) ? 4 : 3;
int size = bmpData.Stride * bmpData.Height;
byte[] data = new byte[size];
System.Runtime.InteropServices.Marshal.Copy(bmpData.Scan0, data, 0, size);
var options = new ParallelOptions();
int maxCore = Environment.ProcessorCount - 1;
options.MaxDegreeOfParallelism = maxCore > 0 ? maxCore : 1;
Parallel.For(0, height, options, y =>
{
for (int x = 0; x < width; x++)
{
{
int index = y * bmpData.Stride + x * bpp;
if (y < (raz)) data[index + 2] = (byte) (data[index + 2] / 5);
else if (y < (raz * 2)) data[index + 1] = (byte)(data[index + 1] / 5);
else if (y < (raz * 3)) data[index ] = (byte)(data[index ] / 5);
else if (y < (raz * 4))
{ data[index + 2] = (byte)(data[index + 2] / 5);
data[index] = (byte)(data[index] / 5); }
else
{ data[index + 2] = (byte)(data[index + 2] / 5);
data[index + 1] = (byte)(data[index + 1] / 5);
data[index] = (byte)(data[index] / 5); }
};
};
});
System.Runtime.InteropServices.Marshal.Copy(data, 0, bmpData.Scan0, data.Length);
temp.UnlockBits(bmpData);
return temp;
}
While not strictly relevant I wanted to post a better faster version than any of the one's that I see given in the answers. This is the fastest way I know of to iterate through a bitmap and save the results in C#. In my work we need to go through millions of large images, this is just me grabbing the red channel and saving it for my own purposes but it should give you the idea of how to work
//Parallel Unsafe, Corrected Channel, Corrected Standard div 5x faster
private void TakeApart_Much_Faster(Bitmap processedBitmap)
{
_RedMin = byte.MaxValue;
_RedMax = byte.MinValue;
_arr = new byte[BMP.Width, BMP.Height];
long Sum = 0,
SumSq = 0;
BitmapData bitmapData = processedBitmap.LockBits(new Rectangle(0, 0, processedBitmap.Width, processedBitmap.Height), ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb);
//this is a much more useful datastructure than the array but it's slower to fill.
points = new ConcurrentDictionary<Point, byte>();
unsafe
{
int bytesPerPixel = Image.GetPixelFormatSize(bitmapData.PixelFormat) / 8;
int heightInPixels = bitmapData.Height;
int widthInBytes = bitmapData.Width * bytesPerPixel;
_RedMin = byte.MaxValue;
_RedMax = byte.MinValue;
byte* PtrFirstPixel = (byte*)bitmapData.Scan0;
Parallel.For(0, heightInPixels, y =>
{
//pointer to the first pixel so we don't lose track of where we are
byte* currentLine = PtrFirstPixel + (y * bitmapData.Stride);
for (int x = 0; x < widthInBytes; x = x + bytesPerPixel)
{
//0+2 is red channel
byte redPixel = currentLine[x + 2];
Interlocked.Add(ref Sum, redPixel);
Interlocked.Add(ref SumSq, redPixel * redPixel);
//divide by three since we are skipping ahead 3 at a time.
_arr[x/3, y] = redPixel;
_RedMin = redPixel < _RedMin ? _RedMin : redPixel;
_RedMax = redPixel > RedMax ? RedMax : redPixel;
}
});
_RedMean = Sum / TotalPixels;
_RedStDev = Math.Sqrt((SumSq / TotalPixels) - (_RedMean * _RedMean));
processedBitmap.UnlockBits(bitmapData);
}
}

How to delete certain pixels in a rectangle

I am designing a space invaders game and I need to make the barriers break from where they were hit like the original game, As of now my barrier just shrinks:
for (int r = 0; r < BarrierXPos.Count; r++) {
if (CollisionCheck("Barrier", BarrierXPos[r], BarrierYPos[r], null) == true)
{
BarrierFactor[r] += 1;
}
}
for (int x = 0; x < BarrierXPos.Count; x++)
{
e.Graphics.FillRectangle(Brushes.White, (int)BarrierXPos[x], (int)BarrierYPos[x], 100 - (10 * BarrierFactor[x]), 50);
}
So as of now the size is determined by the amount of times the block has been hit. However I want it to be more like the original space invaders and would like the block to be dynamically broken from where it was hit. What I am wondering is how I can determine the pixels in which the bullet hits the barrier bitmap and subsequently remove them.

How to determine edges in an image optimally?

I recently was put in front of the problem of cropping and resizing images. I needed to crop the 'main content' of an image for example if i had an image similar to this:
(source: msn.com)
the result should be an image with the msn content without the white margins(left& right).
I search on the X axis for the first and last color change and on the Y axis the same thing. The problem is that traversing the image line by line takes a while..for an image that is 2000x1600px it takes up to 2 seconds to return the CropRect => x1,y1,x2,y2 data.
I tried to make for each coordinate a traversal and stop on the first value found but it didn't work in all test cases..sometimes the returned data wasn't the expected one and the duration of the operations was similar..
Any idea how to cut down the traversal time and discovery of the rectangle round the 'main content'?
public static CropRect EdgeDetection(Bitmap Image, float Threshold)
{
CropRect cropRectangle = new CropRect();
int lowestX = 0;
int lowestY = 0;
int largestX = 0;
int largestY = 0;
lowestX = Image.Width;
lowestY = Image.Height;
//find the lowest X bound;
for (int y = 0; y < Image.Height - 1; ++y)
{
for (int x = 0; x < Image.Width - 1; ++x)
{
Color currentColor = Image.GetPixel(x, y);
Color tempXcolor = Image.GetPixel(x + 1, y);
Color tempYColor = Image.GetPixel(x, y + 1);
if ((Math.Sqrt(((currentColor.R - tempXcolor.R) * (currentColor.R - tempXcolor.R)) +
((currentColor.G - tempXcolor.G) * (currentColor.G - tempXcolor.G)) +
((currentColor.B - tempXcolor.B) * (currentColor.B - tempXcolor.B))) > Threshold))
{
if (lowestX > x)
lowestX = x;
if (largestX < x)
largestX = x;
}
if ((Math.Sqrt(((currentColor.R - tempYColor.R) * (currentColor.R - tempYColor.R)) +
((currentColor.G - tempYColor.G) * (currentColor.G - tempYColor.G)) +
((currentColor.B - tempYColor.B) * (currentColor.B - tempYColor.B))) > Threshold))
{
if (lowestY > y)
lowestY = y;
if (largestY < y)
largestY = y;
}
}
}
if (lowestX < Image.Width / 4)
cropRectangle.X = lowestX - 3 > 0 ? lowestX - 3 : 0;
else
cropRectangle.X = 0;
if (lowestY < Image.Height / 4)
cropRectangle.Y = lowestY - 3 > 0 ? lowestY - 3 : 0;
else
cropRectangle.Y = 0;
cropRectangle.Width = largestX - lowestX + 8 > Image.Width ? Image.Width : largestX - lowestX + 8;
cropRectangle.Height = largestY + 8 > Image.Height ? Image.Height - lowestY : largestY - lowestY + 8;
return cropRectangle;
}
}
One possible optimisation is to use Lockbits to access the color values directly rather than through the much slower GetPixel.
The Bob Powell page on LockBits is a good reference.
On the other hand, my testing has shown that the overhead associated with Lockbits makes that approach slower if you try to write a GetPixelFast equivalent to GetPixel and drop it in as a replacement. Instead you need to ensure that all pixel access is done in one hit rather than multiple hits. This should fit nicely with your code provided you don't lock/unlock on every pixel.
Here is an example
BitmapData bmd = b.LockBits(new Rectangle(0, 0, b.Width, b.Height), System.Drawing.Imaging.ImageLockMode.ReadOnly, b.PixelFormat);
byte* row = (byte*)bmd.Scan0 + (y * bmd.Stride);
// Blue Green Red
Color c = Color.FromArgb(row[x * pixelSize + 2], row[x * pixelSize + 1], row[x * pixelSize]);
b.UnlockBits(bmd);
Two more things to note:
This code is unsafe because it uses pointers
This approach depends on pixel size within Bitmap data, so you will need to derive pixelSize from bitmap.PixelFormat
GetPixel is probably your main culprit (I recommend running some profiling tests to track it down), but you could restructure the algorithm like this:
Scan first row (y = 0) from left-to-right and right-to-left and record the first and last edge location. It's not necessary to check all pixels, as you want the extreme edges.
Scan all subsequent rows, but now we only need to search outward (from center toward edges), starting at our last known minimum edge. We want to find the extreme boundaries, so we only need to search in the region where we could find new extrema.
Repeat the first two steps for the columns, establishing initial extrema and then using those extrema to iteratively bound the search.
This should greatly reduce the number of comparisons if your images are typically mostly content. The worst case is a completely blank image, for which this would probably be less efficient than the exhaustive search.
In extreme cases, image processing can also benefit from parallelism (split up the image and process it in multiple threads on a multi-core CPU), but this is quite a bit of additional work and there are other, simpler changes you still make. Threading overhead tends to limit the applicability of this technique and is mainly helpful if you expect to run this thing 'realtime', with dedicated repeated processing of incoming data (to make up for the initial setup costs).
This won't make it better on the order... but if you square your threshold, you won't need to do a square root, which is very expensive.
That should give a significant speed increase.

Speed up Matrix Addition in C#

I'd like to optimize this piece of code :
public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
Byte pixelValue = image.GetPixel(x, y).B;
this.sumOfPixelValues[x, y] += pixelValue;
this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
}
}
}
This is to be used for image processing, and we're currently running this for about 200 images. We've optimized the GetPixel value to use unsafe code, and we're not using image.Width, or image.Height, as those properties were adding to our runtime costs.
However, we're still stuck at a low speed. The problem is that our images are 640x480, so the middle of the loop is being called about 640x480x200 times.
I'd like to ask if there's a way to speed it up somehow, or convince me that it's fast enough as it is. Perhaps a way is through some fast Matrix Addition, or is Matrix Addition inherently an n^2 operation with no way to speed it up?
Perhaps doing array accesses via unsafe code would speed it up, but I'm not sure how to go about doing it, and whether it would be worth the time. Probably not.
Thanks.
EDIT : Thank you for all your answers.
This is the GetPixel method we're using:
public Color GetPixel(int x, int y)
{
int offsetFromOrigin = (y * this.stride) + (x * 3);
unsafe
{
return Color.FromArgb(this.imagePtr[offsetFromOrigin + 2], this.imagePtr[offsetFromOrigin + 1], this.imagePtr[offsetFromOrigin]);
}
}
Despite using unsafe code, GetPixel may well be the bottleneck here. Have you looked at ways of getting all the pixels in the image in one call rather than once per pixel? For instance, Bitmap.LockBits may be your friend...
On my netbook, a very simply loop iterating 640 * 480 * 200 times only take about 100 milliseconds - so if you're finding it's all going slowly, you should take another look at the bit inside the loop.
Another optimisation you might want to look at: avoid multi-dimensional arrays. They're significantly slower than single-dimensional arrays.
In particular, you can have a single-dimensional array of size Width * Height and just keep an index:
int index = 0;
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
Byte pixelValue = image.GetPixel(x, y).B;
this.sumOfPixelValues[index] += pixelValue;
this.sumOfPixelValuesSquared[index] += pixelValue * pixelValue;
index++;
}
}
Using the same simple test harness, adding a write to a 2-D rectangular array took the total time of looping over 200 * 640 * 480 up to around 850ms; using a 1-D rectangular array took it back down to around 340ms - so it's somewhat significant, and currently you've got two of those per loop iteration.
Read this article which also has some code and mentions about the slowness of GetPixel.
link text
From the article this is code to simply invert bits. This shows you the usage of LockBits as well.
It is important to note that unsafe code will not allow you to run your code remotely.
public static bool Invert(Bitmap b)
{
BitmapData bmData = b.LockBits(new Rectangle(0, 0, b.Width, b.Height),
ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb);
int stride = bmData.Stride;
System.IntPtr Scan0 = bmData.Scan0;
unsafe
{
byte * p = (byte *)(void *)Scan0;
int nOffset = stride - b.Width*3;
int nWidth = b.Width * 3;
for(int y=0;y < b.Height;++y)
{
for(int x=0; x < nWidth; ++x )
{
p[0] = (byte)(255-p[0]);
++p;
}
p += nOffset;
}
}
b.UnlockBits(bmData);
return true;
}
I recommend that you profile this code and find out what's taking the most time.
You may find that it's the subscripting operation, in which case you might want to change your data structures from:
long sumOfPixelValues[n,m];
long sumOfPixelValuesSquared[n,m];
to
struct Sums
{
long sumOfPixelValues;
long sumOfPixelValuesSquared;
}
Sums sums[n,m];
This would depend on what you find once you profile the code.
Code profiling is the best place to start.
Matrix addition is a highly parallel operation and can be speed up by parallelizing the operation w/ multiple threads.
I would recommend using Intels IPP library that contains threaded highly optimized API for this sort of operation. Perhaps surprisingly it's only about $100 - but would add significant complexity to your project.
If you don't want to trouble yourself with mixed language programming and IPP, you could try out centerspace's C# math libraries. The NMath API contains easy to used, forward scaling, matrix operations.
Paul
System.Drawing.Color is a structure, which on current versions of .NET kills most optimizations. Since you're only interested in the blue component anyway, use a method that only gets the data you need.
public byte GetPixelBlue(int x, int y)
{
int offsetFromOrigin = (y * this.stride) + (x * 3);
unsafe
{
return this.imagePtr[offsetFromOrigin];
}
}
Now, exchange the order of iteration of x and y:
public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{
for (int y = 0; y < Height; y++)
{
for (int x = 0; x < Width; x++)
{
Byte pixelValue = image.GetPixelBlue(x, y);
this.sumOfPixelValues[y, x] += pixelValue;
this.sumOfPixelValuesSquared[y, x] += pixelValue * pixelValue;
}
}
}
Now you're accessing all values within a scan line sequentially, which will make much better use of CPU cache for all three matrices involved (image.imagePtr, sumOfPixelValues, and sumOfPixelValuesSquared. [Thanks to Jon for noticing that when I fixed access to image.imagePtr, I broke the other two. Now the output array indexing is swapped to keep it optimal.]
Next, get rid of the member references. Another thread could theoretically be setting sumOfPixelValues to another array midway through, which does horrible horrible things to optimizations.
public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{
uint [,] sums = this.sumOfPixelValues;
ulong [,] squares = this.sumOfPixelValuesSquared;
for (int y = 0; y < Height; y++)
{
for (int x = 0; x < Width; x++)
{
Byte pixelValue = image.GetPixelBlue(x, y);
sums[y, x] += pixelValue;
squares[y, x] += pixelValue * pixelValue;
}
}
}
Now the compiler can generate optimal code for moving through the two output arrays, and after inlining and optimization, the inner loop can step through the image.imagePtr array with a stride of 3 instead of recalculating the offset all the time. Now an unsafe version for good measure, doing the optimizations that I think .NET ought to be smart enough to do but probably isn't:
unsafe public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{
byte* scanline = image.imagePtr;
fixed (uint* sums = &this.sumOfPixelValues[0,0])
fixed (uint* squared = &this.sumOfPixelValuesSquared[0,0])
for (int y = 0; y < Height; y++)
{
byte* blue = scanline;
for (int x = 0; x < Width; x++)
{
byte pixelValue = *blue;
*sums += pixelValue;
*squares += pixelValue * pixelValue;
blue += 3;
sums++;
squares++;
}
scanline += image.stride;
}
}
Where are images stored? If each is on disk, then a bit of your processing time issue may be in fetching them from the disk. You might examine this to see if it is an issue, and if so, then rewrite to pre-fetch the image data so that the array procesing code does not have to wait for the data...
If the overall application logic will allow it (Is each matrix addition independant, or dependant on output of a previous matrix addition?) If they are independant, I'd examine executing them all on separate threads, or in parallel..
The only possible way I can think of to speed it up would be to try do some of the additions in parallel, which with your size might be beneficial over the threading overhead.
Matrix addition is of course an n^2 operation but you can speed it up by using unsafe code or at least using jagged arrays instead of multidimensional.
About the only way to effectively speed up your matrix multiplication is to use the right algorithm. There are more efficient ways to speed up matrix multiplication.Take a look at the Stressen and Coopersmith Winograd algorithms. It is also noted [with the previous replies] that you can parallize the code, which helps quite a bit.
I'm not sure if it's faster but you may write something like;
public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{
Byte pixelValue;
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
pixelValue = image.GetPixel(x, y).B;
this.sumOfPixelValues[x, y] += pixelValue;
this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
}
}
}
If you only do matrix addition, you'd like to consider using multiple threads to speed up by taking advantage of multi-core processors. Also use one dimensional index instead of two.
If you want to do more complicated operations, you need to use a highly optimized math library, like NMath.Net, which uses native code rather than .net.
Sometimes doing things in native C#, even unsafe calls, is just slower than using methods that have already been optimized.
No results guaranteed, but you may want to investigate the System.Windows.Media.Imaging name space and look at your whole problem in a different way.
Although it's a micro-optimization and thus may not add much you might want to study what the likelihood is of getting a zero when you do
Byte pixelValue = image.GetPixel(x, y).B;
Clearly, if pixelValue = 0 then there's no reason to do the summations so your routine might become
public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{
for (int x = 0; x < Width; x++)
{
for (int y = 0; y < Height; y++)
{
Byte pixelValue = image.GetPixel(x, y).B;
if(pixelValue != 0)
{
this.sumOfPixelValues[x, y] += pixelValue;
this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
}}}}
However, the question is how often you're going to see pixelValue=0, and whether the saving on the compute-and-store will offset the cost of the test.
This is a classic case of micro-optimisation failing horribly. You're not going to get anything from looking at that loop. To get real speed benefits you need to start off by looking at the big picture:-
Can you asynchronously preload image[n+1] whilst processing image[n]?
Can you load just the B channel from the image? This will decrease memory bandwidth?
Can you load the B value and update the sumOfPixelValues(Squared) arrays directly, i.e. read the file and update instead of read file, store, read, update? Again, this decreases memory bandwidth.
Can you use one dimensional arrays instead of two dimensional? Maybe create your own array class that works either way.
Perhaps you could look into using Mono and the SIMD extensions?
Can you process the image in chunks and assign them to idle CPUs in a multi-cpu environment?
EDIT:
Try having specialised image accessors so you're not wasting memory bandwidth:
public Color GetBPixel (int x, int y)
{
int offsetFromOrigin = (y * this.stride) + (x * 3);
unsafe
{
return this.imagePtr [offsetFromOrigin + 1];
}
}
or, better still:
public Color GetBPixel (int offset)
{
unsafe
{
return this.imagePtr [offset + 1];
}
}
and use the above in a loop like:
for (int start_offset = 0, y = 0 ; y < Height ; start_offset += stride, ++y)
{
for (int x = 0, offset = start_offset ; x < Width ; offset += 3, ++x)
{
pixel = GetBPixel (offset);
// do stuff
}
}
matrix's addition complexity is O(n^2), in number of additions.
However, since there are no intermediate results, you can parallelize the additions using threads:
it easy to proof that the resulting algorithm will be lock-free
you can tune the optimal number of threads to use

Categories

Resources