I'm experimenting with OpenCL (through Cloo's C# interface). To do so, I'm experimenting with the customary matrix-multiplication-on-the-GPU. The problem is, during my speed tests, the application crashes. I'm trying to be efficient regarding the the re-allocation of various OpenCL objects, and I'm wondering if I'm botching something in doing so.
I'll put the code in this question, but for a bigger picture, you can get the code from github here: https://github.com/kwende/ClooMatrixMultiply
My main program does this:
Stopwatch gpuSw = new Stopwatch();
gpuSw.Start();
for (int c = 0; c < NumberOfIterations; c++)
{
float[] result = gpu.MultiplyMatrices(matrix1, matrix2, MatrixHeight, MatrixHeight, MatrixWidth);
}
gpuSw.Stop();
So I'm basically doing the call NumberOfIterations times, and timing the average execution time.
Within the MultiplyMatrices call, the first time through, I call Initialize to setup all the objects I'm going to reuse:
private void Initialize()
{
// get the intel integrated GPU
_integratedIntelGPUPlatform = ComputePlatform.Platforms.Where(n => n.Name.Contains("Intel")).First();
// create the compute context.
_context = new ComputeContext(
ComputeDeviceTypes.Gpu, // use the gpu
new ComputeContextPropertyList(_integratedIntelGPUPlatform), // use the intel openCL platform
null,
IntPtr.Zero);
// the command queue is the, well, queue of commands sent to the "device" (GPU)
_commandQueue = new ComputeCommandQueue(
_context, // the compute context
_context.Devices[0], // first device matching the context specifications
ComputeCommandQueueFlags.None); // no special flags
string kernelSource = null;
using (StreamReader sr = new StreamReader("kernel.cl"))
{
kernelSource = sr.ReadToEnd();
}
// create the "program"
_program = new ComputeProgram(_context, new string[] { kernelSource });
// compile.
_program.Build(null, null, null, IntPtr.Zero);
_kernel = _program.CreateKernel("ComputeMatrix");
}
I then enter the main body of my function (the part that will be executed NumberOfIterations times).
ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly| ComputeMemoryFlags.CopyHostPointer,
matrix1);
_kernel.SetMemoryArgument(0, matrix1Buffer);
ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix2);
_kernel.SetMemoryArgument(1, matrix2Buffer);
float[] ret = new float[matrix1Height * matrix2Width];
ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.CopyHostPointer,
ret);
_kernel.SetMemoryArgument(2, retBuffer);
_kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
_kernel.SetValueArgument<int>(4, matrix2Width);
_commandQueue.Execute(_kernel,
new long[] { 0 },
new long[] { matrix2Width ,matrix1Height },
null, null);
unsafe
{
fixed (float* retPtr = ret)
{
_commandQueue.Read(retBuffer,
false, 0,
ret.Length,
new IntPtr(retPtr),
null);
_commandQueue.Finish();
}
}
The third or fourth time through (it's somewhat random, which hints at memory access issues), the program crashes. Here is my kernel (I'm sure there are faster implementations, but right now my goal is just to get something working without blowing up):
kernel void ComputeMatrix(
global read_only float* matrix1,
global read_only float* matrix2,
global write_only float* output,
int matrix1WidthMatrix2Height,
int matrix2Width)
{
int x = get_global_id(0);
int y = get_global_id(1);
int i = y * matrix2Width + x;
float value = 0.0f;
// row y of matrix1 * column x of matrix2
for (int c = 0; c < matrix1WidthMatrix2Height; c++)
{
int m1Index = y * matrix1WidthMatrix2Height + c;
int m2Index = c * matrix2Width + x;
value += matrix1[m1Index] * matrix2[m2Index];
}
output[i] = value;
}
Ultimately the goal here is to better understand the zero-copy features of OpenCL (since I'm using Intel's integrated GPU). I have been having trouble getting it to work and so wanted to step back a bit to see if I understood even more basic things...apparently I don't as I can't get even this to work without blowing up.
The only other thing I can think of is it's how I'm pinning the pointer to send it to the .Read() function. But I don't know of an alternative.
Edit:
For what it's worth, I updated the last part of code (the read code) to this, and it still crashes:
_commandQueue.ReadFromBuffer(retBuffer, ref ret, false, null);
_commandQueue.Finish();
Edit #2
Solution found by huseyin tugrul buyukisik (see comment below).
Upon placing
matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();
At the end, it all worked fine.
OpenCl resources like buffers, kernels and commandqueues should be released after other resources that they are bound-to are released. Re-creating without releasing depletes avaliable slots quickly.
You have been re-creating arrays in a method of gpu and that was the scope of opencl buffers. When it finishes, GC cannot track opencl's unmanaged memory areas and that causes leaks, which makes crashes.
Many opencl implementations use C++ bindings which needs explicit release commands by C#, Java and other environments.
Also the set-argument part is not needed many times when repeated kernel executions use exact same buffer order as parameters of kernel.
Related
Currently, I am working with ILGPU library. As I mentioned in the title briefly, I would like to go through the cameraSpacePoints and check whether the value is valid (if not infinity), if valid store the point in the Arrayview, as the code below.
However the kinect APIs (for e.g. new CameraSpacePoint[depthWidth * depthHeight] )can not be used in a kernel. So how can the below code actually work on GPU?
static void gpu_kernel(Index index,int depthWidth,int depthHeight, ArrayView<CameraSpacePoint> dataView)
{
CameraSpacePoint[] cameraSpacePoints
cameraSpacePoints = new CameraSpacePoint[depthWidth * depthHeight];
for (int i = 0; i < depthWidth*depthHeight; i++)
{
if (!(XMath.IsInfinity(cameraSpacePoints[i].X)))
{
dataView[i] = cameraSpacePoints[i];
}
}
}
I have some DSP effects coded in the ISampleProvider model. To apply one effect I do this and it works fine.
string filename = "C:\myaudio.mp3";
MediaFoundationReader mediaFileReader = new MediaFoundationReader(filename);
ISampleProvider sampProvider = mediaFileReader.ToSampleProvider();
ReverbSampleProvider reverbSamplr = new ReverbSampleProvider(sampProvider);
IWavePlayer waveOutDevice.Init(reverbSamplr);
waveOutDevice.Play();
How can I apply multiple effects to the same input file simultaneously?
For example, if i have a Reverb effect and Distortion effect providers, how can I chain them together to apply them at the same time to one input file?
Effects can be chained together by passing one as the "source" for the next. So if you wanted your audio to go first through a reverb, and then distortion, you might do something like this, passing the original audio into the Reverb effect, the output of the reverb into the distortion effect and then sending the distortion to the waveOut device.
var reverb = new ReverbSampleProvider(sampProvider);
var distortion = new DistortionSampleProvider(reverb);
waveOutDevice.Init(distortion);
(n.b. NAudio does not come with built in reverb/distortion effects - you must make these yourself or source them from elsewhere)
Mark's answer is correct, but that approach is a pain if you're copy and pasting things around in different orders, because you have to change the variables that you're passing through.
For example, if you start with:
var lpf = new LowPassEffectStream(input);
var reverb = new ReverbEffectStream(lpf);
var stereo = new StereoEffectStream(reverb);
var vol = new VolumeSampleProvider(stereo);
waveOutDevice.Init(vol);
And you want to swap reverb and stereo, a quick copy-paste leaves you with the input variables backwards:
var lpf = new LowPassEffectStream(input);
var stereo = new StereoEffectStream(reverb); // <--
var reverb = new ReverbEffectStream(lpf); // <--
var vol = new VolumeSampleProvider(stereo);
waveOutDevice.Init(vol);
It also makes it easy to fix a parameter but forget to fix another, e.g. fixing the stereo effect to have lpf as its input, but forgetting to fix the reverb effect. This often results in skipped effects in the chain leading to frustrated debugging when the effect appears not to work.
To make things easier and less error-prone when I'm stacking effects together and re-ordering them, I created the following helper class:
class EffectChain : ISampleProvider
{
public EffectChain(ISampleProvider source)
{
this._sourceStream = source;
}
private readonly ISampleProvider _sourceStream;
private readonly List<ISampleProvider> _chain = new List<ISampleProvider>();
public ISampleProvider Head
{
get
{
return _chain.LastOrDefault() ?? _sourceStream;
}
}
public WaveFormat WaveFormat
{
get
{
return Head.WaveFormat;
}
}
public void AddEffect(ISampleProvider effect)
{
_chain.Add(effect);
}
public int Read(float[] buffer, int offset, int count)
{
return Head.Read(buffer, offset, count);
}
}
You can use it like this:
var effectChain = new EffectChain(input);
var lpf = new LowPassEffectStream(effectChain.Head);
effectChain.AddEffect(lpf);
var stereo = new StereoEffectStream(effectChain.Head);
effectChain.AddEffect(stereo);
var reverb = new ReverbEffectStream(effectChain.Head);
effectChain.AddEffect(reverb);
var vol = new VolumeSampleProvider(effectChain.Head);
effectChain.AddEffect(vol);
waveOutDevice.Init(effectChain);
This allows you to quickly re-order effects in the chain, as each effect takes the effect chain's head as an input. If you don't add any effects it just acts as a pass-through. You could easily expand this class to have more methods for managing the contained effects if you wanted to, but as it stands it works quite cleanly.
I'm trying to implement DirectX 11 using SharpDX 2.5 into WPF.
Sadly http://directx4wpf.codeplex.com/ and http://sharpdxwpf.codeplex.com/ don't work properly with SharpDX 2.5. I was also not able to port the WPFHost DX10 sample to DX11 and the full code package of this example is down: http://www.indiedev.de/wiki/DirectX_in_WPF_integrieren
Can someone suggest another way of implementing?
SharpDX supports WPF via SharpDXElement.
Take a look in the Samples repository at the Toolkit.sln - all projects that have WPF in their name use SharpDXElement as rendering surface:
MiniCube.WPF - demonstrates basic SharpDX-WPF integration;
MiniCube.SwitchContext.WPF - demonstrates basic scenario when lifetime of the Game instance is different from the lifetime of SharpDXElement (in other words - when there is need to switch game rendering on another surface).
MiniCube.SwitchContext.WPF.MVVM - same as above, but more 'MVVM-way'.
Update: SharpDX.Toolkit has been deprecated and it is not maintained anymore. It is moved to a separate repository. The Toolkit samples were deleted, however I changed the link to a changeset where they are still present.
You can still use http://sharpdxwpf.codeplex.com/.
In order to work properly with SharpDX 2.5.0 you need to do a few modifications.
1) In project Sharp.WPF in class DXUtils.cs in method
Direct3D11.Buffer CreateBuffer<T>(this Direct3D11.Device device, T[] range)
add this line
stream.Position = 0;
just after
stream.WriteRange(range);
So fixed method looks like this:
public static Direct3D11.Buffer CreateBuffer<T>(this Direct3D11.Device device, T[] range)
where T : struct
{
int sizeInBytes = Marshal.SizeOf(typeof(T));
using (var stream = new DataStream(range.Length * sizeInBytes, true, true))
{
stream.WriteRange(range);
stream.Position = 0; // fix
return new Direct3D11.Buffer(device, stream, new Direct3D11.BufferDescription
{
BindFlags = Direct3D11.BindFlags.VertexBuffer,
SizeInBytes = (int)stream.Length,
CpuAccessFlags = Direct3D11.CpuAccessFlags.None,
OptionFlags = Direct3D11.ResourceOptionFlags.None,
StructureByteStride = 0,
Usage = Direct3D11.ResourceUsage.Default,
});
}
}
2) And in class D3D11 in file D3D11.cs
rename this
m_device.ImmediateContext.Rasterizer.SetViewports(new Viewport(0, 0, w, h, 0.0f, 1.0f));
into this
m_device.ImmediateContext.Rasterizer.SetViewport(new Viewport(0, 0, w, h, 0.0f, 1.0f));
i.e. SetViewports into SetViewport.
And it should work now.
I need to port code from Java to C#. In the Java code, the methods "ByteBuffer.flip()" and "ByteBuffer.slice" is used, and I don't know how to translate this.
I've read this question (An equivalent of javax.nio.Buffer.flip() in c#), but although an answer is given, I cannot figure how to apply it. According to Tom Hawtin, I should "Set the limit to the current position and then set the position to zero" in the underlying array. I am unsure as of how to change these values. (If you could explain the underlying logic, it would help me a lot :)
As for the ByteBuffer.slice, I have no clue on how to translate it.
EDIT: If it can be clearer with the actual code, I'll post it:
Java:
ByteBuffer buff;
buff.putShort((short) 0);
buff.put(customArray);
buff.flip();
buff.putShort((short) 0);
ByteBuffer b = buff.slice();
short size = (short) (customFunction(b) + 2);
buff.putShort(0, size);
buff.position(0).limit(size);
So far, my translation in C#.NET:
BinaryWriter b = new BinaryWriter(); //ByteBuffer buff;
b.Write((short)0); // buff.putShort((short) 0);
b.Write(paramStream.ToArray()); // buff.put(customArray);
b.BaseStream.SetLength(b.BaseStream.Position); // buff.flip; (not sure)
b.BaseStream.Position = 0; // buff.flip; too (not sure)
b.Write((short)0); // buff.putShort((short) 0)
??? // ByteBuffer b = buff.slice();
// Not done but I can do it, short size = (short) (customFunction(b) + 2);
??? // How do I write at a particular position?
??? // buff.position(0).limit(size); I don't know how to do this
Thank you!
EDIT: Changed b.BaseStream.SetLength(b.BaseStream.Length); to b.BaseStream.SetLength(b.BaseStream.Position);, based on the Java docs.
(See See http://java.sun.com/javase/6/docs/api/java/nio/ByteBuffer.html#slice%28%29 and http://java.sun.com/javase/6/docs/api/java/nio/Buffer.html#flip%28%29 for java's calls)
Flip is a quick way to reset the buffer. So for example
(pseudocode)
void flip()
{
Length = currentPos;
currentPos = 0;
}
Allows you to quickly setup the buffer you presumably just wrote to for reading from the beginning.
Update:
Splice is a bit trickier due to the requirement that "Changes to this buffer's content will be visible in the new buffer, and vice versa; the two buffers' position, limit, and mark values will be independent". There unfortunately is no concept of a shared portion of buffer (that i know of - theres always using arrays, detailed below) without making your own class. The closest thing you could do is this:
Old Code:
ByteBuffer b = buff.slice();
New Code (assuming a List)
List<Byte> b= buff;
int bStart = buffPos; // buffPos is your way of tracking your mark
the downside to the code above is that there is no way for c# to hold the new starting point of the new buffer and still share it. You'll have to manually use the new starting point whenever you do anything, from for loops (for i=bStart;...) to indexing (newList[i + bStart]...)
Your other option is to do use Byte[] arrays instead, and do something like this:
Byte[] b = &buff[buffPos];
... however that requires unsafe operations to be enabled, and I cannot vouch for its saftey, due to the garbage collector and my avoidance of the "unsafe" features.
Outside of that, theres always making your own ByteBuffer class.
Untested, but if I understand the java bits correctly, this would give you an idea on how to implement.
public class ByteBuffer {
private int _Position;
private int _Capacity;
private byte[] _Buffer;
private int _Start;
private ByteBuffer(int capacity, int position, int start, byte[] buffer) {
_Capacity = capacity;
_Position = position;
_Start = start;
_Buffer = buffer;
}
public ByteBuffer(int capacity) : this(capacity, 0 , 0, new byte[capacity]) {
}
public void Write(byte item) {
if (_Position >= _Capacity) {
throw new InvalidOperationException();
}
_Buffer[_Start + _Position++] = item;
}
public byte Read() {
if (_Position >= _Capacity) {
throw new InvalidOperationException();
}
return _Buffer[_Start + _Position++];
}
public void Flip() {
_Capacity = _Position;
_Position = _Start;
}
public ByteBuffer Slice() {
return new ByteBuffer(_Capacity-_Position, 0, _Position, _Buffer);
}
}
I am trying to optimize my engine (C# + SlimDX) to make as less allocations as possible (to prevent the GC from firing too often) using as guide a profiler that gives me where the garbaged object are generated. Its going pretty well (going down from like 20 MB garbaged every 5s to 8 MB garbaged every 1 minute and half (yep, it was very little optimized XD))
There is a method where I can't find anything declarated and i don't know what's to do. It seems this method generate 2 garbaged object per execution in its body (not on a called function) :
Can somebody guide me to understand why this function generate object to be garbaged? I really don't have a clue.
public override void Update()
{
base.Update();
if (LastCheckInstancesNumber != Instances.Count)
{
LastCheckInstancesNumber = Instances.Count;
_needToRegenerateUpdate = true;
}
// Crea byte array da usare nel prossimo draw.
if (_needToRegenerateUpdate)
{
Int32 PrimitivesCount = Instances.Count;
Int32 Size = PrimitivesCount * 80;
if ((ByteUpdateTemp != null) && (ByteUpdateTemp.Length < Size))
ByteUpdateTemp = new byte[Size];
int offset = 0;
PrimitivesCount = 0;
Int32 Count = Instances.Count;
for (int i = 0; i < Count; i++)
{
InstancedBase3DObjectInstanceValues ib = Instances[i];
if (ib.Process)
{
MathHelper.CopyMatrix(ref ib._matrix, ref MatrixTemp);
MathHelper.CopyVector(ref ib._diffuseColor, ref ColorTemp);
ObjectUpdateTemp[0] = MatrixTemp.M11;
ObjectUpdateTemp[1] = MatrixTemp.M12;
ObjectUpdateTemp[2] = MatrixTemp.M13;
ObjectUpdateTemp[3] = MatrixTemp.M14;
ObjectUpdateTemp[4] = MatrixTemp.M21;
ObjectUpdateTemp[5] = MatrixTemp.M22;
ObjectUpdateTemp[6] = MatrixTemp.M23;
ObjectUpdateTemp[7] = MatrixTemp.M24;
ObjectUpdateTemp[8] = MatrixTemp.M31;
ObjectUpdateTemp[9] = MatrixTemp.M32;
ObjectUpdateTemp[10] = MatrixTemp.M33;
ObjectUpdateTemp[11] = MatrixTemp.M34;
ObjectUpdateTemp[12] = MatrixTemp.M41;
ObjectUpdateTemp[13] = MatrixTemp.M42;
ObjectUpdateTemp[14] = MatrixTemp.M43;
ObjectUpdateTemp[15] = MatrixTemp.M44;
ObjectUpdateTemp[16] = ColorTemp.X;
ObjectUpdateTemp[17] = ColorTemp.Y;
ObjectUpdateTemp[18] = ColorTemp.Z;
ObjectUpdateTemp[19] = ColorTemp.W;
ByteConverter.WriteSingleArrayToByte(ref ObjectUpdateTemp, ref ByteUpdateTemp, offset);
offset += 20;
PrimitivesCount++;
}
}
SynchronizedObject so = SynchronizationEventWriter.LockData();
so.Synchronizedobject = ByteUpdateTemp;
SynchronizationEventWriter.Update();
SynchronizationEventWriter.UnlockData();
_needToRegenerateUpdate = false;
so = SynchronizationEventWriterNum.LockData();
so.Synchronizedobject = PrimitivesCount;
SynchronizationEventWriterNum.Update();
SynchronizationEventWriterNum.UnlockData();
}
}
Notes :
The new byte[Size] is NEVER called due to caching.
The MathHelper function simply copy each element (Single) from one object to another without creating anything.
The base.Update() does almost nothing (and anyway is derived from ALL object in my engine, but only here i have the garbage object)
Thanks!!!
EDIT:
internal void GetLock()
{
Monitor.Enter(InternalLock);
Value.Locked = true;
Value.LockOwner = Thread.CurrentThread;
}
public SynchronizedObject LockData()
{
Parent.GetLock();
return Parent.Value;
}
Here's the code of the LockData(). I don't think it generates anything :|
I've resolved!!!
The problem was that the so.Synchronizedobject = PrimitivesCount; was assigning an Int32 to an Object class. It seems that this replaces every time the object causing the old object to be garbaged.
I resolved by using a box class to enclose the Int32 object and simply change the value inside.
What's in base.Update(), anything?
Can your profiler dump the heap? If so why I'd put a breakpoint just before this method and dump the heap and then again straight afterwards. That way you'll be able to see what type of object has been created.
Short of that a brute force approach of commenting out line by line is another (horrible) idea.
Do your MathHelper methods create a temp object?
I'm just guessing, but it looks like you're creating two SynchronizedObject objects in the bottom nine lines of that function:
SynchronizedObject so = SynchronizationEventWriter.LockData();
and
so = SynchronizationEventWriterNum.LockData();
No detailed knowledge of SynchronizedObject or whether LockData() actually creates anything, but it's the only choice I can see in your code...