I'm writing a physical memory manager that gets some intervals of memory from the BIOS that are not used by crucial system data. Each interval has 0 <= start <= 2^32 - 1 and 0 <= length <= 2^32. I have already filtered out the zero-length intervals.
Given two intervals S and T, I want to detect how they intersect. For example, does S start before T and end within T (picture a)? Or does S start before T and end after T (picture c)?
You'd think the solution is trivial:
uint s_end = s_start + s_length;
uint t_end = t_start + t_length;
if (s_start < t_start)
// S starts before T
else if (s_start < t_end)
// S starts within T
else
// S starts after T
if (s_end <= t_start)
// S ends before T
else if (s_end <= t_end)
// S ends within T
else
// S ends after T
The problem is overflow: I am technically limited to a 32-bit integer and the intervals can (and often do) use the whole range of available integers. For example in figure b, t_end equals 0 due to overflow. Or even, as in figure f t_start = t_end = s_start = 0 while t_length != 0.
How can I make these interval intersection conditions work with overflow taken into account?
The overflow screws up my conditions, but I really can't use a 64-bit integer for this (that would be easiest). I know it must be possible using some clever reshuffling of my conditions and using addition and subtraction, but after making endless diagrams and thinking about it for hours, I can't seem to be able to wrap my head around it.
While my problem is with 32-bit integers, in this image I used 4-bit integers just to simplify it. The problem remains the same.
OK, the issue is, if you want your ranges to span all of n-bits, any calculations based on start/end has the potential to overflow.
So the trick is to do a linear transform to a place where your start/end calculations do not overflow, do your calcs, and then linear transform back.
NOTES
Below the we can safely call end() now line, you can call the ordering checks (your original code) and it will be safe since the ordering is preserved during a linear transform.
Also, as I noted in the previous post, there is a special boundary case where even if you do this transform, you will overflow (where you span the entire line) - but you can code for that special boundary condition.
OUTPUT
5 11
CODE
#include <iostream>
using type = uint8_t;
struct segment
{
type start, length;
type end() const { return start + length; }
};
static segment
intersect( segment s, segment t )
{
type shift = std::min( s.start, t.start );
// transform so we can safely call end()
s.start -= shift; // doesn't affect length
t.start -= shift; // doesn't affect length
// we can safely call end() now ----------------------------------------------
type u_start = std::max( s.start, t.start );
type u_end = std::min( s.end(), t.end() );
type u_length = u_end - u_start;
segment u{ u_start, u_length };
// transform back
u.start += shift;
return u;
}
int main()
{
segment s{ 3, 13 }, t{ 5, 11 };
segment u = intersect( s, t );
std::cerr << uint32_t( u.start ) << " " << uint32_t( u.length ) << std::endl;
return 0;
}
Your example code does not enumerate all the cases. For example the intervals could also start or end at the same point.
To solve the overflow problem you could try to add different math based on the start comparison that will not include computing the ends at all. Something like:
if (s_start < t_start)
{
// S starts before T
uint start_offset = t_start - s_start;
if (start_offset < s_length)
{
if (s_length - start_offset < t_length)
{
// ...
}
else ...
} else ...
}
One solution is to treat an end of 0 as a special case. Weaving this into the if-statements, it becomes:
uint s_end = s_start + s_length;
uint t_end = t_start + t_length;
if (s_start < t_start)
// S starts before T
else if (t_end == 0 || s_start < t_end)
// S starts within T
else
// S starts after T
if (s_end != 0 && s_end <= t_start)
// S ends before T
else if (t_end == 0 || s_end == t_end
|| (s_end != 0 && s_end <= t_end))
// S ends within T
else
// S ends after T
This looks correct.
I don't know what do you do with conditions like (f), since 32-bit t_length will be 0 there.
Assuming you've managed this case somehow when you were filtering out length=0, which can mean both 0 and 2^32, the basic idea is this:
bool s_overflows=false;
if(s_start>0)//can't have overflow with s_start==0,
{
uint32 s_max_length=_UI32_MAX-s_start+1;
if(s_length==s_max_length) s_overflow=true;
}
bool t_overflows=false;
if(t_start>0)
{
uint32 t_max_length=_UI32_MAX-t_start+1;
if(t_length==t_max_length) t_overflow=true;
}
Then you just do your calculations, but if s_overflow is true, you don't calculate s_end -- you don't need it, since you already know it's 0x100000000. The same for t_overflow. Since these are already special cases, just like start=0, they shouldn't complicate your code much.
Related
I've been trying to write a piece of code that takes a money input and rewrites it with numerals (1000 to 1 thousand, 1,000,000 to 1 million, etc.) So far, I haven't been able to get past Unity telling me that there's a stack overflow on my array before it crashes, but I don't see why it's overflowing. Am I missing something huge or is something just not right here?
Unity has been giving me the error "The requested operation caused a stack overflow, MoneyTruncate() (at Assets/Scripts/Money.cs:60", which is the line pertaining to the array in this void.
{
string[] Numerals = new string[]{" ", "thousand", "million", "billion"} ;
int i = 0;
TotalMoneyFloat = (TotalMoney / (10 ^ (i * 3)));
TotalMoneyFloatLimit = (TotalMoney / (10 ^ ((i + 1) * 3)));
//current iteration of Numeral is correct- greater than current numeral, less than next
if(TotalMoneyFloat >= 1 && TotalMoneyFloatLimit < 1)
{
TotalMoneyText.GetComponent<Text>().text = "$" + TotalMoneyFloat.ToString("0.00") + " " + Numerals[i];
}
//current iteration of Numeral is too high- less than current numeral
if(TotalMoneyFloat < 1)
{
i--;
MoneyTruncate();
}
//current iteration of Numeral is too low- greater than current numeral
if(TotalMoneyFloatLimit >= 1)
{
i++;
MoneyTruncate();
}
//i is at its limit for number of numeral available- i has reached max value for the array but money is higher than
if(i > 3 && TotalMoneyFloatLimit >= 1)
{
TotalMoneyText.GetComponent<Text>().text = "$" + TotalMoneyFloat.ToString("0.00") + " " + Numerals[i];
}
}```
What line is "line from the array"? What function is this? If I had to guess, you've got a circular reference somewhere here, which would happen if this function were called MoneyTruncate().
The logic is not doing what you think it's doing and I would urge you to set a break point and step into every function. At some point you'll see that you keep coming through the same point in your code.
I would bet this function is named MoneyTruncate and you're trying to recursively call it, but your recursion is broken - your i variable is LOCAL and any decrement before recursion is not affecting the called child/recurring instance. This means the recurring instances follow the same steps, call the same function in the same way, and this goes on until your stack builds up so many function calls that it overflows.
You're using recursion to solve a problem that doesn't really need recursion. Just check if >= 1e12 and return trillion, 1e9 for billion, etc.
I'm using Visual Studio 2017 and C#.
When I run this for loop, it iterates beyond 0, and i becomes 4294967295
//where loActionList.Count starts at 1
// First time through loop works
// Second time I can see i decrement to 4294967295
// I'm declaring i as uint because loActionList requires it
// and because other vars require it (they are based on an external API)
for (uint i = loActionList.Count - 1; i >= 0; i--)
{
....
}
I can do this:
for (int i = (int)loActionList.Count - 1; i >= 0; i--)
{
IPXC_Action_Goto myvar = (IPXC_Action_Goto)loActionList[(uint)i];
}
...but I'd like to know if there is another way to handle this.
Perhaps something that simply prevents the for loop from going beyond 0?
RESOLVED: I ended up using #RonBeyer's suggestion of adding a break with this code if (i == 0) break;.
To understand the issue, lets look at a for loop with a regular int:
int i;
for(i = 5; i >= 0; i--) {
Console.Write(i);
}
Running that, you'd get 543210 as you'd expect.
However, if you output i now, you'd get -1. It stopped, because after making i = -1, it checked -1 >= 0, and then saw that was false and aborted.
The problem with uint (unsigned integer) is that 0 - 1 on a uint give you its max value, since it wraps back around. After doing 0 - 1, it'll check that and see BIG_NUMBER_HERE >= 0 is true, so it'll keep going.
There are a couple simple ways to avoid this. Which you use depends on your use case / personal tastes:
use an int instead of a uint
increase your start value by 1 and end at > 0 instead
make your condition i >= 0 && i < BIG_NUMBER_HERE
Add if (i == 0) break; at the end of the for loop to force out if you've hit zero (thanks Ron Beyer).
The BIG_NUMBER_HERE would be the uint max value. It differs here and there how you get that number, but there should be a constant that will give you that number if you need it.
If you are having issues with sign, but need the cardinality of a uint, I suggest going with long.
If the API requires a particular data type such as uint, I would cast it for each call, to minimize the damage to your code.
for (var i = loActionList.LongCount() - 1; i >= 0; i--)
{
IPXC_Action_Goto myvar = (IPXC_Action_Goto)loActionList[(uint)i];
}
I want to split a large array of UTF-8 encoded data, so that decoding it into chars can be parallelized.
It seems that there's no way to find out how many bytes Encoding.GetCharCount reads. I also can't use GetByteCount(GetChars(...)) since it decodes the entire array anyways, which is what I'm trying to avoid.
UTF-8 has well-defined byte sequences and is considered self-synchronizing, meaning given any position in bytes you can find where the character at that position begins at.
The UTF-8 spec (Wikipedia is the easiest link) defines the following byte sequences:
0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character
So, the following method (or something similar) should get your result:
Get the byte count for bytes (bytes.Length, et. al.)
Determine how many sections to split into
Select byte byteCount / sectionCount
Test byte against table:
If byte & 0x80 == 0x00 then you can make this byte part of either section
If byte & 0xE0 == 0xC0 then you need to seek ahead one byte, and keep it with the current section
If byte & 0xF0 == 0xE0 then you need to seek ahead two bytes, and keep it with the current section
If byte & 0xF8 == 0xF0 then you need to seek ahead three bytes, and keep it with the current section
If byte & 0xC0 == 0x80 then you are in a continuation, and should seek ahead until the first byte that does not fit val & 0xB0 == 0x80, then keep up to (but not including) this value in the current section
Select byteStart through byteCount + offset where offset can be defined by the test above
Repeat for each section.
Of course, if we redefine our test as returning the current char start position, we have two cases:
1. If (byte[i] & 0xC0) == 0x80 then we need to move around the array
2. Else, return the current i (since it's not a continuation)
This gives us the following method:
public static int GetCharStart(ref byte[] arr, int index) =>
(arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
Next, we want to get each section. The easiest way is to use a state-machine (or abuse, depending on how you look at it) to return the sections:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
for (var i = 0; i < sectionCount; i++)
{
sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
Now I built this in this manner because I want to use Parallel.ForEach to demonstrate the result, which makes it super easy if we have an IEnumerable, and it also allows me to be extremely lazy with the processing: we only continue to gather sections when needed, which means we can lazily process it and do it on-demand, which is a good thing, no?
Lastly, we need to be able to get a section of bytes, so we have the GetSection method:
public static byte[] GetSection(ref byte[] array, int start, int end)
{
var result = new byte[end - start];
for (var i = 0; i < result.Length; i++)
{
result[i] = array[i + start];
}
return result;
}
Finally, the demonstration:
var sourceText = "Some test 平仮名, ひらがな string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);
var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
new ParallelOptions { MaxDegreeOfParallelism = 1 },
x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });
Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Some test ???, ??
?? string that should b
e decoded in parallel, thi
s demonstrates that we work
flawlessly with Parallel.
ForEach. The only downside
to using `Parallel.ForEach`
the way I demonstrate is
that it doesn't take order into account, but oh-well.
Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Not perfect, but it does the job. If we change MaxDegreesOfParallelism to a higher value, our string gets jumbled:
Some test ???, ??
e decoded in parallel, thi
flawlessly with Parallel.
?? string that should b
to using `Parallel.ForEach`
ForEach. The only downside
that it doesn't take order into account, but oh-well.
s demonstrates that we work
the way I demonstrate is
So, as you can see, super easy. You'll want to make modifications to allow for correct order-reassembly, but this should demonstrate the trick.
If we modify the GetByteSections method as follows, the last section is no longer ~2x the size of the remaining ones:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);
for (var i = 0; i < sectionCount; i++)
{
if (i == (sectionCount - 1))
{
var lengthRem = utf8Array.Length - i * sectionSize;
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
sectionEnd = utf8Array.Length;
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
}
else
{
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
}
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
Some test ???, ???? string that should be de
coded in parallel, this demonstrates that we work flawless
ly with Parallel.ForEach. The only downside to using `Para
llel.ForEach` the way I demonstrate is that it doesn't tak
e order into account, but oh-well. We can continue to incr
ease the length of this string to demonstrate that the las
t section is usually about double the size of the other se
ctions, we could fix that if we really wanted to. In fact,
with a small modification it does so, we just have to rem
ember that we'll end up with `sectionCount + 1` results.
Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
And finally, if for some reason you split into an abnormally large number of sections compared to input size (my input size of ~578 bytes at 250 chars demonstrates this) you'll hit an IndexOutOfRangeException in GetCharStart, the following version fixes that:
public static int GetCharStart(ref byte[] arr, int index)
{
if (index > arr.Length)
{
index = arr.Length - 1;
}
return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}
Of course this leaves you with a bunch of empty results, but when you reassemble the string doesn't change, so I'm not even going to bother posting the full scenario test here. (I leave it up to you to experiment.)
Great answer Mathieu and Der, adding a python variant 100% based on your answer which works great:
def find_utf8_split(data, bytes=None):
bytes = bytes or len(data)
while bytes > 0 and data[bytes - 1] & 0xC0 == 0x80:
bytes -= 1
if bytes > 0:
if data[bytes - 1] & 0xE0 == 0xC0: bytes = bytes - 1
if data[bytes - 1] & 0xF0 == 0xE0: bytes = bytes - 1
if data[bytes - 1] & 0xF8 == 0xF0: bytes = bytes - 1
return bytes
This code finds a UTF-8 compatible split in a given byte string. It does not do the split as that would take more memory, that is left to the rest of the code.
For example you could:
position = find_utf8_split(data)
leftovers = data[position:]
text = data[:position].decode('utf-8')
I have following function:
public static long Fibon(long num)
{
if (num == 1)
{
return 1;
}
else if (num == 2)
{
return 1;
}
return fibon(num - 1) + fibon(num - 2);
}
this function uses recursion in order to calculate Fibonacci number. How can I calculate amount of required stack memory for executing this function before executing it? For example I want to execute this function in few separated threads with some big numbers, and before executing threads I want to know how much stack memory available I need to have.
Just looking at it, the code won't work because when num == 2, the method tries to find fibon(0).
Try
public static long Fibon(long num)
{
if (num == 1)
{
return 1;
}
else if (num == 2)
{
return 1;
}
return fibon(num - 1) + fibon(num - 2);
}
will give you 1, 1, 2, 3, 5, ...
Sorry this wasn't an answer, I don't have the reputation to comment.
edit: You'll also be able compute greater entries bu using ulong.
Since you only have to remember the previous two terms to calculate the current one, you will not face any memory problem if using a non-recursive procedure :
public static long Fibon(long num)
{
long result ;
if (num == 1) { return 1; }
else if (num=2) { return 1; }
long grandfather = 1 ;
long father = 1 ;
for (in i=2;i<=num;i++)
{
result = father + grandFather;
grandfather = father ;
father = result ;
}
return result ;
}
For nth Fibonacci term the amount of memory needed by your function is O(n), i.e., linear in the index of the term in the Fibonacci sequence. More precisely, it will be n-1 times the amount of memory needed for each recursive call, which is implementation-dependent (plus some constant).
The amount of memory needed is equal to the amount of memory in each recursive call plus the "depth" of the "execution tree". In each recursive call you either terminate or make two new calls, one on the argument n-1 and one on the argument n-2; it is obvious this has to stop after n-1 calls.
If you imagine the whole process as a binary tree with nodes labeled f(k), where the node f(k) has a left child labeled f(k-1) and a right child labeled f(k-2), then the space complexity of f corresponds to the depth of the execution tree.
I believe the number of longs needed is actually equal to the returned long.
To return 2, you need to add 2 longs. To return 3, you need to add the number of longs needed to return 2 (which is 2 longs) to 1 which == 3. The pattern continues.
Since a long is 64 bits, the memory needed is equal to the fibonacci value * 64 bits.
I want to perform a double threshold on a volume, using a GPU kernel. I send my volume, per slice, as read_only image2d_t. My output volume is a binary volume, where each bit specifies if its related voxel is enabled or disabled. My kernel checks if the current pixel value is within the lower/upper threshold range, and enables its corresponding bit in the binary volume.
For debugging purposes, I left the actual check commented for now. I simply use the passed slice nr to determine if the binary volume bit should be on or off. The first 14 slices are set to "on", the rest to "off". I have also verified this code on the CPU side, the code I pasted at the bottom of this post. The code shows both paths, the CPU being commented now.
The CPU code works as intended, the following image is returned after rendering the volume with the binary mask applied:
Running the exact same logic using my GPU kernel returns incorrect results (1st 3D, 2nd slice view):
What goes wrong here? I read that OpenCL does not support bit fields, but it does support bitwise operators as far as I could understand from the OpenCL specs. My bit logic, which selects the right bit from the 32 bit word and flips it, is supported right? Or is my simple flag considered a bit field. What it does is select the voxel%32 bit from the left (not the right, hence the subtract).
Another thing could be that the uint pointer passed to my kernel is different from what I expect. I assumed this would be valid use of pointers and passing data to my kernel. The logic applied to the "uint* word" part in the kernel is due to padding words per row, and paddings rows per slice. The CPU variant confirmed that the pointer calculation logic is valid though.
Below; the code
uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);
int[] dims = new int[3];
dims[0] = volume.Geometry.NumberOfVoxels.X;
dims[1] = volume.Geometry.NumberOfVoxels.Y;
dims[2] = volume.Geometry.NumberOfVoxels.Z;
uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
unsafe {
fixed(int* dimPtr = dims) {
fixed(uint *arrcPtr = arrC) {
// pick Cloo Platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// load opencl source
StreamReader streamReader = new StreamReader(#"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
string clSource = streamReader.ReadToEnd();
streamReader.Close();
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, clSource);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
// Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
// For this reason their use should be avoided if possible.
ComputeEventList eventList = new ComputeEventList();
// Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
// Create the kernel function and set its arguments.
ComputeKernel kernel = program.CreateKernel("LowerThreshold");
int slicenr = 0;
foreach (IntPtr ptr in pinnedSlices) {
/*// CPU VARIANT FOR TESTING PURPOSES
for (int y = 0; y < dims[1]; y++) {
for (int x = 0; x < dims[0]; x++) {
long pixelOffset = x + y * dims[0];
ushort* ushortPtr = (ushort*)ptr;
ushort pixel = *(ushortPtr + pixelOffset);
int BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= x) &&
(0 <= y) &&
(0 <= slicenr) &&
(x < dims[0]) &&
(y < dims[1]) &&
(slicenr < dims[2])
) {
uint* word =
arrcPtr + 1 + (slicenr * wordsPerPlane) +
(y * wordsPerRow) +
(x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
//if (pixel > lowerThreshold && pixel < upperThreshold) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
}*/
ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
3,
new IntPtr(dimPtr));
ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
ComputeImage2D image2D = new ComputeImage2D(
context,
ComputeMemoryFlags.ReadOnly,
format,
volume.Geometry.NumberOfVoxels.X,
volume.Geometry.NumberOfVoxels.Y,
0,
ptr
);
// The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
ComputeBuffer<uint> c = new ComputeBuffer<uint>(
context, ComputeMemoryFlags.WriteOnly, arrC.Length);
kernel.SetMemoryArgument(0, image2D);
kernel.SetMemoryArgument(1, dimsBuffer);
kernel.SetValueArgument(2, wordsPerRow);
kernel.SetValueArgument(3, wordsPerPlane);
kernel.SetValueArgument(4, slicenr);
kernel.SetValueArgument(5, lowerThreshold);
kernel.SetValueArgument(6, upperThreshold);
kernel.SetMemoryArgument(7, c);
// Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
// If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);
// Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer
// will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete
// by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
// eventList will contain two events after this method returns.
commands.ReadFromBuffer(c, ref arrC, false, eventList);
// A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
// in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands
// to finish has to be issued before "arrC" can be used.
// This explicit synchronization can be achieved in two ways:
// 1) Wait for the events in the list to finish,
//eventList.Wait();
//}
// 2) Or simply use
commands.Finish();
slicenr++;
}
}
}
}
And my kernel code:
const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP | CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
read_only image2d_t image,
global int* brickSize,
uint wordsPerRow,
uint wordsPerPlane,
int slicenr,
int lower,
int upper,
global write_only uint* c )
{
int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
uint4 pixel = read_imageui(image, smp, coord.xy);
uchar BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= coord.x) &&
(0 <= coord.y) &&
(0 <= coord.z) &&
(coord.x < brickSize[0]) &&
(coord.y < brickSize[1]) &&
(coord.z < brickSize[2])
) {
global uint* word =
c + 1 + (coord.z * wordsPerPlane) +
(coord.y * wordsPerRow) +
(coord.x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
//if (pixel.w > lower && pixel.w < upper) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
Two issues:
You've declared "c" as "write_only" yet use the "|=" and "&=" operators, which are read-modify-write
As the other posters mentioned, if two work items are accessing the same word, there are race conditions between the read-modify-write that will cause errors. Atomic operations are much slower than non-atomic operations, so while possible, not recommended.
I'd recommend making your output 8x larger and using bytes rather than bits. This would make your output write-only and would also remove contention and therefore race conditions.
Or (if data compactness or format is important) process 8 elements at a time per work item, and write the composite 8-bit output as a single byte. This would be write-only, with no contention, and would still have your data compactness.