.NET BitArray cardinality [duplicate]

.NET BitArray cardinality [duplicate] - c#

I am implementing a library where I am extensively using the .Net BitArray class and need an equivalent to the Java BitSet.Cardinality() method, i.e. a method which returns the number of bits set. I was thinking of implementing it as an extension method for the BitArray class. The trivial implementation is to iterate and count the bits set (like below), but I wanted a faster implementation as I would be performing thousands of set operations and counting the answer. Is there a faster way than the example below?
count = 0;
for (int i = 0; i < mybitarray.Length; i++)
{
if (mybitarray [i])
count++;
}

This is my solution based on the "best bit counting method" from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
public static Int32 GetCardinality(BitArray bitArray)
{
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
// fix for not truncated bits in last integer that may have been set to true with SetAll()
ints[ints.Length - 1] &= ~(-1 << (bitArray.Count % 32));
for (Int32 i = 0; i < ints.Length; i++)
{
Int32 c = ints[i];
// magic (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
unchecked
{
c = c - ((c >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = ((c + (c >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
count += c;
}
return count;
}
According to my tests, this is around 60 times faster than the simple foreach loop and still 30 times faster than the Kernighan approach with around 50% bits set to true in a BitArray with 1000 bits. I also have a VB version of this if needed.

you can accomplish this pretty easily with Linq
BitArray ba = new BitArray(new[] { true, false, true, false, false });
var numOnes = (from bool m in ba
where m
select m).Count();

BitArray myBitArray = new BitArray(...
int
bits = myBitArray.Count,
size = ((bits - 1) >> 3) + 1,
counter = 0,
x,
c;
byte[] buffer = new byte[size];
myBitArray.CopyTo(buffer, 0);
for (x = 0; x < size; x++)
for (c = 0; buffer[x] > 0; buffer[x] >>= 1)
counter += buffer[x] & 1;
Taken from "Counting bits set, Brian Kernighan's way" and adapted for bytes. I'm using it for bit arrays of 1 000 000+ bits and it's superb.
If your bits are not n*8 then you can count the mod byte manually.

I had the same issue, but had more than just the one Cardinality method to convert. So, I opted to port the entire BitSet class. Fortunately it was self-contained.
Here is the Gist of the C# port.
I would appreciate if people would report any bugs that are found - I am not a Java developer, and have limited experience with bit logic, so I might have translated some of it incorrectly.

Faster and simpler version than the accepted answer thanks to the use of System.Numerics.BitOperations.PopCount
C#
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
for (Int32 i = 0; i < ints.Length; i++) {
count += BitOperations.PopCount(ints[i]);
}
Console.WriteLine(count);
F#
let ints = Array.create ((bitArray.Count >>> 5) + 1) 0u
bitArray.CopyTo(ints, 0)
ints
|> Array.sumBy BitOperations.PopCount
|> printfn "%d"
See more details in Is BitOperations.PopCount the best way to compute the BitArray cardinality in .NET?

You could use Linq, but it would be useless and slower:
var sum = mybitarray.OfType<bool>().Count(p => p);

There is no faster way with using BitArray - What it comes down to is you will have to count them - you could use LINQ to do that or do your own loop, but there is no method offered by BitArray and the underlying data structure is an int[] array (as seen with Reflector) - so this will always be O(n), n being the number of bits in the array.
The only way I could think of making it faster is using reflection to get a hold of the underlying m_array field, then you can get around the boundary checks that Get() uses on every call (see below) - but this is kinda dirty, and might only be worth it on very large arrays since reflection is expensive.
public bool Get(int index)
{
if ((index < 0) || (index >= this.Length))
{
throw new ArgumentOutOfRangeException("index", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
return ((this.m_array[index / 0x20] & (((int) 1) << (index % 0x20))) != 0);
}
If this optimization is really important to you, you should create your own class for bit manipulation, that internally could use BitArray, but keeps track of the number of bits set and offers the appropriate methods (mostly delegate to BitArray but add methods to get number of bits currently set) - then of course this would be O(1).

If you really want to maximize the speed, you could pre-compute a lookup table where given a byte-value you have the cardinality, but BitArray is not the most ideal structure for this, since you'd need to use reflection to pull the underlying storage out of it and operate on the integral types - see this question for a better explanation of that technique.
Another, perhaps more useful technique, is to use something like the Kernighan trick, which is O(m) for an n-bit value of cardinality m.
static readonly ZERO = new BitArray (0);
static readonly NOT_ONE = new BitArray (1).Not ();
public static int GetCardinality (this BitArray bits)
{
int c = 0;
var tmp = new BitArray (myBitArray);
for (c; tmp != ZERO; c++)
tmp = tmp.And (tmp.And (NOT_ONE));
return c;
}
This too is a bit more cumbersome than it would be in say C, because there are no operations defined between integer types and BitArrays, (tmp &= tmp - 1, for example, to clear the least significant set bit, has been translated to tmp &= (tmp & ~0x1).
I have no idea if this ends up being any faster than naively iterating for the case of the BCL BitArray, but algorithmically speaking it should be superior.
EDIT: cited where I discovered the Kernighan trick, with a more in-depth explanation

If you don't mind to copy the code of System.Collections.BitArray to your project and Edit it,you can write as fellow:
(I think it's the fastest. And I've tried use BitVector32[] to implement my BitArray, but it's still so slow.)
public void Set(int index, bool value)
{
if ((index < 0) || (index >= this.m_length))
{
throw new ArgumentOutOfRangeException("index", "Index Out Of Range");
}
SetWithOutAuth(index,value);
}
//When in batch setting values,we need one method that won't auth the index range
private void SetWithOutAuth(int index, bool value)
{
int v = ((int)1) << (index % 0x20);
index = index / 0x20;
bool NotSet = (this.m_array[index] & v) == 0;
if (value && NotSet)
{
CountOfTrue++;//Count the True values
this.m_array[index] |= v;
}
else if (!value && !NotSet)
{
CountOfTrue--;//Count the True values
this.m_array[index] &= ~v;
}
else
return;
this._version++;
}
public int CountOfTrue { get; internal set; }
public void BatchSet(int start, int length, bool value)
{
if (start < 0 || start >= this.m_length || length <= 0)
return;
for (int i = start; i < length && i < this.m_length; i++)
{
SetWithOutAuth(i,value);
}
}

I wrote my version of after not finding one that uses a look-up table:
private int[] _bitCountLookup;
private void InitLookupTable()
{
_bitCountLookup = new int[256];
for (var byteValue = 0; byteValue < 256; byteValue++)
{
var count = 0;
for (var bitIndex = 0; bitIndex < 8; bitIndex++)
{
count += (byteValue >> bitIndex) & 1;
}
_bitCountLookup[byteValue] = count;
}
}
private int CountSetBits(BitArray bitArray)
{
var result = 0;
var numberOfFullBytes = bitArray.Length / 8;
var numberOfTailBits = bitArray.Length % 8;
var tailByte = numberOfTailBits > 0 ? 1 : 0;
var bitArrayInBytes = new byte[numberOfFullBytes + tailByte];
bitArray.CopyTo(bitArrayInBytes, 0);
for (var i = 0; i < numberOfFullBytes; i++)
{
result += _bitCountLookup[bitArrayInBytes[i]];
}
for (var i = (numberOfFullBytes * 8); i < bitArray.Length; i++)
{
if (bitArray[i])
{
result++;
}
}
return result;
}

The problem is naturally O(n), as a result your solution is probably the most efficient.
Since you are trying to count an arbitrary subset of bits you cannot count the bits when they are set (would would provide a speed boost if you are not setting the bits too often).
You could check to see if the processor you are using has a command which will return the number of set bits. For example a processor with SSE4 could use the POPCNT according to this post. This would probably not work for you since .Net does not allow assembly (because it is platform independent). Also, ARM processors probably do not have an equivalent.
Probably the best solution would be a look up table (or switch if you could guarantee the switch will compiled to a single jump to currentLocation + byteValue). This would give you the count for the whole byte. Of course BitArray does not give access to the underlying data type so you would have to make your own BitArray. You would also have to guarantee that all the bits in the byte will always be part of the intersection which does not sound likely.
Another option would be to use an array of booleans instead of a BitArray. This has the advantage not needing to extract the bit from the others in the byte. The disadvantage is the array will take up 8x as much space in memory meaning not only wasted space, but also more data push as you iterate through the array to perform your count.
The difference between a standard array look up and a BitArray look up is as follows:
Array:
offset = index * indexSize
Get memory at location + offset and save to value
BitArray:
index = index/indexSize
offset = index * indexSize
Get memory at location + offset and save to value
position = index%indexSize
Shift value position bits
value = value and 1
With the exception of #2 for Arrays and #3 most of these commands take 1 processor cycle to complete. Some of the commands can be combined into 1 command using x86/x64 processors, though probably not with ARM since it uses a reduced set of instructions.
Which of the two (array or BitArray) perform better will be specific to your platform (processor speed, processor instructions, processor cache sizes, processor cache speed, amount of system memory (Ram), speed of system memory (CAS), speed of connection between processor and RAM) as well as the spread of indexes you want to count (are the intersections most often clustered or are they randomly distributed).
To summarize: you could probably find a way to make it faster, but your solution is the fastest you will get for your data set using a bit per boolean model in .NET.
Edit: make sure you are accessing the indexes you want to count in order. If you access indexes 200, 5, 150, 151, 311, 6 in that order then you will increase the amount of cache misses resulting in more time spent waiting for values to be retrieved from RAM.

Related

Why is it faster to calculate the product of a consecutive array of integers by performing the calculation in pairs?

I was trying to create my own factorial function when I found that the that the calculation is twice as fast if it is calculated in pairs. Like this:
Groups of 1: 2*3*4 ... 50000*50001 = 4.1 seconds
Groups of 2: (2*3)*(4*5)*(6*7) ... (50000*50001) = 2.0 seconds
Groups of 3: (2*3*4)*(5*6*7) ... (49999*50000*50001) = 4.8 seconds
Here is the c# I used to test this.
Stopwatch timer = new Stopwatch();
timer.Start();
// Seperate the calculation into groups of this size.
int k = 2;
BigInteger total = 1;
// Iterates from 2 to 50002, but instead of incrementing 'i' by one, it increments it 'k' times,
// and the inner loop calculates the product of 'i' to 'i+k', and multiplies 'total' by that result.
for (var i = 2; i < 50000 + 2; i += k)
{
BigInteger partialTotal = 1;
for (var j = 0; j < k; j++)
{
// Stops if it exceeds 50000.
if (i + j >= 50000) break;
partialTotal *= i + j;
}
total *= partialTotal;
}
Console.WriteLine(timer.ElapsedMilliseconds / 1000.0 + "s");
I tested this at different levels and put the average times over a few tests in a bar graph. I expected it to become more efficient as I increased the number of groups, but 3 was the least efficient and 4 had no improvement over groups of 1.
Link to First Data
Link to Second Data
What causes this difference, and is there an optimal way to calculate this?

BigInteger has a fast case for numbers of 31 bits or less. When you do a pairwise multiplication, this means a specific fast-path is taken, that multiplies the values into a single ulong and sets the value more explicitly:
public void Mul(ref BigIntegerBuilder reg1, ref BigIntegerBuilder reg2) {
...
if (reg1._iuLast == 0) {
if (reg2._iuLast == 0)
Set((ulong)reg1._uSmall * reg2._uSmall);
else {
...
}
}
else if (reg2._iuLast == 0) {
...
}
else {
...
}
}
public void Set(ulong uu) {
uint uHi = NumericsHelpers.GetHi(uu);
if (uHi == 0) {
_uSmall = NumericsHelpers.GetLo(uu);
_iuLast = 0;
}
else {
SetSizeLazy(2);
_rgu[0] = (uint)uu;
_rgu[1] = uHi;
}
AssertValid(true);
}
A 100% predictable branch like this is perfect for a JIT, and this fast-path should get optimized extremely well. It's possible that _rgu[0] and _rgu[1] are even inlined. This is extremely cheap, so effectively cuts down the number of real operations by a factor of two.
So why is a group of three so much slower? It's obvious that it should be slower than for k = 2; you have far fewer optimized multiplications. More interesting is why it's slower than k = 1. This is easily explained by the fact that the outer multiplication of total now hits the slow path. For k = 2 this impact is mitigated by halving the number of multiplies and the potential inlining of the array.
However, these factors do not help k = 3, and in fact the slow case hurts k = 3 a lot more. The second multiplication in the k = 3 case hits this case
if (reg1._iuLast == 0) {
...
}
else if (reg2._iuLast == 0) {
Load(ref reg1, 1);
Mul(reg2._uSmall);
}
else {
...
}
which allocates
EnsureWritable(1);
uint uCarry = 0;
for (int iu = 0; iu <= _iuLast; iu++)
uCarry = MulCarry(ref _rgu[iu], u, uCarry);
if (uCarry != 0) {
SetSizeKeep(_iuLast + 2, 0);
_rgu[_iuLast] = uCarry;
}
why does this matter? Well, EnsureWritable(1) causes
uint[] rgu = new uint[_iuLast + 1 + cuExtra];
so rgu becomes length 3. The number of passes in total's code is decided in
public void Mul(ref BigIntegerBuilder reg1, ref BigIntegerBuilder reg2)
as
for (int iu1 = 0; iu1 < cu1; iu1++) {
...
for (int iu2 = 0; iu2 < cu2; iu2++, iuRes++)
uCarry = AddMulCarry(ref _rgu[iuRes], uCur, rgu2[iu2], uCarry);
...
}
which means that we have a total of len(total._rgu) * 3 operations. This hasn't saved us anything! There are only len(total._rgu) * 1 passes for k = 1 - we just do it three times!
There is actually an optimization on the outer loop that reduces this back down to len(total._rgu) * 2:
uint uCur = rgu1[iu1];
if (uCur == 0)
continue;
However, they "optimize" this optimization in a way that hurts far more than before:
if (reg1.CuNonZero <= reg2.CuNonZero) {
rgu1 = reg1._rgu; cu1 = reg1._iuLast + 1;
rgu2 = reg2._rgu; cu2 = reg2._iuLast + 1;
}
else {
rgu1 = reg2._rgu; cu1 = reg2._iuLast + 1;
rgu2 = reg1._rgu; cu2 = reg1._iuLast + 1;
}
For k = 2, that causes the outer loop to be over total, since reg2 contains no zero values with high probability. This is great because total is way longer than partialTotal, so the fewer passes the better. For k = 3, the EnsureWritable(1) will always cause a spare space because the multiplication of three numbers no more than 15 bits long can never exceed 64 bits. This means that, although we still only do one pass over total for k = 2, we do two for k = 3!
This starts to explain why the speed increases again beyond k = 3: the number of passes per addition increases slower than the number of additions decreases, as you're only adding ~15 bits to the inner value each time. The inner multiplications are fast relative to the massive total multiplications, so the more time spent consolidating values, the more time saved in passes over total. Further, the optimization is less frequently a pessimism.
It also explains why odd values take longer: they add an extra 32-bit integer to the _rgu array. This won't happen so cleanly if the ~15 bits wasn't so close to half of 32.
It's worth noting that there are a lot of ways to improve this code; the comments here are about why, not how to fix it. The easiest improvement would be to chuck the values in a heap and multiply only the two smallest values at a time.

The time required to do a BigInteger multiplication depends on the size of the product.
Both methods take the same number of multiplications, but if you multiply the factors in pairs, then the average size of the product is much smaller than it is if you multiply each factor with the product of all the smaller ones.
You can do even better if you always multiply the two smallest factors (original factors or intermediate products) that have yet to be multiplied together, until you get to the complete product.

I think you have a bug ('+' instead of '*').
partialTotal *= i + j;
Good to check that you are getting the right answer, not just interesting performance metrics.
But I'm curious what motivated you to try this. If you do find a difference, I would expect it would have to do with optimalities in register and/or memory allocation. And I would expect it would be 0-30% or something like that, not 50%.

Cartesian product subset returning set of mostly 0

I'm trying to calculate
If we calculated every possible combination of numbers from 0 to (c-1)
with a length of x
what set would occur at point i
For example:
c = 4
x = 4
i = 3
Would yield:
[0000]
[0001]
[0002]
[0003] <- i
[0010]
....
[3333]
This is very nearly the same problem as in the related question Logic to select a specific set from Cartesian set. However, because x and i are large enough to require the use of BigInteger objects, the code has to be changed to return a List, and take an int, instead of a string array:
int PossibleNumbers;
public List<int> Get(BigInteger Address)
{
List<int> values = new List<int>();
BigInteger sizes = new BigInteger(1);
for (int j = 0; j < PixelArrayLength; j++)
{
BigInteger index = BigInteger.Divide(Address, sizes);
index = (index % PossibleNumbers);
values.Add((int)index);
sizes *= PossibleNumbers;
}
return values;
}
This seems to behave as I'd expect, however, when I start using values like this:
c = 66000
x = 950000
i = (66000^950000)/2
So here, I'm looking for the ith value in the cartesian set of 0 to (c-1) of length 950000, or put another way, the halfway point.
At this point, I just get a list of zeroes returned. How can I solve this problem?
Notes: It's quite a specific problem, and I apologise for the wall-of-text, I do hope it's not too much, I was just hoping to properly explain what I meant. Thanks to you all!
Edit: Here are some more examples: http://pastebin.com/zmSDQEGC

Here is a generic base converter... it takes a decimal for the base10 value to convert into your newBase and returns an array of int's. If you need a BigInteger this method works perfectly well with just changing the base10Value to BigInteger.
EDIT: Converted method to BigInteger since that's what you need.
EDIT 2: Thanks phoog for pointing out BigInteger is base2 so changing the method signature.
public static int[] ConvertToBase(BigInteger value, int newBase, int length)
{
var result = new Stack<int>();
while (value > 0)
{
result.Push((int)(value % newBase));
if (value < newBase)
value = 0;
else
value = value / newBase;
}
for (var i = result.Count; i < length; i++)
result.Push(0);
return result.ToArray();
}
usage...
int[] a = ConvertToBase(13, 4, 4) = [0,0,3,1]
int[] b = ConvertToBase(0, 4, 4) = [0,0,3,1]
int[] c = ConvertToBase(1234, 12, 4) = [0,8,6,10]
However the probelm you specifically state is a bit large to test it on. :)
Just calculating 66000 ^ 950000 / 2 is a good bit of work as Phoog mentioned. Unless of course you meant ^ to be the XOR operator. In which case it's quite fast.
EDIT: From the comments... The largest base10 number that can be represented given a particular newBase and length is...
var largestBase10 = BigInteger.Pow(newBase, length)-1;

The first expression of the problem boils down to "write 3 as a 4-digit base-4 number". So, if the problem is "write i as an x-digit base-c number", or, in this case, "write (66000^950000)/2 as a 950000-digit base 66000 number", then does that make it easier?
If you're specifically looking for the halfway point of the cartesian product, it's not so hard. If you assume that c is even, then the most significant digit is c / 2, and the rest of the digits are zero. If your return value is all zeros, then you may have an off-by-one error, or the like, since actually only one digit is incorrect.

Pointers and bit operators in GPU kernels

I want to perform a double threshold on a volume, using a GPU kernel. I send my volume, per slice, as read_only image2d_t. My output volume is a binary volume, where each bit specifies if its related voxel is enabled or disabled. My kernel checks if the current pixel value is within the lower/upper threshold range, and enables its corresponding bit in the binary volume.
For debugging purposes, I left the actual check commented for now. I simply use the passed slice nr to determine if the binary volume bit should be on or off. The first 14 slices are set to "on", the rest to "off". I have also verified this code on the CPU side, the code I pasted at the bottom of this post. The code shows both paths, the CPU being commented now.
The CPU code works as intended, the following image is returned after rendering the volume with the binary mask applied:
Running the exact same logic using my GPU kernel returns incorrect results (1st 3D, 2nd slice view):
What goes wrong here? I read that OpenCL does not support bit fields, but it does support bitwise operators as far as I could understand from the OpenCL specs. My bit logic, which selects the right bit from the 32 bit word and flips it, is supported right? Or is my simple flag considered a bit field. What it does is select the voxel%32 bit from the left (not the right, hence the subtract).
Another thing could be that the uint pointer passed to my kernel is different from what I expect. I assumed this would be valid use of pointers and passing data to my kernel. The logic applied to the "uint* word" part in the kernel is due to padding words per row, and paddings rows per slice. The CPU variant confirmed that the pointer calculation logic is valid though.
Below; the code
uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);
int[] dims = new int[3];
dims[0] = volume.Geometry.NumberOfVoxels.X;
dims[1] = volume.Geometry.NumberOfVoxels.Y;
dims[2] = volume.Geometry.NumberOfVoxels.Z;
uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
unsafe {
fixed(int* dimPtr = dims) {
fixed(uint *arrcPtr = arrC) {
// pick Cloo Platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// load opencl source
StreamReader streamReader = new StreamReader(#"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
string clSource = streamReader.ReadToEnd();
streamReader.Close();
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, clSource);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
// Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
// For this reason their use should be avoided if possible.
ComputeEventList eventList = new ComputeEventList();
// Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
// Create the kernel function and set its arguments.
ComputeKernel kernel = program.CreateKernel("LowerThreshold");
int slicenr = 0;
foreach (IntPtr ptr in pinnedSlices) {
/*// CPU VARIANT FOR TESTING PURPOSES
for (int y = 0; y < dims[1]; y++) {
for (int x = 0; x < dims[0]; x++) {
long pixelOffset = x + y * dims[0];
ushort* ushortPtr = (ushort*)ptr;
ushort pixel = *(ushortPtr + pixelOffset);
int BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= x) &&
(0 <= y) &&
(0 <= slicenr) &&
(x < dims[0]) &&
(y < dims[1]) &&
(slicenr < dims[2])
) {
uint* word =
arrcPtr + 1 + (slicenr * wordsPerPlane) +
(y * wordsPerRow) +
(x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
//if (pixel > lowerThreshold && pixel < upperThreshold) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
}*/
ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
3,
new IntPtr(dimPtr));
ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
ComputeImage2D image2D = new ComputeImage2D(
context,
ComputeMemoryFlags.ReadOnly,
format,
volume.Geometry.NumberOfVoxels.X,
volume.Geometry.NumberOfVoxels.Y,
0,
ptr
);
// The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
ComputeBuffer<uint> c = new ComputeBuffer<uint>(
context, ComputeMemoryFlags.WriteOnly, arrC.Length);
kernel.SetMemoryArgument(0, image2D);
kernel.SetMemoryArgument(1, dimsBuffer);
kernel.SetValueArgument(2, wordsPerRow);
kernel.SetValueArgument(3, wordsPerPlane);
kernel.SetValueArgument(4, slicenr);
kernel.SetValueArgument(5, lowerThreshold);
kernel.SetValueArgument(6, upperThreshold);
kernel.SetMemoryArgument(7, c);
// Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
// If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);
// Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer
// will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete
// by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
// eventList will contain two events after this method returns.
commands.ReadFromBuffer(c, ref arrC, false, eventList);
// A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
// in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands
// to finish has to be issued before "arrC" can be used.
// This explicit synchronization can be achieved in two ways:
// 1) Wait for the events in the list to finish,
//eventList.Wait();
//}
// 2) Or simply use
commands.Finish();
slicenr++;
}
}
}
}
And my kernel code:
const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP | CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
read_only image2d_t image,
global int* brickSize,
uint wordsPerRow,
uint wordsPerPlane,
int slicenr,
int lower,
int upper,
global write_only uint* c )
{
int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
uint4 pixel = read_imageui(image, smp, coord.xy);
uchar BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= coord.x) &&
(0 <= coord.y) &&
(0 <= coord.z) &&
(coord.x < brickSize[0]) &&
(coord.y < brickSize[1]) &&
(coord.z < brickSize[2])
) {
global uint* word =
c + 1 + (coord.z * wordsPerPlane) +
(coord.y * wordsPerRow) +
(coord.x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
//if (pixel.w > lower && pixel.w < upper) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}

Two issues:
You've declared "c" as "write_only" yet use the "|=" and "&=" operators, which are read-modify-write
As the other posters mentioned, if two work items are accessing the same word, there are race conditions between the read-modify-write that will cause errors. Atomic operations are much slower than non-atomic operations, so while possible, not recommended.
I'd recommend making your output 8x larger and using bytes rather than bits. This would make your output write-only and would also remove contention and therefore race conditions.
Or (if data compactness or format is important) process 8 elements at a time per work item, and write the composite 8-bit output as a single byte. This would be write-only, with no contention, and would still have your data compactness.

Search a file for a sequence of bytes (C#)

I'm writing a C# application in which I need to search a file (could be very big) for a sequence of bytes, and I can't use any libraries to do so. So, I need a function that takes a byte array as an argument and returns the position of the byte following the given sequence. The function doesn't have to be fast, it simply has to work. Any help would be greatly appreciated :)

If it doesn't have to be fast you could use this:
int GetPositionAfterMatch(byte[] data, byte[]pattern)
{
for (int i = 0; i < data.Length - pattern.Length; i++)
{
bool match = true;
for (int k = 0; k < pattern.Length; k++)
{
if (data[i + k] != pattern[k])
{
match = false;
break;
}
}
if (match)
{
return i + pattern.Length;
}
}
}
But I really would recommend you to use Knuth-Morris-Pratt algorithm, it's the algorithm mostly used as a base of IndexOf methods for strings. The algorithm above will perform really slow, exept for small arrays and small patterns.

The straight-forward approach as pointed out by Turrau works, and for your purposes is probably good enough, since you say it doesn't have to be fast - especially since for most practical purposes the algorithm is much faster than the worst case O(n*m). (Depending on your pattern I guess).
For an optimal solution you can also check out the Knuth-Morris-Pratt algorithm, which makes use of partial matches which in the end is O(n+m).

Here's an extract of some code I used to do a boyer-moore type search. It's mean to work on pcap files, so it operates record by record, but should be easy enough to modify to suit just searching a long binary file. It's sort of extracted from some test code, so I hope I got everything for you to follow along. Also look up boyer-moore searching on wikipedia, since that is what it's based off of.
int[] badMatch = new int[256];
byte[] pattern; //the pattern we are searching for
//badMath is an array of every possible byte value (defined as static later).
//we use this as a jump table to know how many characters we can skip comparison on
//so first, we prefill every possibility with the length of our search string
for (int i = 0; i < badMatch.Length; i++)
{
badMatch[i] = pattern.Length;
}
//Now we need to calculate the individual maximum jump length for each byte that appears in my search string
for (int i = 0; i < pattern.Length - 1; i++)
{
badMatch[pattern[i] & 0xff] = pattern.Length - i - 1;
}
// Place the bytes you want to run the search against in the payload variable
byte[] payload = <bytes>
// search the packet starting at offset, and try to match the last character
// if we loop, we increment by whatever our jump value is
for (i = offset + pattern.Length - 1; i < end && cont; i += badMatch[payload[i] & 0xff])
{
// if our payload character equals our search string character, continue matching counting backwards
for (j = pattern.Length - 1, k = i; (j >= 0) && (payload[k] == pattern[j]) && cont; j--)
{
k--;
}
// if we matched every character, then we have a match, add it to the packet list, and exit the search (cont = false)
if (j == -1)
{
//we MATCHED!!!
//i = end;
cont = false;
}
}

C# Prime Generator, Maxxing out Bit Array

(C#, prime generator)
Heres some code a friend and I were poking around on:
public List<int> GetListToTop(int top)
{
top++;
List<int> result = new List<int>();
BitArray primes = new BitArray(top / 2);
int root = (int)Math.Sqrt(top);
for (int i = 3, count = 3; i <= root; i += 2, count++)
{
int n = i - count;
if (!primes[n])
for (int j = n + i; j < top / 2; j += i)
{
primes[j] = true;
}
}
if (top >= 2)
result.Add(2);
for (int i = 0, count = 3; i < primes.Length; i++, count++)
{
if (!primes[i])
{
int n = i + count;
result.Add(n);
}
}
return result;
}
On my dorky AMD x64 1800+ (dual core), for all primes below 1 billion in 34546.875ms. Problem seems to be storing more in the bit array. Trying to crank more than ~2billion is more than the bitarray wants to store. Any ideas on how to get around that?

I would "swap" parts of the array out to disk. By that, I mean, divide your bit array into half-billion bit chunks and store them on disk.
The have only a few chunks in memory at any one time. With C# (or any other OO language), it should be easy to encapsulate the huge array inside this chunking class.
You'll pay for it with a slower generation time but I don't see any way around that until we get larger address spaces and 128-bit compilers.

Or as an alternative approach to the one suggested by Pax, make use of the new Memory-Mapped File classes in .NET 4.0 and let the OS decide which chunks need to be in memory at any given time.
Note however that you'll want to try and optimise the algorithm to increase locality so that you do not needlessly end up swapping pages in and out of memory (trickier than this one sentence makes it sound).

Use multiple BitArrays to increase the maximum size. If a number is to great bit-shift it and store the result in a bit-array for storing bits 33-64.
BitArray second = new BitArray(int.MaxValue);
long num = 23958923589;
if (num > int.MaxValue)
{
int shifted = (int)num >> 32;
second[shifted] = true;
}
long request = 0902305023;
if (request > int.MaxValue)
{
int shifted = (int)request >> 32;
return second[shifted];
}
else return first[request];
Of course it would be nice if BitArray would support size up to System.Numerics.BigInteger.
Swapping to disk will make your code really slow.
I have a 64-bit OS, and my BitArray is also limited to 32-bits.
PS: your prime number calculations looks wierd, mine looks like this:
for (int i = 2; i <= number; i++)
if (primes[i])
for (int scalar = i + i; scalar <= number; scalar += i)
{
primes[scalar] = false;
yield return scalar;
}

The Sieve algorithm would be better performing. I could determine all the 32-bit primes (total about 105 million) for the int range in less than 4 minutes with that. Of course returning the list of primes is a different thing as the memory requirement there would be a little over 400 MB (1 int = 4 bytes). Using a for loop the numbers were printed to a file and then imported to a DB for more fun :) However for the 64 bit primes the program would need several modifications and perhaps require distributed execution over multiple nodes. Also refer to the following links
http://www.troubleshooters.com/codecorn/primenumbers/primenumbers.htm
http://en.wikipedia.org/wiki/Prime-counting_function

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.