I am trying to implement a Spinlock in GLSL. It will be used in the context of Voxel Cone Tracing. I try to move the information, which stores the lock state, to a separate 3D texture which allows atomic operations. In order to not waste memory I don't use a full integer to store the lock state but only a single bit. The problem is that without limiting the maximum number of iterations, the loop never terminates. I implemented the exact same mechanism in C#, created a lot of tasks working on shared resources and there it works perfectly.
The book Euro Par 2017: Parallel Processing Page 274 (can be found on Google) mentions possible caveats when using locks on SIMT devices. I think the code should bypass those caveats.
Problematic GLSL Code:
void imageAtomicRGBA8Avg(layout(RGBA8) volatile image3D image, layout(r32ui) volatile uimage3D lockImage,
ivec3 coords, vec4 value)
{
ivec3 lockCoords = coords;
uint bit = 1<<(lockCoords.z & (4)); //1<<(coord.z % 32)
lockCoords.z = lockCoords.z >> 5; //Division by 32
uint oldValue = 0;
//int counter=0;
bool goOn = true;
while (goOn /*&& counter < 10000*/)
//while(true)
{
uint newValue = oldValue | bit;
uint result = imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
//Writing is allowed if could write our value and if the bit indicating the lock is not already set
if (result == oldValue && (result & bit) == 0)
{
vec4 rval = imageLoad(image, coords);
rval.rgb = (rval.rgb * rval.a); // Denormalize
vec4 curValF = rval + value; // Add
curValF.rgb /= curValF.a; // Renormalize
imageStore(image, coords, curValF);
//Release the lock and set the flag such that the loops terminate
bit = ~bit;
oldValue = 0;
while (goOn)
{
newValue = oldValue & bit;
result = imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
if (result == oldValue)
goOn = false; //break;
oldValue = result;
}
//break;
}
oldValue = result;
//++counter;
}
}
Working C# code with identical functionality
public static void Test()
{
int buffer = 0;
int[] resource = new int[2];
Action testA = delegate ()
{
for (int i = 0; i < 100000; ++i)
imageAtomicRGBA8Avg(ref buffer, 1, resource);
};
Action testB = delegate ()
{
for (int i = 0; i < 100000; ++i)
imageAtomicRGBA8Avg(ref buffer, 2, resource);
};
Task[] tA = new Task[100];
Task[] tB = new Task[100];
for (int i = 0; i < tA.Length; ++i)
{
tA[i] = new Task(testA);
tA[i].Start();
tB[i] = new Task(testB);
tB[i].Start();
}
for (int i = 0; i < tA.Length; ++i)
tA[i].Wait();
for (int i = 0; i < tB.Length; ++i)
tB[i].Wait();
}
public static void imageAtomicRGBA8Avg(ref int lockImage, int bit, int[] resource)
{
int oldValue = 0;
int counter = 0;
bool goOn = true;
while (goOn /*&& counter < 10000*/)
{
int newValue = oldValue | bit;
int result = Interlocked.CompareExchange(ref lockImage, newValue, oldValue); //imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
if (result == oldValue && (result & bit) == 0)
{
//Now we hold the lock and can write safely
resource[bit - 1]++;
bit = ~bit;
oldValue = 0;
while (goOn)
{
newValue = oldValue & bit;
result = Interlocked.CompareExchange(ref lockImage, newValue, oldValue); //imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
if (result == oldValue)
goOn = false; //break;
oldValue = result;
}
//break;
}
oldValue = result;
++counter;
}
}
The locking mechanism should work quite identical as the one described in OpenGL Insigts Chapter 22 Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer by Cyril Crassin and Simon Green. They just use integer textures to store the colors for every voxel which I would like to avoid because this complicates Mip Mapping and other things.
I hope the post is understandable, I get the feeling it is already becoming too long...
Why does the GLSL implementation not terminate?
If I understand you well, you use lockImage as thread-lock: A determined value at determined coords means "only this shader instance can do next operations" (change data in other image at that coords). Right.
The key is imageAtomicCompSwap. We know it did the job because it was able to store that determined value (let's say 0 means "free" and 1 means "locked"). We know it because the returned value (the original value) is "free" (i.e. the swap operation happened):
bool goOn = true;
unit oldValue = 0; //free
uint newValue = 1; //locked
//Wait for other shader instance to free the simulated lock
while ( goON )
{
uint result = imageAtomicCompSwap(lockImage, lockCoords, oldValue, newValue);
if ( result == oldValue ) //it was free, now it's locked
{
//Just this shader instance executes next lines now.
//Other instances will find a "locked" value in 'lockImage' and will wait
...
//release our simulated lock
imageAtomicCompSwap(lockImage, lockCoords, newValue, oldValue);
goOn = false;
}
}
I think your code loops forever because you complicated your life with bitvar and did a wrong use of oldVale and newValue
EDIT:
If the 'z' of the lockImage is multiple of 32 (just a hint for understanding, no needed exact multiple), you try to pack 32 voxel-locks in an integer. Let's call this integer 32C.
A shader instance ("SI") may want to change its bit in 32C, lock or unlock. So you must (A)get the current value and (B)change only your bit.
Other SIs are trying to change their bits. Some with the same bit, others with different bits.
Between two calls to imageAtomicCompSwap in the one SI, other SI may have changed not your bit (it's locked, no?) but other bits in the same 32C value. You don't know which is the current value, you know only your bit. Thus you have nothing (or an old wrong value) to compare with in the imageAtomicCompSwap call. It likely fails to set a new value. Several SIs failing leads to "deadlocks" and the while-loop never ends.
You try to avoid using an old wrong value by oldValue = result and trying again with imageAtomicCompSwap. This the (A)-(B) I wrote before. But between (A) and (B) still other SI may have changed the result= 32C value, ruining your idea.
IDEA:
You can use my simple approach (just 0 or 1 values in lockImage), without bits thing. The result is that lockImage is smaller. But all shader instances trying to update any of the 32 image coords related to a 32C value in lockImage will wait until the one who locked that value frees it.
Using another lockImage2 just to lock-unlock the 32C value for a bit update, seems too much spinning.
I have written article about how to implement per pixel mutex in fragment shader along with code . I think you can refer that. You are doing pretty similar thing that I have explained there. Here we go:
Getting Over Draw Count and Per Pixel Mutex
what is overdraw count ?
Mostly on embedded hardware the major concern for performance drop could be overdraw. Basically one pixel on screen is shaded multiple times by the GPU due to nature of geometry or scene we are drawing and this is called as overdraw. There are many tools to visualize overdraw count.
Details about overdraw?
When we draw some vertices those vertices will be transformed to clip space then to window coordinates. Rasterizer then maps this coordinates to pixels/fragments.Then for pixels/fragments GPU calls pixel shader. There could be cases when we are drawing multiple instance of geometry and blending them. So, this will do drawing on same pixel multiple times.This will lead to overdraw and could degrade performance.
Strategies to avoid overdraw?
Consider Frustum culling - Do frustum culling on CPU so that objects out of cameras field of view will not be rendered.
Sort objects based on z - Draw objects from front to back this way for later objects z test will fail and the fragment wont be written.
Enable back face culling - Using this we can avoid rendering back faces that are looking towards camera.
If you observe point 2, we are rendering in exactly reverse order for blending.We are rendering from back to front. We need to do this because blending happens after z test. If for any fragment fails z test then though it is at back we should still consider it as blending is on but, that fragment will be completely ignored giving artifacts.Hence we need to maintain order from back to front. Due to this when blending is enabled we get more overdraw count.
Why we need Per Pixel Mutex?
By nature GPU is parallel so, shading of pixels can be done in parallel. So there are many instance of pixel shader running at a time. This instances may be shading same pixel and hence accessing same pixels.This may lead to some synchronization issues.This may create some unwanted effects. In this application I am maintaining overdraw count in image buffer initialized to 0. The operations I do are in following order.
Read i pixel's count from image buffer (which will be zero for first time)
Add 1 to value of counter read in step 1
Store new value of counter in ith position pixel in image buffer
As I told you multiple instance of pixel shader could be working on same pixel this may lead to corruption of counter variable.As these steps of algorithm are not atomic. I could have used inbuilt function imageAtomicAdd(). I wanted to show how we can implement per pixel mutex so, I have not used inbuilt function imageAtomicAdd().
#version 430
layout(binding = 0,r32ui) uniform uimage2D overdraw_count;
layout(binding = 1,r32ui) uniform uimage2D image_lock;
void mutex_lock(ivec2 pos) {
uint lock_available;
do {
lock_available = imageAtomicCompSwap(image_lock, pos, 0, 1);
} while (lock_available == 0);
}
void mutex_unlock(ivec2 pos) {
imageStore(image_lock, pos, uvec4(0));
}
out vec4 color;
void main() {
mutex_lock(ivec2(gl_FragCoord.xy));
uint count = imageLoad(overdraw_count, ivec2(gl_FragCoord.xy)).x + 1;
imageStore(overdraw_count, ivec2(gl_FragCoord.xy), uvec4(count));
mutex_unlock(ivec2(gl_FragCoord.xy));
}
Fragment_Shader.fs
About Demo.
In demo video you can see we are rendering many teapots and blending is on.So pixels with more intensity shows there overdraw count is high.
on youtube
Note: On android you can see this overdraw count in debug GPU options.
source: Per Pixel Mutex
Related
This function is taken from some example of image processing and it returns value between 0 and 255:
private static byte CalculateColorComponentBlendValue(float source, float overlay)
{
float resultValue = 0;
byte resultByte = 0;
resultValue = source + overlay;
if (resultValue > 255)
{
resultByte = 255;
}
else if (resultValue < 0)
{
resultByte = 0;
}
else
{
resultByte = (byte)resultValue;
}
return resultByte;
}
And it is called in a big loop. Can this somehow be optimized? May be with some bits manipulations? Now the whole loop takes 400ms and if I remove calls to this function it reduces to 200ms.
What you're trying to achieve here is called clamping. It has been discussed several times before. I'd suggest you to look at those discussions. So rather than repeating those, I'll pose some questions instead. Does your implementation need to be purely C? Does the value need to be float? Are other optimisation options available to you: SIMD, thread, GPU?
I'm trying to put Transposition tables into my alpha beta scout. I do see an incremental speed boost I think toward mid or late game, however, even with a table size of 1-2GB, its may or may not be slower than just not reading from the Transpose table at all. I'm also noticing some less than efficient moves if I were to play the exact same game without the tables.
I tested my Zobrist key hashing, and they come out properly even after making and undoing moves. I don't believe they are the issue. I tried to follow the advice of these articles in designing the alpha/beta pruning. http://web.archive.org/web/20070809015843/http://www.seanet.com/~brucemo/topics/hashing.htm http://mediocrechess.blogspot.com/2007/01/guide-transposition-tables.html
Can anyone help me identify a mistake? Perhaps I'm not understanding the evaluation of checking alpha vs beta from the hash. Or is 1-2GB too small to make a difference? I can post more of the Transposition table code if need be.
// !!!! With or without this specific section, and any other Transpose.Insert, doesn't make the game play or evaluate any faster.
HashType type = HashType.AlphaPrune;
HashEntry h = Transpose.GetInstance().Get(board.zobristKey);
if (h != null)
{
if (h.depth >= depth)
{
if (h.flag == HashType.ExactPrune)
{
return h.scored;
}
if (h.flag == HashType.BetaPrune)
{
if(h.scoredState < beta)
{
beta = h.scored;
}
}
if (h.flag == HashType.AlphaPrune)
{
if(h.scoredState > alpha)
{
alpha = h.scored;
}
}
if (alpha >= beta)
{
return alpha;
}
}
}
if (board.terminal)
{
int scoredState = board.Evaluate(color);
Table.GetInstance().Add(board.zobristKey, depth, Entry.EXACT, scoredState);
return scoredState;
}
//May do Quescience search here if necessary && depth = 0
Stack movesGenerated = GeneratePossibleMoves();
while(!movesGenerated.isEmpty())
{
int scoredState = MAXNEGASCOUT;
board.MakeMove(movesGenerated.pop());
int newAlpha = -(alpha +1)
scoredState = -alphaBetaScout(board, depth - 1, newAlpha, -alpha, !color, quiscence);
if (scoredState < beta && alpha < scoredState)
{
scoredState = -alphaBetaScout(board, depth - 1, -beta, -scoredState, !color, quiscence);
}
board.UndoMove();
if (scoredState >= beta)
{
Table.GetInstance().Add(key, depth, Entry.BETA, beta);
return scoredState;
}
if (scoredState > alpha)
{
type = HashType.ExactPrune;
alpha = scoredState;
}
}
Table.GetInstance().Add(key, depth, type, alpha);
return alpha;
I believe you need to make a copy of your alpha and beta bounds before you search your table at the beginning. When you update your bounds (with your table or by searching) these copies do not change.
Then, when you add new entries into your transposition table, you should compare scoredState to the bounds saved in the table (that is, the copies of the bounds you made at the start) instead of comparing it to the updated bounds, because the updated bounds are not the ones stored in the table but the backups are!
This is not a terribly uncommon question, but I still couldn't seem to find an answer that really explained the choice.
I have a very large list of strings (ASCII representations of SHA-256 hashes, to be exact), and I need to query for the presence of a string within that list.
There will be what is likely in excess of 100 million entries in this list, and I will need to repeatably query for the presence of an entry many times.
Given the size, I doubt I can stuff it all into a HashSet<string>. What would be an appropriate retrieval system to maximize performance?
I CAN pre-sort the list, I CAN put it into a SQL table, I CAN put it into a text file, but I'm not sure what really makes the most sense given my application.
Is there a clear winner in terms of performance among these, or other methods of retrieval?
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Security.Cryptography;
namespace HashsetTest
{
abstract class HashLookupBase
{
protected const int BucketCount = 16;
private readonly HashAlgorithm _hasher;
protected HashLookupBase()
{
_hasher = SHA256.Create();
}
public abstract void AddHash(byte[] data);
public abstract bool Contains(byte[] data);
private byte[] ComputeHash(byte[] data)
{
return _hasher.ComputeHash(data);
}
protected Data256Bit GetHashObject(byte[] data)
{
var hash = ComputeHash(data);
return Data256Bit.FromBytes(hash);
}
public virtual void CompleteAdding() { }
}
class HashsetHashLookup : HashLookupBase
{
private readonly HashSet<Data256Bit>[] _hashSets;
public HashsetHashLookup()
{
_hashSets = new HashSet<Data256Bit>[BucketCount];
for(int i = 0; i < _hashSets.Length; i++)
_hashSets[i] = new HashSet<Data256Bit>();
}
public override void AddHash(byte[] data)
{
var item = GetHashObject(data);
var offset = item.GetHashCode() & 0xF;
_hashSets[offset].Add(item);
}
public override bool Contains(byte[] data)
{
var target = GetHashObject(data);
var offset = target.GetHashCode() & 0xF;
return _hashSets[offset].Contains(target);
}
}
class ArrayHashLookup : HashLookupBase
{
private Data256Bit[][] _objects;
private int[] _offsets;
private int _bucketCounter;
public ArrayHashLookup(int size)
{
size /= BucketCount;
_objects = new Data256Bit[BucketCount][];
_offsets = new int[BucketCount];
for(var i = 0; i < BucketCount; i++) _objects[i] = new Data256Bit[size + 1];
_bucketCounter = 0;
}
public override void CompleteAdding()
{
for(int i = 0; i < BucketCount; i++) Array.Sort(_objects[i]);
}
public override void AddHash(byte[] data)
{
var hashObject = GetHashObject(data);
_objects[_bucketCounter][_offsets[_bucketCounter]++] = hashObject;
_bucketCounter++;
_bucketCounter %= BucketCount;
}
public override bool Contains(byte[] data)
{
var hashObject = GetHashObject(data);
return _objects.Any(o => Array.BinarySearch(o, hashObject) >= 0);
}
}
struct Data256Bit : IEquatable<Data256Bit>, IComparable<Data256Bit>
{
public bool Equals(Data256Bit other)
{
return _u1 == other._u1 && _u2 == other._u2 && _u3 == other._u3 && _u4 == other._u4;
}
public int CompareTo(Data256Bit other)
{
var rslt = _u1.CompareTo(other._u1); if (rslt != 0) return rslt;
rslt = _u2.CompareTo(other._u2); if (rslt != 0) return rslt;
rslt = _u3.CompareTo(other._u3); if (rslt != 0) return rslt;
return _u4.CompareTo(other._u4);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj))
return false;
return obj is Data256Bit && Equals((Data256Bit) obj);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = _u1.GetHashCode();
hashCode = (hashCode * 397) ^ _u2.GetHashCode();
hashCode = (hashCode * 397) ^ _u3.GetHashCode();
hashCode = (hashCode * 397) ^ _u4.GetHashCode();
return hashCode;
}
}
public static bool operator ==(Data256Bit left, Data256Bit right)
{
return left.Equals(right);
}
public static bool operator !=(Data256Bit left, Data256Bit right)
{
return !left.Equals(right);
}
private readonly long _u1;
private readonly long _u2;
private readonly long _u3;
private readonly long _u4;
private Data256Bit(long u1, long u2, long u3, long u4)
{
_u1 = u1;
_u2 = u2;
_u3 = u3;
_u4 = u4;
}
public static Data256Bit FromBytes(byte[] data)
{
return new Data256Bit(
BitConverter.ToInt64(data, 0),
BitConverter.ToInt64(data, 8),
BitConverter.ToInt64(data, 16),
BitConverter.ToInt64(data, 24)
);
}
}
class Program
{
private const int TestSize = 150000000;
static void Main(string[] args)
{
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var arrayHashLookup = new ArrayHashLookup(TestSize);
PerformBenchmark(arrayHashLookup, TestSize);
}
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var hashsetHashLookup = new HashsetHashLookup();
PerformBenchmark(hashsetHashLookup, TestSize);
}
Console.ReadLine();
}
private static void PerformBenchmark(HashLookupBase hashClass, int size)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < size; i++)
hashClass.AddHash(BitConverter.GetBytes(i * 2));
Console.WriteLine("Hashing and addition took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
hashClass.CompleteAdding();
Console.WriteLine("Hash cleanup (sorting, usually) took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
var found = 0;
for (int i = 0; i < size * 2; i += 10)
{
found += hashClass.Contains(BitConverter.GetBytes(i)) ? 1 : 0;
}
Console.WriteLine("Found " + found + " elements (expected " + (size / 5) + ") in " + sw.ElapsedMilliseconds + "ms");
}
}
}
Results are pretty promising. They run single-threaded. The hashset version can hit a little over 1 million lookups per second at 7.9GB RAM usage. The array-based version uses less RAM (4.6GB). Startup times between the two are nearly identical (388 vs 391 seconds). The hashset trades RAM for lookup performance. Both had to be bucketized because of memory allocation constraints.
Array performance:
Hashing and addition took 307408ms
Hash cleanup (sorting, usually) took 81892ms
Found 30000000 elements (expected 30000000) in 562585ms [53k searches per second]
======================================
Hashset performance:
Hashing and addition took 391105ms
Hash cleanup (sorting, usually) took 0ms
Found 30000000 elements (expected 30000000) in 74864ms [400k searches per second]
If the list changes over time, I would put it in a database.
If the list doesn't change, I would put it in a sorted file and do a binary search for every query.
In both cases, I would use a Bloom filter to minimize I/O. And I would stop using strings and use the binary representation with four ulongs (to avoid the object reference cost).
If you have more than 16 GB (2*64*4/3*100M, assuming Base64 encoding) to spare, an option is to make a Set<string> and be happy. Of course it would fit in less than 7 GB if you use the binary representation.
David Haney's answer shows us that the memory cost is not so easily calculated.
With <gcAllowVeryLargeObjects>, you can have arrays that are much larger. Why not convert those ASCII representations of 256-bit hash codes to a custom struct that implements IComparable<T>? It would look like this:
struct MyHashCode: IComparable<MyHashCode>
{
// make these readonly and provide a constructor
ulong h1, h2, h3, h4;
public int CompareTo(MyHashCode other)
{
var rslt = h1.CompareTo(other.h1);
if (rslt != 0) return rslt;
rslt = h2.CompareTo(other.h2);
if (rslt != 0) return rslt;
rslt = h3.CompareTo(other.h3);
if (rslt != 0) return rslt;
return h4.CompareTo(other.h4);
}
}
You can then create an array of these, which would occupy approximately 3.2 GB. You can search it easy enough with Array.BinarySearch.
Of course, you'll need to convert the user's input from ASCII to one of those hash code structures, but that's easy enough.
As for performance, this isn't going to be as fast as a hash table, but it's certainly going to be faster than a database lookup or file operations.
Come to think of it, you could create a HashSet<MyHashCode>. You'd have to override the Equals method on MyHashCode, but that's really easy. As I recall, the HashSet costs something like 24 bytes per entry, and you'd have the added cost of the larger struct. Figure five or six gigabytes, total, if you were to use a HashSet. More memory, but still doable, and you get O(1) lookup.
These answers don't factor the string memory into the application. Strings are not 1 char == 1 byte in .NET. Each string object requires a constant 20 bytes for the object data. And the buffer requires 2 bytes per character. Therefore: the memory usage estimate for a string instance is 20 + (2 * Length) bytes.
Let's do some math.
100,000,000 UNIQUE strings
SHA256 = 32 bytes (256 bits)
size of each string = 20 + (2 * 32 bytes) = 84 bytes
Total required memory: 8,400,000,000 bytes = 8.01 gigabytes
It is possible to do so, but this will not store well in .NET memory. Your goal should be to load all of this data into a form that can be accessed/paged without holding it all in memory at once. For that I'd use Lucene.net which will store your data on disk and intelligently search it. Write each string as searchable to an index and then search the index for the string. Now you have a scalable app that can handle this problem; your only limitation will be disk space (and it would take a lot of string to fill up a terabyte drive). Alternatively, put these records in a database and query against it. That's why databases exist: to persist things outside of RAM. :)
For maximum speed, keep them in RAM. It's only ~3GB worth of data, plus whatever overhead your data structure needs. A HashSet<byte[]> should work just fine. If you want to lower overhead and GC pressure, turn on <gcAllowVeryLargeObjects>, use a single byte[], and a HashSet<int> with a custom comparer to index into it.
For speed and low memory usage, store them in a disk-based hash table.
For simplicity, store them in a database.
Whatever you do, you should store them as plain binary data, not strings.
A hashset splits your data into buckets (arrays). On a 64-bit system, the size limit for an array is 2 GB, which is roughly 2,000,000,000 bytes.
Since a string is a reference type, and since a reference takes eight bytes (assuming a 64-bit system), each bucket can hold approximately 250,000,000 (250 million) references to strings. It seems to be way more than what you need.
That being said, as Tim S. pointed out, it's highly unlikely you'll have the necessary memory to hold the strings themselves, even though the references would fit into the hashset. A database would me a much better fit for this.
You need to be careful in this sort of situation as most collections in most languages are not really designed or optimized for that sort of scale. As you have already identified memory usage will be a problem too.
The clear winner here is to use some form of database. Either a SQL database or there are a number of NoSQL ones that would be appropriate.
The SQL server is already designed and optimized for keeping track of large amounts of data, indexing it and searching and querying across those indexes. It's designed for doing exactly what you are trying to do so really would be the best way to go.
For performance you could consider using an embedded database that will run within your process and save the resulting communications overhead. For Java I could recommend a Derby database for that purpose, I'm not aware of the C# equivalents enough to make a recommendation there but I imagine suitable databases exist.
It might take a while (1) to dump all the records in a (clustered indexed) table (preferably use their values, not their string representation (2)) and let SQL do the searching. It will handle binary searching for you, it will handle caching for you and it's probably the easiest thing to work with if you need to make changes to the list. And I'm pretty sure that querying things will be just as fast (or faster) than building your own.
(1): For loading the data have a look at the SqlBulkCopy object, things like ADO.NET or Entity Framework are going to be too slow as they load the data row by row.
(2): SHA-256 = 256 bits, so a binary(32) will do; which is only half of the 64 characters you're using now. (Or a quarter of it if you're using Unicode numbers =P) Then again, if you currently have the information in a plain text-file you could still go the char(64) way and simply dump the data in the table using bcp.exe. The database will be bigger, the queries slightly slower (as more I/O is needed + the cache holds only half of the information for the same amount of RAM), etc... But it's quite straightforward to do, and if you're not happy with the result you can still write your own database-loader.
If the set is constant then just make a big sorted hash list (in raw format, 32 bytes each). Store all hashes so that they fit to disk sectors (4KB), and that the beginning of each sector is also the beginning of a hash. Save the first hash in every Nth sector in a special index list, which will easily fit into memory. Use binary search on this index list to determine the starting sector of a sector cluster where the hash should be, and then use another binary search within this sector cluster to find your hash. Value N should be determined based on measuring with test data.
EDIT: alternative would be to implement your own hash table on disk. The table should use open addressing strategy, and the probe sequence should be restricted to the same disk sector as much as possible. Empty slot have to be marked with a special value (all zeroes for instance) so this special value should be specially handled when queried for existence. To avoid collisions the table should not be less than 80% full with values, so in your case with 100 million entries with size of 32 bytes that means the table should have at least 100M/80%= 125 millions slots, and have the size of 125M*32= 4 GB. You only need to create the hashing function that would convert 2^256 domain to 125M, and some nice probe sequence.
You can try a Suffix Tree, this question goes over how to do it in C#
Or you can try a search like so
var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();
AsParallel will help speed things up as it creates a parallelization of a query.
Store your hashes as UInt32[8]
2a. Use sorted list. To compare two hashes, first compare their first elements; if they are equals, then compare second ones and so on.
2b. Use prefix tree
First of all I would really recommend that you use data compression in order to minimize resource consumption. Cache and memory bandwidth are usually the most limited resource in a modern computer. No matter how you implement this the biggest bottleneck will be waiting for data.
Also I would recommend using an existing database engine. Many of them have build-in compression and any database would make use of the RAM you have available. If you have a decent operating system, the system cache will store as much of the file as it can. But most databases have their own caching subsystem.
I cant really tell what db engine will be best for you, you have to try them out. Personally I often use H2 which have decent performance and can be used as both in-memory and file-based database, and have build in transparent compression.
I see that some have stated that importing your data to a database and building the search index may take longer than some custom solution. That may be true but importing are usually something that's quite rare. I am going to assume that you are more interested in fast searches as they are probable to be the most common operation.
Also why SQL databases are both reliable and quite fast, you may want to consider NoSQL databases. Try out a few alternatives. The only way to know which solution will give you the best performance are by benchmarking them.
Also you should consider if storing your list as text makes sense. Perhaps you should convert the list to numeric values. That will use less space and therefore give you faster queries. Database import may be significantly slower, but queries may become significantly faster.
If you want really fast, and the elements are more or less immutable and require exact matches, you can build something that operates like a virus scanner: set the scope to collect the minimum number of potential elements using whatever algorithms are relevant to your entries and search criteria, then iterate through those items, testing against the search item using RtlCompareMemory.. You can pull the items from disk if they are fairly contiguous and compare using something like this:
private Boolean CompareRegions(IntPtr hFile, long nPosition, IntPtr pCompare, UInt32 pSize)
{
IntPtr pBuffer = IntPtr.Zero;
UInt32 iRead = 0;
try
{
pBuffer = VirtualAlloc(IntPtr.Zero, pSize, MEM_COMMIT, PAGE_READWRITE);
SetFilePointerEx(hFile, nPosition, IntPtr.Zero, FILE_BEGIN);
if (ReadFile(hFile, pBuffer, pSize, ref iRead, IntPtr.Zero) == 0)
return false;
if (RtlCompareMemory(pCompare, pBuffer, pSize) == pSize)
return true; // equal
return false;
}
finally
{
if (pBuffer != IntPtr.Zero)
VirtualFree(pBuffer, pSize, MEM_RELEASE);
}
}
I would modify this example to grab a large buffer full of entries, and loop through those. But managed code may not be the way to go.. Fastest is always closer to the calls that do the actual work, so a driver with kernel mode access built on straight C would be much faster..
Firstly, you say the strings are really SHA256 hashes. Observe that 100 million * 256 bits = 3.2 gigabytes, so it is possible to fit the entire list in memory, assuming you use a memory-efficient data structure.
If you forgive occasional false positives, you can actually use less memory than that. See bloom filters http://billmill.org/bloomfilter-tutorial/
Otherwise, use a sorted data structure to achieve fast querying (time complexity O(log n)).
If you really do want to store the data in memory (because you're querying frequently and need fast results), try Redis. http://redis.io/
Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
It has a set datatype http://redis.io/topics/data-types#sets
Redis Sets are an unordered collection of Strings. It is possible to add, remove, and test for existence of members in O(1) (constant time regardless of the number of elements contained inside the Set).
Otherwise, use a database that saves the data on disk.
A plain vanilla binary search tree will give excellent lookup performance on large lists. However, if you don't really need to store the strings and simple membership is what you want to know, a Bloom Filter may be a terric solution. Bloom filters are a compact data structure that you train with all the strings. Once trained, it can quickly tell you if it has seen a string before. It rarely reports.false positives, but never reports false negatives. Depending on the application, they can produce amazing results quickly and with relatively little memory.
I developed a solution similar to Insta's approach, but with some differences. In effect, it looks a lot like his chunked array solution. However, instead of just simply splitting the data, my approach builds an index of chunks and directs the search only to the appropriate chunk.
The way the index is built is very similar to a hashtable, with each bucket being an sorted array that can be search with a binary search. However, I figured that there's little point in computing a hash of an SHA256 hash, so instead I simply take a prefix of the value.
The interesting thing about this technique is that you can tune it by extending the length of the index keys. A longer key means a larger index and smaller buckets. My test case of 8 bits is probably on the small side; 10-12 bits would probably be more effective.
I attempted to benchmark this approach, but it quickly ran out of memory so I wasn't able to see anything interesting in terms of performance.
I also wrote a C implementation. The C implementation wasn't able to deal with a data set of the specified size either (the test machine has only 4GB of RAM), but it did manage somewhat more. (The target data set actually wasn't so much of a problem in that case, it was the test data that filled up the RAM.) I wasn't able to figure out a good way to throw data at it fast enough to really see its performance tested.
While I enjoyed writing this, I'd say overall it mostly provides evidence in favor of the argument that you shouldn't be trying to do this in memory with C#.
public interface IKeyed
{
int ExtractKey();
}
struct Sha256_Long : IComparable<Sha256_Long>, IKeyed
{
private UInt64 _piece1;
private UInt64 _piece2;
private UInt64 _piece3;
private UInt64 _piece4;
public Sha256_Long(string hex)
{
if (hex.Length != 64)
{
throw new ArgumentException("Hex string must contain exactly 64 digits.");
}
UInt64[] pieces = new UInt64[4];
for (int i = 0; i < 4; i++)
{
pieces[i] = UInt64.Parse(hex.Substring(i * 8, 1), NumberStyles.HexNumber);
}
_piece1 = pieces[0];
_piece2 = pieces[1];
_piece3 = pieces[2];
_piece4 = pieces[3];
}
public Sha256_Long(byte[] bytes)
{
if (bytes.Length != 32)
{
throw new ArgumentException("Sha256 values must be exactly 32 bytes.");
}
_piece1 = BitConverter.ToUInt64(bytes, 0);
_piece2 = BitConverter.ToUInt64(bytes, 8);
_piece3 = BitConverter.ToUInt64(bytes, 16);
_piece4 = BitConverter.ToUInt64(bytes, 24);
}
public override string ToString()
{
return String.Format("{0:X}{0:X}{0:X}{0:X}", _piece1, _piece2, _piece3, _piece4);
}
public int CompareTo(Sha256_Long other)
{
if (this._piece1 < other._piece1) return -1;
if (this._piece1 > other._piece1) return 1;
if (this._piece2 < other._piece2) return -1;
if (this._piece2 > other._piece2) return 1;
if (this._piece3 < other._piece3) return -1;
if (this._piece3 > other._piece3) return 1;
if (this._piece4 < other._piece4) return -1;
if (this._piece4 > other._piece4) return 1;
return 0;
}
//-------------------------------------------------------------------
// Implementation of key extraction
public const int KeyBits = 8;
private static UInt64 _keyMask;
private static int _shiftBits;
static Sha256_Long()
{
_keyMask = 0;
for (int i = 0; i < KeyBits; i++)
{
_keyMask |= (UInt64)1 << i;
}
_shiftBits = 64 - KeyBits;
}
public int ExtractKey()
{
UInt64 keyRaw = _piece1 & _keyMask;
return (int)(keyRaw >> _shiftBits);
}
}
class IndexedSet<T> where T : IComparable<T>, IKeyed
{
private T[][] _keyedSets;
public IndexedSet(IEnumerable<T> source, int keyBits)
{
// Arrange elements into groups by key
var keyedSetsInit = new Dictionary<int, List<T>>();
foreach (T item in source)
{
int key = item.ExtractKey();
List<T> vals;
if (!keyedSetsInit.TryGetValue(key, out vals))
{
vals = new List<T>();
keyedSetsInit.Add(key, vals);
}
vals.Add(item);
}
// Transform the above structure into a more efficient array-based structure
int nKeys = 1 << keyBits;
_keyedSets = new T[nKeys][];
for (int key = 0; key < nKeys; key++)
{
List<T> vals;
if (keyedSetsInit.TryGetValue(key, out vals))
{
_keyedSets[key] = vals.OrderBy(x => x).ToArray();
}
}
}
public bool Contains(T item)
{
int key = item.ExtractKey();
if (_keyedSets[key] == null)
{
return false;
}
else
{
return Search(item, _keyedSets[key]);
}
}
private bool Search(T item, T[] set)
{
int first = 0;
int last = set.Length - 1;
while (first <= last)
{
int midpoint = (first + last) / 2;
int cmp = item.CompareTo(set[midpoint]);
if (cmp == 0)
{
return true;
}
else if (cmp < 0)
{
last = midpoint - 1;
}
else
{
first = midpoint + 1;
}
}
return false;
}
}
class Program
{
//private const int NTestItems = 100 * 1000 * 1000;
private const int NTestItems = 1 * 1000 * 1000;
private static Sha256_Long RandomHash(Random rand)
{
var bytes = new byte[32];
rand.NextBytes(bytes);
return new Sha256_Long(bytes);
}
static IEnumerable<Sha256_Long> GenerateRandomHashes(
Random rand, int nToGenerate)
{
for (int i = 0; i < nToGenerate; i++)
{
yield return RandomHash(rand);
}
}
static void Main(string[] args)
{
Console.WriteLine("Generating test set.");
var rand = new Random();
IndexedSet<Sha256_Long> set =
new IndexedSet<Sha256_Long>(
GenerateRandomHashes(rand, NTestItems),
Sha256_Long.KeyBits);
Console.WriteLine("Testing with random input.");
int nFound = 0;
int nItems = NTestItems;
int waypointDistance = 100000;
int waypoint = 0;
for (int i = 0; i < nItems; i++)
{
if (++waypoint == waypointDistance)
{
Console.WriteLine("Test lookups complete: " + (i + 1));
waypoint = 0;
}
var item = RandomHash(rand);
nFound += set.Contains(item) ? 1 : 0;
}
Console.WriteLine("Testing complete.");
Console.WriteLine(String.Format("Found: {0} / {0}", nFound, nItems));
Console.ReadKey();
}
}
I want to perform a double threshold on a volume, using a GPU kernel. I send my volume, per slice, as read_only image2d_t. My output volume is a binary volume, where each bit specifies if its related voxel is enabled or disabled. My kernel checks if the current pixel value is within the lower/upper threshold range, and enables its corresponding bit in the binary volume.
For debugging purposes, I left the actual check commented for now. I simply use the passed slice nr to determine if the binary volume bit should be on or off. The first 14 slices are set to "on", the rest to "off". I have also verified this code on the CPU side, the code I pasted at the bottom of this post. The code shows both paths, the CPU being commented now.
The CPU code works as intended, the following image is returned after rendering the volume with the binary mask applied:
Running the exact same logic using my GPU kernel returns incorrect results (1st 3D, 2nd slice view):
What goes wrong here? I read that OpenCL does not support bit fields, but it does support bitwise operators as far as I could understand from the OpenCL specs. My bit logic, which selects the right bit from the 32 bit word and flips it, is supported right? Or is my simple flag considered a bit field. What it does is select the voxel%32 bit from the left (not the right, hence the subtract).
Another thing could be that the uint pointer passed to my kernel is different from what I expect. I assumed this would be valid use of pointers and passing data to my kernel. The logic applied to the "uint* word" part in the kernel is due to padding words per row, and paddings rows per slice. The CPU variant confirmed that the pointer calculation logic is valid though.
Below; the code
uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);
int[] dims = new int[3];
dims[0] = volume.Geometry.NumberOfVoxels.X;
dims[1] = volume.Geometry.NumberOfVoxels.Y;
dims[2] = volume.Geometry.NumberOfVoxels.Z;
uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
unsafe {
fixed(int* dimPtr = dims) {
fixed(uint *arrcPtr = arrC) {
// pick Cloo Platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// load opencl source
StreamReader streamReader = new StreamReader(#"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
string clSource = streamReader.ReadToEnd();
streamReader.Close();
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, clSource);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
// Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
// For this reason their use should be avoided if possible.
ComputeEventList eventList = new ComputeEventList();
// Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
// Create the kernel function and set its arguments.
ComputeKernel kernel = program.CreateKernel("LowerThreshold");
int slicenr = 0;
foreach (IntPtr ptr in pinnedSlices) {
/*// CPU VARIANT FOR TESTING PURPOSES
for (int y = 0; y < dims[1]; y++) {
for (int x = 0; x < dims[0]; x++) {
long pixelOffset = x + y * dims[0];
ushort* ushortPtr = (ushort*)ptr;
ushort pixel = *(ushortPtr + pixelOffset);
int BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= x) &&
(0 <= y) &&
(0 <= slicenr) &&
(x < dims[0]) &&
(y < dims[1]) &&
(slicenr < dims[2])
) {
uint* word =
arrcPtr + 1 + (slicenr * wordsPerPlane) +
(y * wordsPerRow) +
(x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
//if (pixel > lowerThreshold && pixel < upperThreshold) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
}*/
ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
3,
new IntPtr(dimPtr));
ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
ComputeImage2D image2D = new ComputeImage2D(
context,
ComputeMemoryFlags.ReadOnly,
format,
volume.Geometry.NumberOfVoxels.X,
volume.Geometry.NumberOfVoxels.Y,
0,
ptr
);
// The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
ComputeBuffer<uint> c = new ComputeBuffer<uint>(
context, ComputeMemoryFlags.WriteOnly, arrC.Length);
kernel.SetMemoryArgument(0, image2D);
kernel.SetMemoryArgument(1, dimsBuffer);
kernel.SetValueArgument(2, wordsPerRow);
kernel.SetValueArgument(3, wordsPerPlane);
kernel.SetValueArgument(4, slicenr);
kernel.SetValueArgument(5, lowerThreshold);
kernel.SetValueArgument(6, upperThreshold);
kernel.SetMemoryArgument(7, c);
// Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
// If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);
// Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer
// will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete
// by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
// eventList will contain two events after this method returns.
commands.ReadFromBuffer(c, ref arrC, false, eventList);
// A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
// in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands
// to finish has to be issued before "arrC" can be used.
// This explicit synchronization can be achieved in two ways:
// 1) Wait for the events in the list to finish,
//eventList.Wait();
//}
// 2) Or simply use
commands.Finish();
slicenr++;
}
}
}
}
And my kernel code:
const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP | CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
read_only image2d_t image,
global int* brickSize,
uint wordsPerRow,
uint wordsPerPlane,
int slicenr,
int lower,
int upper,
global write_only uint* c )
{
int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
uint4 pixel = read_imageui(image, smp, coord.xy);
uchar BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= coord.x) &&
(0 <= coord.y) &&
(0 <= coord.z) &&
(coord.x < brickSize[0]) &&
(coord.y < brickSize[1]) &&
(coord.z < brickSize[2])
) {
global uint* word =
c + 1 + (coord.z * wordsPerPlane) +
(coord.y * wordsPerRow) +
(coord.x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
//if (pixel.w > lower && pixel.w < upper) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
Two issues:
You've declared "c" as "write_only" yet use the "|=" and "&=" operators, which are read-modify-write
As the other posters mentioned, if two work items are accessing the same word, there are race conditions between the read-modify-write that will cause errors. Atomic operations are much slower than non-atomic operations, so while possible, not recommended.
I'd recommend making your output 8x larger and using bytes rather than bits. This would make your output write-only and would also remove contention and therefore race conditions.
Or (if data compactness or format is important) process 8 elements at a time per work item, and write the composite 8-bit output as a single byte. This would be write-only, with no contention, and would still have your data compactness.
I'm working on a simple demo for collision detection, which contains only a bunch of objects bouncing around in the window. (The goal is to see how many objects the game can handle at once without dropping frames.)
There is gravity, so the objects are either moving or else colliding with a wall.
The naive solution was O(n^2):
foreach Collidable c1:
foreach Collidable c2:
checkCollision(c1, c2);
This is pretty bad. So I set up CollisionCell objects, which maintain information about a portion of the screen. The idea is that each Collidable only needs to check for the other objects in its cell. With 60 px by 60 px cells, this yields almost a 10x improvement, but I'd like to push it further.
A profiler has revealed that the the code spends 50% of its time in the function each cell uses to get its contents. Here it is:
// all the objects in this cell
public ICollection<GameObject> Containing
{
get
{
ICollection<GameObject> containing = new HashSet<GameObject>();
foreach (GameObject obj in engine.GameObjects) {
// 20% of processor time spent in this conditional
if (obj.Position.X >= bounds.X &&
obj.Position.X < bounds.X + bounds.Width &&
obj.Position.Y >= bounds.Y &&
obj.Position.Y < bounds.Y + bounds.Height) {
containing.Add(obj);
}
}
return containing;
}
}
Of that 20% of the program's time is spent in that conditional.
Here is where the above function gets called:
// Get a list of lists of cell contents
List<List<GameObject>> cellContentsSet = cellManager.getCellContents();
// foreach item, only check items in the same cell
foreach (List<GameObject> cellMembers in cellContentsSet) {
foreach (GameObject item in cellMembers) {
// process collisions
}
}
//...
// Gets a list of list of cell contents (each sub list = 1 cell)
internal List<List<GameObject>> getCellContents() {
List<List<GameObject>> result = new List<List<GameObject>>();
foreach (CollisionCell cell in cellSet) {
result.Add(new List<GameObject>(cell.Containing.ToArray()));
}
return result;
}
Right now, I have to iterate over every cell - even empty ones. Perhaps this could be improved on somehow, but I'm not sure how to verify that a cell is empty without looking at it somehow. (Maybe I could implement something like sleeping objects, in some physics engines, where if an object will be still for a while it goes to sleep and is not included in calculations for every frame.)
What can I do to optimize this? (Also, I'm new to C# - are there any other glaring stylistic errors?)
When the game starts lagging out, the objects tend to be packed fairly tightly, so there's not that much motion going on. Perhaps I can take advantage of this somehow, writing a function to see if, given an object's current velocity, it can possibly leave its current cell before the next call to Update()
UPDATE 1 I decided to maintain a list of the objects that were found to be in the cell at last update, and check those first to see if they were still in the cell. Also, I maintained an area of the CollisionCell variable, when when the cell was filled I could stop looking. Here is my implementation of that, and it made the whole demo much slower:
// all the objects in this cell
private ICollection<GameObject> prevContaining;
private ICollection<GameObject> containing;
internal ICollection<GameObject> Containing {
get {
return containing;
}
}
/**
* To ensure that `containing` and `prevContaining` are up to date, this MUST be called once per Update() loop in which it is used.
* What is a good way to enforce this?
*/
public void updateContaining()
{
ICollection<GameObject> result = new HashSet<GameObject>();
uint area = checked((uint) bounds.Width * (uint) bounds.Height); // the area of this cell
// first, try to fill up this cell with objects that were in it previously
ICollection<GameObject>[] toSearch = new ICollection<GameObject>[] { prevContaining, engine.GameObjects };
foreach (ICollection<GameObject> potentiallyContained in toSearch) {
if (area > 0) { // redundant, but faster?
foreach (GameObject obj in potentiallyContained) {
if (obj.Position.X >= bounds.X &&
obj.Position.X < bounds.X + bounds.Width &&
obj.Position.Y >= bounds.Y &&
obj.Position.Y < bounds.Y + bounds.Height) {
result.Add(obj);
area -= checked((uint) Math.Pow(obj.Radius, 2)); // assuming objects are square
if (area <= 0) {
break;
}
}
}
}
}
prevContaining = containing;
containing = result;
}
UPDATE 2 I abandoned that last approach. Now I'm trying to maintain a pool of collidables (orphans), and remove objects from them when I find a cell that contains them:
internal List<List<GameObject>> getCellContents() {
List<GameObject> orphans = new List<GameObject>(engine.GameObjects);
List<List<GameObject>> result = new List<List<GameObject>>();
foreach (CollisionCell cell in cellSet) {
cell.updateContaining(ref orphans); // this call will alter orphans!
result.Add(new List<GameObject>(cell.Containing));
if (orphans.Count == 0) {
break;
}
}
return result;
}
// `orphans` is a list of GameObjects that do not yet have a cell
public void updateContaining(ref List<GameObject> orphans) {
ICollection<GameObject> result = new HashSet<GameObject>();
for (int i = 0; i < orphans.Count; i++) {
// 20% of processor time spent in this conditional
if (orphans[i].Position.X >= bounds.X &&
orphans[i].Position.X < bounds.X + bounds.Width &&
orphans[i].Position.Y >= bounds.Y &&
orphans[i].Position.Y < bounds.Y + bounds.Height) {
result.Add(orphans[i]);
orphans.RemoveAt(i);
}
}
containing = result;
}
This only yields a marginal improvement, not the 2x or 3x I'm looking for.
UPDATE 3 Again I abandoned the above approaches, and decided to let each object maintain its current cell:
private CollisionCell currCell;
internal CollisionCell CurrCell {
get {
return currCell;
}
set {
currCell = value;
}
}
This value gets updated:
// Run 1 cycle of this object
public virtual void Run()
{
position += velocity;
parent.CellManager.updateContainingCell(this);
}
CellManager code:
private IDictionary<Vector2, CollisionCell> cellCoords = new Dictionary<Vector2, CollisionCell>();
internal void updateContainingCell(GameObject gameObject) {
CollisionCell currCell = findContainingCell(gameObject);
gameObject.CurrCell = currCell;
if (currCell != null) {
currCell.Containing.Add(gameObject);
}
}
// null if no such cell exists
private CollisionCell findContainingCell(GameObject gameObject) {
if (gameObject.Position.X > GameEngine.GameWidth
|| gameObject.Position.X < 0
|| gameObject.Position.Y > GameEngine.GameHeight
|| gameObject.Position.Y < 0) {
return null;
}
// we'll need to be able to access these outside of the loops
uint minWidth = 0;
uint minHeight = 0;
for (minWidth = 0; minWidth + cellWidth < gameObject.Position.X; minWidth += cellWidth) ;
for (minHeight = 0; minHeight + cellHeight < gameObject.Position.Y; minHeight += cellHeight) ;
CollisionCell currCell = cellCoords[new Vector2(minWidth, minHeight)];
// Make sure `currCell` actually contains gameObject
Debug.Assert(gameObject.Position.X >= currCell.Bounds.X && gameObject.Position.X <= currCell.Bounds.Width + currCell.Bounds.X,
String.Format("{0} should be between lower bound {1} and upper bound {2}", gameObject.Position.X, currCell.Bounds.X, currCell.Bounds.X + currCell.Bounds.Width));
Debug.Assert(gameObject.Position.Y >= currCell.Bounds.Y && gameObject.Position.Y <= currCell.Bounds.Height + currCell.Bounds.Y,
String.Format("{0} should be between lower bound {1} and upper bound {2}", gameObject.Position.Y, currCell.Bounds.Y, currCell.Bounds.Y + currCell.Bounds.Height));
return currCell;
}
I thought this would make it better - now I only have to iterate over collidables, not all collidables * cells. Instead, the game is now hideously slow, delivering only 1/10th of its performance with my above approaches.
The profiler indicates that a different method is now the main hot spot, and the time to get neighbors for an object is trivially short. That method didn't change from before, so perhaps I'm calling it WAY more than I used to...
It spends 50% of its time in that function because you call that function a lot. Optimizing that one function will only yield incremental improvements to performance.
Alternatively, just call the function less!
You've already started down that path by setting up a spatial partitioning scheme (lookup Quadtrees to see a more advanced form of your technique).
A second approach is to break your N*N loop into an incremental form and to use a CPU budget.
You can allocate a CPU budget for each of the modules that want action during frame times (during Updates). Collision is one of these modules, AI might be another.
Let's say you want to run your game at 60 fps. This means you have about 1/60 s = 0.0167 s of CPU time to burn between frames. No we can split those 0.0167 s between our modules. Let's give collision 30% of the budget: 0.005 s.
Now your collision algorithm knows that it can only spend 0.005 s working. So if it runs out of time, it will need to postpone some tasks for later - you will make the algorithm incremental. Code for achieving this can be as simple as:
const double CollisionBudget = 0.005;
Collision[] _allPossibleCollisions;
int _lastCheckedCollision;
void HandleCollisions() {
var startTime = HighPerformanceCounter.Now;
if (_allPossibleCollisions == null ||
_lastCheckedCollision >= _allPossibleCollisions.Length) {
// Start a new series
_allPossibleCollisions = GenerateAllPossibleCollisions();
_lastCheckedCollision = 0;
}
for (var i=_lastCheckedCollision; i<_allPossibleCollisions.Length; i++) {
// Don't go over the budget
if (HighPerformanceCount.Now - startTime > CollisionBudget) {
break;
}
_lastCheckedCollision = i;
if (CheckCollision(_allPossibleCollisions[i])) {
HandleCollision(_allPossibleCollisions[i]);
}
}
}
There, now it doesn't matter how fast the collision code is, it will be done as quickly as is possible without affecting the user's perceived performance.
Benefits include:
The algorithm is designed to run out of time, it just resumes on the next frame, so you don't have to worry about this particular edge case.
CPU budgeting becomes more and more important as the number of advanced/time consuming algorithms increases. Think AI. So it's a good idea to implement such a system early on.
Human response time is less than 30 Hz, your frame loop is running at 60 Hz. That gives the algorithm 30 frames to complete its work, so it's OK that it doesn't finish its work.
Doing it this way gives stable, data-independent frame rates.
It still benefits from performance optimizations to the collision algorithm itself.
Collision algorithms are designed to track down the "sub frame" in which collisions happened. That is, you will never be so lucky as to catch a collision just as it happens - thinking you're doing so is lying to yourself.
I can help here; i wrote my own collision detection as an experiment. I think i can tell you right now that you won't get the performance you need without changing algorithms. Sure, the naive way is nice, but only works for so many items before collapsing. What you need is Sweep and prune. The basic idea goes like this (from my collision detection library project):
using System.Collections.Generic;
using AtomPhysics.Interfaces;
namespace AtomPhysics.Collisions
{
public class SweepAndPruneBroadPhase : IBroadPhaseCollider
{
private INarrowPhaseCollider _narrowPhase;
private AtomPhysicsSim _sim;
private List<Extent> _xAxisExtents = new List<Extent>();
private List<Extent> _yAxisExtents = new List<Extent>();
private Extent e1;
public SweepAndPruneBroadPhase(INarrowPhaseCollider narrowPhase)
{
_narrowPhase = narrowPhase;
}
public AtomPhysicsSim Sim
{
get { return _sim; }
set { _sim = null; }
}
public INarrowPhaseCollider NarrowPhase
{
get { return _narrowPhase; }
set { _narrowPhase = value; }
}
public bool NeedsNotification { get { return true; } }
public void Add(Nucleus nucleus)
{
Extent xStartExtent = new Extent(nucleus, ExtentType.Start);
Extent xEndExtent = new Extent(nucleus, ExtentType.End);
_xAxisExtents.Add(xStartExtent);
_xAxisExtents.Add(xEndExtent);
Extent yStartExtent = new Extent(nucleus, ExtentType.Start);
Extent yEndExtent = new Extent(nucleus, ExtentType.End);
_yAxisExtents.Add(yStartExtent);
_yAxisExtents.Add(yEndExtent);
}
public void Remove(Nucleus nucleus)
{
foreach (Extent e in _xAxisExtents)
{
if (e.Nucleus == nucleus)
{
_xAxisExtents.Remove(e);
}
}
foreach (Extent e in _yAxisExtents)
{
if (e.Nucleus == nucleus)
{
_yAxisExtents.Remove(e);
}
}
}
public void Update()
{
_xAxisExtents.InsertionSort(comparisonMethodX);
_yAxisExtents.InsertionSort(comparisonMethodY);
for (int i = 0; i < _xAxisExtents.Count; i++)
{
e1 = _xAxisExtents[i];
if (e1.Type == ExtentType.Start)
{
HashSet<Extent> potentialCollisionsX = new HashSet<Extent>();
for (int j = i + 1; j < _xAxisExtents.Count && _xAxisExtents[j].Nucleus.ID != e1.Nucleus.ID; j++)
{
potentialCollisionsX.Add(_xAxisExtents[j]);
}
HashSet<Extent> potentialCollisionsY = new HashSet<Extent>();
for (int j = i + 1; j < _yAxisExtents.Count && _yAxisExtents[j].Nucleus.ID != e1.Nucleus.ID; j++)
{
potentialCollisionsY.Add(_yAxisExtents[j]);
}
List<Extent> probableCollisions = new List<Extent>();
foreach (Extent e in potentialCollisionsX)
{
if (potentialCollisionsY.Contains(e) && !probableCollisions.Contains(e) && e.Nucleus.ID != e1.Nucleus.ID)
{
probableCollisions.Add(e);
}
}
foreach (Extent e2 in probableCollisions)
{
if (e1.Nucleus.DNCList.Contains(e2.Nucleus) || e2.Nucleus.DNCList.Contains(e1.Nucleus))
continue;
NarrowPhase.DoCollision(e1.Nucleus, e2.Nucleus);
}
}
}
}
private bool comparisonMethodX(Extent e1, Extent e2)
{
float e1PositionX = e1.Nucleus.NonLinearSpace != null ? e1.Nucleus.NonLinearPosition.X : e1.Nucleus.Position.X;
float e2PositionX = e2.Nucleus.NonLinearSpace != null ? e2.Nucleus.NonLinearPosition.X : e2.Nucleus.Position.X;
e1PositionX += (e1.Type == ExtentType.Start) ? -e1.Nucleus.Radius : e1.Nucleus.Radius;
e2PositionX += (e2.Type == ExtentType.Start) ? -e2.Nucleus.Radius : e2.Nucleus.Radius;
return e1PositionX < e2PositionX;
}
private bool comparisonMethodY(Extent e1, Extent e2)
{
float e1PositionY = e1.Nucleus.NonLinearSpace != null ? e1.Nucleus.NonLinearPosition.Y : e1.Nucleus.Position.Y;
float e2PositionY = e2.Nucleus.NonLinearSpace != null ? e2.Nucleus.NonLinearPosition.Y : e2.Nucleus.Position.Y;
e1PositionY += (e1.Type == ExtentType.Start) ? -e1.Nucleus.Radius : e1.Nucleus.Radius;
e2PositionY += (e2.Type == ExtentType.Start) ? -e2.Nucleus.Radius : e2.Nucleus.Radius;
return e1PositionY < e2PositionY;
}
private enum ExtentType { Start, End }
private sealed class Extent
{
private ExtentType _type;
public ExtentType Type
{
get
{
return _type;
}
set
{
_type = value;
_hashcode = 23;
_hashcode *= 17 + Nucleus.GetHashCode();
}
}
private Nucleus _nucleus;
public Nucleus Nucleus
{
get
{
return _nucleus;
}
set
{
_nucleus = value;
_hashcode = 23;
_hashcode *= 17 + Nucleus.GetHashCode();
}
}
private int _hashcode;
public Extent(Nucleus nucleus, ExtentType type)
{
Nucleus = nucleus;
Type = type;
_hashcode = 23;
_hashcode *= 17 + Nucleus.GetHashCode();
}
public override bool Equals(object obj)
{
return Equals(obj as Extent);
}
public bool Equals(Extent extent)
{
if (this.Nucleus == extent.Nucleus)
{
return true;
}
return false;
}
public override int GetHashCode()
{
return _hashcode;
}
}
}
}
and here's the code that does the insertion sort (more-or-less a direct translation of the pseudocode here):
/// <summary>
/// Performs an insertion sort on the list.
/// </summary>
/// <typeparam name="T">The type of the list supplied.</typeparam>
/// <param name="list">the list to sort.</param>
/// <param name="comparison">the method for comparison of two elements.</param>
/// <returns></returns>
public static void InsertionSort<T>(this IList<T> list, Func<T, T, bool> comparison)
{
for (int i = 2; i < list.Count; i++)
{
for (int j = i; j > 1 && comparison(list[j], list[j - 1]); j--)
{
T tempItem = list[j];
list.RemoveAt(j);
list.Insert(j - 1, tempItem);
}
}
}
IIRC, i was able to get an extremely large performance increase with that, especially when dealing with large numbers of colliding bodies. You'll need to adapt it for your code, but that's the basic premise behind sweep and prune.
The other thing i want to remind you is that you should use a profiler, like the one made by Red Gate. There's a free trial which should last you long enough.
It looks like you are looping through all the game objects just to see what objects are contained in a cell. It seems like a better approach would be to store the list of game objects that are in a cell for each cell. If you do that and each object knows what cells it is in, then moving objects between cells should be easy. This seems like it will yield the biggest performance gain.
Here is another optimization tip for determing what cells an object is in:
If you have already determined what cell(s) an object is in and know that based on the objects velocity it will not change cells for the current frame, there is no need to rerun the logic that determines what cells the object is in. You can do a quick check by creating a bounding box that contains all the cells that the object is in. You can then create a bounding box that is the size of the object + the velocity of the object for the current frame. If the cell bounding box contains the object + velocity bounding box, no further checks need to be done. If the object isn't moving, it's even easier and you can just use the object bounding box.
Let me know if that makes sense, or google / bing search for "Quad Tree", or if you don't mind using open source code, check out this awesome physics library: http://www.codeplex.com/FarseerPhysics
I'm in the exact same boat as you. I'm trying to create an overhead shooter and need to push efficiency to the max so I can have tons of bullets and enemies on screen at once.
I'd get all of my collidable objects in an array with a numbered index. This affords the opportunity to take advantage of an observation: if you iterate over the list fully for each item you'll be duplicating efforts. That is (and note, I'm making up variables names just to make it easier to spit out some pseudo-code)
if (objs[49].Intersects(objs[51]))
is equivalent to:
if (objs[51].Intersects(objs[49]))
So if you use a numbered index you can save some time by not duplicating efforts. Do this instead:
for (int i1 = 0; i1 < collidables.Count; i1++)
{
//By setting i2 = i1 + 1 you ensure an obj isn't checking collision with itself, and that objects already checked against i1 aren't checked again. For instance, collidables[4] doesn't need to check against collidables[0] again since this was checked earlier.
for (int i2 = i1 + 1; i2 < collidables.Count; i2++)
{
//Check collisions here
}
}
Also, I'd have each cell either have a count or a flag to determine if you even need to check for collisions. If a certain flag is set, or if the count is less than 2, than no need to check for collisions.
Just a heads up: Some people suggest farseer; which is a great 2D physics library for use with XNA. If you're in the market for a 3D physics engine for XNA, I've used bulletx (a c# port of bullet) in XNA projects to great effect.
Note: I have no affiliation to the bullet or bulletx projects.
An idea might be to use a bounding circle. Basically, when a Collidable is created, keep track of it's centre point and calculate a radius/diameter that contains the whole object. You can then do a first pass elimination using something like;
int r = C1.BoundingRadius + C2.BoundingRadius;
if( Math.Abs(C1.X - C2.X) > r && Math.Abs(C1.Y - C2.Y) > r )
/// Skip further checks...
This drops the comparisons to two for most objects, but how much this will gain you I'm not sure...profile!
There are a couple of things that could be done to speed up the process... but as far as I can see your method of checking for simple rectangular collision is just fine.
But I'd replace the check
if (obj.Position.X ....)
With
if (obj.Bounds.IntersercsWith(this.Bounds))
And I'd also replace the line
result.Add(new List<GameObject>(cell.Containing.ToArray()));
For
result.Add(new List<GameObject>(cell.Containing));
As the Containing property returns an ICollection<T> and that inherits the IEnumerable<T> that is accepted by the List<T> constructor.
And the method ToArray() simply iterates to the list returning an array, and this process is done again when creating the new list.
I know this Thread is old but i would say that the marked answar was completly wrong...
his code contain a fatal error and don´t give performance improvent´s it will take performence!
At first a little notic...
His code is created so that you have to call this code in your Draw methode but this is the wrong place for collision-detection. In your draw methode you should only draw nothing else!
But you can´t call HandleCollisions() in Update, because Update get a lots of more calls than Draw´s.
If you want call HandleCollisions() your code have to look like this... This code will prevent that your collision detection run more then once per frame.
private bool check = false;
protected override Update(GameTime gameTime)
{
if(!check)
{
check = true;
HandleCollisions();
}
}
protected override Draw(GameTime gameTime)
{
check = false;
}
Now let us take a look what´s wrong with HandleCollisions().
Example: We have 500 objects and we would do a check for every possible Collision without optimizing our detection.
With 500 object we should have 249500 collision checks (499X500 because we don´t want to check if an object collide with it´s self)
But with Frank´s code below we will lose 99.998% of your collosions (only 500 collision-checks will done). << THIS WILL INCREASE THE PERFORMENCES!
Why? Because _lastCheckedCollision will never be the same or greater then allPossibleCollisions.Length... and because of that you would only check the last index 499
for (var i=_lastCheckedCollision; i<_allPossibleCollisions.Length; i++)
_lastCheckedCollision = i;
//<< This could not be the same as _allPossibleCollisions.Length,
//because i have to be lower as _allPossibleCollisions.Length
you have to replace This
if (_allPossibleCollisions == null ||
_lastCheckedCollision >= _allPossibleCollisions.Length)
with this
if (_allPossibleCollisions == null ||
_lastCheckedCollision >= _allPossibleCollisions.Length - 1) {
so your whole code can be replaced by this.
private bool check = false;
protected override Update(GameTime gameTime)
{
if(!check)
{
check = true;
_allPossibleCollisions = GenerateAllPossibleCollisions();
for(int i=0; i < _allPossibleCollisions.Length; i++)
{
if (CheckCollision(_allPossibleCollisions[i]))
{
//Collision!
}
}
}
}
protected override Draw(GameTime gameTime)
{
check = false;
}
... this should be a lot of faster than your code ... and it works :D ...
RCIX answer should marked as correct because Frank´s answar is wrong.