C#/.net bitwise shift left operation over a short[] - c#

Is there a method (in c#/.net) that would left-shift (bitwise) each short in a short[] that would be faster then doing it in a loop?
I am talking about data coming from a digital camera (16bit gray), the camera only uses the lower 12 bits. So to see something when rendering the data it needs to be shifted left by 4.
This is what I am doing so far:
byte[] RawData; // from camera along with the other info
if (pf == PixelFormats.Gray16)
{
fixed (byte* ptr = RawData)
{
short* wptr = (short*)ptr;
short temp;
for (int line = 0; line < ImageHeight; line++)
{
for (int pix = 0; pix < ImageWidth; pix++)
{
temp = *(wptr + (pix + line * ImageWidth));
*(wptr + (pix + line * ImageWidth)) = (short)(temp << 4);
}
}
}
}
Any ideas?

I don't know of a library method that will do it, but I have some suggestions that might help. This will only work if you know that the upper four bits of the pixel are definitely zero (rather than garbage). (If they are garbage, you'd have to add bitmasks to the below). Basically I would propose:
Using a shift operator on a larger data type (int or long) so that you are shifting more data at once
Getting rid of the multiply operations inside your loop
Doing a little loop unrolling
Here is my code:
using System.Diagnostics;
namespace ConsoleApplication9 {
class Program {
public static void Main() {
Crazy();
}
private static unsafe void Crazy() {
short[] RawData={
0x000, 0x111, 0x222, 0x333, 0x444, 0x555, 0x666, 0x777, 0x888,
0x999, 0xaaa, 0xbbb, 0xccc, 0xddd, 0xeee, 0xfff, 0x123, 0x456,
//extra sentinel value which is just here to demonstrate that the algorithm
//doesn't go too far
0xbad
};
const int ImageHeight=2;
const int ImageWidth=9;
var numShorts=ImageHeight*ImageWidth;
fixed(short* rawDataAsShortPtr=RawData) {
var nextLong=(long*)rawDataAsShortPtr;
//1 chunk of 4 longs
// ==8 ints
// ==16 shorts
while(numShorts>=16) {
*nextLong=*nextLong<<4;
nextLong++;
*nextLong=*nextLong<<4;
nextLong++;
*nextLong=*nextLong<<4;
nextLong++;
*nextLong=*nextLong<<4;
nextLong++;
numShorts-=16;
}
var nextShort=(short*)nextLong;
while(numShorts>0) {
*nextShort=(short)(*nextShort<<4);
nextShort++;
numShorts--;
}
}
foreach(var item in RawData) {
Debug.Print("{0:X4}", item);
}
}
}
}

Related

Copy unmanaged System.IntPtr byte vector into GPU row of 2D device byte array

I am using C# and CUDAfy.net (yes, this problem is easier in straight C with pointers, but I have my reasons for using this approach given the larger system).
I have a video frame grabber card that is collecting byte[1024 x 1024] image data at 30 FPS. Every 33.3 ms it fills a slot in a circular buffer and returns a System.IntPtr that points to that un-managed 1D vector of *byte; The Circular buffer has 15 slots.
On the GPU device (Tesla K40) I want to have a global 2D array that is organized as a dense 2D array. That is, I want something like the Circular Queue but on the GPU organized as a dense 2D array.
byte[15, 1024*1024] rawdata;
// if CUDAfy.NET supported jagged arrays I could use byte[15][1024*1024 but it does not
How can I fill in a different row each 33ms? Do I use something like:
gpu.CopyToDevice<byte>(inputPtr, 0, rawdata, offset, length) // length = 1024*1024
//offset is computed by rowID*(1024*1024) where rowID wraps to 0 via modulo 15.
// inputPrt is the System.Inptr that points to the buffer in the circular queue (un-managed)?
// rawdata is a device buffer allocated gpu.Allocate<byte>(1024*1024);
And in my kernel header is:
[Cudafy]
public static void filter(GThread thread, byte[,] rawdata, int frameSize, byte[] result)
I did try something along these lines. But there is no API pattern in CudaFy for:
GPGPU.CopyToDevice(T) Method (IntPtr, Int32, T[,], Int32, Int32, Int32)
So I used the gpu.Cast Function to change the 2D device array to 1D.
I tried the code below, but I am getting CUDA.net exception: ErrorLaunchFailed
FYI: When I try the CUDA emulator, it aborts on the CopyToDevice
claiming that Data is not host allocated
public static byte[] process(System.IntPtr data, int slot)
{
Stopwatch watch = new Stopwatch();
watch.Start();
byte[] output = new byte[FrameSize];
int offset = slot*FrameSize;
gpu.Lock();
byte[] rawdata = gpu.Cast<byte>(grawdata, FrameSize); // What is the size supposed to be? Documentation lacking
gpu.CopyToDevice<byte>(data, 0, rawdata, offset, FrameSize * frameCount);
byte[] goutput = gpu.Allocate<byte>(output);
gpu.Launch(height, width).filter(rawdata, FrameSize, goutput);
runTime = watch.Elapsed.ToString();
gpu.CopyFromDevice(goutput, output);
gpu.Free(goutput);
gpu.Synchronize();
gpu.Unlock();
watch.Stop();
totalRunTime = watch.Elapsed.ToString();
return output;
}
I propose this "solution", for now, either:
1. Run the program only in native mode (not in emulation mode).
or
2. Do not handle the pinned-memory allocation yourself.
There seems to be an open issue with that now. But this happens only in emulation mode.
see: https://cudafy.codeplex.com/workitem/636
If I understand your question properly I think you are looking to convert the
byte* you get from the cyclic buffer into a multi-dimensional byte array to be sent to
the graphics card API.
int slots = 15;
int rows = 1024;
int columns = 1024;
//Try this
for (int currentSlot = 0; currentSlot < slots; currentSlot++)
{
IntPtr intPtrToUnManagedMemory = CopyContextFrom(currentSlot);
// use Marshal.Copy ?
byte[] byteData = CopyIntPtrToByteArray(intPtrToUnManagedMemory);
int offset =0;
for (int m = 0; m < rows; m++)
for (int n = 0; n < columns; n++)
{
//then send this to your GPU method
rawForGpu[m, n] = ReadByteValue(IntPtr: intPtrToUnManagedMemory,
offset++);
}
}
//or try this
for (int currentSlot = 0; currentSlot < slots; currentSlot++)
{
IntPtr intPtrToUnManagedMemory = CopyContextFrom(currentSlot);
// use Marshal.Copy ?
byte[] byteData = CopyIntPtrToByteArray(intPtrToUnManagedMemory);
byte[,] rawForGpu = ConvertTo2DArray(byteData, rows, columns);
}
}
private static byte[,] ConvertTo2DArray(byte[] byteArr, int rows, int columns)
{
byte[,] data = new byte[rows, columns];
int totalElements = rows * columns;
//Convert 1D to 2D rows, colums
return data;
}
private static IntPtr CopyContextFrom(int slotNumber)
{
//code that return byte* from circular buffer.
return IntPtr.Zero;
}
You should consider using the GPGPU Async functionality that's built in for a really efficient way to move data from/to host/device and use the gpuKern.LaunchAsync(...)
Check out http://www.codeproject.com/Articles/276993/Base-Encoding-on-a-GPU for an efficient way to use this. Another great example can be found in CudafyExamples project, look for PinnedAsyncIO.cs. Everything you need to do what you're describing.
This is in CudaGPU.cs in Cudafy.Host project, which matches the method you're looking for (only it's async):
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, DevicePtrEx devArray,
int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[, ,] devArray,
int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[,] devArray,
int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[] devArray,
int devOffset, int count, int streamId = 0) where T : struct;

Converting data from C++ dll in C#

I use C# and C++ dll. I want to send array from C++ to C#. I want to return from C++ array with 512 doubles. In C++ code it works perfect. I have results exactly what I expected in array of double.
Later I send data from C++ to C# and convert this data to array of double in C#. First 450 elements from C++ are moved to array in C# without any error. But left doubles are weird and they don't have anything common with input data.
I don't know why exactly at 450 element starts wronge doubles until end.
EDIT.
Also the same issue when I change arrays in C++ and C# to float and also when I parse data to integer.
C# code.
[DllImport(#"C:\Users\Jaryn\Documents\Visual Studio 2013\Projects\MathFuncDll\Debug\MathFuncDll.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr One(string address);
static void Main(string[] args)
{
var pointer = One(#"C:\Users\Jaryn\Documents\101_ObjectCategories\accordion\image_0002.jpg");
var result = new double[512];
Marshal.Copy(pointer, result, 0, 512);
foreach (var x in result)
{
Console.WriteLine(x + " ");
}
}
MathFuncsDll.h
#include <stdexcept>
using namespace std;
namespace MathFuncs
{
extern "C" { __declspec(dllexport) double* One(char* str); }
}
MathFuncsDll.cpp
double* One(char* adress)
{
IplImage* img = cvLoadImage(adress, CV_LOAD_IMAGE_COLOR);
double data[512];
int iteration = 0;
...
for (int h = 0; h <h_bins; h++)
{
for (int s = 0; s < s_bins; s++)
{
double bin_value = 0;
for (int v = 0; v < h_bins; v++)
{
bin_value += cvGetReal3D(hist->bins, h, s, v);
data[iteration] = bin_value;
iteration++;
}
}
}
...
return data;
}
Your problem is that you are returning the address of a local variable. As soon as the function returns, that local variable's life ends. And so the address you return is the address of an object whose life is over.
The clean way to do this is to let the caller allocate the array and have the callee populate it. Have the caller pass the address of the first element of the array, and the length of the array. Then the callee can be sure not to write beyond the end of the array.

Pointer arithmetic with differing types and understanding the results?

Consider the following code:
static void Main(string[] args)
{
int max = 1024;
var lst = new List<int>();
for (int i = 1; i <= max; i *= 2) { lst.Add(i); }
var arr = lst.ToArray();
IterateInt(arr);
Console.WriteLine();
IterateShort(arr);
Console.WriteLine();
IterateLong(arr);
}
static void IterateInt(int[] arr)
{
Console.WriteLine("Iterating as INT ({0})", sizeof(int));
Console.WriteLine();
unsafe
{
fixed (int* src = arr)
{
var ptr = (int*)src;
var len = arr.Length;
while (len > 0)
{
Console.WriteLine(*ptr);
ptr++;
len--;
}
}
}
}
static void IterateShort(int[] arr)
{
Console.WriteLine("Iterating as SHORT ({0})", sizeof(short));
Console.WriteLine();
unsafe
{
fixed (int* src = arr)
{
var ptr = (short*)src;
var len = arr.Length;
while (len > 0)
{
Console.WriteLine(*ptr);
ptr++;
len--;
}
}
}
}
static void IterateLong(int[] arr)
{
Console.WriteLine("Iterating as LONG ({0})", sizeof(long));
Console.WriteLine();
unsafe
{
fixed (int* src = arr)
{
var ptr = (long*)src;
var len = arr.Length;
while (len > 0)
{
Console.WriteLine(*ptr);
ptr++;
len--;
}
}
}
}
Now, by no means do I have a full understanding in this arena. Nor did I have any real expectations. I'm experimenting and trying to learn. However, based off what I've read thus far, I don't understand the results I got for short and long.
It is my understanding that the original int[], when read 1 location at a time (i.e. arr + i), reads 4 bytes at a time because of the data types size and thus the value *ptr is of course the integral value.
However, with short I don't quite understand why every even iteration is 0 (or arguably odd iteration depending on your root reference). I mean I can see the pattern. Every time I iterate 4 bytes I get the real integral value in memory (just like iterating the int*), but why 0 on every other result?
Then the long iterations is even further outside my understanding; I don't even know what to say or assume there.
Results
Iterating as INT (4)
1
2
4
8
16
32
64
128
256
512
1024
Iterating as SHORT (2)
1
0
2
0
4
0
8
0
16
0
32
Iterating as LONG (8)
8589934593
34359738372
137438953488
549755813952
2199023255808
-9223372036854774784
96276819136
32088581144313929
30962698417340513
32370038935650407
23644233055928352
What is actually happening with the short and long iterations?
When you say pointer[index] it gives you sizeof(type) bytes at location pointer + index * sizeof(type). So by changing the type that you "iterate with" you change the stride.
With short you read halves of the original int's. Small ints have all zeros in their upper half.
With long you read two int's at the same time, forced into one long. At ptr[0] you are reading, for example, (1L << 32 | 2L) which is a big number.
You are still using the original Length measured in int-units, though, which is a bug. In the long-case you are reading outside the bounds of the array, in the short case you are reading too little.

Unsafe auto-fill of structs in .Net, for Network code

The idea: Being able to take the bytes of any struct, send those bytes across a TcpClient (or through my Client wrapper), then have the receiving client load those bytes and use pointers to "paint" them onto a new struct.
The problem: It reads the bytes into the buffer perfectly; it reads the array of bytes on the other end perfectly. The "paint" operation, however, fails miserably. I write a new Vector3(1F, 2F, 3F); I read a Vector3(0F, 0F, 0F)...Obviously, not ideal.
Unfortunately, I don't see the bug - If it works one way, it should work the reverse - And the values are being filled in.
The write/read functions are as follows:
public static unsafe void Write<T>(Client client, T value) where T : struct
{
int n = System.Runtime.InteropServices.Marshal.SizeOf(value);
byte[] buffer = new byte[n];
{
var handle = System.Runtime.InteropServices.GCHandle.Alloc(value, System.Runtime.InteropServices.GCHandleType.Pinned);
void* ptr = handle.AddrOfPinnedObject().ToPointer();
byte* bptr = (byte*)ptr;
for (int t = 0; t < n; ++t)
{
buffer[t] = *(bptr + t);
}
handle.Free();
}
client.Writer.Write(buffer);
}
Line Break
public static unsafe T Read<T>(Client client) where T : struct
{
T r = new T();
int n = System.Runtime.InteropServices.Marshal.SizeOf(r);
{
byte[] buffer = client.Reader.ReadBytes(n);
var handle = System.Runtime.InteropServices.GCHandle.Alloc(r, System.Runtime.InteropServices.GCHandleType.Pinned);
void* ptr = handle.AddrOfPinnedObject().ToPointer();
byte* bptr = (byte*)ptr;
for (int t = 0; t < n; ++t)
{
*(bptr + t) = buffer[t];
}
handle.Free();
}
return r;
}
Help, please, thanks.
Edit:
Well, one major problem is that I'm getting a handle to a temporary copy created when I passed in the struct value.
Edit2:
Changing "T r = new T();" to "object r = new T();" and "return r" to "return (T)r" boxes and unboxes the struct and, in the meantime, makes it a reference, so the pointer actually points to it.
However, it is slow. I'm getting 13,500 - 14,500 write/reads per second.
Edit3:
OTOH, serializing/deserializing the Vector3 through a BinaryFormatter gets about 750 writes/reads per second. So a Lot faster than what I was using. :)
Edit4:
Sending the floats individually got 8,400 RW/second. Suddenly, I feel much better about this. :)
Edit5:
Tested GCHandle allocation pinning and freeing; 28,000,000 ops per second (Compared to about 1,000,000,000 Int32/int add+assign/second. So compared to integers, it's 35 times slower. However, that's still comparatively fast enough). Note that you don't appear to be able to pin classes, even if GCHandle does auto-boxed structs fine (GCHandle accepts values of type "object").
Now, if the C# guys update constraints to the point where the pointer allocation recognizes that "T" is a struct, I could just assign directly to a pointer, which is...Yep, incredibly fast.
Next up: Probably testing write/read using separate threads. :) See how much the GCHandle actually affects the send/receive delay.
As it turns out:
Edit6:
double start = Timer.Elapsed.TotalSeconds;
for (t = 0; t < count; ++t)
{
Vector3 from = new Vector3(1F, 2F, 3F);
// Vector3* ptr = &test;
// Vector3* ptr2 = &from;
int n = sizeof(Vector3);
if (n / 4 * 4 != n)
{
// This gets 9,000,000 ops/second;
byte* bptr1 = (byte*)&test;
byte* bptr2 = (byte*)&from;
// int n = 12;
for (int t2 = 0; t2 < n; ++t2)
{
*(bptr1 + t2) = *(bptr2 + t2);
}
}
else
{
// This speedup gets 24,000,000 ops/second.
int n2 = n / 4;
int* iptr1 = (int*)&test;
int* iptr2 = (int*)&from;
// int n = 12;
for (int t2 = 0; t2 < n2; ++t2)
{
*(iptr1 + t2) = *(iptr2 + t2);
}
}
}
So, overall, I don't think the GCHandle is really slowing things down. (Those who are thinking this is a slow way of assigning one Vector3 to another, remember that the purpose is to serialize structs into a byte[] buffer to send over a network. And, while that's not what we're doing here, it would be rather easy to do with the first method).
Edit7:
The following got 6,900,000 ops/second:
for (t = 0; t < count; ++t)
{
Vector3 from = new Vector3(1F, 2F, 3F);
int n = sizeof(Vector3);
byte* bptr2 = (byte*)&from;
byte[] buffer = new byte[n];
for (int t2 = 0; t2 < n; ++t2)
{
buffer[t2] = *(bptr2 + t2);
}
}
...Help! I've got IntruigingPuzzlitus! :D

Mapping Stream data to data structures in C#

Is there a way of mapping data collected on a stream or array to a data structure or vice-versa?
In C++ this would simply be a matter of casting a pointer to the stream as a data type I want to use (or vice-versa for the reverse)
eg: in C++
Mystruct * pMyStrct = (Mystruct*)&SomeDataStream;
pMyStrct->Item1 = 25;
int iReadData = pMyStrct->Item2;
obviously the C++ way is pretty unsafe unless you are sure of the quality of the stream data when reading incoming data, but for outgoing data is super quick and easy.
Most people use .NET serialization (there is faster binary and slower XML formatter, they both depend on reflection and are version tolerant to certain degree)
However, if you want the fastest (unsafe) way - why not:
Writing:
YourStruct o = new YourStruct();
byte[] buffer = new byte[Marshal.SizeOf(typeof(YourStruct))];
GCHandle handle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
Marshal.StructureToPtr(o, handle.AddrOfPinnedObject(), false);
handle.Free();
Reading:
handle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
o = (YourStruct)Marshal.PtrToStructure(handle.AddrOfPinnedObject(), typeof(YourStruct));
handle.Free();
In case lubos hasko's answer was not unsafe enough, there is also the really unsafe way, using
pointers in C#. Here's some tips and pitfalls I've run into:
using System;
using System.Runtime.InteropServices;
using System.IO;
using System.Diagnostics;
// Use LayoutKind.Sequential to prevent the CLR from reordering your fields.
[StructLayout(LayoutKind.Sequential)]
unsafe struct MeshDesc
{
public byte NameLen;
// Here fixed means store the array by value, like in C,
// though C# exposes access to Name as a char*.
// fixed also requires 'unsafe' on the struct definition.
public fixed char Name[16];
// You can include other structs like in C as well.
public Matrix Transform;
public uint VertexCount;
// But not both, you can't store an array of structs.
//public fixed Vector Vertices[512];
}
[StructLayout(LayoutKind.Sequential)]
unsafe struct Matrix
{
public fixed float M[16];
}
// This is how you do unions
[StructLayout(LayoutKind.Explicit)]
unsafe struct Vector
{
[FieldOffset(0)]
public fixed float Items[16];
[FieldOffset(0)]
public float X;
[FieldOffset(4)]
public float Y;
[FieldOffset(8)]
public float Z;
}
class Program
{
unsafe static void Main(string[] args)
{
var mesh = new MeshDesc();
var buffer = new byte[Marshal.SizeOf(mesh)];
// Set where NameLen will be read from.
buffer[0] = 12;
// Use Buffer.BlockCopy to raw copy data across arrays of primitives.
// Note we copy to offset 2 here: char's have alignment of 2, so there is
// a padding byte after NameLen: just like in C.
Buffer.BlockCopy("Hello!".ToCharArray(), 0, buffer, 2, 12);
// Copy data to struct
Read(buffer, out mesh);
// Print the Name we wrote above:
var name = new char[mesh.NameLen];
// Use Marsal.Copy to copy between arrays and pointers to arrays.
unsafe { Marshal.Copy((IntPtr)mesh.Name, name, 0, mesh.NameLen); }
// Note you can also use the String.String(char*) overloads
Console.WriteLine("Name: " + new string(name));
// If Erik Myers likes it...
mesh.VertexCount = 4711;
// Copy data from struct:
// MeshDesc is a struct, and is on the stack, so it's
// memory is effectively pinned by the stack pointer.
// This means '&' is sufficient to get a pointer.
Write(&mesh, buffer);
// Watch for alignment again, and note you have endianess to worry about...
int vc = buffer[100] | (buffer[101] << 8) | (buffer[102] << 16) | (buffer[103] << 24);
Console.WriteLine("VertexCount = " + vc);
}
unsafe static void Write(MeshDesc* pMesh, byte[] buffer)
{
// But byte[] is on the heap, and therefore needs
// to be flagged as pinned so the GC won't try to move it
// from under you - this can be done most efficiently with
// 'fixed', but can also be done with GCHandleType.Pinned.
fixed (byte* pBuffer = buffer)
*(MeshDesc*)pBuffer = *pMesh;
}
unsafe static void Read(byte[] buffer, out MeshDesc mesh)
{
fixed (byte* pBuffer = buffer)
mesh = *(MeshDesc*)pBuffer;
}
}
if its .net on both sides:
think you should use binary serialization and send the byte[] result.
trusting your struct to be fully blittable can be trouble.
you will pay in some overhead (both cpu and network) but will be safe.
If you need to populate each member variable by hand you can generalize it a bit as far as the primitives are concerned by using FormatterServices to retrieve in order the list of variable types associated with an object. I've had to do this in a project where I had a lot of different message types coming off the stream and I definitely didn't want to write the serializer/deserializer for each message.
Here's the code I used to generalize the deserialization from a byte[].
public virtual bool SetMessageBytes(byte[] message)
{
MemberInfo[] members = FormatterServices.GetSerializableMembers(this.GetType());
object[] values = FormatterServices.GetObjectData(this, members);
int j = 0;
for (int i = 0; i < members.Length; i++)
{
string[] var = members[i].ToString().Split(new char[] { ' ' });
switch (var[0])
{
case "UInt32":
values[i] = (UInt32)((message[j] << 24) + (message[j + 1] << 16) + (message[j + 2] << 8) + message[j + 3]);
j += 4;
break;
case "UInt16":
values[i] = (UInt16)((message[j] << 8) + message[j + 1]);
j += 2;
break;
case "Byte":
values[i] = (byte)message[j++];
break;
case "UInt32[]":
if (values[i] != null)
{
int len = ((UInt32[])values[i]).Length;
byte[] b = new byte[len * 4];
Array.Copy(message, j, b, 0, len * 4);
Array.Copy(Utilities.ByteArrayToUInt32Array(b), (UInt32[])values[i], len);
j += len * 4;
}
break;
case "Byte[]":
if (values[i] != null)
{
int len = ((byte[])values[i]).Length;
Array.Copy(message, j, (byte[])(values[i]), 0, len);
j += len;
}
break;
default:
throw new Exception("ByteExtractable::SetMessageBytes Unsupported Type: " + var[1] + " is of type " + var[0]);
}
}
FormatterServices.PopulateObjectMembers(this, members, values);
return true;
}

Categories

Resources