If I have a byte array and want to convert a contiguous 16 byte block of that array, containing .net's representation of a Decimal, into a proper Decimal struct, what is the most efficient way to do it?
Here's the code that showed up in my profiler as the biggest CPU consumer in a case that I'm optimizing.
public static decimal ByteArrayToDecimal(byte[] src, int offset)
{
using (MemoryStream stream = new MemoryStream(src))
{
stream.Position = offset;
using (BinaryReader reader = new BinaryReader(stream))
return reader.ReadDecimal();
}
}
To get rid of MemoryStream and BinaryReader, I thought feeding an array of BitConverter.ToInt32(src, offset + x)s into the Decimal(Int32[]) constructor would be faster than the solution I present below, but the version below is, strangely enough, twice as fast.
const byte DecimalSignBit = 128;
public static decimal ByteArrayToDecimal(byte[] src, int offset)
{
return new decimal(
BitConverter.ToInt32(src, offset),
BitConverter.ToInt32(src, offset + 4),
BitConverter.ToInt32(src, offset + 8),
src[offset + 15] == DecimalSignBit,
src[offset + 14]);
}
This is 10 times as fast as the MemoryStream/BinaryReader combo, and I tested it with a bunch of extreme values to make sure it works, but the decimal representation is not as straightforward as that of other primitive types, so I'm not yet convinced it works for 100% of the possible decimal values.
In theory however, there could be a way to copy those 16 contiguous byte to some other place in memory and declare that to be a Decimal, without any checks. Is anyone aware of a method to do this?
(There's only one problem: Although decimals are represented as 16 bytes, some of the possible values do not constitute valid decimals, so doing an uncheckedmemcpy could potentially break things...)
Or is there any other faster way?
Even though this is an old question, I was a bit intrigued, so decided to run some experiments. Let's start with the experiment code.
static void Main(string[] args)
{
byte[] serialized = new byte[16 * 10000000];
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10000000; ++i)
{
decimal d = i;
// Serialize
using (var ms = new MemoryStream(serialized))
{
ms.Position = (i * 16);
using (var bw = new BinaryWriter(ms))
{
bw.Write(d);
}
}
}
var ser = sw.Elapsed.TotalSeconds;
sw = Stopwatch.StartNew();
decimal total = 0;
for (int i = 0; i < 10000000; ++i)
{
// Deserialize
using (var ms = new MemoryStream(serialized))
{
ms.Position = (i * 16);
using (var br = new BinaryReader(ms))
{
total += br.ReadDecimal();
}
}
}
var dser = sw.Elapsed.TotalSeconds;
Console.WriteLine("Time: {0:0.00}s serialization, {1:0.00}s deserialization", ser, dser);
Console.ReadLine();
}
Result: Time: 1.68s serialization, 1.81s deserialization. This is our baseline. I also tried Buffer.BlockCopy to an int[4], which gives us 0.42s for deserialization. Using the method described in the question, deserialization goes down to 0.29s.
In theory however, there could be a way to copy those 16 contiguous
byte to some other place in memory and declare that to be a Decimal,
without any checks. Is anyone aware of a method to do this?
Well yes, the fastest way to do this is to use unsafe code, which is okay here because decimals are value types:
static unsafe void Main(string[] args)
{
byte[] serialized = new byte[16 * 10000000];
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10000000; ++i)
{
decimal d = i;
fixed (byte* sp = serialized)
{
*(decimal*)(sp + i * 16) = d;
}
}
var ser = sw.Elapsed.TotalSeconds;
sw = Stopwatch.StartNew();
decimal total = 0;
for (int i = 0; i < 10000000; ++i)
{
// Deserialize
decimal d;
fixed (byte* sp = serialized)
{
d = *(decimal*)(sp + i * 16);
}
total += d;
}
var dser = sw.Elapsed.TotalSeconds;
Console.WriteLine("Time: {0:0.00}s serialization, {1:0.00}s deserialization", ser, dser);
Console.ReadLine();
}
At this point, our result is: Time: 0.07s serialization, 0.16s deserialization. Pretty sure that's the fastest this is going to get... still, you have to accept unsafe here, and I assume stuff is written the same way as it's read.
#Eugene Beresovksy read from a stream is very costly. MemoryStream is certainly a powerful and versatile tool, but it has a pretty high cost to a direct reading a binary array. Perhaps because of this the second method performs better.
I have a 3rd solution for you, but before I write it, it is necessary to say that I haven't tested the performance of it.
public static decimal ByteArrayToDecimal(byte[] src, int offset)
{
var i1 = BitConverter.ToInt32(src, offset);
var i2 = BitConverter.ToInt32(src, offset + 4);
var i3 = BitConverter.ToInt32(src, offset + 8);
var i4 = BitConverter.ToInt32(src, offset + 12);
return new decimal(new int[] { i1, i2, i3, i4 });
}
This is a way to make the building based on a binary without worrying about the canonical of System.Decimal. It is the inverse of the default .net bit extraction method:
System.Int32[] bits = Decimal.GetBits((decimal)10);
EDITED:
This solution perhaps don't peform better but also don't have this problem: "(There's only one problem: Although decimals are represented as 16 bytes, some of the possible values do not constitute valid decimals, so doing an uncheckedmemcpy could potentially break things...)".
Related
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++, p++) {
newResults[p] = largeArray[lookUp[p]];
}
}
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
{
static int Gather(int[] source, int[] index, int[] results, bool avx)
{ // normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
{
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
{
fixed (int* sPtr = source)
{
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
{
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
}
}
}
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
}
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
{
results[y] = source[index[y]];
}
return result;
}
static void Main()
{
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
{
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}");
for (int i = 0; i < 32; i++)
{
Console.Write(results[i].ToString("x2"));
}
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms");
Console.WriteLine();
}
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
}
}
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
I want to generate a one million bit random binary but my problem is that the code take to much time and not execute why that happen?
string result1 = "";
Random rand = new Random();
for (int i = 0; i < 1000000; i++)
{
result1 += ((rand.Next() % 2 == 0) ? "0" : "1");
}
textBox1.Text = result1.ToString();
Concatenating strings is an O(N) operation. Strings are immutable, so when you add to a string the new value is copied into a new string, which requires iterating the previous string. Since you're adding a value for each iteration, the amount that has to be read each time grows with each addition, leading to a performance of O(N^2). Since your N is 1,000,000 this takes a very, very long time, and probably is eating all of the memory you have storing these intermediary throw-away strings.
The normal solution when building a string with an arbitrary number of inputs is to instead use a StringBuilder. Although, a 1,000,000 character bit string is still.. unwieldy. Assuming a bitstring is what you want/need, you can change your code to something like the following and have a much more performant solution.
public string GetGiantBitString() {
var sb = new StringBuilder();
var rand = new Random();
for(var i = 0; i < 1_000_000; i++) {
sb.Append(rand.Next() % 2);
}
return sb.ToString();
}
This works for me, it takes about 0.035 seconds on my box:
private static IEnumerable<Byte> MillionBits()
{
var rand = new RNGCryptoServiceProvider();
//a million bits is 125,000 bytes, so
var bytes = new List<byte>(125000);
for (var i = 0; i < 125; ++i)
{
byte[] tempBytes = new byte[1000];
rand.GetBytes(tempBytes);
bytes.AddRange(tempBytes);
}
return bytes;
}
private static string BytesAsString(IEnumerable<Byte> bytes)
{
var buffer = new StringBuilder();
foreach (var byt in bytes)
{
buffer.Append(Convert.ToString(byt, 2).PadLeft(8, '0'));
}
return buffer.ToString();
}
and then:
var myStopWatch = new Stopwatch();
myStopWatch.Start();
var lotsOfBytes = MillionBits();
var bigString = BytesAsString(lotsOfBytes);
var len = bigString.Length;
var elapsed = myStopWatch.Elapsed;
The len variable was a million, the string looked like it was all 1s and 0s.
If you really want to fill your textbox full of ones and zeros, just set its Text property to bigString.
I have a byte array and I would like to return sequential chuncks (in the form of new byte arrays) of a certain size.
I tried:
originalArray = BYTE_ARRAY
var segment = new ArraySegment<byte>(originalArray,0,640);
byte[] newArray = new byte[640];
for (int i = segment.Offset; i <= segment.Count; i++)
{
newArray[i] = segment.Array[i];
}
Obviously this only creates an array of the first 640 bytes from the original array. Ultimately, I want a loop that goes through the first 640 bytes and returns an array of those bytes, then it goes through the NEXT 640 bytes and returns an array of THOSE bytes. The purpose of this is to send messages to a server and each message must contain 640 bytes. I cannot garauntee that the original array length is divisible by 640.
Thanks
if speed isn't a concern
var bytes = new byte[640 * 6];
for (var i = 0; i <= bytes.Length; i+=640)
{
var chunk = bytes.Skip(i).Take(640).ToArray();
...
}
Alternatively you could use
Span.Slice Method
Buffer.BlockCopy(Array, Int32, Array, Int32, Int32) Method
Span
Span<byte> bytes = arr; // Implicit cast from T[] to Span<T>
...
slicedBytes = bytes.Slice(i, 640);
BlockCopy
Note this will probably be the fastest of the 3
var chunk = new byte[640]
Buffer.BlockCopy(bytes, i, chunk, 0, 640);
If you truly want to make new arrays from each 640 byte chunk, then you're looking for .Skip and .Take
Here's a working example (and a repl of the example) that I hacked together.
using System;
using System.Linq;
using System.Text;
using System.Collections;
using System.Collections.Generic;
class MainClass {
public static void Main (string[] args) {
// mock up a byte array from something
var seedString = String.Join("", Enumerable.Range(0, 1024).Select(x => x.ToString()));
var byteArrayInput = Encoding.ASCII.GetBytes(seedString);
var skip = 0;
var take = 640;
var total = byteArrayInput.Length;
var output = new List<byte[]>();
while (skip + take < total) {
output.Add(byteArrayInput.Skip(skip).Take(take).ToArray());
skip += take;
}
output.ForEach(c => Console.WriteLine($"chunk: {BitConverter.ToString(c)}"));
}
}
It's really probably better to actually use the ArraySegment properly --unless this is an assignment to learn LINQ extensions.
You can write a generic helper method like this:
public static IEnumerable<T[]> AsBatches<T>(T[] input, int n)
{
for (int i = 0, r = input.Length; r >= n; r -= n, i += n)
{
var result = new T[n];
Array.Copy(input, i, result, 0, n);
yield return result;
}
}
Then you can use it in a foreach loop:
byte[] byteArray = new byte[123456];
foreach (var batch in AsBatches(byteArray, 640))
{
Console.WriteLine(batch.Length); // Do something with the batch.
}
Or if you want a list of batches just do this:
List<byte[]> listOfBatches = AsBatches(byteArray, 640).ToList();
If you want to get fancy you could make it an extension method, but this is only recommended if you will be using it a lot (don't make an extension method for something you'll only be calling in one place!).
Here I've changed the name to InChunksOf() to make it more readable:
public static class ArrayExt
{
public static IEnumerable<T[]> InChunksOf<T>(this T[] input, int n)
{
for (int i = 0, r = input.Length; r >= n; r -= n, i += n)
{
var result = new T[n];
Array.Copy(input, i, result, 0, n);
yield return result;
}
}
}
Which you could use like this:
byte[] byteArray = new byte[123456];
// ... initialise byteArray[], then:
var listOfChunks = byteArray.InChunksOf(640).ToList();
[EDIT] Corrected loop terminator from r > n to r >= n.
I built a test and got following results:
allocating classes: 15.3260622, allocating structs: 14.7216018.
Looks like a 4% advantage when allocates structs instead of classes. That's cool but is it really enough to add in the language value types? Where I can find an example which shows that structs really beat classes?
void Main()
{
var stopWatch = new System.Diagnostics.Stopwatch();
stopWatch.Start();
for (int i = 0; i < 100000000; i++)
{
var foo = new refFoo()
{
Str = "Alex" + i
};
}
stopWatch.Stop();
stopWatch.Dump();
stopWatch.Restart();
for (int i = 0; i < 100000000; i++)
{
var foo = new valFoo()
{
Str = "Alex" + i
};
}
stopWatch.Stop();
stopWatch.Dump();
}
public struct valFoo
{
public string Str;
}
public class refFoo
{
public string Str;
}
Your methodology is wrong. You are mostly measuring string allocations, conversions of integers to strings, and concatenation of strings. This benchmark is not worth the bits it is written on.
In order to see the benefit of structs, compare allocating an array of 1000 objects and an array of 1000 structs. In the case of the array of objects, you will need one allocation for the array itself, and then one allocation for each object in the array. In the case of the array of structs, you have one allocation for the array of structs.
Also, look at the implementation of the Enumerator of the List class in the C# source code of .Net collections. It is declared as a struct. That's because it only contains an int, so the entire enumerator struct fits inside a machine word, so it is very inexpensive.
Try some simpler test:
int size = 1000000;
var listA = new List<int>(size);
for (int i = 0; i < size; i++)
listA.Add(i);
var listB = new List<object>(size);
for (int i = 0; i < size; i++)
listB.Add(i);
To store 1000000 integers in first case the system allocates 4000000 bytes. In second, if I'm not mistaken — about 12000000 bytes. And I suspect the performance difference will be much greater.
I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.
I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.
public static class RLE
{
public static byte[] Encode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 0; i < source.Length; i++)
{
runLength = 1;
while (runLength < byte.MaxValue
&& i + 1 < source.Length
&& source[i] == source[i + 1])
{
runLength++;
i++;
}
dest.Add(runLength);
dest.Add(source[i]);
}
return dest.ToArray();
}
public static byte[] Decode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 1; i < source.Length; i+=2)
{
runLength = source[i - 1];
while (runLength > 0)
{
dest.Add(source[i]);
runLength--;
}
}
return dest.ToArray();
}
}
I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.
public class LZW
{
/* Compress a string to a list of output symbols. */
public static int[] compress(string uncompressed)
{
// Build the dictionary.
int dictSize = 256;
Dictionary<string, int> dictionary = new Dictionary<string, int>();
for (int i = 0; i < dictSize; i++)
dictionary.Add("" + (char)i, i);
string w = "";
List<int> result = new List<int>();
for (int i = 0; i < uncompressed.Length; i++)
{
char c = uncompressed[i];
string wc = w + c;
if (dictionary.ContainsKey(wc))
w = wc;
else
{
result.Add(dictionary[w]);
// Add wc to the dictionary.
dictionary.Add(wc, dictSize++);
w = "" + c;
}
}
// Output the code for w.
if (w != "")
result.Add(dictionary[w]);
return result.ToArray();
}
/* Decompress a list of output ks to a string. */
public static string decompress(int[] compressed)
{
int dictSize = 256;
Dictionary<int, string> dictionary = new Dictionary<int, string>();
for (int i = 0; i < dictSize; i++)
dictionary.Add(i, "" + (char)i);
string w = "" + (char)compressed[0];
string result = w;
for (int i = 1; i < compressed.Length; i++)
{
int k = compressed[i];
string entry = "";
if (dictionary.ContainsKey(k))
entry = dictionary[k];
else if (k == dictSize)
entry = w + w[0];
result += entry;
// Add w+entry[0] to the dictionary.
dictionary.Add(dictSize++, w + entry[0]);
w = entry;
}
return result;
}
}
Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.
The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.
This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.
This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.
A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:
http://www.shikadi.net/moddingwiki/RLE_Compression
Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.
Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.