Why my use of Intrinsics API is slower than naive solution?

Why my use of Intrinsics API is slower than naive solution? - c#

I struggle to realize why my usage of intrinsics API is slower than just sum with foreach loop?
public class ArraySum
{
private double[] data;
public ArraySum()
{
if (!Avx.IsSupported)
{
throw new Exception("Avx is not supported");
}
var rnd = new Random();
var list = new List<double>();
for (int i = 0; i < 100_000; i++)
{
list.Add(rnd.Next(500));
}
data = list.ToArray();
}
[Benchmark]
public void Native()
{
int result = 0;
foreach (int i in data)
{
result += i;
}
Console.WriteLine($"Native: {result}");
}
[Benchmark]
public unsafe void Intrinsics()
{
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<double>.Zero;
int i;
var array = data;
fixed (double* ptr = array)
{
for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
{
var v = Avx.LoadVector256(ptr + i);
accVector = Avx.Add(accVector, v);
}
}
double result = 0;
var temp = stackalloc double[vectorSize];
Avx.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++)
{
result += temp[j];
}
for (; i < array.Length; i++)
{
result += array[i];
}
Console.WriteLine($"Intrinsics: {result}");
}
Result:
.NET SDK=6.0.100-rc.2.21505.57
| Method | Mean | Error | StdDev | Median |
|----------- |---------:|---------:|---------:|---------:|
| Native | 387.6 us | 12.15 us | 35.83 us | 405.8 us |
| Intrinsics | 393.2 us | 9.01 us | 25.70 us | 385.0 us |
what may be causing this?
It's running on Windows and Intel Core i5-3340M CPU 2.70GHz (Ivy Bridge) if it does matter
BenchmarkDotNet warns that ArraySum.Native: Default -> It seems that the distribution is bimodal (mValue = 3.92)

I just realized that
native method should perform it on doubles not ints, opsie
[Benchmark]
public void Native()
{
double result = 0;
foreach (double i in data)
{
result += i;
}
Console.WriteLine($"Native: {result}");
}
| Method | Mean | Error | StdDev | Median |
|----------- |---------:|---------:|---------:|---------:|
| Native | 415.1 us | 25.35 us | 73.95 us | 385.9 us |
| Intrinsics | 388.7 us | 7.58 us | 21.74 us | 384.7 us |
but also:
Console.WriteLine adds probably too much overhead which is way higher than time spent performing sum and skews the results
now the difference is more significant:
[Benchmark]
public double Native()
{
double result = 0;
foreach (double i in data)
{
result += i;
}
return result;
}
[Benchmark]
public unsafe double Intrinsics()
{
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<double>.Zero;
int i;
var array = data;
fixed (double* ptr = array)
{
for (i = 0; i <= array.Length - vectorSize; i += vectorSize)
{
var v = Avx.LoadVector256(ptr + i);
accVector = Avx.Add(accVector, v);
}
}
double result = 0;
var temp = stackalloc double[vectorSize];
Avx.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++)
{
result += temp[j];
}
for (; i < array.Length; i++)
{
result += array[i];
}
return result;
}
| Method | Mean | Error | StdDev |
|----------- |---------:|---------:|---------:|
| Native | 92.92 us | 1.547 us | 1.447 us |
| Intrinsics | 25.06 us | 0.459 us | 1.090 us |

Related

Convert string to number without built in functions

I'm told to implement fucntion which takes string as a parameter and returns int. Here is how i implemented, but the question is that my implementation is ugly and would like to see other implementations of this function. According to conditions you're not allowed to use built in fucntions such tryParse, parse, or something what does all the job for you.
My implementation:
private static int StringToNumber(string str)
{
int number = 0;
if (str.Contains('-'))
{
foreach (var character in str)
{
if (character == '-')
{
continue;
}
number += character - '0';
number *= 10;
}
number *= (-1);
number /= 10;
}
else
{
foreach (var character in str)
{
number *= 10;
number += character - '0';
}
}
return number;
}

Here is a variant combining OP's char subtraction solution and #sitholewb solution, that is somewhat optimized.
public static int StringToIntCharSubtraction(string str)
{
if (string.IsNullOrWhiteSpace(str))
{
//invalid input, do something
return 0;
}
var num = 0;
var sign = 1;
int i = 0;
if (str[0] == '-')
{
sign = -1;
i = 1;
}
while (i < str.Length)
{
int currentNum = (str[i] - '0');
if (currentNum > 9 || currentNum < 0)
{
//do something else or ignore
continue;
}
num = (num * 10) + currentNum;
i++;
}
return num * sign;
}
If you are worried about performance here is benchmark.
| Method | number | Mean | Error | StdDev | Ratio | Rank | Allocated |
|----------------------------- |-------- |----------:|----------:|----------:|------:|-----:|----------:|
| StringToIntCharSubtraction | 220567 | 6.310 ns | 0.0637 ns | 0.0565 ns | 0.44 | 1 | - |
| StringToIntSwitch | 220567 | 13.824 ns | 0.3083 ns | 0.2884 ns | 0.96 | 2 | - |
| int.Parse | 220567 | 14.345 ns | 0.0883 ns | 0.0782 ns | 1.00 | 3 | - |
| | | | | | | | |
| StringToIntCharSubtraction | -829304 | 6.413 ns | 0.0556 ns | 0.0492 ns | 0.45 | 1 | - |
| StringToIntSwitch | -829304 | 12.896 ns | 0.2711 ns | 0.2784 ns | 0.90 | 2 | - |
| int.Parse | -829304 | 14.272 ns | 0.2637 ns | 0.2467 ns | 1.00 | 3 | - |
You can even drop the first one to 3 ns if you remove the validations, but it seems too risky for me.

You can use the approach following approach to solve this as well.
private static int StringToInt(string str)
{
if (string.IsNullOrWhiteSpace(str) || str.Length == 0)
{
//invalid input, do something
return 0;
}
var num = 0;
var sign = 1;
if (str[0] == '-')
{
sign = -1;
str = str.Substring(1);
}
foreach (var c in str)
{
switch (c)
{
case '0':
num = (num * 10);
break;
case '1':
num = (num * 10) + 1;
break;
case '2':
num = (num * 10) + 2;
break;
case '3':
num = (num * 10) + 3;
break;
case '4':
num = (num * 10) + 4;
break;
case '5':
num = (num * 10) + 5;
break;
case '6':
num = (num * 10) + 6;
break;
case '7':
num = (num * 10) + 7;
break;
case '8':
num = (num * 10) + 8;
break;
case '9':
num = (num * 10) + 9;
break;
default:
//do something else or ignore
break;
}
}
return num * sign;
}

Fastest way to multiply and sum/add two arrays (dot product) - unaligned surprisingly faster than FMA

Hi I have the following code:
public unsafe class MultiplyAndAdd : IDisposable
{
float[] rawFirstData = new float[1024];
float[] rawSecondData = new float[1024];
static int alignment = 32;
float[] alignedFirstData = new float[1024 + alignment / sizeof(float)];
int alignedFirstDataOffset;
GCHandle alignedFirstDataHandle;
float* alignedFirstDataPointer;
float[] alignedSecondData = new float[1024 + alignment / sizeof(float)];
int alignedSecondDataOffset;
GCHandle alignedSecondDataHandle;
float* alignedSecondDataPointer;
public IEnumerable<object[]> Data { get; set; }
public void Dispose()
{
this.alignedFirstDataHandle.Free();
this.alignedSecondDataHandle.Free();
}
//Calculate the offset that needs to be applied to ensure that the array is aligned with 32.
private int CalculateAlignmentOffset(GCHandle handle)
{
var handlePointer = handle.AddrOfPinnedObject().ToInt64();
long lPtr2 = (handlePointer + alignment - 1) & ~(alignment - 1);
return (int)(lPtr2 - handlePointer);
}
public MultiplyAndAdd()
{
Random random = new Random(1055);
for (var i = 0; i < 1024; i++)
{
rawFirstData[i] = (float)random.NextDouble() * 4f - 2f;
rawSecondData[i] = (float)random.NextDouble() * 4f - 2f;
}
alignedFirstDataHandle = GCHandle.Alloc(alignedFirstData, GCHandleType.Pinned);
alignedFirstDataOffset = CalculateAlignmentOffset(alignedFirstDataHandle);
alignedFirstDataPointer = (float*)(alignedFirstDataHandle.AddrOfPinnedObject() + alignedFirstDataOffset);
alignedSecondDataHandle = GCHandle.Alloc(alignedSecondData, GCHandleType.Pinned);
alignedSecondDataOffset = CalculateAlignmentOffset(alignedSecondDataHandle);
alignedSecondDataPointer = (float*)(alignedSecondDataHandle.AddrOfPinnedObject() + alignedSecondDataOffset);
for (var i = 0; i < 1024; i++)
{
alignedFirstData[i + alignedFirstDataOffset / sizeof(float)] = rawFirstData[i];
alignedSecondData[i + alignedSecondDataOffset / sizeof(float)] = rawSecondData[i];
}
Data = new[] {
//7,
8,
//11,
//16,
20,
//30,
32,
//40,
50 }.Select(x => new object[] { x }).ToList();
}
public void Validate()
{
for(var i = 0; i < 1024; i++)
{
if (rawFirstData[i] != alignedFirstData[i + alignedFirstDataOffset / sizeof(float)])
{
throw new InvalidOperationException("Diff found!");
}
if (rawFirstData[i] != *(alignedFirstDataPointer + i))
{
throw new InvalidOperationException("Diff found!");
}
if (rawSecondData[i] != alignedSecondData[i + alignedSecondDataOffset / sizeof(float)])
{
throw new InvalidOperationException("Diff found!");
}
if (rawSecondData[i] != *(alignedSecondDataPointer + i))
{
throw new InvalidOperationException("Diff found!");
}
}
Action<string, float, float> ensureAlmostSame = delegate (string name, float normal, float other)
{
var diff = MathF.Abs(normal - other);
if (diff > 0.00001)
{
throw new InvalidOperationException($"The difference between normal and {name} was {diff}");
}
};
foreach (var count in Data.Select(x => (int)x[0]))
{
var normal = Normal(count);
var vectorUnaligned = VectorUnaligned(count);
ensureAlmostSame(nameof(vectorUnaligned), normal, vectorUnaligned);
var vectorAligned = VectorAligned(count);
ensureAlmostSame(nameof(vectorAligned), normal, vectorAligned);
var avx2Aligned = Avx2Aligned(count);
ensureAlmostSame(nameof(avx2Aligned), normal, avx2Aligned);
var fmaAligned = FmaAligned(count);
ensureAlmostSame(nameof(fmaAligned), normal, fmaAligned);
}
}
//[Benchmark(Baseline = true)]
[ArgumentsSource(nameof(Data))]
public float Normal(int count)
{
var result = 0f;
for (var i = 0; i < count; i++)
{
result += rawFirstData[i] * rawSecondData[i];
}
return result;
}
[Benchmark]
[ArgumentsSource(nameof(Data))]
public float VectorUnaligned(int count)
{
int vectorSize = Vector<float>.Count;
var accVector = Vector<float>.Zero;
int i = 0;
for (; i <= count - vectorSize; i += vectorSize)
{
var firstVector = new Vector<float>(rawFirstData, i);
var secondVector = new Vector<float>(rawSecondData, i);
var v = Vector.Multiply(firstVector, secondVector);
accVector = Vector.Add(v, accVector);
}
float result = Vector.Sum(accVector);
for (; i < count; i++)
{
result += rawFirstData[i] * rawSecondData[i];
}
return result;
}
//[Benchmark]
[ArgumentsSource(nameof(Data))]
public float VectorAligned(int count)
{
int vectorSize = Vector<float>.Count;
var accVector = Vector<float>.Zero;
int i = 0;
for (; i <= count - vectorSize; i += vectorSize)
{
var firstVector = new Vector<float>(alignedFirstData, alignedFirstDataOffset / sizeof(float) + i);
var secondVector = new Vector<float>(alignedSecondData, alignedSecondDataOffset / sizeof(float) + i);
var v = Vector.Multiply(firstVector, secondVector);
accVector = Vector.Add(v, accVector);
}
float result = Vector.Sum(accVector);
for (; i < count; i++)
{
result += rawFirstData[i] * rawSecondData[i];
}
return result;
}
[Benchmark]
[ArgumentsSource(nameof(Data))]
public float Avx2Aligned(int count)
{
int vectorSize = Vector256<float>.Count;
var accumulationVector = Vector256<float>.Zero;
var i = 0;
for (;i <= count - vectorSize; i += vectorSize)
{
var firstVector = Avx2.LoadAlignedVector256(alignedFirstDataPointer + i);
var secondVector = Avx2.LoadAlignedVector256(alignedSecondDataPointer + i);
var resultVector = Avx2.Multiply(firstVector, secondVector);
accumulationVector = Avx2.Add(accumulationVector, resultVector);
}
var result = 0f;
var temp = stackalloc float[vectorSize];
Avx2.Store(temp, accumulationVector);
for (int j = 0; j < vectorSize; j++)
{
result += temp[j];
}
for (; i < count; i++)
{
result += *(alignedFirstDataPointer + i) * *(alignedSecondDataPointer + i);
}
return result;
}
[Benchmark]
[ArgumentsSource(nameof(Data))]
public float FmaAligned(int count)
{
int vectorSize = Vector256<float>.Count;
var accumulationVector = Vector256<float>.Zero;
var i = 0;
for (; i <= count - vectorSize; i += vectorSize)
{
var firstVector = Avx2.LoadAlignedVector256(alignedFirstDataPointer + i);
var secondVector = Avx2.LoadAlignedVector256(alignedSecondDataPointer + i);
accumulationVector = Fma.MultiplyAdd(firstVector, secondVector, accumulationVector);
}
var result = 0f;
var temp = stackalloc float[vectorSize];
Avx2.Store(temp, accumulationVector);
for (int j = 0; j < vectorSize; j++)
{
result += temp[j];
}
for (; i < count; i++)
{
result += *(alignedFirstDataPointer + i) * *(alignedSecondDataPointer + i);
}
return result;
}
}
If I run this benchmark on my Zen3 CPU, I get the following result:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1586 (20H2/October2020Update)
AMD Ryzen 5 5600X, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.200
[Host] : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
DefaultJob : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
| Method | count | Mean | Error | StdDev |
|---------------- |------ |---------:|----------:|----------:|
| VectorUnaligned | 8 | 1.231 ns | 0.0093 ns | 0.0082 ns |
| Avx2Aligned | 8 | 3.576 ns | 0.0208 ns | 0.0195 ns |
| FmaAligned | 8 | 3.408 ns | 0.0259 ns | 0.0243 ns |
| VectorUnaligned | 20 | 4.428 ns | 0.0146 ns | 0.0122 ns |
| Avx2Aligned | 20 | 6.321 ns | 0.0578 ns | 0.0541 ns |
| FmaAligned | 20 | 5.845 ns | 0.0121 ns | 0.0113 ns |
| VectorUnaligned | 32 | 4.022 ns | 0.0098 ns | 0.0087 ns |
| Avx2Aligned | 32 | 5.205 ns | 0.0161 ns | 0.0150 ns |
| FmaAligned | 32 | 4.776 ns | 0.0265 ns | 0.0221 ns |
| VectorUnaligned | 50 | 6.901 ns | 0.0337 ns | 0.0315 ns |
| Avx2Aligned | 50 | 7.207 ns | 0.0476 ns | 0.0422 ns |
| FmaAligned | 50 | 7.246 ns | 0.0169 ns | 0.0158 ns |
Why is VectorUnaligned so much faster that the more optimized AVX2 and Fma code?
If I enable VectorAligned its also slower than VectorUnaligned.

Not an answer but a tip for "Fastest way to multiply".
Sorry, I don't know how to deal with alignment but you missed the option of casting the array type. It might be faster than picking floats from source arrays in the loop.
int vectorSize = Vector<float>.Count;
var accVector = Vector<float>.Zero;
Span<Vector<float>> firstVectors = MemoryMarshal.Cast<float, Vector<float>>(rawFirstData);
Span<Vector<float>> secondVectors = MemoryMarshal.Cast<float, Vector<float>>(rawSecondData);
for (int i = 0; i < firstVectors.Length; i++)
{
accVector += Vector.Multiply(firstVectors[i], secondVectors[i]);
}
float result = Vector.Sum(accVector);
for (int i = firstVectors.Length * vectorSize; i < count; i++)
{
result += rawFirstData[i] * rawSecondData[i];
}
It makes a bit more JIT Assembler code than VectorUnaligned method but the first loop looks like twice shorter because if contains only one out-of-range check instead of 4. Give it a chance to test with different types of vectors and alignment.
this one
L0080: movsxd rsi, r11d
L0083: shl rsi, 5
L0087: vmovupd ymm1, [r8+rsi]
L008d: cmp r11d, r9d
L0090: jae short L00ff ; throw out-of-range
L0092: vmovupd ymm2, [r10+rsi]
L0098: vmulps ymm1, ymm1, ymm2
L009c: vaddps ymm0, ymm0, ymm1
L00a0: inc r11d
L00a3: cmp r11d, edx
L00a6: jl short L0080
VectorUnaligned loop, looks like JIT failed to optimize it
L0020: mov r8, rdx
L0023: cmp eax, [r8+8]
L0027: jae L00c3 ; throw out-of-range
L002d: lea r9d, [rax+7]
L0031: cmp r9d, [r8+8]
L0035: jae L00c3 ; throw out-of-range
L003b: vmovupd ymm1, [r8+rax*4+0x10]
L0042: mov r8, [rcx+0x10]
L0046: cmp eax, [r8+8]
L004a: jae L00c3 ; throw out-of-range
L0050: cmp r9d, [r8+8]
L0054: jae short L00c3 ; throw out-of-range
L0056: vmovupd ymm2, [r8+rax*4+0x10]
L005d: vmulps ymm1, ymm1, ymm2
L0061: vaddps ymm0, ymm1, ymm0
L0065: add eax, 8
L0068: mov r8d, [rdx+8]
L006c: sub r8d, 8
L0070: cmp r8d, eax
L0073: jge short L0020
Compiled code got from https://sharplab.io/. Real generated code may vary from CPU to CPU because Vector<T>.Count on certain CPUs may vary.

System.Numerics.Vector<T> Initialization Performance on .NET Framework

System.Numerics.Vector brings SIMD support to .NET Core and .NET Framework. It works on .NET Framework 4.6+ and .NET Core.
// Baseline
public void SimpleSumArray()
{
for (int i = 0; i < left.Length; i++)
results[i] = left[i] + right[i];
}
// Using Vector<T> for SIMD support
public void SimpleSumVectors()
{
int ceiling = left.Length / floatSlots * floatSlots;
for (int i = 0; i < ceiling; i += floatSlots)
{
Vector<float> v1 = new Vector<float>(left, i);
Vector<float> v2 = new Vector<float>(right, i);
(v1 + v2).CopyTo(results, i);
}
for (int i = ceiling; i < left.Length; i++)
{
results[i] = left[i] + right[i];
}
}
Unfortunately, the initialization of the Vector can be the limiting step. To work around this, several sources recommend using MemoryMarshal to transform the source array into an array of Vectors [1][2]. For example:
// Improving Vector<T> Initialization Performance
public void SimpleSumVectorsNoCopy()
{
int numVectors = left.Length / floatSlots;
int ceiling = numVectors * floatSlots;
// leftMemory is simply a ReadOnlyMemory<float> referring to the "left" array
ReadOnlySpan<Vector<float>> leftVecArray = MemoryMarshal.Cast<float, Vector<float>>(leftMemory.Span);
ReadOnlySpan<Vector<float>> rightVecArray = MemoryMarshal.Cast<float, Vector<float>>(rightMemory.Span);
Span<Vector<float>> resultsVecArray = MemoryMarshal.Cast<float, Vector<float>>(resultsMemory.Span);
for (int i = 0; i < numVectors; i++)
resultsVecArray[i] = leftVecArray[i] + rightVecArray[i];
}
This brings a dramatic improvement in performance when running on .NET Core:
| Method | Mean | Error | StdDev |
|----------------------- |----------:|----------:|----------:|
| SimpleSumArray | 165.90 us | 0.1393 us | 0.1303 us |
| SimpleSumVectors | 53.69 us | 0.0473 us | 0.0443 us |
| SimpleSumVectorsNoCopy | 31.65 us | 0.1242 us | 0.1162 us |
Unfortunately, on .NET Framework, this way of initializing the vector has the opposite effect. It actually leads to worse performance:
| Method | Mean | Error | StdDev |
|----------------------- |----------:|---------:|---------:|
| SimpleSumArray | 152.92 us | 0.128 us | 0.114 us |
| SimpleSumVectors | 52.35 us | 0.041 us | 0.038 us |
| SimpleSumVectorsNoCopy | 77.50 us | 0.089 us | 0.084 us |
Is there a way to optimize the initialization of Vector on .NET Framework and get similar performance to .NET Core? Measurements have been performed using this sample application [1].
[1] https://github.com/CBGonzalez/SIMDPerformance
[2] https://stackoverflow.com/a/62702334/430935

As far as I know, the only efficient way to load a vector in .NET Framework 4.6 or 4.7 (presumably this will all change in 5.0) is with unsafe code, for example using Unsafe.Read<Vector<float>> (or its unaliged variant if applicable):
public unsafe void SimpleSumVectors()
{
int ceiling = left.Length / floatSlots * floatSlots;
fixed (float* leftp = left, rightp = right, resultsp = results)
{
for (int i = 0; i < ceiling; i += floatSlots)
{
Unsafe.Write(resultsp + i,
Unsafe.Read<Vector<float>>(leftp + i) + Unsafe.Read<Vector<float>>(rightp + i));
}
}
for (int i = ceiling; i < left.Length; i++)
{
results[i] = left[i] + right[i];
}
}
This uses the System.Runtime.CompilerServices.Unsafe package which you can get via NuGet, but it could be done without that too.

More efficient algorithm for printing numbers that are palindromic and their power to 2 are palindromic too

I am looking for more efficient algorithm for printing numbers that are palindromic (for example 1001) and their power to 2 (1001 * 1001 = 1002001) are palindromic too. In my algorithm I think I make unnecessary checks to determine if number is palindromic. How can I improve it?
In [1000,9999] range I found this kind of 3 numbers: 1001, 1111 and 2002.
This is my algorithm:
for (int i = n; i <= m; i++)
{
if (checkIfPalindromic(i.ToString()))
{
if (checkIfPalindromic((i * i).ToString()))
Console.WriteLine(i);
}
}
this is my method to determine if number is palindromic:
static bool checkIfPalindromic(string A)
{
int n = A.Length - 1;
int i = 0;
bool IsPalindromic = true;
while (i < (n - i))
{
if (A[i] != A[n - i])
{
IsPalindromic = false;
break;
}
i++;
}
return IsPalindromic;
}

Instead of checking very number for "palindromness", it may be better to iterate through palindromes only. For that just iterate over the first halves of the number and then compose palindrome from it.
for(int half=10;half<=99;++half)
{
const int candidate=half*100+Reverse(half);//may need modification for odd number of digits
if(IsPalindrome(candidate*candidate))
Output(candidate);
}
This will make your program O(sqrt(m)) instead of O(m), which will probably beat all improvements of constant factors.

What you have already seems fairly efficient
Scale is checking 1,000,000 integers
Note : i use longs
Disclaimer : I must admit these results are a little sketchy, ive added more scaling so you can see
Results
Mode : Release
Test Framework : .Net 4.7.1
Benchmarks runs : 10 times (averaged)
Scale : 1,000
Name | Average | Fastest | StDv | Cycles | Pass | Gain
-----------------------------------------------------------------
Mine2 | 0.107 ms | 0.102 ms | 0.01 | 358,770 | Yes | 5.83 %
Original | 0.114 ms | 0.098 ms | 0.05 | 361,810 | Base | 0.00 %
Mine | 0.120 ms | 0.100 ms | 0.03 | 399,935 | Yes | -5.36 %
Scale : 10,000
Name | Average | Fastest | StDv | Cycles | Pass | Gain
-------------------------------------------------------------------
Mine2 | 1.042 ms | 0.944 ms | 0.17 | 3,526,050 | Yes | 11.69 %
Mine | 1.073 ms | 0.936 ms | 0.19 | 3,633,369 | Yes | 9.06 %
Original | 1.180 ms | 0.920 ms | 0.29 | 3,964,418 | Base | 0.00 %
Scale : 100,000
Name | Average | Fastest | StDv | Cycles | Pass | Gain
--------------------------------------------------------------------
Mine2 | 10.406 ms | 9.502 ms | 0.91 | 35,341,208 | Yes | 6.59 %
Mine | 10.479 ms | 9.332 ms | 1.09 | 35,592,718 | Yes | 5.93 %
Original | 11.140 ms | 9.272 ms | 1.72 | 37,624,494 | Base | 0.00 %
Scale : 1,000,000
Name | Average | Fastest | StDv | Cycles | Pass | Gain
-------------------------------------------------------------------------
Original | 106.271 ms | 101.662 ms | 3.61 | 360,996,200 | Base | 0.00 %
Mine | 107.559 ms | 102.695 ms | 5.35 | 365,525,239 | Yes | -1.21 %
Mine2 | 108.757 ms | 104.530 ms | 4.81 | 368,939,992 | Yes | -2.34 %
Mode : Release
Test Framework : .Net Core 2.0
Benchmarks runs : 10 times (averaged)
Scale : 1,000,000
Name | Average | Fastest | StDv | Cycles | Pass | Gain
-------------------------------------------------------------------------
Mine2 | 95.054 ms | 87.144 ms | 8.45 | 322,650,489 | Yes | 10.54 %
Mine | 95.849 ms | 89.971 ms | 5.38 | 325,315,589 | Yes | 9.79 %
Original | 106.251 ms | 84.833 ms | 17.97 | 350,106,144 | Base | 0.00 %
Given
protected override List<int> InternalRun()
{
var results = new List<int>();
for (var i = 0; i <= Input; i++)
if (checkIfPalindromic(i) && checkIfPalindromic(i * (long)i))
results.Add(i);
return results;
}
Mine1
private static unsafe bool checkIfPalindromic(long value)
{
var str = value.ToString();
fixed (char* pStr = str)
{
for (char* p = pStr, p2 = pStr + str.Length - 1; p < p2;)
if (*p++ != *p2--)
return false;
}
return true;
}
Mine2
private static bool checkIfPalindromic(long value)
{
var str = value.ToString();
var n = str.Length - 1;
for (var i = 0; i < n - i; i++)
if (str[i] != str[n - i])
return false;
return true;
}

More optimistic way is to use int instead of string. this algorithm is about two time faster:
static int[] pow10 = { 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 };
static bool checkIfPalindromic(int A)
{
int n = 1;
int i = A;
if (i >= 100000000) { n += 8; i /= 100000000; }
if (i >= 10000) { n += 4; i /= 10000; }
if (i >= 100) { n += 2; i /= 100; }
if (i >= 10) { n++; }
int num = A / pow10[(n+1) / 2];
for (; num % 10 == 0;)
num /= 10;
int reversedNum = 0;
for (int input = A % pow10[ n / 2]; input != 0; input /= 10)
reversedNum = reversedNum * 10 + input % 10;
return num == reversedNum;
}
Usage:
for (int i = n; i <= m; i++)
if (checkIfPalindromic(i) && checkIfPalindromic(i * i))
Console.WriteLine(i);
Benchmark:
Bemchmark in range of [1000, 99999999] on Core2Duo CPU:
This algorithm: 12261ms
Your algorithm: 24181ms
Palindromic Numbers:
1001
1111
2002
10001
10101
10201
11011
11111
11211
20002
20102

you can use Linq to simplify your code
sample:-
static void Main(string[] args)
{
int n = 1000, m = 9999;
for (int i = n; i <= m; i++)
{
if (CheckIfNoAndPowerPalindromic(i))
{
Console.WriteLine(i);
}
}
}
private static bool CheckIfNoAndPowerPalindromic(int number)
{
string numberString = number.ToString();
string numberSquareString = (number * number).ToString();
return (Enumerable.SequenceEqual(numberString.ToCharArray(), numberString.ToCharArray().Reverse()) &&
Enumerable.SequenceEqual(numberSquareString.ToCharArray(), numberSquareString.ToCharArray().Reverse()));
}
output:-
1001
1111
2002.

Loop up to len/2 as follow:
static bool checkIfPalindromic(string A)
{
for (int i = 0; i < A.Length / 2; i++)
if (A[i] != A[A.Length - i - 1])
return false;
return true;
}

We can get an interesting optimisation by changing the palindromic checking method and using a direct integer reversing method instead of converting first to a string then looping in the string.
I used the method in the accepted answer from this question:
static int reverse(int n)
{
int left = n;
int rev = 0;
int r = 0;
while (left > 0)
{
r = left % 10;
rev = rev * 10 + r;
left = left / 10;
}
return rev;
}
I also used the StopWatch from System.Diagnostics to measure the elapsed time.
My function to check if a number is a palindromic number is:
static bool IsPalindromicNumber(int number)
{
return reverse(number) == number;
}
For n value of 1000 and for different values of m I get the following results for the elapsed time in milliseconds:
---------------------------------------------------------
| m | original | mine | optimisation|
---------------------------------------------------------
|9999 |6.3855 |4.2171 | -33.95% |
---------------------------------------------------------
|99999 |71.3961 |42.3399 | -40.69% |
---------------------------------------------------------
|999999 |524.4921 |342.8899 | -34.62% |
---------------------------------------------------------
|9999999 |7016.4050 |4565.4563 | -34.93% |
---------------------------------------------------------
|99999999 |71319.658 |49837.5632 | -30.12% |
---------------------------------------------------------
The measured values are an indicative and not absolute because from one run of the program to another they are different but the pattern stays the same and the second approach appears always faster.
To measure using the StopWatch:
With your method:
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
for (int i = n; i <= m; i++)
{
if (checkIfPalindromic(i.ToString()))
{
if (checkIfPalindromic((i * i).ToString()))
Console.WriteLine(i);
}
}
stopWatch.Stop();
Console.WriteLine("First approach: Elapsed time..." + stopWatch.Elapsed + " which is " + stopWatch.Elapsed.TotalMilliseconds + " miliseconds");
I used of course exact same approach with my changes:
With my method:
Stopwatch stopWatch2 = new Stopwatch();
stopWatch2.Start();
for (int i = n; i <= m; i++)
{
if (IsPalindromicNumber(i) && IsPalindromicNumber(i*i))
{
Console.WriteLine(i);
}
}
stopWatch2.Stop();
Console.WriteLine("Second approach: Elapsed time..." + stopWatch2.Elapsed + " which is " + stopWatch2.Elapsed.TotalMilliseconds + " miliseconds");

Fill a multidimensional array with same values C#

Is there a faster way of doing this using C#?
double[,] myArray = new double[length1, length2];
for(int i=0;i<length1;i++)
for(int j=0;j<length2;j++)
myArray[i,j] = double.PositiveInfinity;
I remember using C++, there was something called memset() for doing these kind of things...

A multi-dimensional array is just a large block of memory, so we can treat it like one, similar to how memset() works. This requires unsafe code. I wouldn't say it's worth doing unless it's really performance critical. This is a fun exercise, though, so here are some benchmarks using BenchmarkDotNet:
public class ArrayFillBenchmark
{
const int length1 = 1000;
const int length2 = 1000;
readonly double[,] _myArray = new double[length1, length2];
[Benchmark]
public void MultidimensionalArrayLoop()
{
for (int i = 0; i < length1; i++)
for (int j = 0; j < length2; j++)
_myArray[i, j] = double.PositiveInfinity;
}
[Benchmark]
public unsafe void MultidimensionalArrayNaiveUnsafeLoop()
{
fixed (double* a = &_myArray[0, 0])
{
double* b = a;
for (int i = 0; i < length1; i++)
for (int j = 0; j < length2; j++)
*b++ = double.PositiveInfinity;
}
}
[Benchmark]
public unsafe void MultidimensionalSpanFill()
{
fixed (double* a = &_myArray[0, 0])
{
double* b = a;
var span = new Span<double>(b, length1 * length2);
span.Fill(double.PositiveInfinity);
}
}
[Benchmark]
public unsafe void MultidimensionalSseFill()
{
var vectorPositiveInfinity = Vector128.Create(double.PositiveInfinity);
fixed (double* a = &_myArray[0, 0])
{
double* b = a;
ulong i = 0;
int size = Vector128<double>.Count;
ulong length = length1 * length2;
for (; i < (length & ~(ulong)15); i += 16)
{
Sse2.Store(b+size*0, vectorPositiveInfinity);
Sse2.Store(b+size*1, vectorPositiveInfinity);
Sse2.Store(b+size*2, vectorPositiveInfinity);
Sse2.Store(b+size*3, vectorPositiveInfinity);
Sse2.Store(b+size*4, vectorPositiveInfinity);
Sse2.Store(b+size*5, vectorPositiveInfinity);
Sse2.Store(b+size*6, vectorPositiveInfinity);
Sse2.Store(b+size*7, vectorPositiveInfinity);
b += size*8;
}
for (; i < (length & ~(ulong)7); i += 8)
{
Sse2.Store(b+size*0, vectorPositiveInfinity);
Sse2.Store(b+size*1, vectorPositiveInfinity);
Sse2.Store(b+size*2, vectorPositiveInfinity);
Sse2.Store(b+size*3, vectorPositiveInfinity);
b += size*4;
}
for (; i < (length & ~(ulong)3); i += 4)
{
Sse2.Store(b+size*0, vectorPositiveInfinity);
Sse2.Store(b+size*1, vectorPositiveInfinity);
b += size*2;
}
for (; i < length; i++)
{
*b++ = double.PositiveInfinity;
}
}
}
}
Results:
| Method | Mean | Error | StdDev | Ratio |
|------------------------------------- |-----------:|----------:|----------:|------:|
| MultidimensionalArrayLoop | 1,083.1 us | 11.797 us | 11.035 us | 1.00 |
| MultidimensionalArrayNaiveUnsafeLoop | 436.2 us | 8.567 us | 8.414 us | 0.40 |
| MultidimensionalSpanFill | 321.2 us | 6.404 us | 10.875 us | 0.30 |
| MultidimensionalSseFill | 231.9 us | 4.616 us | 11.323 us | 0.22 |
MultidimensionalArrayLoop is slow because of bounds checking. The JIT emits code each loop that makes sure that [i, j] is inside the bounds of the array. The JIT can elide bounds checking sometimes, I know it does for single-dimensional arrays. I'm not sure if it does it for multi-dimensional.
MultidimensionalArrayNaiveUnsafeLoop is essentially the same code as MultidimensionalArrayLoop but without bounds checking. It's considerably faster, taking 40% of the time. It's considered 'Naive', though, because the loop could still be improved by unrolling the loop.
MultidimensionalSpanFill also has no bounds check, and is more-or-less the same as MultidimensionalArrayNaiveUnsafeLoop, however, Span.Fill internally does loop unrolling, which is why it's a bit faster than our naive unsafe loop. It only take 30% of the time as our original.
MultidimensionalSseFill improves on our first unsafe loop by doing two things: loop unrolling and vectorizing. This requires a CPU with Sse2 support, but it allows us to write 128-bits (16 bytes) in a single instruction. This gives us an additional speed boost, taking it down to 22% of the original. Interestingly, this same loop with Avx (256-bits) was consistently slower than the Sse2 version, so that benchmark is not included here.
But these numbers only apply to an array that is 1000x1000. As you change the size of the array, the results differ. For example, when we change the array size to 10000x10000, the results for all of the unsafe benchmarks are very close. Probably because there are more memory fetches for the larger array that it tends to equalize the smaller iterative improvements seen in the last three benchmarks.
There's a lesson in there somewhere, but I mostly just wanted to share these results, since it was a pretty fun experiment to do.

I wrote the method that is not faster, but it works with actual multidimensional arrays, not only 2D.
public static class ArrayExtensions
{
public static void Fill(this Array array, object value)
{
var indicies = new int[array.Rank];
Fill(array, 0, indicies, value);
}
public static void Fill(Array array, int dimension, int[] indicies, object value)
{
if (dimension < array.Rank)
{
for (int i = array.GetLowerBound(dimension); i <= array.GetUpperBound(dimension); i++)
{
indicies[dimension] = i;
Fill(array, dimension + 1, indicies, value);
}
}
else
array.SetValue(value, indicies);
}
}

double[,] myArray = new double[x, y];
if( parallel == true )
{
stopWatch.Start();
System.Threading.Tasks.Parallel.For( 0, x, i =>
{
for( int j = 0; j < y; ++j )
myArray[i, j] = double.PositiveInfinity;
});
stopWatch.Stop();
Print( "Elapsed milliseconds: {0}", stopWatch.ElapsedMilliseconds );
}
else
{
stopWatch.Start();
for( int i = 0; i < x; ++i )
for( int j = 0; j < y; ++j )
myArray[i, j] = double.PositiveInfinity;
stopWatch.Stop();
Print("Elapsed milliseconds: {0}", stopWatch.ElapsedMilliseconds);
}
When setting x and y to 10000 I get 553 milliseconds for the single-threaded approach and 170 for the multi-threaded one.

There is a possibility to quickly fill an md-array that does not use the keyword unsafe (see answers for this question)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Why my use of Intrinsics API is slower than naive solution? - c#

Related

Convert string to number without built in functions

Fastest way to multiply and sum/add two arrays (dot product) - unaligned surprisingly faster than FMA

System.Numerics.Vector<T> Initialization Performance on .NET Framework

More efficient algorithm for printing numbers that are palindromic and their power to 2 are palindromic too

Fill a multidimensional array with same values C#

Categories

Resources