I need the fastest way to process math on numerical array data

I need the fastest way to process math on numerical array data - c#

I apologize if this is in the incorrect forum. Despite finding a lot of Array manipulation on this site, most of these are averaging/summing... the array of numerics as a set using LINQ, which processes well for all values in an array. But I need to process each index over multiple arrays (of the same size).
My routine receives array data from devices, typically double[512] or ushort[512]; A single device itself will always have the same size of Array data, but the array sizes can range from 256 to 2048 depending on the device. I need to hold CountToAverage quantity of the arrays to average. Each time an array is received, it must push and pop from the queue to ensure that the number of arrays in the average process is consistent (this part of the process is fixed in the Setup() for this benchmark testing. For comparison purposes, the benchmark results are shown after the code.
What I am looking for is the fastest most efficient way to average the values of each index of all the arrays, and return a new array (of the same size) where each index is averaged from the set of arrays. The count of arrays to be averaged can range from 3 to 25 (the code below sets benchmark param to 10). I have 2 different averaging methods in the test, the 2nd is significantly faster, 6-7 times faster than the first. My first question is; Is there any way to achieve this faster, that can be done at O(1) or O(log n) time complexity?
Secondarily, I am using a Queue (which may be changed to ConcurrentQueue for implementation) as a holder for the arrays to be processed. My primary reasoning for using a queue is because I can guarantee FIFO processing of the feed of arrays which is critical. Also, I can process against the values in the Queue through a foreach loop (just like a List) without having to dequeue until I am ready. I would be interested if anyone knows whether this is performance hindering as I haven't benchmarked it. Keep in mind it must be thread-safe. If you have an alternative way to process multiple sets of array data in a thread-safe manner I am "all ears".
The reason for the performance requirement is this is not the only process that is happening, I have multiple devices that are sending array results "streamed" at an approximate rate of 1 every 1-5 milliseconds, for each device coming from different threads/processes/connections, that still has several other much more intensive algorithms to process through, so this cannot be a bottleneck.
Any insights on optimizations and performance are appreciated.
using System;
using System.Collections.Generic;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using Microsoft.Diagnostics.Tracing.Parsers.MicrosoftAntimalwareEngine;
namespace ArrayAverage
{
public class ArrayAverage
{
[Params(10)]
public int CountToAverage;
[Params(512, 2048)]
public int PixelSize;
static Queue<double[]> calcRepo = new Queue<double[]>();
static List<double[]> spectra = new();
[Benchmark]
public double[] CalculateIndexAverages()
{
// This is too slow
var avg = new double[PixelSize];
for (int i = 0; i < PixelSize; i++)
{
foreach (var arrayData in calcRepo)
{
avg[i] += arrayData[i];
}
avg[i] /= calcRepo.Count;
}
return avg;
}
[Benchmark]
public double[] CalculateIndexAverages2()
{
// this is faster, but is it the fastest?
var sum = new double[PixelSize];
int cnt = calcRepo.Count;
foreach (var arrayData in calcRepo)
{
for (int i = 0; i < PixelSize; i++)
{
sum[i] += arrayData[i];
}
}
var avg = new double[PixelSize];
for (int i = 0; i < PixelSize; i++)
{
avg[i] = sum[i] / cnt;
}
return avg;
}
[GlobalSetup]
public void Setup()
{
// Just generating some data as simple Triangular curve simulating a range of spectra
for (double offset = 0; offset < CountToAverage; offset++)
{
var values = new double[PixelSize];
var decrement = 0;
for (int i = 0; i < PixelSize; i++)
{
if (i > (PixelSize / 2))
decrement--;
values[i] = (offset / 7) + i + (decrement * 2);
}
calcRepo.Enqueue(values);
}
}
}
public class App
{
public static void Main()
{
BenchmarkRunner.Run<ArrayAverage>();
}
}
}
Benchmark results:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1348 (21H1/May2021Update)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
[Host] : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT [AttachedDebugger]
DefaultJob : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
Method
Arrays To Average
Array Size
Mean
Error
StdDev
CalculateIndexAverages
10
512
32.164 μs
0.5485 μs
0.5130 μs
CalculateIndexAverages2
10
512
5.792 μs
0.1135 μs
0.2241 μs
CalculateIndexAverages
10
2048
123.628 μs
2.3394 μs
1.9535 μs
CalculateIndexAverages2
10
2048
22.311 μs
0.4366 μs
0.8093 μs

When dealing with simple operations on a large amount of data, you'd be very interested in SIMD:
SIMD stands for "single instruction, multiple data". It’s a set of processor instructions that ... allows mathematical operations to execute over a set of values in parallel.
In your particular case, using the the Vector<T> example would give you a quick win. Naively converting your fastest method to use Vectors already gives a ~2x speed up on my PC.
public double[] CalculateIndexAverages4() {
// Assumption: PixelSize is a round multiple of Vector<>.Count
// If not, you'll have to add in the 'remainder' from the example.
var batch = Vector<double>.Count;
var sum = new double[PixelSize];
foreach (var arrayData in calcRepo) {
// Vectorised summing:
for (int i = 0; i <= PixelSize - batch; i += batch) {
var vSum = new Vector<double>(sum, i);
var vData = new Vector<double>(arrayData, i);
(vSum + vData).CopyTo(sum, i);
}
}
var vCnt = Vector<double>.One * calcRepo.Count;
// Reuse sum[] for averaging, so we don't incur memory allocation cost
for (int i = 0; i <= PixelSize - batch; i += batch) {
var vSum = new Vector<double>(sum, i);
(vSum / vCnt).CopyTo(sum, i);
}
return sum;
}
The Vector<T>.Count gives you how many items are being parallelised into one instruction. In the case of double, it's likely to be 4 on most modern CPUs supporting AVX2.
If you're okay with losing precision and can go to float, you'll get a much bigger win by again doubling the amount of data processed in a single CPU op. All of this without even changing your algorithm.

You can further optimize the code by reducing memory allocations. If the method is called frequently, time spent on GC will dominate completely.
// Assuming the data fits on the stack. Some 100k pixels should be safe.
Span<double> sum = stackalloc double[PixelSize];
// ...
Span<double> avg = stackalloc double[PixelSize];
And possibly also remove the extra stack-allocation of avg and simply reuse the sum:
for (int i = 0; i < sum.Length; i++)
{
sum[i] /= cnt;
}
// TODO: Avoid array allocation! Maybe use a pre-allocated array and fill it here.
return sum.ToArray();

In my opinion this would be fairly well optimized code. A major reason for the second option to be faster is that it access memory linearly, instead of jumping between multiple different arrays. Another factor is that foreach loops have some overhead, so placing this in the outer loop will also help a bit.
You might gain a little bit performance by switching out the queue and foreach loop to a list/array and for loop, but since PixelSize is much larger than CountToAverage I would expect the benefit to be fairly small.
Unrolling the loop to process say 4 values at a time might help a bit. It is possible for the c# compiler to apply such optimization automatically, but it is often difficult to tell what optimization are applied or not, so it might be easier just to test.
The next step would be to look at parallelization. Simple summing code like this might benefit a from SIMD to process multiple values at a time. But the link shows that using processor specific intrinsic has a much larger benefit over the more general Vector<T>, but may require separate code paths for each platform you are targeting. The link also have performance examples of summing values at various levels of optimization, with example code, so is well worth a read.
Another option would be to use multiple threads with Parallel.For/Foreach, but at 6μs it is likely that the overhead will be larger than any gains unless the size of the data is significantly larger.

Related

Is parallel code supposed to run slower than sequential code, after a certain dataset size?

I'm fairly new to C# and programming in general and I was trying out parallel programming.
I have written this example code that computes the sum of an array first, using multiple threads, and then, using one thread (the main thread).
I've timed both cases.
static long Sum(int[] numbers, int start, int end)
{
long sum = 0;
for (int i = start; i < end; i++)
{
sum += numbers[i];
}
return sum;
}
static async Task Main()
{
// Arrange data.
const int COUNT = 100_000_000;
int[] numbers = new int[COUNT];
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = random.Next(100);
}
// Split task into multiple parts.
int threadCount = Environment.ProcessorCount;
int taskCount = threadCount - 1;
int taskSize = numbers.Length / taskCount;
var start = DateTime.Now;
// Run individual parts in separate threads.
List<Task<long>> tasks = new();
for (int i = 0; i < taskCount; i++)
{
int begin = i * taskSize;
int end = (i == taskCount - 1) ? numbers.Length : (i + 1) * taskSize;
tasks.Add(Task.Run(() => Sum(numbers, begin, end)));
}
// Wait for all threads to finish, as we need the result.
var partialSums = await Task.WhenAll(tasks);
long sumAsync = partialSums.Sum();
var durationAsync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Async sum: {sumAsync}");
Console.WriteLine($"Async duration: {durationAsync} miliseconds");
// Sequential
start = DateTime.Now;
long sumSync = Sum(numbers, 0, numbers.Length);
var durationSync = (DateTime.Now - start).TotalMilliseconds;
Console.WriteLine($"Sync sum: {sumSync}");
Console.WriteLine($"Sync duration: {durationSync} miliseconds");
var factor = durationSync / durationAsync;
Console.WriteLine($"Factor: {factor:0.00}x");
}
When the array size is 100 million, the parallel sum is computed 2x faster. (on average).
But when the array size is 1 billion, it's significantly slower than the sequential sum.
Why is it running slower?
Hardware Information
Environment.ProcessorCount = 4
GC.GetGCMemoryInfo().TotalAvailableMemoryBytes = 8468377600
Timing:
When array size is 100,000,000
When array size is 1,000,000,000
New Test:
This time instead of separate threads (it was 3 in my case) working on different parts of a single array of 1,000,000,000 integers, I physically divided the dataset into 3 separate arrays of 333,333,333 (one-third in size). This time, although, I'm working on adding up a billion integers on the same machine, my parallel code runs faster (as expected)
private static void InitArray(int[] numbers)
{
Random random = new();
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = (int)random.Next(100);
}
}
public static async Task Main()
{
Stopwatch stopwatch = new();
const int SIZE = 333_333_333; // one third of a billion
List<int[]> listOfArrays = new();
for (int i = 0; i < Environment.ProcessorCount - 1; i++)
{
int[] numbers = new int[SIZE];
InitArray(numbers);
listOfArrays.Add(numbers);
}
// Sequential.
stopwatch.Start();
long syncSum = 0;
foreach (var array in listOfArrays)
{
syncSum += Sum(array);
}
stopwatch.Stop();
var sequentialDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Sequential sum: {syncSum}");
Console.WriteLine($"Sequential duration: {sequentialDuration} ms");
// Parallel.
stopwatch.Restart();
List<Task<long>> tasks = new();
foreach (var array in listOfArrays)
{
tasks.Add(Task.Run(() => Sum(array)));
}
var partialSums = await Task.WhenAll(tasks);
long parallelSum = partialSums.Sum();
stopwatch.Stop();
var parallelDuration = stopwatch.Elapsed.TotalMilliseconds;
Console.WriteLine($"Parallel sum: {parallelSum}");
Console.WriteLine($"Parallel duration: {parallelDuration} ms");
Console.WriteLine($"Factor: {sequentialDuration / parallelDuration:0.00}x");
}
Timing
I don't know if it helps figure out what went wrong in the first approach.

The asynchronous pattern is not the same as running code in parallel. The main reason for asynchronous code is better resource utilization while the computer is waiting for some kind of IO device. Your code would be better described as parallel computing or concurrent computing.
While your example should work fine, it may not be the easiest, nor optimal way to do it. The easiest option would probably be to use Parallel Linq: numbers.AsParallel().Sum();. There is also a Parallel.For method that should be better suited, including an overload that maintains a thread local state. Note that while the parallel.For will attempt to optimize its partitioning, you probably want to process chunks of data in each iteration to reduce overhead. I would try around 1-10k values or so.
We can only guess the reason your parallel method is slower. Summing numbers is a really fast operation, so it may be that the computation is limited by memory bandwith or Cache usage. And while you want your work partitions to be fairly large, using too large partitions may result in less overall parallelism if a thread gets suspended for any reason. You may also want partitions on certain sizes to work well with the caching system, see cache associativity. It is also possible you are including things you did not intend to measure, like compilation times or GCs, See benchmark .Net that takes care of many of the edge cases when measuring performance.
Also, never use DateTime for measuring performance, Stopwatch is both much easier to use and much more accurate.

My machine has 4GB RAM, so initializing an int[1_000_000_000] results in memory paging. Going from int[100_000_000] to int[1_000_000_000] results in non-linear performance degradation (100x instead of 10x). Essentially a CPU-bound operation becomes I/O-bound. Instead of adding numbers, the program spends most of its time reading segments of the array from the disk. In these conditions using multiple threads can be detrimental for the overall performance, because the pattern of accessing the storage device becomes more erratic and less streamlined.
Maybe something similar happens on your 8GB RAM machine too, but I can't say for sure.

How CPU caching works when you get access to different value in a 'for' loop?

I made some tests of code performance, and I would like to know how the CPU cache works in this kind of situation:
Here is a classic example for a loop:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
}
Here is the specific situation to get the same thing, but much more performant:
private static readonly short[] _values;
static MyClass()
{
var random = new Random();
_values = Enumerable.Range(0, 100)
.Select(x => (short)random.Next(5000))
.ToArray();
}
public static void Run()
{
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
short max = Math.Max(max1, max2);
}
So I am interested to know why the second is more efficient as the first one.
I understand it's a story of CPU cache, but I don't get really how it happens (like values are not read twice between loops).
EDIT:
.NET Core 4.6.27617.04
2.1.11
Intel Core i7-7850HQ 2.90GHz 64-bit
Calling 50 Million of times:
MyClass1:
=> 00:00:06.0702028
MyClass2:
=> 00:00:03.8563776 (-36 %)
The last metric are the one with the Loop unrolling.

The difference in performance in this case is not related to caching - you have just 100 values - they fit entirely in the L2 cache already at the time you generated them.
The difference is due to out-of-order execution.
A modern CPU has multiple execution units and can perform more than one operation at the same time even in a single-threaded application.
But your loop is problematic for a modern CPU because it has a dependency:
short max = 0;
for (var index = 0; index < _values.Length; index++)
{
max = Math.Max(max, _values[index]);
}
Here each subsequent iteration is dependent on the value max from the previous one, so the CPU is forced to compute them sequentially.
Your revised loop adds a degree of freedom for the CPU; since max1 and max2 are independent, they can be computed in parallel.
So essentially the revised loop can run equally fast per iteration as the first one:
short max1 = 0;
short max2 = 0;
for (var index = 0; index < _values.Length; index+=2)
{
max1 = Math.Max(max1, _values[index]);
max2 = Math.Max(max2, _values[index + 1]);
}
But it has half the iterations, so in the end you get a significant speedup (not 2x because out-of-order execution is not perfect).

Caching
Caching in the cpu works such as it pre-loads the next few lines of code from memory and stores it in the CPU Cache, This may be data, pointers, variable values, etc. etc.
Code Blocks
between your two blocks of code, the difference may not appear in the syntax, try converting your Code to IL (intermediate runtime language for c# which is executed by JIT(just-in-time compiler)) see ref for tools and resources.
or just decompiler your built/compiled code and check how the compiler "optimized it" when making the dll/exe files using the decompiler below.
other performance optimization
Loop Unrolling
CPU Caching
Refs:
C# Decompiler
JIT

Shuffling an array bottlenecks on Random.Nex(int)

I've been working on a small piece of code that sorts the provided array. The array should be sorted as fast as possible. Randomization is not that important. After profiling the method I found out that the biggest hog is Random.Next. Which takes up about 70% of the method execution time. After searching online for faster random generators I found no plug and play libraries that offer any improved performance.
So I was wondering whether there are any ways to improve the performance of this code any more.
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Shuffle(byte[] chars)
{
for (var i = 0; i < chars.Length; i++)
{
var index = rnd.Next(chars.Length);
byte tmpStore = chars[index];
chars[index] = chars[i];
chars[i] = tmpStore;
}
}

Alright, this is getting into micro-optimization territory.
Random.Next(int) actually performs some ops internally that we can optimize out:
int index = (int)(rnd.Next() * (1.0 / int.Max) * chars.Length);
Since you're using the same maxValue over and over in a loop, a trivial optimization would be to precalculate your denominator outside of the loop. This way we get rid of an int->double conversion and a multiply:
double d = chars.Length / (double)int.Max;
And then:
int index = (int)(rnd.Next() * d);
On a separate note: your shuffle isn't going to have a uniform distribution. See Jeff Atwood's post The Danger of Naïveté which deals specifically with this subject and shows how to perform a uniform Fisher-Yates shuffle.

If n^n isn't too big for the double range, you could create one random double number, multiply it by n^n, then use modulo(n) each iteration as the current random number prior to dividing the random result by n as preparation for the next iteration.

Deleting from array, mirrored (strange) behavior

The title may seem a little odd, because I have no idea how to describe this in one sentence.
For the course Algorithms we have to micro-optimize some stuff, one is finding out how deleting from an array works. The assignment is delete something from an array and re-align the contents so that there are no gaps, I think it is quite similar to how std::vector::erase works from c++.
Because I like the idea of understanding everything low-level, I went a little further and tried to bench my solutions. This presented some weird results.
At first, here is a little code that I used:
class Test {
Stopwatch sw;
Obj[] objs;
public Test() {
this.sw = new Stopwatch();
this.objs = new Obj[1000000];
// Fill objs
for (int i = 0; i < objs.Length; i++) {
objs[i] = new Obj(i);
}
}
public void test() {
// Time deletion
sw.Restart();
deleteValue(400000, objs);
sw.Stop();
// Show timings
Console.WriteLine(sw.Elapsed);
}
// Delete function
// value is the to-search-for item in the list of objects
private static void deleteValue(int value, Obj[] list) {
for (int i = 0; i < list.Length; i++) {
if (list[i].Value == value) {
for (int j = i; j < list.Length - 1; j++) {
list[j] = list[j + 1];
//if (list[j + 1] == null) {
// break;
//}
}
list[list.Length - 1] = null;
break;
}
}
}
}
I would just create this class and call the test() method. I did this in a loop for 25 times.
My findings:
The first round it takes a lot longer than the other 24, I think this is because of caching, but I am not sure.
When I use a value that is in the start of the list, it has to move more items in memory than when I use a value at the end, though it still seems to take less time.
Benchtimes differ quite a bit.
When I enable the commented if, performance goes up (10-20%) even if the value I search for is almost at the end of the list (which means the if goes off a lot of times without actually being useful).
I have no idea why these things happen, is there someone who can explain (some of) them? And maybe if someone sees this who is a pro at this, where can I find more info to do this the most efficient way?
Edit after testing:
I did some testing and found some interesting results. I run the test on an array with a size of a million items, filled with a million objects. I run that 25 times and report the cumulative time in milliseconds. I do that 10 times and take the average of that as a final value.
When I run the test with my function described just above here I get a score of:
362,1
When I run it with the answer of dbc I get a score of:
846,4
So mine was faster, but then I started to experiment with a half empty empty array and things started to get weird. To get rid of the inevitable nullPointerExceptions I added an extra check to the if (thinking it would ruin a bit more of the performance) like so:
if (fromItem != null && fromItem.Value != value)
list[to++] = fromItem;
This seemed to not only work, but improve performance dramatically! Now I get a score of:
247,9
The weird thing is, the scores seem to low to be true, but sometimes spike, this is the set I took the avg from:
94, 26, 966, 36, 632, 95, 47, 35, 109, 439
So the extra evaluation seems to improve my performance, despite of doing an extra check. How is this possible?

You are using Stopwatch to time your method. This calculates the total clock time taken during your method call, which could include the time required for .Net to initially JIT your method, interruptions for garbage collection, or slowdowns caused by system loads from other processes. Noise from these sources will likely dominate noise due to cache misses.
This answer gives some suggestions as to how you can minimize some of the noise from garbage collection or other processes. To eliminate JIT noise, you should call your method once without timing it -- or show the time taken by the first call in a separate column in your results table since it will be so different. You might also consider using a proper profiler which will report exactly how much time your code used exclusive of "noise" from other threads or processes.
Finally, I'll note that your algorithm to remove matching items from an array and shift everything else down uses a nested loop, which is not necessary and will access items in the array after the matching index twice. The standard algorithm looks like this:
public static void RemoveFromArray(this Obj[] array, int value)
{
int to = 0;
for (int from = 0; from < array.Length; from++)
{
var fromItem = array[from];
if (fromItem.Value != value)
array[to++] = fromItem;
}
for (; to < array.Length; to++)
{
array[to] = default(Obj);
}
}
However, instead of using the standard algorithm you might experiment by using Array.RemoveAt() with your version, since (I believe) internally it does the removal in unmanaged code.

Why is processing a sorted array slower than an unsorted array?

I have a list of 500000 randomly generated Tuple<long,long,string> objects on which I am performing a simple "between" search:
var data = new List<Tuple<long,long,string>>(500000);
...
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!
I tried switching around the order in which I ran my experiments, and used different seed for the random number generator, but the effect has been the same: searches in an unsorted array ran nearly twice as fast as the searches in the same array, but sorted!
Does anyone have a good explanation of this strange effect? The source code of my tests follows; I am using .NET 4.0.
private const int TotalCount = 500000;
private const int TotalQueries = 100;
private static long NextLong(Random r) {
var data = new byte[8];
r.NextBytes(data);
return BitConverter.ToInt64(data, 0);
}
private class TupleComparer : IComparer<Tuple<long,long,string>> {
public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) {
var res = x.Item1.CompareTo(y.Item1);
if (res != 0) return res;
res = x.Item2.CompareTo(y.Item2);
return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3);
}
}
static void Test(bool doSort) {
var data = new List<Tuple<long,long,string>>(TotalCount);
var random = new Random(1000000007);
var sw = new Stopwatch();
sw.Start();
for (var i = 0 ; i != TotalCount ; i++) {
var a = NextLong(random);
var b = NextLong(random);
if (a > b) {
var tmp = a;
a = b;
b = tmp;
}
var s = string.Format("{0}-{1}", a, b);
data.Add(Tuple.Create(a, b, s));
}
sw.Stop();
if (doSort) {
data.Sort(new TupleComparer());
}
Console.WriteLine("Populated in {0}", sw.Elapsed);
sw.Reset();
var total = 0L;
sw.Start();
for (var i = 0 ; i != TotalQueries ; i++) {
var x = NextLong(random);
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
total += cnt;
}
sw.Stop();
Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted");
}
static void Main() {
Test(false);
Test(true);
Test(false);
Test(true);
}
Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)

When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.
When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.
This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.
The penalty from cache misses outweighs the saved branch prediction penalty in this case.
Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.
Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.
This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.

LINQ doesn't know whether you list is sorted or not.
Since Count with predicate parameter is extension method for all IEnumerables, I think it doesn't even know if it's running over the collection with efficient random access. So, it simply checks every element and Usr explained why performance got lower.
To exploit performance benefits of sorted array (such as binary search), you'll have to do a little bit more coding.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.