Locks vs Compare-and-swap

Locks vs Compare-and-swap - c#

I've been reading about lock-free techniques, like Compare-and-swap and leveraging the Interlocked and SpinWait classes to achieve thread synchronization without locking.
I've ran a few tests of my own, where I simply have many threads trying to append a character to a string. I tried using regular locks and compare-and-swap. Surprisingly (at least to me), locks showed much better results than using CAS.
Here's the CAS version of my code (based on this). It follows a copy->modify->swap pattern:
private string _str = "";
public void Append(char value)
{
var spin = new SpinWait();
while (true)
{
var original = Interlocked.CompareExchange(ref _str, null, null);
var newString = original + value;
if (Interlocked.CompareExchange(ref _str, newString, original) == original)
break;
spin.SpinOnce();
}
}
And the simpler (and more efficient) lock version:
private object lk = new object();
public void AppendLock(char value)
{
lock (lk)
{
_str += value;
}
}
If i try adding 50.000 characters, the CAS version takes 1.2 seconds and the lock version 700ms (average). For 100k characters, they take 7 seconds and 3.8 seconds, respectively.
This was run on a quad-core (i5 2500k).
I suspected the reason why CAS was displaying these results was because it was failing the last "swap" step a lot. I was right. When I try adding 50k chars (50k successful swaps), i was able to count between 70k (best case scenario) and almost 200k (worst case scenario) failed attempts. Worst case scenario, 4 out of every 5 attempts failed.
So my questions are:
What am I missing? Shouldn't CAS give better results? Where's the benefit?
Why exactly and when is CAS a better option? (I know this has been asked, but I can't find any satisfying answer that also explains my specific scenario).
It is my understanding that solutions employing CAS, although hard to code, scale much better and perform better than locks as contention increases. In my example, the operations are very small and frequent, which means high contention and high frequency. So why do my tests show otherwise?
I assume that longer operations would make the case even worse -> the "swap" failing rate would increase even more.
PS: this is the code I used to run the tests:
Stopwatch watch = Stopwatch.StartNew();
var cl = new Class1();
Parallel.For(0, 50000, i => cl.Append('a'));
var time = watch.Elapsed;
Debug.WriteLine(time.TotalMilliseconds);

The problem is a combination of the failure rate on the loop and the fact that strings are immutable. I did a couple of tests on my own using the following parameters.
Ran 8 different threads (I have an 8 core machine).
Each thread called Append 10,000 times.
What I observed was that the final length of the string was 80,000 (8 x 10,000) so that was perfect. The number of append attempts averaged ~300,000 for me. So that is a failure rate of ~73%. Only 27% of the CPU time resulted in useful work. Now because strings are immutable that means a new instance of the string is created on the heap and the original contents plus the one extra character is copied into it. By the way, this copy operation is O(n) so it gets longer and longer as the string's length increases. Because of the copy operation my hypothesis was that the failure rate would increase as the length of the string increases. The reason being that as the copy operation takes more and more time there is a higher probability of a collision as the threads spend more time competing to finalize the ICX. My tests confirmed this. You should try this same test yourself.
The biggest issue here is that sequential string concatenations do not lend themselves to parallelism very well. Since the results of the operation Xn depend on Xn-1 it is going to be faster to take the full lock especially if it means you avoid all of the failures and retries. A pessimistic strategy wins the battle against an optimistic one in this case. The low techniques work better when you can partition the problem into independent chucks that really can run unimpeded in parallel.
As a side note the use of Interlocked.CompareExchange to do the initial read of _str is unnecessary. The reason is that a memory barrier is not required for the read in this case. This is because the Interlocked.CompareExchange call that actually performs work (the second one in your code) will create a full barrier. So the worst case scenario is that the first read is "stale", the ICX operation fails the test, and the loop spins back around to try again. This time, however, the previous ICX forced a "fresh" read.1
The following code is how I generalize a complex operation using low lock mechanisms. In fact, the code presented below allows you to pass a delegate representing the operation so it is very generalized. Would you want to use it in production? Probably not because invoking the delegate is slow, but you at least you get the idea. You could always hard code the operation.
public static class InterlockedEx
{
public static T Change<T>(ref T destination, Func<T, T> operation) where T : class
{
T original, value;
do
{
original = destination;
value = operation(original);
}
while (Interlocked.CompareExchange(ref destination, value, original) != original);
return original;
}
}
1I actually dislike the terms "stale" and "fresh" when discussing memory barriers because that is not what they are about really. It is more of a side effect as opposed to actual guarantee. But, in this case it illustrates my point better.

Related

How to test a multi-threaded race scenario

I'm just now starting to learn multi-threading and I came across this question:
public class Program1
{
int variable;
bool variableValueHasBeenSet = false;
public void Func1()
{
variable = 1;
variableValueHasBeenSet = true;
}
public void Func2()
{
if (variableValueHasBeenSet) Console.WriteLine(variable);
}
}
the questions is: Determine all possible outputs (in console) for the following code snippet if Func1() and Func2() are run in parallel on two separate threads. The answer given is nothing, 1 or 0. the first two options are obvious but the third one surprised me so I wanted to try and get it, this is what I tried:
for (int i = 0; i < 100; i++)
{
var prog1 = new Program1();
List<Task> tasks = new List<Task>();
tasks.Add(new Task(() => prog1.Func2(), TaskCreationOptions.LongRunning));
tasks.Add(new Task(() => prog1.Func1(), TaskCreationOptions.LongRunning));
Parallel.ForEach(tasks, t => t.Start());
}
I couldn't get 0, only nothing and 1, so I was wondering what I'm doing wrong and how can I test this specific problem?
this is the explanation they provided for 0:
0 - This might seem impossible but this is a probable output and an interesting one. .Net runtime, C# and the CPU take the liberty of reordering instructions for optimization. So it is possible that variableValueHasBeenSet is set to true but the value of the variable is still zero. Another reason for such an output is caching. Thread2 might cache the value for the variable as 0 and will not see the updated value when Thread1 updates it in Func1. For a single threaded program this is not an issue as the ordering is guaranteed, but not so in multithreaded code. If the code at both the places is surrounded by locks, this problem can be mitigated. Another advanced way is to use memory barriers.

.Net runtime, C# and the CPU take the liberty of reordering
instructions for optimization.
This bit of information is very important, because there is no guarantee the reordering will happen at all.
The optimizer will often reorder the instructions, but usually this is triggered by code complexity and will typically only occur on a release build (the optimizer will look for dependency-chains and may decide to reorder the code if no dependency is broken AND it will result in faster/more compact code). The code complexity of your test is very low and may not trigger the reordering optimization.
The same thing may happen at the CPU level, if no dependency chains are found between CPU instructions, they may be reordered or at least run in parallel by a superscalar CPU, but other, simpler architectures will run code in-order.
Another reason for such an output is caching. Thread2 might cache the
value for the variable as 0 and will not see the updated value when
Thread1 updates it in Func1
Again, this is is only a possibility. This type of optimization is typically triggered when repeatedly accessing a variable in a loop. The optimizer may decide that it is faster to place the variable on a CPU register instead of accessing it from memory every iteration.
In any case, the amount of control you have over how the C# compiler emits its code is very limited, same goes for how the IL code gets translated to machine code. For these reasons, it would be very difficult for you to produce a reproducible test on every architecture for the case you intend to prove.
What is really important is that you need to be aware that 1) the execution order of the instructions can never be taken for granted and 2) variables may be temporarily stored in registers as a potential optimization. Once aware you should write your code defensively around these possibilities

Variable freshness guarantee in .NET (volatile vs. volatile read)

I have read many contradicting information (msdn, SO etc.) about volatile and VoletileRead (ReadAcquireFence).
I understand the memory access reordering restriction implication of those - what I'm still completely confused about is the freshness guarantee - which is very important for me.
msdn doc for volatile mentions:
(...) This ensures that the most up-to-date value is present in the field at all times.
msdn doc for volatile fields mentions:
A read of a volatile field is called a volatile read. A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it in the instruction sequence.
.NET code for VolatileRead is:
public static int VolatileRead(ref int address)
{
int ret = address;
MemoryBarrier(); // Call MemoryBarrier to ensure the proper semantic in a portable way.
return ret;
}
According to msdn MemoryBarrier doc Memory barrier prevents reordering. However this doesn't seem to have any implications on freshness - correct?
How then one can get freshness guarantee?
And is there difference between marking field volatile and accessing it with VolatileRead and VolatileWrite semantic? I'm currently doing the later in my performance critical code that needs to guarantee freshness, however readers sometimes get stale value. I'm wondering if marking the state volatile will make situation different.
EDIT1:
What I'm trying to achieve - get the guarantee that reader threads will get as recent value of shared variable (written by multiple writers) as possible - ideally no older than what is the cost of context switch or other operations that may postpone the immediate write of state.
If volatile or higher level construct (e.g. lock) have this guarantee (do they?) than how do they achieve this?
EDIT2:
The very condensed question should have been - how do I get guarantee of as fresh value during reads as possible? Ideally without locking (as exclusive access is not needed and there is potential for high contention).
From what I learned here I'm wondering if this might be the solution (solving(?) line is marked with comment):
private SharedState _sharedState;
private SpinLock _spinLock = new SpinLock(false);
public void Update(SharedState newValue)
{
bool lockTaken = false;
_spinLock.Enter(ref lockTaken);
_sharedState = newValue;
if (lockTaken)
{
_spinLock.Exit();
}
}
public SharedState GetFreshSharedState
{
get
{
Thread.MemoryBarrier(); // <---- This is added to give readers freshness guarantee
var value = _sharedState;
Thread.MemoryBarrier();
return value;
}
}
The MemoryBarrier call was added to make sure both - reads and writes - are wrapped by full fences (same as lock code - as indicated here http://www.albahari.com/threading/part4.aspx#_The_volatile_keyword 'Memory barriers and locking' section)
Does this look correct or is it flawed?
EDIT3:
Thanks to very interesting discussions here I learned quite a few things and I actually was able to distill to the simplified unambiguous question that I have about this topic. It's quite different from the original one so I rather posted a new one here: Memory barrier vs Interlocked impact on memory caches coherency timing

I think this is a good question. But, it is also difficult to answer. I am not sure I can give you a definitive answer to your questions. It is not your fault really. It is just that the subject matter is complex and really requires knowing details that might not be feasible to enumerate. Honestly, it really seems like you have educated yourself on the subject quite well already. I have spent a lot of time studying the subject myself and I still do not fully understand everything. Nevertheless, I will still attempt an some semblance of an answer here anyway.
So what does it mean for a thread to read a fresh value anyway? Does it mean the value returned by the read is guaranteed to be no older than 100ms, 50ms, or 1ms? Or does it mean the value is the absolute latest? Or does it mean that if two reads occur back-to-back then the second is guaranteed to get a newer value assuming the memory address changed after the first read? Or does it mean something else altogether?
I think you are going to have a hard time getting your readers to work correctly if you are thinking about things in terms of time intervals. Instead think of things in terms of what happens when you chain reads together. To illustrate my point consider how you would implement an interlocked-like operation using arbitrarily complex logic.
public static T InterlockedOperation<T>(ref T location, T operand)
{
T initial, computed;
do
{
initial = location;
computed = op(initial, operand); // where op is replaced with a specific implementation
}
while (Interlocked.CompareExchange(ref location, computed, initial) != initial);
return computed;
}
In the code above we can create any interlocked-like operation if we exploit the fact that the second read of location via Interlocked.CompareExchange will be guaranteed to return a newer value if the memory address received a write after the first read. This is because the Interlocked.CompareExchange method generates a memory barrier. If the value has changed between reads then the code spins around the loop repeatedly until location stops changing. This pattern does not require that the code use the latest or freshest value; just a newer value. The distinction is crucial.1
A lot of lock free code I have seen works on this principal. That is the operations are usually wrapped into loops such that the operation is continually retried until it succeeds. It does not assume that the first attempt is using the latest value. Nor does it assume every use of the value be the latest. It only assumes that the value is newer after each read.
Try to rethink how your readers should behave. Try to make them more agnostic about the age of the value. If that is simply not possible and all writes must be captured and processed then you may be forced into a more deterministic approach like placing all writes into a queue and having the readers dequeue them one-by-one. I am sure the ConcurrentQueue class would help in that situation.
If you can reduce the meaning of "fresh" to only "newer" then placing a call to Thread.MemoryBarrier after each read, using Volatile.Read, using the volatile keyword, etc. will absolutely guarantee that one read in a sequence will return a newer value than a previous read.
1The ABA problem opens up a new can of worms.

A memory barrier does provide this guarantee. We can derive the "freshness" property that you are looking for from the reording properties that a barrier guarantees.
By freshness you probably mean that a read returns the value of the most recent write.
Let's say we have these operations, each on a different thread:
x = 1
x = 2
print(x)
How could we possibly print a value other than 2? Without volatile the read can move one slot upwards and return 1. Volatile prevents reorderings, though. The write cannot move backwards in time.
In short, volatile guarantees you to see the most recent value.
Strictly speaking I'd need to differentiate between volatile and a memory barrier here. The latter one is a stronger guarantee. I have simplified this discussion because volatile is implemented using memory barriers, at least on x86/x64.

Why there is a Thread.Sleep(1) in .NET internal Hashtable?

Recently I was reading implementation of .NET Hashtable and encountered piece of code that I don't understand. Part of the code is:
int num3 = 0;
int num4;
do
{
num4 = this.version;
bucket = bucketArray[index];
if (++num3 % 8 == 0)
Thread.Sleep(1);
}
while (this.isWriterInProgress || num4 != this.version);
The whole code is within public virtual object this[object key] of System.Collections.Hashtable (mscorlib Version=4.0.0.0).
The question is:
What is the reason of having Thread.Sleep(1) there?

Sleep(1) is a documented way in Windows to yield the processor and allow other threads to run. You can find this code in the Reference Source with comments:
// Our memory model guarantee if we pick up the change in bucket from another processor,
// we will see the 'isWriterProgress' flag to be true or 'version' is changed in the reader.
//
int spinCount = 0;
do {
// this is violate read, following memory accesses can not be moved ahead of it.
currentversion = version;
b = lbuckets[bucketNumber];
// The contention between reader and writer shouldn't happen frequently.
// But just in case this will burn CPU, yield the control of CPU if we spinned a few times.
// 8 is just a random number I pick.
if( (++spinCount) % 8 == 0 ) {
Thread.Sleep(1); // 1 means we are yeilding control to all threads, including low-priority ones.
}
} while ( isWriterInProgress || (currentversion != version) );
The isWriterInProgress variable is a volatile bool. The author had some trouble with English "violate read" is "volatile read". The basic idea is try to avoid yielding, thread context switches are very expensive, with some hope that the writer gets done quickly. If that doesn't pan out then explicitly yield to avoid burning cpu. This would probably have been written with Spinlock today but Hashtable is very old. As are the assumptions about the memory model.

Without having access to the rest of the implementation code, I can only make an educated guess based on what you've posted.
That said, it looks like it's trying to update something in the Hashtable, either in memory or on disk, and doing an infinite loop while waiting for it to finish (as seen by checking the isWriterInProgress).
If it's a single-core processor, it can only run the one thread at a time. Going in a continuous loop like this could easily mean the other thread doesn't have a chance to run, but the Thread.Sleep(1) gives the processor a chance to give time to the writer. Without the wait, the writer thread may never get a chance to run, and never complete.

I haven't read the source, but it looks like a lockless concurrency thing. You're trying to read from the hashtable, but someone else might be writing to it, so you wait until isWriterInProgress is unset and the version that you've read hasn't changed.
This does not explain why e.g. we always wait at least once. EDIT: that's because we don't, thanks #Maciej for pointing that out. When there's no contention we proceed immediately. I don't know why 8 is the magic number instead of e.g. 4 or 16, though.

Interlocked.CompareExchange<Int> using GreaterThan or LessThan instead of equality

The System.Threading.Interlocked object allows for Addition (subtraction) and comparison as an atomic operation. It seems that a CompareExchange that just doesn't do equality but also GreaterThan/LessThan as an atomic comparison would be quite valuable.
Would a hypothetical Interlocked.GreaterThan a feature of the IL or is it a CPU-level feature? Both?
Lacking any other option, is it possible to create such a feature in C++ or direct IL code and expose that functionality to C#?

You can build other atomic operations out of InterlockedCompareExchange.
public static bool InterlockedExchangeIfGreaterThan(ref int location, int comparison, int newValue)
{
int initialValue;
do
{
initialValue = location;
if (initialValue >= comparison) return false;
}
while (System.Threading.Interlocked.CompareExchange(ref location, newValue, initialValue) != initialValue);
return true;
}

With these helper methods you can not only exchange value but also detect was it replaced or not.
Usage looks like this:
int currentMin = 10; // can be changed from other thread at any moment
int potentialNewMin = 8;
if (InterlockedExtension.AssignIfNewValueSmaller(ref currentMin, potentialNewMin))
{
Console.WriteLine("New minimum: " + potentialNewMin);
}
And here are methods:
public static class InterlockedExtension
{
public static bool AssignIfNewValueSmaller(ref int target, int newValue)
{
int snapshot;
bool stillLess;
do
{
snapshot = target;
stillLess = newValue < snapshot;
} while (stillLess && Interlocked.CompareExchange(ref target, newValue, snapshot) != snapshot);
return stillLess;
}
public static bool AssignIfNewValueBigger(ref int target, int newValue)
{
int snapshot;
bool stillMore;
do
{
snapshot = target;
stillMore = newValue > snapshot;
} while (stillMore && Interlocked.CompareExchange(ref target, newValue, snapshot) != snapshot);
return stillMore;
}
}

Update to the later post I made here: we found a better way to make the greater comparison by using additional lock object. We wrote many unit tests in order to validate that a lock and Interlocked can be used together, but only for some cases.
How the code works: Interlocked uses memory barriers that a read or write is atomic. The sync-lock is needed to make the greater-than comparison an atomic operation. So the rule now is that inside this class no other operation writes the value without this sync lock.
What we get with this class is an interlocked value which can be read very fast, but write takes a little bit more. Read is about 2-4 times faster in our application.
Here the code as view:
See here: http://files.thekieners.com/blogcontent/2012/ExchangeIfGreaterThan2.png
Here as code to copy&paste:
public sealed class InterlockedValue
{
private long _myValue;
private readonly object _syncObj = new object();
public long ReadValue()
{
// reading of value (99.9% case in app) will not use lock-object,
// since this is too much overhead in our highly multithreaded app.
return Interlocked.Read(ref _myValue);
}
public bool SetValueIfGreaterThan(long value)
{
// sync Exchange access to _myValue, since a secure greater-than comparisons is needed
lock (_syncObj)
{
// greather than condition
if (value > Interlocked.Read(ref _myValue))
{
// now we can set value savely to _myValue.
Interlocked.Exchange(ref _myValue, value);
return true;
}
return false;
}
}
}

This isn't actually true, but it is useful to think of concurrency as coming in 2 forms:
Lock free concurrency
Lock based concurrency
It's not true because software lock based concurrency ends up being implemented using lock free atomic instructions somewhere on the stack (often in the Kernel). Lock free atomic instructions, however, all ultimately end up acquiring a hardware lock on the memory bus. So, in reality, lock free concurrency and lock based concurrency are the same.
But conceptually, at the level of a user application, they are 2 distinct ways of doing things.
Lock based concurrency is based on the idea of "locking" access to a critical section of code. When one thread has "locked" a critical section, no other thread may have code running inside that same critical section. This is usually done by the use of "mutexes", which interface with the os scheduler and cause threads to become un-runnable while waiting to enter a locked critical section. The other approach is to use "spin locks" which cause a thread to spin in a loop, doing nothing useful, until the critical section becomes available.
Lock free concurrency is based on the idea of using atomic instructions (specially supported by the CPU), that are guaranteed by hardware to run atomically. Interlocked.Increment is a good example of lock free concurrency. It just calls special CPU instructions that do an atomic increment.
Lock free concurrency is hard. It gets particularly hard as the length and complexity of critical sections increase. Any step in a critical section can be simultaneously executed by any number of threads at once, and they can move at wildly different speeds. You have to make sure that despite that, the results of a system as a whole remain correct. For something like an increment, it can be simple (the cs is just one instruction). For more complex critical sections, things can get very complex very quickly.
Lock based concurrency is also hard, but not quite as hard as lock free concurrency. It allows you to create arbitrarily complex regions of code and know that only 1 thread is executing it at any time.
Lock free concurrency has one big advantage, however: speed. When used correctly it can be orders of magnitude faster than lock based concurrency. Spin loops are bad for long running critical sections because they waste CPU resources doing nothing. Mutexes can be bad for small critical sections because they introduce a lot of overhead. They involve a mode switch at a minimum, and multiple context switches in the worst case.
Consider implementing the Managed heap. Calling into the OS everytime "new" is called would be horrible. It would destroy the performance of your app. However, using lock free concurrency it's possible to implement gen 0 memory allocations using an interlocked increment (I don't know for sure if that's what the CLR does, but I'd be surprised if it wasn't. That can be a HUGE savings.
There are other uses, such as in lock free data structures, like persistent stacks and avl trees. They usually use "cas" (compare and swap).
The reason, however, that locked based concurrency and lock free concurrency are really equivalent is because of the implementation details of each.
Spin locks usually use atomic instructions (typically cas) in their loop condition. Mutexes need to use either spin locks or atomic updates of internal kernel structures in their implementation.
The atomic instructions are in turn implemented using hardware locks.
In any case, they both have their sets of trade offs, usually centering on perf vs complexity. Mutexes can be both faster and slower than lock free code. Lock free code can be both more and less complex than a mutex. The appropriate mechanism to use depends on specific circumstances.
Now, to answer your question:
A method that did an interlocked compare exchange if less than would imply to callers that it is not using locks. You can't implement it with a single instruction the same way increment or compare exchange can be done. You could simulate it doing a subtraction (to compute less than), with an interlocked compare exchange in a loop. You can also do it with a mutex (but that would imply a lock and so using "interlocked" in the name would be misleading). Is it appropriate to build the "simulated interlocked via cas" version? That depends. If the code is called very frequently, and has very little thread contention then the answer is yes. If not, you can turn a O(1) operation with moderately high constant factors into an infinite (or very long) loop, in which case it would be better to use a mutex.
Most of the time it's not worth it.

What do you think about this implementation:
// this is a Interlocked.ExchangeIfGreaterThan implementation
private static void ExchangeIfGreaterThan(ref long location, long value)
{
// read
long current = Interlocked.Read(ref location);
// compare
while (current < value)
{
// set
var previous = Interlocked.CompareExchange(ref location, value, current);
// if another thread has set a greater value, we can break
// or if previous value is current value, then no other thread has it changed in between
if (previous == current || previous >= value) // note: most commmon case first
break;
// for all other cases, we need another run (read value, compare, set)
current = Interlocked.Read(ref location);
}
}

All interlocked operations have direct support in the hardware.
Interlocked operations and atomic data type are different things.. Atomic type is a library level feature. On some platforms and for some data types atomics are implemented using interlocked instructions. In this case they are very effective.
In other cases when platform does not have interlocked operations at all or they are not available for some particular data type, library implements these operations using appropriate synchronization (crit_sect, mutex, etc).
I am not sure if Interlocked.GreaterThan is really needed. Otherwise it might be already implemented. If your know good example where it can be useful, I am sure that everybody here will be happy to hear this.

Greater/Less than and equal to are already atomic operations. That doesn't address the safe concurrent behavior of your application tho.
There is no point in making them part of the Interlocked family, so the question is: what are you actually trying to achieve?

CPU performance when creating lots of objects with SqlDataReader

I have a stored procedure that returns several resultsets and within (some) resultsets 1000's of rows. I am executing this stored proc using x threads at the same time, with a maximum of y concurrently.
When i just run through the data:
using (SqlDataReader reader = command.ExecuteReader(CommandBehavior.CloseConnection))
{
do
{
while (reader.Read())
{
for (int i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i);
}
}
}
while (reader.NextResult());
}
I get reasonable throughput - and CPU. Now obviously this is useless, so i need some objects! OK, now i change it to do something like this:
using (SqlDataReader reader = command.ExecuteReader(CommandBehavior.CloseConnection))
{
while(reader.Read())
{
Bob b = new Bob(reader);
this.bobs.Add(b);
}
reader.NextResult();
while(reader.Read())
{
Clarence c = new Clarence(reader);
this.clarences.Add(c);
}
// More fun
}
And my implementation of data classes:
public Bob(SqlDataReader reader)
{
this.some = reader.GetInt32(0);
this.parameter = reader.GetInt32(1);
this.value = reader.GetString(2);
}
And this performs worse. Which is not surprising. What is surprising, is that the CPU drops, and by about 20 - 25% of total (i.e. not 25% of what it was using; it's close to 50% of what it was using)! Why would doing more work, drop the CPU? I don't get it... it looks like there's some locking somewhere - but i don't understand where? i want to get the most out of the machine!
EDIT - changed code due to bad demo code example. Oops.
ALSO: to test the constructor theory, i changed the implementation of the one that creates the object. This time it also does a for loop over the fields, and does a getValue, and will create an empty object. So it doesn't do what i want yes, but i wanted to see if creating the large numbers of objects is the issue. It isn't.
SECOND EDIT: looks like adding the objects to the list is the issue - CPU drops immediately as soon as i add these objects to the list. Not sure how to improve this... (or if it's worth investigating; the first scenario is obviously stupid)

I can see the performance issue with your methodology, but I am not sure I can pinpoint the exact way to explain the cause. But essentially, you are adding the flow of the cursor to instantiation rather setting properties. If I were to take a guess, you are causing some context switching when you make a pull from a reader as part of your constructor.
If you think of a reader as a firehose cursor (it is) and think of the difference between the firehose being controlled by the user holding the hose (normal method) rather than the container being filled, you start to get a picture of the problem.
Not sure how the threading relates? But, if you have multiple clients and you are stopping the flow of the hose a bit more by moving bits over to a constructor rather than setting properties on a constructed object, and then contending for thread time across multiple requests, I can envision an even bigger issue under load.

Well, the reason the CPU isn't maxed out is simply that it's waiting on something else. Either the disk or the network.
Considering your first code block simply reads and immediately discards a value there are several possibilities: One is that the compiler could be optimizing out the allocation of the value variable because it is limited in scope and never used. This would mean the processor doesn't have to wait for memory allocation. Instead it would be limited mainly by how fast you can get the data off of the network wire.
Normally memory allocation ought to be super fast, however, if the amount of memory you are allocating causes windows to push stuff out to disk then this be held back by your hard drive speeds.
Looking at your second code block, you are constructing lots of objects (more memory usage) and keeping them around. Neither of which allows the compiler to optimize the calls out; and both of which put increased pressure on local system resources beyond just the processor.
To determine if the above is even close, you need to put watch counters on the amount of memory used by the app vs the amount of available RAM. Also you'll want counters on your disk system to see if it's being heavily accessed under either scenario.
Point is, try and watch EVERYTHING that is going on in the machine while the code blocks are running. That will give you a better idea of why the processor is waiting.

Maybe its just a matter of incorporating the Bob and Clarence calls that are inherently less CPU intensive- and that part of the execution just takes some time because of I/O bottlenecks or something along those lines.
Your best bet is to run this through a profiler and take a look at the report.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.