Google is giving me headaches with this search term.
I need a thread safe mechanism to achieve the following.
A thread safe list with insert priority over read.
I need to always be able to insert a message (let's say) to the queue (or whatever) and occasionally, be able to read.
So reading, cannot, ever, interfere with inserting.
Thanks.
EDIT: Reading would also mean clearing the red part.
EDIT2: Maybe helpful, there is a single reader and a single writer.
EDIT3: Case scenario: 10 inserts per second for a period of 1 minute (or max possible using the hardware on which the software is on). Then a insert pause of 1 minute. Then 20 inserts (or max possible using the hardware on which the software is on) in 2 seconds for a period of 30 sec. Then a pause of 30 sec. Then the pause is used for max number of reads. I don't know if I am being clear enough. Obviously not. (PS: I don't know when the pause will occur, that is the problem). Max acc. delay for insert: the time for the Enqueue or Add method to finish.
ADDITIONAL: Could a ConcurrentDictionary with a AddOrUpdate with TryGetValue and TryRemove be used?
Construct your queue as a linked list of objects. Keep a reference to the head and the tail of the queue. See below the pseudo code which roughly tells the idea
QueueEntity Head;
QueueEntity Tail
class QueueEntity
{
QueueEntity Prev;
QueueEntity Next;
... //queue content;
}
and then do this:
//Read
lock(Tail)
{
//get the content
Tail=Tail.Prev;
}
//Write
lock(Head)
{
newEntity = new QueueEntity();
newEntity.Next = Head ;
Head.Prev = newEntity;
Head = newEntity;
}
Here you have separate locks for reading and writing and they won't block each other unless there is only one entiry in your queue.
Related
I have an application that uses hangfire to do long-running jobs for me (I know the time the job takes and it is always roughly the same), and in my UI I want to give an estimate for when a certain job is done. For that I need to query hangfire for the position of the job in the queue and the number of servers working on it.
I know I can get the number of enqueued jobs (in the "DEFAULT" queue) by
public long JobsInQueue() {
var monitor = JobStorage.Current.GetMonitoringApi();
return monitor.EnqueuedCount("DEFAULT");
}
and the number of servers by
public int HealthyServers() {
var monitor = JobStorage.Current.GetMonitoringApi();
return monitor.Servers().Count(n => (n.Heartbeat != null) && (DateTime.Now - n.Heartbeat.Value).TotalMinutes < 5);
}
(BTW: I exclude older heartbeats, because if I turn off servers they sometimes linger in the hangfire database. Is there a better way?), but to give a proper estimate I need to know the position of the job in the queue. How do I get that?
The problem you have is that hangfire is asynchronous, queued, parallel, exhibits an at-least-once durability semantic, and basically non-deterministic.
To know with certainty the order in which an item will finish being processed in such a system is impossible. In fact, if the requirement was to enforce strict ordering, then many of the benefits of hangfire would go away.
There is a very good blog post by #odinserj (the author of hangfire) where he outlines this point: http://odinserj.net/2014/05/10/are-your-methods-ready-to-run-in-background/
However, that said, it's not impossible to come up with a sensible estimation algorithm, but it would have to be one where the order of execution is approximated in some way. As to how you can arrive at such an algorithm I don't know but something like this might work (but probably won't):
Approximate seconds remaining until completion =
(
(average duration of job in seconds * queue depth)
/ (the lower of: number of hangfire threads OR queue depth)
)
- number of seconds already spent in queue
+ average duration of job in seconds
I have the following issue :
I am using a parallel.foreach iteration for a pretty CPU intensive workload (applying a method on a number of items) & it works fine for about the first 80% of the items - using all cpu cores very nice.
As the iteration seems to come near to the end (around 80% i would say) i see that the number of threads begins to go down core by core, & at the end the last around 5% of the items are proceesed only by two cores. So insted to use all cores untill the end, it slows down pretty hard toward the end of the iteration.
Please note the the workload can be per item very different. One can last 1-2 seconds, the other item can take 2-3 minutes to finish.
Any ideea, suggestion is very welcome.
Code used:
var source = myList.ToArray();
var rangePartitioner = Partitioner.Create(0, source.Lenght);
using (SqlConnection connection =new SqlConnection(cnStr))
{
connection.Open();
try
(
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
for(int i = range.Item1; i<range.Item2; i++)
{
CPUIntensiveMethod(source[i]);
}
});
}
catch(AggretateException ae)
{ //Exception cachting}
}
This is an unavoidable consequence of the fact the parallelism is per computation. It is clear that the whole parallel batch cannot run any quicker than the time taken by the slowest single item in the work-set.
Imagine a batch of 100 items, 8 of which are slow (say 1000s to run) and the rest are quick (say 1s to run). You kick them off in a random order across 8 threads. Its clear that eventually each thread will be calculating one of your long running items, at this point you are seeing full utilisation. Eventually the one(s) that hit their long-op(s) first will finish up their long op(s) and quickly finish up any remaining short ops. At that time you ONLY have some of the long ops waiting to finish, so you will see the active utilisation drop off.. i.e. at some point there are only 3 ops left to finish, so only 3 cores are in use.
Mitigation Tactics
Your long running items might be amenable to 'internal parallelism' allowing them to have a faster minimum limit runtime.
Your long running items may be able to be identified and prioritised to start first (which will ensure you get full CPU utilisation for a long as possible)
(see update below) DONT use partitioning in cases where the body can be long running as this simply increases the 'hit' of this effect. (ie get rid of your rangePartitioner entirely). This will massively reduce the impact of this effect to your particular loop
either way your batch run-time is bound by the run-time of the slowest item in the batch.
Update I have also noticed you are using partitioning on your loop, which massively increases the scope of this effect, i.e. you are saying 'break this work-set down into N work-sets' and then parallelize the running of those N work-sets. In the example above this could mean that you get (say) 3 of the long ops into the same work-set and so those are going to process on that same thread. As such you should NOT be using partitioning if the inner body can be long running. For example the docs on partitioning here https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx are saying this is aimed at short bodies
If you have multiple threads that process the same number of items each and each item takes varying amount of time, then of course you will have some threads that finish earlier.
If you use collection whose size is not known, then the items will be taken one by one:
var source = myList.AsEnumerable();
Another approach can be a Producer-Consumer pattern
https://msdn.microsoft.com/en-us/library/dd997371
One day I was trying to get a better understanding of threading concepts, so I wrote a couple of test programs. One of them was:
using System;
using System.Threading.Tasks;
class Program
{
static volatile int a = 0;
static void Main(string[] args)
{
Task[] tasks = new Task[4];
for (int h = 0; h < 20; h++)
{
a = 0;
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = new Task(() => DoStuff());
tasks[i].Start();
}
Task.WaitAll(tasks);
Console.WriteLine(a);
}
Console.ReadKey();
}
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
a++;
}
}
}
I hoped I will be able to see outputs less than 2000000. The model in my imagination was the following: more threads read variable a at the same time, all local copies of a will be the same, the threads increment it and the writes happen and one or more increments are "lost" this way.
Although the output is against this reasoning. One sample output (from a corei5 machine):
2000000
1497903
1026329
2000000
1281604
1395634
1417712
1397300
1396031
1285850
1092027
1068205
1091915
1300493
1357077
1133384
1485279
1290272
1048169
704754
If my reasoning were true I would see 2000000 occasionally and sometimes numbers a bit less. But what I see is 2000000 occasionally and numbers way less than 2000000. This indicates that what happens behind the scenes is not just a couple of "increment losses" but something more is going on. Could somebody explain me the situation?
Edit:
When I was writing this test program I was fully aware how I could make this thrad safe and I was expecting to see numbers less than 2000000. Let me explain why I was surprised by the output: First lets assume that the reasoning above is correct. Second assumption (this wery well can be the source of my confusion): if the conflicts happen (and they do) than these conflicts are random and I expect a somewhat normal distribution for these random event occurences. In this case the first line of the output says: from 500000 experiments the random event never occured. The second line says: the random event occured at least 167365 times. The difference between 0 and 167365 is just to big (almost impossible with a normal distribution). So the case boils down to the following:
One of the two assumptions (the "increment loss" model or the "somewhat normally distributed paralell conflicts" model) are incorrect. Which one is and why?
The behavior stems from the fact that you are using both the volatile keyword as well as not locking access to the variable a when using the increment operator (++) (although you still get a random distribution when not using volatile, using volatile does change the nature of the distribution, which is explored below).
When using the increment operator, it's the equivalent of:
a = a + 1;
In this case, you're actually doing three operations, not one:
Read the value of a
Add 1 to the value of a
Assign the result of 2 back to a
While the volatile keyword serializes access, in the above case, it's serializing access to three separate operations, not serializing access to them collectively, as an atomic unit of work.
Because you're performing three operations when incrementing instead of one, you have additions that are being dropped.
Consider this:
Time Thread 1 Thread 2
---- -------- --------
0 read a (1) read a (1)
1 evaluate a + 1 (2) evaluate a + 1 (2)
2 write result to a (3) write result to a (3)
Or even this:
Time a Thread 1 Thread 2 Thread 3
---- - -------- -------- --------
0 1 read a read a
1 1 evaluate a + 1 (2)
2 2 write back to a
3 2 read a
4 2 evaluate a + 1 (3)
5 3 write back to a
6 3 evaluate a + 1 (2)
7 2 write back to a
Note in particular steps 5-7, thread 2 has written a value back to a, but because thread 3 has an old, stale value, it actually overwrites the results that previous threads have written, essentially wiping out any trace of those increments.
As you can see, as you add more threads, you have a greater potential to mix up the order in which the operations are being performed.
volatile will prevent you from corrupting the value of a due to two writes happening at the same time, or a corrupt read of a due to a write happening during a read, but it doesn't do anything to handle making the operations atomic in this case (since you're performing three operations).
In this case, volatile ensures that the distribution of the value of a is between 0 and 2,000,000 (four threads * 500,000 iterations per thread) because of this serialization of access to a. Without volatile, you run the risk of a being anything as you can run into corruption of the value a when reads and/or writes happen at the same time.
Because you haven't synchronized access to a for the entire increment operation, the results are unpredictable, as you have writes that are being overwritten (as seen in the previous example).
What's going on in your case?
For your specific case you have many writes that are being overwritten, not just a few; since you have four threads each writing a loop two million times, theoretically all the writes could be overwritten (expand the second example to four threads and then just add a few million rows to increment the loops).
While it's not really probable, there shouldn't be an expectation that you wouldn't drop a tremendous amount of writes.
Additionally, Task is an abstraction. In reality (assuming you are using the default scheduler), it uses the ThreadPool class to get threads to process you requests. The ThreadPool is ultimately shared with other operations (some internal to the CLR, even in this case) and even then, it does things like work-stealing, using the current thread for operations and ultimately at some point drops down to the operating system at some level to get a thread to perform work on.
Because of this, you can't assume that there's a random distribution of overwrites that will be skipped, as there's always going to be a lot more going on that will throw whatever order you expect out the window; the order of processing is undefined, the allocation of work will never be evenly distributed.
If you want to ensure that additions won't be overwritten, then you should use the Interlocked.Increment method in the DoStuff method, like so:
for (int i = 0; i < 500000; i++)
{
Interlocked.Increment(ref a);
}
This will ensure that all writes will take place, and your output will be 2000000 twenty times (as per your loop).
It also invalidates the need for the volatile keyword, as you're making the operations you need atomic.
The volatile keyword is good when the operation that you need to make atomic is limited to a single read or write.
If you have to do anything more than a read or a write, then the volatile keyword is too granular, you need a more coarse locking mechanism.
In this case, it's Interlocked.Increment, but if you have more that you have to do, then the lock statement will more than likely be what you rely on.
I don't think it's anything else happening - it's just happening a lot. If you add 'locking' or some other synch technique (Best thread-safe way to increment an integer up to 65535) you'll reliably get the full 2,000,000 increments.
Each task is calling DoStuff() as you'd expect.
private static object locker = new object();
static void DoStuff()
{
for (int i = 0; i < 500000; i++)
{
lock (locker)
{
a++;
}
}
}
Try increasing the the amounts, the timespan is simply to short to draw any conclusions on. Remember that normal IO is in the range of milliseconds and just one blocking IO-op in this case would render the results useless.
Something along the lines of this is better: (or why not intmax?)
static void DoStuff()
{
for (int i = 0; i < 50000000; i++) // 50 000 000
a++;
}
My results ("correct" being 400 000 000):
63838940
60811151
70716761
62101690
61798372
64849158
68786233
67849788
69044365
68621685
86184950
77382352
74374061
58356697
70683366
71841576
62955710
70824563
63564392
71135381
Not really a normal distribution but we are getting there. Bear in mind that this is roughly 35% of the correct amount.
I can explain my results as I am running on 2 physical cores, although viewed as 4 due to hyperthreading, which means that if it is optimal to do a "ht-switch" during the actual addition atleast 50% of the additions will be "removed" (if I remember the implementation of ht correctly it would be (ie modifying some threads data in ALU while loading/saving other threads data)). And the remaining 15% due to the program actually running on 2 cores in parallell.
My recommendations
post your hardware
increase the loop count
vary the TaskCount
hardware matters!
Ok, So, I just started screwing around with threading, now it's taking a bit of time to wrap my head around the concepts so i wrote a pretty simple test to see how much faster if faster at all printing out 20000 lines would be (and i figured it would be faster since i have a quad core processor?)
so first i wrote this, (this is how i would normally do the following):
System.DateTime startdate = DateTime.Now;
for (int i = 0; i < 10000; ++i)
{
Console.WriteLine("Producing " + i);
Console.WriteLine("\t\t\t\tConsuming " + i);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
And then with threading:
public class Test
{
static ProducerConsumer queue;
public System.DateTime startdate = DateTime.Now;
static void Main()
{
queue = new ProducerConsumer();
new Thread(new ThreadStart(ConsumerJob)).Start();
for (int i = 0; i < 10000; i++)
{
Console.WriteLine("Producing {0}", i);
queue.Produce(i);
}
Test a = new Test();
}
static void ConsumerJob()
{
Test a = new Test();
for (int i = 0; i < 10000; i++)
{
object o = queue.Consume();
Console.WriteLine("\t\t\t\tConsuming {0}", o);
}
System.DateTime endtime = DateTime.Now;
Console.WriteLine(a.startdate.Second + ":" + a.startdate.Millisecond + " to " + endtime.Second + ":" + endtime.Millisecond);
}
}
public class ProducerConsumer
{
readonly object listLock = new object();
Queue queue = new Queue();
public void Produce(object o)
{
lock (listLock)
{
queue.Enqueue(o);
Monitor.Pulse(listLock);
}
}
public object Consume()
{
lock (listLock)
{
while (queue.Count == 0)
{
Monitor.Wait(listLock);
}
return queue.Dequeue();
}
}
}
Now, For some reason i assumed this would be faster, but after testing it 15 times, the median of the results is ... a few milliseconds different in favor of non threading
Then i figured hey ... maybe i should try it on a million Console.WriteLine's, but the results were similar
am i doing something wrong ?
Writing to the console is internally synchronized. It is not parallel. It also causes cross-process communication.
In short: It is the worst possible benchmark I can think of ;-)
Try benchmarking something real, something that you actually would want to speed up. It needs to be CPU bound and not internally synchronized.
As far as I can see you have only got one thread servicing the queue, so why would this be any quicker?
I have an example for why your expectation of a big speedup through multi-threading is wrong:
Assume you want to upload 100 pictures. The single threaded variant loads the first, uploads it, loads the second, uploads it, etc.
The limiting part here is the bandwidth of your internet connection (assuming that every upload uses up all the upload bandwidth you have).
What happens if you create 100 threads to upload 1 picture only? Well, each thread reads its picture (this is the part that speeds things up a little, because reading the pictures is done in parallel instead of one after the other).
As the currently active thread uses 100% of the internet upload bandwidth to upload its picture, no other thread can upload a single byte when it is not active. As the amount of bytes that needs to be transmitted, the time that 100 threads need to upload one picture each is the same time that one thread needs to upload 100 pictures one after the other.
You only get a speedup if uploading pictures was limited to lets say 50% of the available bandwidth. Then, 100 threads would be done in 50% of the time it would take one thread to upload 100 pictures.
"For some reason i assumed this would be faster"
If you don't know why you assumed it would be faster, why are you surprised that it's not? Simply starting up new threads is never guaranteed to make any operation run faster. There has to be some inefficiency in the original algorithm that a new thread can reduce (and that is sufficient to overcome the extra overhead of creating the thread).
All the advice given by others is good advice, especially the mention of the fact that the console is serialized, as well as the fact that adding threads does not guarantee speedup.
What I want to point out and what it seems the others missed is that in your original scenario you are printing everything in the main thread, while in the second scenario you are merely delegating the entire printing task to the secondary worker. This cannot be any faster than your original scenario because you simply traded one worker for another.
A scenario where you might see speedup is this one:
for(int i = 0; i < largeNumber; i++)
{
// embarrassingly parallel task that takes some time to process
}
and then replacing that with:
int i = 0;
Parallel.For(i, largeNumber,
o =>
{
// embarrassingly parallel task that takes some time to process
});
This will split the loop among the workers such that each worker processes a smaller chunk of the original data. If the task does not need synchronization you should see the expected speedup.
Cool test.
One thing to have in mind when dealing with threads is bottlenecks. Consider this:
You have a Restaurant. Your kitchen can make a new order every 10
minutes (your chef has a bladder problem so he's always in the
bathroom, but is your girlfriend's cousin), so he produces 6 orders an
hour.
You currently employ only one waiter, which can attend tables
immediately (he's probably on E, but you don't care as long as the
service is good).
During the first week of business everything is fine: you get
customers every ten minutes. Customers still wait for exactly ten
minutes for their meal, but that's fine.
However, after that week, you are getting as much as 2 costumers every
ten minutes, and they have to wait as much as 20 minutes to get their
meal. They start complaining and making noises. And god, you have
noise. So what do you do?
Waiters are cheap, so you hire two more. Will the wait time change?
Not at all... waiters will get the order faster, sure (attend two
customers in parallel), but still some customers wait 20 minutes for
the chef to complete their orders.You need another chef, but as you
search, you discover they are lacking! Every one of them is on TV
doing some crazy reality show (except for your girlfriend's cousin who
actually, you discover, is a former drug dealer).
In your case, waiters are the threads making calls to Console.WriteLine; But your chef is the Console itself. It can only service so much calls a second. Adding some threads might make things a bit faster, but the gains should be minimal.
You have multiple sources, but only 1 output. It that case multi-threading will not speed it up. It's like having a road where 4 lanes that merge into 1 lane. Having 4 lanes will move traffic faster, but at the end it will slow back down when it merges into 1 lane.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Threads synchronization. How exactly lock makes access to memory 'correct'?
This question is inspired by this one.
We got a following test class
class Test
{
private static object ms_Lock=new object();
private static int ms_Sum = 0;
public static void Main ()
{
Parallel.Invoke(HalfJob, HalfJob);
Console.WriteLine(ms_Sum);
Console.ReadLine();
}
private static void HalfJob()
{
for (int i = 0; i < 50000000; i++) {
lock(ms_Lock) { }// empty lock
ms_Sum += 1;
}
}
}
Actual result is very close to expected value 100 000 000 (50 000 000 x 2, since 2 loops are running at the same time), with around 600 - 200 difference (mistake is approx 0.0004% on my machine which is very low). No other way of synchronization can provide such way of approximation (its either a much bigger mistake % or its 100% correct)
We currently understand that such level of preciseness is because of program runs in the following way:
Time is running left to right, and 2 threads are represented by two rows.
where
black box represents process of acquiring, holding and releasing the
lock plus represents addition operation ( schema represents scale on
my PC, lock takes approximated 20 times longer than add)
white box represents period that consists of try to acquire lock,
and further awaiting for it to become available
Also lock provides full memory fence.
So the question now is: if above schema represents what is going on, what is the cause of such big error (now its big cause schema looks like very strong syncrhonization schema)? We could understand difference between 1-10 on boundaries, but its clearly no the only reason of error? We cannot see when writes to ms_Sum can happen at the same time, to cause the error.
EDIT: many people like to jump to quick conclusions. I know what synchronization is, and that above construct is not a real or close to good way to synchronize threads if we need correct result. Have some faith in poster or maybe read linked answer first. I don't need a way to synchronize 2 threads to perform additions in parallel, I am exploring this extravagant and yet efficient , compared to any possible and approximate alternative, synchronization construct (it does synchronize to some extent so its not meaningless like suggested)
lock(ms_Lock) { } this is meaningless construct. lock gurantees exclusive execution of code inside it.
Let me explain why this empty lock decreases (but doesn't eliminate!) the chance of data corruption. Let's simplify threading model a bit:
Thread executes one line of code at time slice.
Thread scheduling is done in strict round robin manner (A-B-A-B).
Monitor.Enter/Exit takes significantly longer to execute than arithmetics. (Let's say 3 times longer. I stuffed code with Nops that mean that previous line is still executing.)
Real += takes 3 steps. I broke them down to atomic ones.
At left column shown which line is executed at the time slice of threads (A and B).
At right column - the program (according to my model).
A B
1 1 SomeOperation();
1 2 SomeOperation();
2 3 Monitor.Enter(ms_Lock);
2 4 Nop();
3 5 Nop();
4 6 Monitor.Exit(ms_Lock);
5 7 Nop();
7 8 Nop();
8 9 int temp = ms_Sum;
3 10 temp++;
9 11 ms_Sum = temp;
4
10
5
11
A B
1 1 SomeOperation();
1 2 SomeOperation();
2 3 int temp = ms_Sum;
2 4 temp++;
3 5 ms_Sum = temp;
3
4
4
5
5
As you see in first scenario thread B just can't catch thread A and A has enough time to finish execution of ms_Sum += 1;. In second scenario the ms_Sum += 1; is interleaved and causes constant data corruption. In reality thread scheduling is stochastic, but it means that thread A has more change to finish increment before another thread gets there.
This is a very tight loop with not much going on inside it, so ms_Sum += 1 has a reasonable chance of being executed in "just the wrong moment" by the parallel threads.
Why would you ever write a code like this in practice?
Why not:
lock(ms_Lock) {
ms_Sum += 1;
}
or just:
Interlocked.Increment(ms_Sum);
?
-- EDIT ---
Some comments on why would you see the error despite memory barrier aspect of the lock... Imagine the following scenario:
Thread A enters the lock, leaves the lock and then is pre-empted by the OS scheduler.
Thread B enters and leaves the lock (possibly once, possibly more than once, possibly millions of times).
At that point the thread A is scheduled again.
Both A and B hit the ms_Sum += 1 at the same time, resulting in some increments being lost (because increment = load + add + store).
As noted the statement
lock(ms_Lock) {}
will cause a full memory barrier. In short this means the value of ms_Sum will be flushed between all caches and updated ("visible") among all threads.
However, ms_Sum += 1 is still not atomic as it is just short-hand for ms_Sum = ms_Sum + 1: a read, an operation, and an assignment. It is in this construct that there is still a race condition -- the count of ms_Sum could be slightly lower than expected. I would also expect that the difference to be more without the memory barrier.
Here is a hypothetical situation of why it might be lower (A and B represent threads and a and b represent thread-local registers):
A: read ms_Sum -> a
B: read ms_Sum -> b
A: write a + 1 -> ms_Sum
B: write b + 1 -> ms_Sum // change from A "discarded"
This depends on a very particular order of interleaving and is dependent upon factors such as thread execution granularity and relative time spent in said non-atomic region. I suspect the lock itself will reduce (but not eliminate) the chance of the interleave above because each thread must wait-in-turn to get through it. The relative time spent in the lock itself to the increment may also play a factor.
Happy coding.
As others have noted, use the critical region established by the lock or one of the atomic increments provided to make it truly thread-safe.
As noted: lock(ms_Lock) { } locks an empty block and so does nothing. You still have a race condition with ms_Sum += 1;. You need:
lock( ms_Lock )
{
ms_Sum += 1 ;
}
[Edited to note:]
Unless you properly serialize access to ms_Sum, you have a race condition. Your code, as written does the following (assuming the optimizer doesn't just throw away the useless lock statement:
Acquire lock
Release lock
Get value of ms_Sum
Increment value of ms_Sum
Store value of ms_Sum
Each thread may be suspended at any point, even in mid-instruction. Unless it is specifically documented as being atomic, it's a fair bet that any machine instruction that takes more than 1 clock cycle to execute may be interrupted mid-execution.
So let's assume that your lock is actually serializing the two threads. There is still nothing in place to prevent one thread from getting suspended (and thus giving priority to the other) whilst it is somewhere in the middle of executing the last three steps.
So the first thread in, locks, release, gets the value of ms_Sum and is then suspended. The second thread comes in, locks, releases, gets the [same] value of ms_Sum, increments it and stores the new value back in ms_Sum, then gets suspended. The 1st thread increments its now-outdates value and stores it.
There's your race condition.
The += operator is not atomic, that is, first it reads then it writes the new value. In the mean time, between reading and writing, the thread A could switch to the other B, in fact without writing the value... then the other thread B don't see the new value because it was not assigned by the other thread A... and when returning to the thread A it will discard all the work of the thread B.