When going through the CLR/CLI specs and memory models etc, I noticed the wording around atomic reads/writes according to the ECMA CLI spec:
A conforming CLI shall guarantee that read and write access to
properly aligned memory locations no larger than the native word size
(the size of type native int) is atomic when all the write accesses to
a location are the same size.
Specifically the phrase 'properly aligned memory' caught my eye. I wondered if I could somehow get torn reads with a long type on a 64-bit system with some trickery. So I wrote the following test-case:
unsafe class Program {
const int NUM_ITERATIONS = 200000000;
const long STARTING_VALUE = 0x100000000L + 123L;
const int NUM_LONGS = 200;
private static int prevLongWriteIndex = 0;
private static long* misalignedLongPtr = (long*) GetMisalignedHeapLongs(NUM_LONGS);
public static long SharedState {
get {
Thread.MemoryBarrier();
return misalignedLongPtr[prevLongWriteIndex % NUM_LONGS];
}
set {
var myIndex = Interlocked.Increment(ref prevLongWriteIndex) % NUM_LONGS;
misalignedLongPtr[myIndex] = value;
}
}
static unsafe void Main(string[] args) {
Thread writerThread = new Thread(WriterThreadEntry);
Thread readerThread = new Thread(ReaderThreadEntry);
writerThread.Start();
readerThread.Start();
writerThread.Join();
readerThread.Join();
Console.WriteLine("Done");
Console.ReadKey();
}
private static IntPtr GetMisalignedHeapLongs(int count) {
const int ALIGNMENT = 7;
IntPtr reservedMemory = Marshal.AllocHGlobal(new IntPtr(sizeof(long) * count + ALIGNMENT - 1));
long allocationOffset = (long) reservedMemory % ALIGNMENT;
if (allocationOffset == 0L) return reservedMemory;
return reservedMemory + (int) (ALIGNMENT - allocationOffset);
}
private static void WriterThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
SharedState = STARTING_VALUE + i;
}
}
private static void ReaderThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
var sharedStateLocal = SharedState;
if (sharedStateLocal < STARTING_VALUE) Console.WriteLine("Torn read detected: " + sharedStateLocal);
}
}
}
However, no matter how many times I run the program I never legitimately see the line "Torn read detected!". So why not?
I allocated multiple longs in a single block in the hopes that at least one of them would spill between two cache lines; and the 'start point' for the first long should be misaligned (unless I'm misunderstanding something).
Also I know that the nature of multithreading errors means they can be hard to force, and that my 'test program' isn't as rigorous as it could be, but I've run the program almost 30 times now with no results- each with 200000000 iterations.
There are a number of flaws in this program that hides torn reads. Reasoning about the behavior of unsynchronized threads is never simple, and hard to explain, the odds for accidental synchronization are always high.
var myIndex = Interlocked.Increment(ref prevLongWriteIndex) % NUM_LONGS;
Nothing very subtle about Interlocked, unfortunately it affects the reader thread a great deal as well. Pretty hard to see, but you can use Stopwatch to time the execution of the threads. You'll see that Interlocked on the writer slows down the reader by a factor of ~2. Enough to affect the timing of the reader and not repro the problem, accidental synchronization.
Simplest way to eliminate the hazard and maximize the odds of detecting a torn read is to just always read and write from the same memory location. Fix:
var myIndex = 0;
if (sharedStateLocal < STARTING_VALUE)
This test doesn't help much to detect torn reads, there are many that simply don't trigger the test. Having too many binary zeros in the STARTING_VALUE make it extra unlikely. A good alternative that maximizes the odds for detection is to alternate between 1 and -1, ensuring the byte values are always different and making the test very simple. Thus:
private static void WriterThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
SharedState = 1;
SharedState = -1;
}
}
private static void ReaderThreadEntry() {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
var sharedStateLocal = SharedState;
if (Math.Abs(sharedStateLocal) != 1) {
Console.WriteLine("Torn read detected: " + sharedStateLocal);
}
}
}
That quickly gets you several pages of torn reads in the console in 32-bit mode. To get them in 64-bit as well you need to do extra work to get the variable mis-aligned. It needs to straddle the L1 cache-line boundary so the processor has to perform two reads and writes, like it does in 32-bit mode. Fix:
private static IntPtr GetMisalignedHeapLongs(int count) {
const int ALIGNMENT = -1;
IntPtr reservedMemory = Marshal.AllocHGlobal(new IntPtr(sizeof(long) * count + 64 + 15));
long cachelineStart = 64 * (((long)reservedMemory + 63) / 64);
long misalignedAddr = cachelineStart + ALIGNMENT;
if (misalignedAddr < (long)reservedMemory) misalignedAddr += 64;
return new IntPtr(misalignedAddr);
}
Any ALIGNMENT value between -1 and -7 will now produce torn reads in 64-bit mode as well.
Related
I have read another SO question: When is ReaderWriterLockSlim better than a simple lock?
And it does not explain exactly why ReaderWriterLockSlim so slow compared to lock.
My test is yes - testing with zero contention but still it doesnt explain the staggering difference.
Read lock takes 2.7s, Write lock 2.2s, lock 1.0s
This is complete code:
using System;
using System.Diagnostics;
using System.Threading;
namespace test
{
internal class Program
{
static int[] data = new int[100000000];
static object lock1 = new object();
static ReaderWriterLockSlim lock2 = new ReaderWriterLockSlim();
static void Main(string[] args)
{
for (int z = 0; z < 3; z++)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < data.Length; i++)
{
lock (lock1)
{
data[i] = i;
}
}
sw.Stop();
Console.WriteLine("Lock: {0}", sw.Elapsed);
sw.Restart();
for (int i = 0; i < data.Length; i++)
{
try
{
lock2.EnterReadLock();
data[i] = i;
}
finally
{
lock2.ExitReadLock();
}
}
sw.Stop();
Console.WriteLine("Read: {0}", sw.Elapsed);
sw.Restart();
for (int i = 0; i < data.Length; i++)
{
try
{
lock2.EnterWriteLock();
data[i] = i;
}
finally
{
lock2.ExitWriteLock();
}
}
sw.Stop();
Console.WriteLine("Write: {0}\n", sw.Elapsed);
}
Console.ReadKey(false);
}
}
}
You are looking at two devices. At the left is a lock. At the right is a ReaderWriterLockSlim.
The device at the left is used to control a single electric lamp from a single location. The device at the right is used to control two lamps from two different locations.¹ The device at the left is cheaper to buy, it requires less wiring, it is simpler to install and operate, and it loses less energy due to heat than the device at the right.
The analogy with the SPST/DPDT electric switches is probably far from perfect, but my point is that a lock is comparatively a simpler mechanism than the ReaderWriterLockSlim. It is used to enforce a single policy to a homogenous group of worker threads. On the other hand a ReaderWriterLockSlim is used to enforce two different policies to two separate groups of workers (readers and writers), regarding to how they interact with members of the same group and the other group. It should be of no big surprise that the more complex mechanism has a higher operational cost (overhead) than the simpler mechanism. That's the cost that you have to pay in order to get finer control of the worker threads.
¹ Or maybe not. I am not an electrician!
Thanks to canton7 and Kevin Gosse, I found my 2013 question perfectly answered by Hans Passant: When exactly does .NET Monitor go to kernel-mode?
So lock is faster in a no-contention scenario simply because it has lighter logic and kernel mode is not involved.
I've read many articles about GC, and about "do no care about objects" paradigm, but i did a test for proove it.
So idea is: i'm creating a lot of large objects stored in local functions, and I suspect that after all tasks are done it will clean the memory itself. But GC didn't. So test code:
class Program
{
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = int.MaxValue/10000000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb();
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed%(count/100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0*completed/count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1),GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public int[] Arr = Enumerable.Range(1,10*1024*1024).ToArray(); // 50MB
}
so in my case app eat ~2GB of RAM, but when I'm clicking on keyboard and launching GC.Collect it free occuped memory up to normal size of 20mb.
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
In your example there is no need to explicitly call GC.Collect()
If you bring it up in the task manager or Performance Monitor you will see the GC working as it runs. GC is called when needed by the OS (when it is trying to allocate and doesn't have memory it will call GC to free some up).
That being said since your objects ( greater than 85000 bytes) are going onto the large object heap, LOH, you need to watch out for large object heap fragmentation. I've modified your code so show how you can fragment the LOH. Which will give an out of memory exception even though the memory is available, just not contiguous memory. As of .NET 4.5.1 you can set a flag to request that LOH to be compacted.
I modified your code to show an example of this here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
namespace GCTesting
{
class Program
{
static int fragLOHbyIncrementing = 1000;
static void Main()
{
var allDone = new ManualResetEvent(false);
int completed = 0;
long sum = 0; //just to prevent optimizer to remove cycle etc.
const int count = 2000;
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
unchecked
{
var dumb = new Dumb( fragLOHbyIncrementing++ );
var localSum = 0;
foreach (int x in dumb.Arr)
{
localSum += x;
}
sum += localSum;
}
if (Interlocked.Increment(ref completed) == count)
allDone.Set();
if (completed % (count / 100) == 0)
Console.WriteLine("Progress = {0:N2}%", 100.0 * completed / count);
});
}
allDone.WaitOne();
Console.WriteLine("Done. Result : {0}", sum);
Console.ReadKey();
GC.Collect();
Console.WriteLine("GC Collected!");
Console.WriteLine("GC CollectionsCount 0 = {0}, 1 = {1}, 2 = {2}", GC.CollectionCount(0), GC.CollectionCount(1), GC.CollectionCount(2));
Console.ReadKey();
}
}
class Dumb
{
public Dumb(int incr)
{
try
{
DumbAllocation(incr);
}
catch (OutOfMemoryException)
{
Console.WriteLine("Out of memory, trying to compact the LOH.");
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
try // try again
{
DumbAllocation(incr);
Console.WriteLine("compacting the LOH worked to free up memory.");
}
catch (OutOfMemoryException)
{
Console.WriteLine("compaction of LOH failed to free memory.");
throw;
}
}
}
private void DumbAllocation(int incr)
{
Arr = Enumerable.Range(1, (10 * 1024 * 1024) + incr).ToArray();
}
public int[] Arr;
}
}
The .NET runtime will garbage collect without your call to the GC. However, the GC methods are exposed so that GC collections can be timed with the user experience (load screens, waiting for downloads, etc).
Use GC methods isn't always a bad idea, but if you need to ask then it likely is. :-)
I've read that manual calls of GC etc is bad practice, but i cannot avoid it in this case.
You can avoid it. Just don't call it. The next time you try to do an allocation, the GC will likely kick in and take care of this for you.
Few things I can think of that may be influencing this, but none for sure :(
One possible effect is that GC doesn't kick in right away... the large objects are on the collection queue - but haven't been cleaned up yet. Specifically calling GC.Collect forces collection right there and that's where you see the difference. Otherwise it would've just happened at some point later.
Second reason i can think of is that GC may collect objects, but not necessarily release memory to OS. Hence you'd continue seeing high memory usage even though it's free internally and available for allocation.
The garbage collection is clever and decide when the time right to collect your objects. This is done by heuristics and you must read about that. The garbage collection makes his job very good. Are the 2GB a problem for yout system or you just wondering about the behaviour?
Whenever you call GC.Collect() don't forget the call GC.WaitingForPendingFinalizer. This avoids unwanted aging of objects with finalizer.
I want to create a function that makes my application allocate X RAM for Y seconds
(I know theres 1.2GB limit on objects).
Is there better way than this?
[MethodImpl(MethodImplOptions.NoOptimization)]
public void TakeRam(int X, int Y)
{
Byte[] doubleArray = new Byte[X];
System.Threading.Sleep(Y*60)
return
}
I would prefer to use unmanaged memory, like this:
IntPtr p = Marshal.AllocCoTaskMem(X);
Sleep(Y);
Marshal.FreeCoTaskMem(p);
Otherwise the CLR garbage collector may play tricks on you.
You have to KeepAlive your block of memory, otherwise the GC can deallocate it. I would even fill the memory (so that you are sure the memory was really allocated, and not reserved)
[MethodImpl(MethodImplOptions.NoOptimization)]
public void TakeRam(int X, int Y)
{
Byte[] doubleArray = new Byte[X];
for (int i = 0; i < X; i++)
{
doubleArray[i] = 0xFF;
}
System.Threading.Sleep(Y)
GC.KeepAlive(doubleArray);
}
I'll add that, at 64bits, the maximum size of an array is something less than 2GB, not 1.2GB.
Unless you write to memory it will not get allocated. Doing repeated garbage collections does not lower the memory usage significantly. The program grows to 1,176,272K with garbage collection, and 1,176,312 K without garbage collection; with this line commented out: GC.GetTotalMemory(true); When this line is commented out byteArray[i] = 99; the program grows to 4,464K
Changed the name of the array; it doesn't hold doubles.
Changed it to write to the memory
Changed the name of the sleep function.
You can see in Task Manager that the memory gets allocated to the running process:
This works:
using System;
using System.Runtime.CompilerServices;
namespace NewTest
{
class ProgramA
{
static void Main(string[] args)
{
TakeRam(1200000000, 3000000);
}
[MethodImpl(MethodImplOptions.NoOptimization)]
static public void TakeRam(long X, int Y)
{
Byte[] byteArray = new Byte[X];
for (int i = 0; i < X; i += 4096)
{
GC.GetTotalMemory(true);
byteArray[i] = 99;
}
System.Threading.Thread.Sleep(Y);
}
}
}
I am testing spawning off many threads running the same function on a 32 core server for Java and C#. I run the application with 1000 iterations of the function, which is batched across either 1,2,4,8, 16 or 32 threads using a threadpool.
At 1, 2, 4, 8 and 16 concurrent threads Java is at least twice as fast as C#. However, as the number of threads increases, the gap closes and by 32 threads C# has nearly the same average run-time, but Java occasionally takes 2000ms (whereas both languages are usually running about 400ms). Java is starting to get worse with massive spikes in the time taken per thread iteration.
EDIT This is Windows Server 2008
EDIT2 I have changed the code below to show using the Executor Service threadpool. I have also installed Java 7.
I have set the following optimisations in the hotspot VM:
-XX:+UseConcMarkSweepGC -Xmx 6000
but it still hasnt made things any better. The only difference between the code is that im using the below threadpool and for the C# version we use:
http://www.codeproject.com/Articles/7933/Smart-Thread-Pool
Is there a way to make the Java more optimised? Perhaos you could explain why I am seeing this massive degradation in performance?
Is there a more efficient Java threadpool?
(Please note, I do not mean by changing the test function)
import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
public class PoolDemo {
static long FastestMemory = 2000000;
static long SlowestMemory = 0;
static long TotalTime;
static int[] FileArray;
static DataOutputStream outs;
static FileOutputStream fout;
static Byte myByte = 0;
public static void main(String[] args) throws InterruptedException, FileNotFoundException {
int Iterations = Integer.parseInt(args[0]);
int ThreadSize = Integer.parseInt(args[1]);
FileArray = new int[Iterations];
fout = new FileOutputStream("server_testing.csv");
// fixed pool, unlimited queue
ExecutorService service = Executors.newFixedThreadPool(ThreadSize);
ThreadPoolExecutor executor = (ThreadPoolExecutor) service;
for(int i = 0; i<Iterations; i++) {
Task t = new Task(i);
executor.execute(t);
}
for(int j=0; j<FileArray.length; j++){
new PrintStream(fout).println(FileArray[j] + ",");
}
}
private static class Task implements Runnable {
private int ID;
public Task(int index) {
this.ID = index;
}
public void run() {
long Start = System.currentTimeMillis();
int Size1 = 100000;
int Size2 = 2 * Size1;
int Size3 = Size1;
byte[] list1 = new byte[Size1];
byte[] list2 = new byte[Size2];
byte[] list3 = new byte[Size3];
for(int i=0; i<Size1; i++){
list1[i] = myByte;
}
for (int i = 0; i < Size2; i=i+2)
{
list2[i] = myByte;
}
for (int i = 0; i < Size3; i++)
{
byte temp = list1[i];
byte temp2 = list2[i];
list3[i] = temp;
list2[i] = temp;
list1[i] = temp2;
}
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
TotalTime += Duration;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
if(Duration < FastestMemory){
FastestMemory = Duration;
}
if (Duration > SlowestMemory)
{
SlowestMemory = Duration;
}
}
}
}
Summary
Below are the original response, update 1, and update 2. Update 1 talks about dealing with the race conditions around the test statistic variables by using concurrency structures. Update 2 is a much simpler way of dealing with the race condition issue. Hopefully no more updates from me - sorry for the length of the response but multithreaded programming is complicated!
Original Response
The only difference between the code is that im using the below
threadpool
I would say that is an absolutely huge difference. It's difficult to compare the performance of the two languages when their thread pool implementations are completely different blocks of code, written in user space. The thread pool implementation could have enormous impact on performance.
You should consider using Java's own built-in thread pools. See ThreadPoolExecutor and the entire java.util.concurrent package of which it is part. The Executors class has convenient static factory methods for pools and is a good higher level interface. All you need is JDK 1.5+, though the newer, the better. The fork/join solutions mentioned by other posters are also part of this package - as mentioned, they require 1.7+.
Update 1 - Addressing race conditions by using concurrency structures
You have race conditions around the setting of FastestMemory, SlowestMemory, and TotalTime. For the first two, you are doing the < and > testing and then the setting in more than one step. This is not atomic; there is certainly the chance that another thread will update these values in between the testing and the setting. The += setting of TotalTime is also non-atomic: a test and set in disguise.
Here are some suggested fixes.
TotalTime
The goal here is a threadsafe, atomic += of TotalTime.
// At the top of everything
import java.util.concurrent.atomic.AtomicLong;
...
// In PoolDemo
static AtomicLong TotalTime = new AtomicLong();
...
// In Task, where you currently do the TotalTime += piece
TotalTime.addAndGet (Duration);
FastestMemory / SlowestMemory
The goal here is testing and updating FastestMemory and SlowestMemory each in an atomic step, so no thread can slip in between the test and update steps to cause a race condition.
Simplest approach:
Protect the testing and setting of the variables using the class itself as a monitor. We need a monitor that contains the variables in order to guarantee synchronized visibility (thanks #A.H. for catching this.) We have to use the class itself because everything is static.
// In Task
synchronized (PoolDemo.class) {
if (Duration < FastestMemory) {
FastestMemory = Duration;
}
if (Duration > SlowestMemory) {
SlowestMemory = Duration;
}
}
Intermediate approach:
You may not like taking the whole class for the monitor, or exposing the monitor by using the class, etc. You could do a separate monitor that does not itself contain FastestMemory and SlowestMemory, but you will then run into synchronization visibility issues. You get around this by using the volatile keyword.
// In PoolDemo
static Integer _monitor = new Integer(1);
static volatile long FastestMemory = 2000000;
static volatile long SlowestMemory = 0;
...
// In Task
synchronized (PoolDemo._monitor) {
if (Duration < FastestMemory) {
FastestMemory = Duration;
}
if (Duration > SlowestMemory) {
SlowestMemory = Duration;
}
}
Advanced approach:
Here we use the java.util.concurrent.atomic classes instead of monitors. Under heavy contention, this should perform better than the synchronized approach. Try it and see.
// At the top of everything
import java.util.concurrent.atomic.AtomicLong;
. . . .
// In PoolDemo
static AtomicLong FastestMemory = new AtomicLong(2000000);
static AtomicLong SlowestMemory = new AtomicLong(0);
. . . . .
// In Task
long temp = FastestMemory.get();
while (Duration < temp) {
if (!FastestMemory.compareAndSet (temp, Duration)) {
temp = FastestMemory.get();
}
}
temp = SlowestMemory.get();
while (Duration > temp) {
if (!SlowestMemory.compareAndSet (temp, Duration)) {
temp = SlowestMemory.get();
}
}
Let me know what happens after this. It may not fix your problem, but the race condition around the very variables that track your performance is too dangerous to ignore.
I originally posted this update as a comment but moved it here so that I would have room to show code. This update has been through a few iterations - thanks to A.H. for catching a bug I had in an earlier version. Anything in this update supersedes anything in the comment.
Last but not least, an excellent source covering all this material is Java Concurrency in Practice, the best book on Java concurrency, and one of the best Java books overall.
Update 2 - Addressing race conditions in a much simpler way
I recently noticed that your current code will never terminate unless you add executorService.shutdown(). That is, the non-daemon threads living in that pool must be terminated or else the main thread will never exit. This got me to thinking that since we have to wait for all threads to exit, why not compare their durations after they finished, and thus bypass the concurrent updating of FastestMemory, etc. altogether? This is simpler and could be faster; there's no more locking or CAS overhead, and you are already doing an iteration of FileArray at the end of things anyway.
The other thing we can take advantage of is that your concurrent updating of FileArray is perfectly safe, since each thread is writing to a separate cell, and since there is no reading of FileArray during the writing of it.
With that, you make the following changes:
// In PoolDemo
// This part is the same, just so you know where we are
for(int i = 0; i<Iterations; i++) {
Task t = new Task(i);
executor.execute(t);
}
// CHANGES BEGIN HERE
// Will block till all tasks finish. Required regardless.
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);
for(int j=0; j<FileArray.length; j++){
long duration = FileArray[j];
TotalTime += duration;
if (duration < FastestMemory) {
FastestMemory = duration;
}
if (duration > SlowestMemory) {
SlowestMemory = duration;
}
new PrintStream(fout).println(FileArray[j] + ",");
}
. . .
// In Task
// Ending of Task.run() now looks like this
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
Give this approach a shot as well.
You should definitely be checking your C# code for similar race conditions.
...but Java occasionally takes 2000ms...
And
byte[] list1 = new byte[Size1];
byte[] list2 = new byte[Size2];
byte[] list3 = new byte[Size3];
The hickups will be the garbage collector cleaning up your arrays. If you really want to tune that I suggest you use some kind of cache for the arrays.
Edit
This one
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
does one or more synchronized internally. So your highly "concurrent" code will be serialized quite good at this point. Just remove it and retest.
While #sparc_spread's answer is great, another thing I've noticed is this:
I run the application with 1000 iterations of the function
Notice that the HotSpot JVM is working on interpreted mode for the first 1.5k iterations of any function on client mode, and for 10k iterations on server mode. Computers with that many cores are automatically considered "servers" by the HotSpot JVM.
That would mean that C# would do JIT (and run in machine code) before Java does, and has a chance for better performance at the function runtime. Try increasing the iterations to 20,000 and start counting from 10k iteration.
The rationale here is that the JVM collects statistical data for how to do JIT best. It trusts that your function is going to be run a lot through time, so it takes a "slow bootstrapping" mechanism for a faster runtime overall. Or in their words "20% of the functions run 80% of the time", so why JIT them all?
Are you using java6? Java 7 comes with features to improve performance in parallel programing:
http://www.oracle.com/technetwork/articles/java/fork-join-422606.html
I was writing a program to illustrate the effects of cache contention in multithreaded programs. My first cut was to create an array of long and show how modifying adjacent items causes contention. Here's the program.
const long maxCount = 500000000;
const int numThreads = 4;
const int Multiplier = 1;
static void DoIt()
{
long[] c = new long[Multiplier * numThreads];
var threads = new Thread[numThreads];
// Create the threads
for (int i = 0; i < numThreads; ++i)
{
threads[i] = new Thread((s) =>
{
int x = (int)s;
while (c[x] > 0)
{
--c[x];
}
});
}
// start threads
var sw = Stopwatch.StartNew();
for (int i = 0; i < numThreads; ++i)
{
int z = Multiplier * i;
c[z] = maxCount;
threads[i].Start(z);
}
// Wait for 500 ms and then access the counters.
// This just proves that the threads are actually updating the counters.
Thread.Sleep(500);
for (int i = 0; i < numThreads; ++i)
{
Console.WriteLine(c[Multiplier * i]);
}
// Wait for threads to stop
for (int i = 0; i < numThreads; ++i)
{
threads[i].Join();
}
sw.Stop();
Console.WriteLine();
Console.WriteLine("Elapsed time = {0:N0} ms", sw.ElapsedMilliseconds);
}
I'm running Visual Studio 2010, program compiled in Release mode, .NET 4.0 target, "Any CPU", and executed in the 64-bit runtime without the debugger attached (Ctrl+F5).
That program runs in about 1,700 ms on my system, with a single thread. With two threads, it takes over 25 seconds. Figuring that the difference was cache contention, I set Multipler = 8 and ran again. The result is 12 seconds, so contention was at least part of the problem.
Increasing Multiplier beyond 8 doesn't improve performance.
For comparison, a similar program that doesn't use an array takes only about 2,200 ms with two threads when the variables are adjacent. When I separate the variables, the two thread version runs in the same amount of time as the single-threaded version.
If the problem was array indexing overhead, you'd expect it to show up in the single-threaded version. It looks to me like there's some kind of mutual exclusion going on when modifying the array, but I don't know what it is.
Looking at the generated IL isn't very enlightening. Nor was viewing the disassembly. The disassembly does show a couple of calls to (I think) the runtime library, but I wasn't able to step into them.
I'm not proficient with windbg or other low-level debugging tools these days. It's been a really long time since I needed them. So I'm stumped.
My only hypothesis right now is that the runtime code is setting a "dirty" flag on every write. It seems like something like that would be required in order to support throwing an exception if the array is modified while it's being enumerated. But I readily admit that I have no direct evidence to back up that hypothesis.
Can anybody tell me what is causing this big slowdown?
You've got false sharing. I wrote an article about it here