Unreliable multi-threading behaviour
While modifying an existing programme to make use of multiple processors, I was frustrated from the results: The modified programme needed more time than the original one. At first, I assumed that it was my mistake, but then I simplified the programme up to the following one:
//*
// Use this for Java 1.5+
import java.util.concurrent.*;
/*/
// Use backport from http://dcl.mathcs.emory.edu/util/backport-util-concurrent/
// for Java 1.4
import edu.emory.mathcs.backport.java.util.concurrent.*;
//*/
publicclass Concurrentimplements Runnable
{
/**
* Use instance array in run().
*/
staticfinalint INSTANCE_ARRAY = 0;
/**
* Use local array in run().
*/
staticfinalint LOCAL_ARRAY = 1;
/**
* Use two local ints in run().
*/
staticfinalint LOCAL_INTS = 2;
/**
* One of INSTANCE_ARRAY, LOCAL_ARRAY, LOCAL_INTS
*/
staticint test;
/**
* Number of iterations in run()
*/
staticfinalint N = 1000000;
int[] dummy =newint[2];
int out;
publicvoid run ()
{
if (test == INSTANCE_ARRAY)
{
for (int i = N; i > 0; i--)
{
dummy[i & 1] += i;
}
out = dummy[0] + dummy[1];
}
elseif (test == LOCAL_ARRAY)
{
int[] dummy =newint[2];
for (int i = N; i > 0; i--)
{
dummy[i & 1] += i;
}
// prevent optimizer from discarding the loop.
out = dummy[0] + dummy[1];
}
else
// test == LOCAL_INTS
{
int dummy0 = 0;
int dummy1 = 0;
for (int i = N; i > 0; i--)
{
if ((i & 1) == 0)
{
dummy0 += i;
}
else
{
dummy1 += i;
}
}
// prevent optimizer from discarding the loop.
out = dummy0 + dummy1;
}
}
publicstaticvoid main (String[] args)throws Exception
{
System.out.println (System.getProperty ("java.vm.version"));
System.out.println (System.getProperty ("java.vm.name"));
System.out.println (System.getProperty ("os.name"));
System.out.println (System.getProperty ("os.version"));
System.out.println (System.getProperty ("os.arch"));
System.out.println (ExecutorService.class.getPackage ());
for (test = INSTANCE_ARRAY; test <= LOCAL_INTS; test++)
{
switch (test)
{
case INSTANCE_ARRAY:
System.out.println ("INSTANCE_ARRAY");
break;
case LOCAL_ARRAY:
System.out.println ("LOCAL_ARRAY");
break;
case LOCAL_INTS:
System.out.println ("LOCAL_INTS");
break;
}
// at first a single thread, then two threads
for (int n = 1; n <= 2; n++)
{
System.out.println (n +" thread(s)");
ExecutorService executor = Executors.newFixedThreadPool (n);
// five identical rounds
for (int i = 0; i < 5; i++)
{
long time = System.currentTimeMillis ();
// execute Concurrent.run 100 times
for (int j = 0; j < 100; j++)
{
executor.execute (new Concurrent ());
}
// and once again so that we can wait for completion
executor.submit (new Concurrent ()).get ();
System.out.println (System.currentTimeMillis () - time);
}
executor.shutdown ();
}
}
}
}
Thus, the indivual tasks (Concurrent.run) have the following properties: 1) Their computation requires some reasonable amount of time. 2) They are completely independent.
Thus, one should benefit from multiple processors. However, only for this configuration:
1.5.0_10-b03
Java HotSpot(TM) 64-Bit Server VM
Linux
2.6.18-gentoo-r6d
amd64
the execution times using two threads were half of the times using one thread. But a dual-core pentium needs more time for the case INSTANCE_ARRAY when two threads are used:
1.5.0_10-b03
Java HotSpot(TM) Client VM
Linux
2.6.13-15.11-smp
i386
package java.util.concurrent, Java Platform API Specification, version 1.5
INSTANCE_ARRAY
1 thread(s)
332
326
341
309
310
2 thread(s)
470
514
463
479
556
LOCAL_ARRAY
1 thread(s)
282
285
282
286
291
2 thread(s)
163
157
145
139
144
LOCAL_INTS
1 thread(s)
244
320
263
261
258
2 thread(s)
144
141
127
126
130
A similar behaviour was observed for three other 1.5 VMs and the 1.6 VM.
A 1.4.2 VM on the same computer is even worse:
1.4.2_13-b06
Java HotSpot(TM) Client VM
Linux
2.6.13-15.11-smp
i386
package edu.emory.mathcs.backport.java.util.concurrent
INSTANCE_ARRAY
1 thread(s)
343
299
298
296
296
2 thread(s)
483
508
482
516
545
LOCAL_ARRAY
1 thread(s)
272
272
272
273
279
2 thread(s)
881
904
949
976
956
LOCAL_INTS
1 thread(s)
204
204
203
204
203
2 thread(s)
123
116
116
110
104
A similar behaviour is observed on a SPARC:
1.4.2_01-b06
Java HotSpot(TM) Client VM
SunOS
5.8
sparc
package edu.emory.mathcs.backport.java.util.concurrent
INSTANCE_ARRAY
1 thread(s)
2312
2283
2267
2267
2266
2 thread(s)
2296
2387
2335
2396
2300
LOCAL_ARRAY
1 thread(s)
2144
2129
2143
2130
2129
2 thread(s)
3393
3230
3274
3458
3206
LOCAL_INTS
1 thread(s)
879
878
880
889
879
2 thread(s)
444
453
445
444
444
Since it works as expected for the case LOCAL_INTS, I think that there is no problem with java.util.concurrent or its backport, but with the VMs for the tested platforms. By the way, the CPU utilization reached 100% for both CPUs, and there were no major background tasks.
[9942 byte] By [
ole_ka] at [2007-11-26 17:38:22]

# 1
Hi,
Your test it's good, but the number of iteration it's too small.
After somme test with
/**
* Number of iterations in run()
*/
static final int N = 5000000;
the result it's better:
1.6.0-b105
Java HotSpot(TM) Client VM
Windows XP
5.1
x86
package java.util.concurrent, Java Platform API Specification, version 1.6
INSTANCE_ARRAY
1 thread(s)
1343
1407
1359
1312
1329
2 thread(s)
718
719
703
656
688
LOCAL_ARRAY
1 thread(s)
1266
1265
1266
1265
1266
2 thread(s)
672
719
656
672
703
LOCAL_INTS
1 thread(s)
1641
1656
1640
1641
1641
2 thread(s)
843
860
859
860
828
Message was edited by:
lbaudet
Message was edited by:
lbaudet
# 2
Time of thread gestion in Executor shadows the time in run method.
# 3
You do know that hotspot won't even compile java byte code to native code until a method has been called 1,500 times for -client and 10,000 times for -server.The option -XX:-PrintCompilation will let you know when a method has been compiled.
You might want to warm up the test first with 10,000 calls to run.
# 4
Are you sure that HotSpot needs 1000 invocations for compilation? I think HotSpot compiles the main-method of this example, but it is invoked only once!
public class Test
{
public static void main (String[] args)
{
int j = 0;
for (int i = 1000000000; i >= 0; i--)
{
j += i;
}
System.out.println (j);
}
}
Output is as follows:
~/workspace/Test/build> java -XX:+PrintCompilation Test
1bjava.lang.String::charAt (33 bytes)
1% bTest::main @ 8 (27 bytes)
-243309312
Using -Xint, the code runs slower. HotSpot has compiled Test.main.
# 5
Just from what I've read. There may be other options that control compilation. See http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp and -XX:CompileThreshold vm option.
http://www.javaworld.com/javaworld/javaqa/2003-04/01-qa-0411-hotspot.html
You can also google on [url http://www.google.com/search?q=CompileThreshold]CompileThreshold[/url]
# 6
public class TestClassCompiling {
private static int int1;
private static int int2;
private static int int3;
public static void main(String args[]) {
int j = 0;
for (int i=0; i<1400; i++) {
j += method1();
j += method2();
j += method3();
}
// method3 should get compiled here
for (int i=0; i<100; i++) {
j += method3();
}
// method2 is next
for (int i=0; i<100; i++) {
j += method2();
}
for (int i=0; i<10000; i++) {
j += method1();
}
System.out.println(j);
}
private static int method1() {
return int1++;
}
private static int method2() {
return int2++;
}
private static int method3() {
return int3++;
}
}
With Client HotSpot
java -client -XX:+PrintCompilation TestClassCompiling
1java.lang.String::hashCode (60 bytes)
2java.lang.String::charAt (33 bytes)
3TestClassCompiling::method3 (10 bytes)
4TestClassCompiling::method2 (10 bytes)
5TestClassCompiling::method1 (10 bytes)
67222800
Main was never compiled. And as you can see, methods1-3 were compiled in a deterministic way.
With Server HotSpot
java -server -XX:+PrintCompilation TestClassCompiling
1TestClassCompiling::method1 (10 bytes)
67222800
# 7
Run with -Xint to use the interpreter and avoid compilation, or else warmup to ensure pre-compilation.
BTW hotspot (server at least) will compile a method if it contains a loop that iterates beyond a given threshhold - hence main() can be compiled even if only called once. can't recall what the default threshhold is - sorry.
Use System.nanoTime for your time measurements too.
These sorts of microbenchmarks are notoriously bad at demonstrating multiprocessor usage. As was already stated, the pool management might overshadow your workload.
# 8
"BTW hotspot (server at least) will compile a method if it contains a loop that iterates beyond a given threshhold - hence main() can be compiled even if only called once"
Yes, that's right. Thus, Concurrent.run() is compiled at the very beginning of programme execution (which can be verified using -XX:+PrintCompilation). Maybe the first computation round needs a bit more time, but that's not my problem. Even with -Xint I get the strange behaviour.
In order to put the discussion in the right direction, I have to clarify my problem:
1.) I have a relatively small number of relatively time-consuming tasks, represented by the Runnable Concurrent. These tasks are completely indepent, no synchronization is involved in the tasks themselves.
2.) The executor makes use of synchronization, yes, but only 101 synchronization events are necessary per round. So please don't tell us that the problem is the executor, 101 synchronizations should not lead to a waste of some 100 milliseconds of processor time! Or if so, this is a buggy behaviour in the VM at least on the dual-core intel and sparc computers.
3.) While the behaviour is as expected when using local arrays or ints, the usage of arrays stored in instance variables makes the code slower when two threads are used. On the tested 1.4 VM's, even local arrays make the code slower.
So my question is: Do you observe such a strange behaviour on your dual-core or multiprocessor computers, too? Should this be considered as a bug in the VM?
# 9
> So my question is: Do you observe such a strange
> behaviour on your dual-core or multiprocessor
> computers, too?
Yes. I tested on three systems:
1. hyperthreaded P3
Neither local array or local int showed any speedup with 2 threads.
Instance array slowed by approx 2x with 2 threads
2. dual-core athlon
Local array and local int showed 2x speedup with 2 threads
Instance array showed zero speedup with 2 threads
3. dual processor sparc
Local array and local int showed 2x speedup with 2 threads
Instance array showed approx 1.3x speedup with 2 threads
> Should this be considered as a bug in the VM?
That remains to be seen. I don't understand why the instance array is so different in behaviour to local array - both arrays are in the heap afterall and the array reference only has to be loaded once. This has piqued my interest.
# 10
Hi,
I think that it is the management of the memory divided between core which makes the difference.
Your test, is demonstration limit of the massiv multithreading.
The architecture of processeur is very importante.
Thank's for your Interesting question.
Laurent.
# 11
> I think that it is the management of the memory
> divided between core which makes the difference.
>
> Your test, is demonstration limit of the massiv
> multithreading.
> The architecture of processeur is very importante.
But then, why is there a speed-up by a factor of two in case LOCAL_ARRAY on a 1.5 VM, while there is a slow-down by a factor of more than two on a 1.4 VM on the same computer? Is the memory management so poor that it slows down the computation even if the modifications to memory do not overlap (note that every Concurrent has its own dummy-array). I can't believe that.
An explanation for the behaviour might be the following, but since I do not know the details of the VM, the explanation might be wrong.
1.) When local ints are used, the VM sees that they can only be seen from the current thread, so there is no problem when these are modified.
2.) When an array stored in an instance variable is used, the VM cannot proof that this is only modified from the current thread (although this is the case in my example), because in principle, the instance variable can be seen from other threads.
3.) When a local array is used, the 1.5 VM is clever and sees that no reference to this array is passed from the local stack to some place where it can be seen from other threads. So no problem when modifying this array. The 1.4 VM is not as clever and treats every object as potentially visible for other threads. Thus the slow-down observed for 1.4 VMs in this case.
While this might be an explanation, I consider it as a bug: We would be forced to carry out our computations such that the VM can proof that modifications can only be seen by the current thread! Of course, this is not possible in practice.
# 12
For this environment:
1.5.0_07-83
Java HotSpot(TM) Client VM
Mac OS X
10.4.8
i386
package java.util.concurrent, Java Platform API Specification, version 1.5
INSTANCE_ARRAY was 2.5 * slower when 2 threads were used, and twice as fast for the other cases.
I downloaded the debug-version of the 1.7 JVM from http://download.java.net/jdk7/binaries/ . Originally, I wanted to see the generated machine code using -XX:+PrintAssembly, but unfortunately this is not possible because a library has not yeen been made available (see http://forums.java.net/jive/thread.jspa?threadID=21033&tstart=75 ). However, using -XX:+PrintLIR some intermediate code can be seen. I changed the INSTANCE_ARRAY case to this:
if (test == INSTANCE_ARRAY)
{
int[] dummy = this.dummy;
for (int i = N; i > 0; i--)
{
dummy[i & 1] += i;
}
out = dummy[0] + dummy[1];
}
so that the modifications are made on a reference on the stack. The LIR code is equal for LOCAL_ARRAY and INSTANCE_ARRAY:
For INSTANCE_ARRAY:
B1 [6, 14] preds: B0 sux: B3
__id_Instruction_
20 label [label:0x81cc320]
22 move [Base:[ecx|L] Disp: 8|L] [esi|L]
24 move [int:5000000|I] [ebx|I]
26 branch [AL] [B3]
B4 bb le [17, 26] preds: B3 sux: B3
__id_Instruction_
38 label [label:0x81cc698]
40 move [ebx|I] [edx|I]
42 logic_and [edx|I] [int:1|I] [edx|I]
44 cmp [edx|I] [Base:[esi|L] Disp: 8|I] [bci:22]
46 branch [AE] [RangeCheckStub: 0x80a93c8] [bci:22]
48 move [Base:[esi|L] Index:[edx|I] * 4 Disp: 12|I] [edi|I]
52 add [edi|I] [ebx|I] [edi|I]
54 cmp [edx|I] [Base:[esi|L] Disp: 8|I]
56 branch [AE] [RangeCheckStub: 0x80a95d8] [bci:25]
58 move [edi|I] [Base:[esi|L] Index:[edx|I] * 4 Disp: 12|I]
62 add [ebx|I] [int:-1|I] [ebx|I]
B3 lh [29, 30] preds: B1 B4 sux: B4 B5
__id_Instruction_
28 label [label:0x81cc570]
30 safepoint[bci:30]
32 cmp [ebx|I] [int:0|I]
34 branch [GT] [B4]
36 branch [AL] [B5]
For LOCAL_ARRAY:
B7 [54, 61] preds: B2 sux: B9
__id_Instruction_
76 label [label:0x81cca10]
78 move [int:2|I] [ebx|I]
80 move [obj:0x80f1558|L] [edx|L]
82 alloc_array [edx|L] [ebx|I] [eax|L] [ecx|L] [esi|L] [edi|L] [eax|L] [type:0xa] [label:0x80a9b34]
86 move [int:5000000|I] [esi|I]
88 branch [AL] [B9]
B10 bb le [64, 73] preds: B9 sux: B9
__id_Instruction_
100 label [label:0x81ccd88]
102 move [esi|I] [edi|I]
104 logic_and [edi|I] [int:1|I] [edi|I]
106 cmp [edi|I] [Base:[eax|L] Disp: 8|I]
108 branch [AE] [RangeCheckStub: 0x80aa018] [bci:69]
110 move [Base:[eax|L] Index:[edi|I] * 4 Disp: 12|I] [ebx|I]
114 add [ebx|I] [esi|I] [ebx|I]
116 cmp [edi|I] [Base:[eax|L] Disp: 8|I]
118 branch [AE] [RangeCheckStub: 0x80aa228] [bci:72]
120 move [ebx|I] [Base:[eax|L] Index:[edi|I] * 4 Disp: 12|I]
124 add [esi|I] [int:-1|I] [esi|I]
B9 lh [76, 77] preds: B7 B10 sux: B10 B11
__id_Instruction_
90 label [label:0x81ccc60]
92 safepoint[bci:77]
94 cmp [esi|I] [int:0|I]
96 branch [GT] [B10]
98 branch [AL] [B11]
The only difference is the initial loading of the dummy-variable. However, the case INSTANCE_ARRAY is still two times slower than the case LOCAL_ARRAY! Why? Is "safepoint" implemented differently for the two cases?
# 13
The safepoint mechanics are unchanged in the two cases - or should be anyway.
I'm not an expert with the workings at this level of the hardware, but this is my understanding of what seems to be happening. The problem appears to be caused by cache conflicts. This would be most prominent in the hyper-threaded case because (I believe) all caches are shared by the two hyper-threads. For multi-core it depends on which multi-core chip and the architecture, as to the level of cache sharing. For SMP you'll have distinct caches and so no (or very little) conflict.
(Also note that for HT you shouldn't expect 2x speedup - hyperthreads do not run concurrently - one runs when the other is stalled - hence the potential speedup is only a small fraction depending on the stall time, and if your threads induce extra stalls in each other then they actually slow down. I've heard it said that Intel suggest 1.2x as the expected speedup.)
In the instance-array case all construction is done by the main thread and so all allocation is done from that thread's local allocation buffer - TLAB. The Concurrent instances when laid out in memory just happen to result in the array entries you are hitting being in the same cache line. (This is for 1.5 - in 1.4 the allocation strategy was different I believe, hence the different behaviour).
There are a number of things you can do to see how this memory layout comes into play:
1. Change the code so that the "instance array" is allocated in the run method. Now it comes from the new thread's TLAB and so behaves like the local-array case.
2. Change the size of the instance array - this will increase the spacing between dummy[0]/[1] in one Concurrent instance and dummy[0]/[1] in the next. With the right spacing the conflict goes.
3. Add some padding variables to Concurrent so that again the size of a concurrent instances changes, and so the arrays in different instances will be spaced different.
You should be able to induce the same problem in the local array case by doing additional allocation
That said, I was unable to induce the same abberant behaviour in the local array case which makes me suspicious that there is something more subtle going on.
If you can, try to analyze cache conflicts/traffic. I don't have experience at that level of analysis.
