Unreliable multi-threading behaviour

While modifying an existing programme to make use of multiple processors, I was frustrated from the results: The modified programme needed more time than the original one. At first, I assumed that it was my mistake, but then I simplified the programme up to the following one:

//*

// Use this for Java 1.5+

import java.util.concurrent.*;

/*/

// Use backport from http://dcl.mathcs.emory.edu/util/backport-util-concurrent/

// for Java 1.4

import edu.emory.mathcs.backport.java.util.concurrent.*;

//*/

publicclass Concurrentimplements Runnable

{

/**

* Use instance array in run().

*/

staticfinalint INSTANCE_ARRAY = 0;

/**

* Use local array in run().

*/

staticfinalint LOCAL_ARRAY = 1;

/**

* Use two local ints in run().

*/

staticfinalint LOCAL_INTS = 2;

/**

* One of INSTANCE_ARRAY, LOCAL_ARRAY, LOCAL_INTS

*/

staticint test;

/**

* Number of iterations in run()

*/

staticfinalint N = 1000000;

int[] dummy =newint[2];

int out;

publicvoid run ()

{

if (test == INSTANCE_ARRAY)

{

for (int i = N; i > 0; i--)

{

dummy[i & 1] += i;

}

out = dummy[0] + dummy[1];

}

elseif (test == LOCAL_ARRAY)

{

int[] dummy =newint[2];

for (int i = N; i > 0; i--)

{

dummy[i & 1] += i;

}

// prevent optimizer from discarding the loop.

out = dummy[0] + dummy[1];

}

else

// test == LOCAL_INTS

{

int dummy0 = 0;

int dummy1 = 0;

for (int i = N; i > 0; i--)

{

if ((i & 1) == 0)

{

dummy0 += i;

}

else

{

dummy1 += i;

}

}

// prevent optimizer from discarding the loop.

out = dummy0 + dummy1;

}

}

publicstaticvoid main (String[] args)throws Exception

{

System.out.println (System.getProperty ("java.vm.version"));

System.out.println (System.getProperty ("java.vm.name"));

System.out.println (System.getProperty ("os.name"));

System.out.println (System.getProperty ("os.version"));

System.out.println (System.getProperty ("os.arch"));

System.out.println (ExecutorService.class.getPackage ());

for (test = INSTANCE_ARRAY; test <= LOCAL_INTS; test++)

{

switch (test)

{

case INSTANCE_ARRAY:

System.out.println ("INSTANCE_ARRAY");

break;

case LOCAL_ARRAY:

System.out.println ("LOCAL_ARRAY");

break;

case LOCAL_INTS:

System.out.println ("LOCAL_INTS");

break;

}

// at first a single thread, then two threads

for (int n = 1; n <= 2; n++)

{

System.out.println (n +" thread(s)");

ExecutorService executor = Executors.newFixedThreadPool (n);

// five identical rounds

for (int i = 0; i < 5; i++)

{

long time = System.currentTimeMillis ();

// execute Concurrent.run 100 times

for (int j = 0; j < 100; j++)

{

executor.execute (new Concurrent ());

}

// and once again so that we can wait for completion

executor.submit (new Concurrent ()).get ();

System.out.println (System.currentTimeMillis () - time);

}

executor.shutdown ();

}

}

}

}

Thus, the indivual tasks (Concurrent.run) have the following properties: 1) Their computation requires some reasonable amount of time. 2) They are completely independent.

Thus, one should benefit from multiple processors. However, only for this configuration:

1.5.0_10-b03

Java HotSpot(TM) 64-Bit Server VM

Linux

2.6.18-gentoo-r6d

amd64

the execution times using two threads were half of the times using one thread. But a dual-core pentium needs more time for the case INSTANCE_ARRAY when two threads are used:

1.5.0_10-b03

Java HotSpot(TM) Client VM

Linux

2.6.13-15.11-smp

i386

package java.util.concurrent, Java Platform API Specification, version 1.5

INSTANCE_ARRAY

1 thread(s)

332

326

341

309

310

2 thread(s)

470

514

463

479

556

LOCAL_ARRAY

1 thread(s)

282

285

282

286

291

2 thread(s)

163

157

145

139

144

LOCAL_INTS

1 thread(s)

244

320

263

261

258

2 thread(s)

144

141

127

126

130

A similar behaviour was observed for three other 1.5 VMs and the 1.6 VM.

A 1.4.2 VM on the same computer is even worse:

1.4.2_13-b06

Java HotSpot(TM) Client VM

Linux

2.6.13-15.11-smp

i386

package edu.emory.mathcs.backport.java.util.concurrent

INSTANCE_ARRAY

1 thread(s)

343

299

298

296

296

2 thread(s)

483

508

482

516

545

LOCAL_ARRAY

1 thread(s)

272

272

272

273

279

2 thread(s)

881

904

949

976

956

LOCAL_INTS

1 thread(s)

204

204

203

204

203

2 thread(s)

123

116

116

110

104

A similar behaviour is observed on a SPARC:

1.4.2_01-b06

Java HotSpot(TM) Client VM

SunOS

5.8

sparc

package edu.emory.mathcs.backport.java.util.concurrent

INSTANCE_ARRAY

1 thread(s)

2312

2283

2267

2267

2266

2 thread(s)

2296

2387

2335

2396

2300

LOCAL_ARRAY

1 thread(s)

2144

2129

2143

2130

2129

2 thread(s)

3393

3230

3274

3458

3206

LOCAL_INTS

1 thread(s)

879

878

880

889

879

2 thread(s)

444

453

445

444

444

Since it works as expected for the case LOCAL_INTS, I think that there is no problem with java.util.concurrent or its backport, but with the VMs for the tested platforms. By the way, the CPU utilization reached 100% for both CPUs, and there were no major background tasks.

[9942 byte] By [ole_ka] at [2007-11-26 17:38:22]
# 1

Hi,

Your test it's good, but the number of iteration it's too small.

After somme test with

/**

* Number of iterations in run()

*/

static final int N = 5000000;

the result it's better:

1.6.0-b105

Java HotSpot(TM) Client VM

Windows XP

5.1

x86

package java.util.concurrent, Java Platform API Specification, version 1.6

INSTANCE_ARRAY

1 thread(s)

1343

1407

1359

1312

1329

2 thread(s)

718

719

703

656

688

LOCAL_ARRAY

1 thread(s)

1266

1265

1266

1265

1266

2 thread(s)

672

719

656

672

703

LOCAL_INTS

1 thread(s)

1641

1656

1640

1641

1641

2 thread(s)

843

860

859

860

828

Message was edited by:

lbaudet

Message was edited by:

lbaudet

lbaudeta at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 2
Time of thread gestion in Executor shadows the time in run method.
lbaudeta at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 3

You do know that hotspot won't even compile java byte code to native code until a method has been called 1,500 times for -client and 10,000 times for -server.The option -XX:-PrintCompilation will let you know when a method has been compiled.

You might want to warm up the test first with 10,000 calls to run.

Caffeine0001a at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 4

Are you sure that HotSpot needs 1000 invocations for compilation? I think HotSpot compiles the main-method of this example, but it is invoked only once!

public class Test

{

public static void main (String[] args)

{

int j = 0;

for (int i = 1000000000; i >= 0; i--)

{

j += i;

}

System.out.println (j);

}

}

Output is as follows:

~/workspace/Test/build> java -XX:+PrintCompilation Test

1bjava.lang.String::charAt (33 bytes)

1% bTest::main @ 8 (27 bytes)

-243309312

Using -Xint, the code runs slower. HotSpot has compiled Test.main.

ole_ka at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 5

Just from what I've read. There may be other options that control compilation. See http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp and -XX:CompileThreshold vm option.

http://www.javaworld.com/javaworld/javaqa/2003-04/01-qa-0411-hotspot.html

You can also google on [url http://www.google.com/search?q=CompileThreshold]CompileThreshold[/url]

Caffeine0001a at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 6

public class TestClassCompiling {

private static int int1;

private static int int2;

private static int int3;

public static void main(String args[]) {

int j = 0;

for (int i=0; i<1400; i++) {

j += method1();

j += method2();

j += method3();

}

// method3 should get compiled here

for (int i=0; i<100; i++) {

j += method3();

}

// method2 is next

for (int i=0; i<100; i++) {

j += method2();

}

for (int i=0; i<10000; i++) {

j += method1();

}

System.out.println(j);

}

private static int method1() {

return int1++;

}

private static int method2() {

return int2++;

}

private static int method3() {

return int3++;

}

}

With Client HotSpot

java -client -XX:+PrintCompilation TestClassCompiling

1java.lang.String::hashCode (60 bytes)

2java.lang.String::charAt (33 bytes)

3TestClassCompiling::method3 (10 bytes)

4TestClassCompiling::method2 (10 bytes)

5TestClassCompiling::method1 (10 bytes)

67222800

Main was never compiled. And as you can see, methods1-3 were compiled in a deterministic way.

With Server HotSpot

java -server -XX:+PrintCompilation TestClassCompiling

1TestClassCompiling::method1 (10 bytes)

67222800

Caffeine0001a at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 7

Run with -Xint to use the interpreter and avoid compilation, or else warmup to ensure pre-compilation.

BTW hotspot (server at least) will compile a method if it contains a loop that iterates beyond a given threshhold - hence main() can be compiled even if only called once. can't recall what the default threshhold is - sorry.

Use System.nanoTime for your time measurements too.

These sorts of microbenchmarks are notoriously bad at demonstrating multiprocessor usage. As was already stated, the pool management might overshadow your workload.

davidholmesa at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 8

"BTW hotspot (server at least) will compile a method if it contains a loop that iterates beyond a given threshhold - hence main() can be compiled even if only called once"

Yes, that's right. Thus, Concurrent.run() is compiled at the very beginning of programme execution (which can be verified using -XX:+PrintCompilation). Maybe the first computation round needs a bit more time, but that's not my problem. Even with -Xint I get the strange behaviour.

In order to put the discussion in the right direction, I have to clarify my problem:

1.) I have a relatively small number of relatively time-consuming tasks, represented by the Runnable Concurrent. These tasks are completely indepent, no synchronization is involved in the tasks themselves.

2.) The executor makes use of synchronization, yes, but only 101 synchronization events are necessary per round. So please don't tell us that the problem is the executor, 101 synchronizations should not lead to a waste of some 100 milliseconds of processor time! Or if so, this is a buggy behaviour in the VM at least on the dual-core intel and sparc computers.

3.) While the behaviour is as expected when using local arrays or ints, the usage of arrays stored in instance variables makes the code slower when two threads are used. On the tested 1.4 VM's, even local arrays make the code slower.

So my question is: Do you observe such a strange behaviour on your dual-core or multiprocessor computers, too? Should this be considered as a bug in the VM?

ole_ka at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 9

> So my question is: Do you observe such a strange

> behaviour on your dual-core or multiprocessor

> computers, too?

Yes. I tested on three systems:

1. hyperthreaded P3

Neither local array or local int showed any speedup with 2 threads.

Instance array slowed by approx 2x with 2 threads

2. dual-core athlon

Local array and local int showed 2x speedup with 2 threads

Instance array showed zero speedup with 2 threads

3. dual processor sparc

Local array and local int showed 2x speedup with 2 threads

Instance array showed approx 1.3x speedup with 2 threads

> Should this be considered as a bug in the VM?

That remains to be seen. I don't understand why the instance array is so different in behaviour to local array - both arrays are in the heap afterall and the array reference only has to be loaded once. This has piqued my interest.

davidholmesa at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 10

Hi,

I think that it is the management of the memory divided between core which makes the difference.

Your test, is demonstration limit of the massiv multithreading.

The architecture of processeur is very importante.

Thank's for your Interesting question.

Laurent.

lbaudeta at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 11

> I think that it is the management of the memory

> divided between core which makes the difference.

>

> Your test, is demonstration limit of the massiv

> multithreading.

> The architecture of processeur is very importante.

But then, why is there a speed-up by a factor of two in case LOCAL_ARRAY on a 1.5 VM, while there is a slow-down by a factor of more than two on a 1.4 VM on the same computer? Is the memory management so poor that it slows down the computation even if the modifications to memory do not overlap (note that every Concurrent has its own dummy-array). I can't believe that.

An explanation for the behaviour might be the following, but since I do not know the details of the VM, the explanation might be wrong.

1.) When local ints are used, the VM sees that they can only be seen from the current thread, so there is no problem when these are modified.

2.) When an array stored in an instance variable is used, the VM cannot proof that this is only modified from the current thread (although this is the case in my example), because in principle, the instance variable can be seen from other threads.

3.) When a local array is used, the 1.5 VM is clever and sees that no reference to this array is passed from the local stack to some place where it can be seen from other threads. So no problem when modifying this array. The 1.4 VM is not as clever and treats every object as potentially visible for other threads. Thus the slow-down observed for 1.4 VMs in this case.

While this might be an explanation, I consider it as a bug: We would be forced to carry out our computations such that the VM can proof that modifications can only be seen by the current thread! Of course, this is not possible in practice.

ole_ka at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 12

For this environment:

1.5.0_07-83

Java HotSpot(TM) Client VM

Mac OS X

10.4.8

i386

package java.util.concurrent, Java Platform API Specification, version 1.5

INSTANCE_ARRAY was 2.5 * slower when 2 threads were used, and twice as fast for the other cases.

I downloaded the debug-version of the 1.7 JVM from http://download.java.net/jdk7/binaries/ . Originally, I wanted to see the generated machine code using -XX:+PrintAssembly, but unfortunately this is not possible because a library has not yeen been made available (see http://forums.java.net/jive/thread.jspa?threadID=21033&tstart=75 ). However, using -XX:+PrintLIR some intermediate code can be seen. I changed the INSTANCE_ARRAY case to this:

if (test == INSTANCE_ARRAY)

{

int[] dummy = this.dummy;

for (int i = N; i > 0; i--)

{

dummy[i & 1] += i;

}

out = dummy[0] + dummy[1];

}

so that the modifications are made on a reference on the stack. The LIR code is equal for LOCAL_ARRAY and INSTANCE_ARRAY:

For INSTANCE_ARRAY:

B1 [6, 14] preds: B0 sux: B3

__id_Instruction_

20 label [label:0x81cc320]

22 move [Base:[ecx|L] Disp: 8|L] [esi|L]

24 move [int:5000000|I] [ebx|I]

26 branch [AL] [B3]

B4 bb le [17, 26] preds: B3 sux: B3

__id_Instruction_

38 label [label:0x81cc698]

40 move [ebx|I] [edx|I]

42 logic_and [edx|I] [int:1|I] [edx|I]

44 cmp [edx|I] [Base:[esi|L] Disp: 8|I] [bci:22]

46 branch [AE] [RangeCheckStub: 0x80a93c8] [bci:22]

48 move [Base:[esi|L] Index:[edx|I] * 4 Disp: 12|I] [edi|I]

52 add [edi|I] [ebx|I] [edi|I]

54 cmp [edx|I] [Base:[esi|L] Disp: 8|I]

56 branch [AE] [RangeCheckStub: 0x80a95d8] [bci:25]

58 move [edi|I] [Base:[esi|L] Index:[edx|I] * 4 Disp: 12|I]

62 add [ebx|I] [int:-1|I] [ebx|I]

B3 lh [29, 30] preds: B1 B4 sux: B4 B5

__id_Instruction_

28 label [label:0x81cc570]

30 safepoint[bci:30]

32 cmp [ebx|I] [int:0|I]

34 branch [GT] [B4]

36 branch [AL] [B5]

For LOCAL_ARRAY:

B7 [54, 61] preds: B2 sux: B9

__id_Instruction_

76 label [label:0x81cca10]

78 move [int:2|I] [ebx|I]

80 move [obj:0x80f1558|L] [edx|L]

82 alloc_array [edx|L] [ebx|I] [eax|L] [ecx|L] [esi|L] [edi|L] [eax|L] [type:0xa] [label:0x80a9b34]

86 move [int:5000000|I] [esi|I]

88 branch [AL] [B9]

B10 bb le [64, 73] preds: B9 sux: B9

__id_Instruction_

100 label [label:0x81ccd88]

102 move [esi|I] [edi|I]

104 logic_and [edi|I] [int:1|I] [edi|I]

106 cmp [edi|I] [Base:[eax|L] Disp: 8|I]

108 branch [AE] [RangeCheckStub: 0x80aa018] [bci:69]

110 move [Base:[eax|L] Index:[edi|I] * 4 Disp: 12|I] [ebx|I]

114 add [ebx|I] [esi|I] [ebx|I]

116 cmp [edi|I] [Base:[eax|L] Disp: 8|I]

118 branch [AE] [RangeCheckStub: 0x80aa228] [bci:72]

120 move [ebx|I] [Base:[eax|L] Index:[edi|I] * 4 Disp: 12|I]

124 add [esi|I] [int:-1|I] [esi|I]

B9 lh [76, 77] preds: B7 B10 sux: B10 B11

__id_Instruction_

90 label [label:0x81ccc60]

92 safepoint[bci:77]

94 cmp [esi|I] [int:0|I]

96 branch [GT] [B10]

98 branch [AL] [B11]

The only difference is the initial loading of the dummy-variable. However, the case INSTANCE_ARRAY is still two times slower than the case LOCAL_ARRAY! Why? Is "safepoint" implemented differently for the two cases?

ole_ka at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...
# 13

The safepoint mechanics are unchanged in the two cases - or should be anyway.

I'm not an expert with the workings at this level of the hardware, but this is my understanding of what seems to be happening. The problem appears to be caused by cache conflicts. This would be most prominent in the hyper-threaded case because (I believe) all caches are shared by the two hyper-threads. For multi-core it depends on which multi-core chip and the architecture, as to the level of cache sharing. For SMP you'll have distinct caches and so no (or very little) conflict.

(Also note that for HT you shouldn't expect 2x speedup - hyperthreads do not run concurrently - one runs when the other is stalled - hence the potential speedup is only a small fraction depending on the stall time, and if your threads induce extra stalls in each other then they actually slow down. I've heard it said that Intel suggest 1.2x as the expected speedup.)

In the instance-array case all construction is done by the main thread and so all allocation is done from that thread's local allocation buffer - TLAB. The Concurrent instances when laid out in memory just happen to result in the array entries you are hitting being in the same cache line. (This is for 1.5 - in 1.4 the allocation strategy was different I believe, hence the different behaviour).

There are a number of things you can do to see how this memory layout comes into play:

1. Change the code so that the "instance array" is allocated in the run method. Now it comes from the new thread's TLAB and so behaves like the local-array case.

2. Change the size of the instance array - this will increase the spacing between dummy[0]/[1] in one Concurrent instance and dummy[0]/[1] in the next. With the right spacing the conflict goes.

3. Add some padding variables to Concurrent so that again the size of a concurrent instances changes, and so the arrays in different instances will be spaced different.

You should be able to induce the same problem in the local array case by doing additional allocation

That said, I was unable to induce the same abberant behaviour in the local array case which makes me suspicious that there is something more subtle going on.

If you can, try to analyze cache conflicts/traffic. I don't have experience at that level of analysis.

davidholmesa at 2007-7-9 0:06:31 > top of Java-index,Core,Core APIs...