CMS Rescan parallel taking long time

Another chapter of my gc woes...

After somehow managing to reduce my weak ref processing problems (by using IMS and incremental CMS) I'm hitting rescan pauses. It seems that if CMS is running while some more memory is allocated/modified, rescan pause goes over the roof. Problem is that CMS is now running almost continiously due to incremental CMS and there are some places where we have to allocate 100-200MB of semi persistent data over period of minute or two (which is not so bad case IMHO - we are talking about few MB per second). When it happens, we get a quite large rescan pause. Is there any way to avoid this ? Put some more work to be done in concurrent mode ? (we have quite considerable buffer of free memory on top of our server needs so we can accept CMS rescanning stuff concurrently for a longer time without freeing memory, as long as it is not taking so big rescan pauses).

Mandatory gc output follows

(cardTable: 446 cards, re-scanned 52114 cards, 2 iterations)

[1 iterations, 0 waits, 52114 cards)] 30401.844: [CMS-concurrent-abortable-preclean: 0.641/20.023 secs]

(CMS-concurrent-abortable-preclean yielded 3 times)

30401.857: [GC[YG occupancy: 11343 K (24576 K)]30401.857: [Rescan (parallel) Finished root rescan work in 2th thread: 0.011 sec

Finished root rescan work in 1th thread: 0.012 sec

Finished root rescan work in 3th thread: 0.017 sec

Finished root rescan work in 0th thread: 0.056 sec

Finished dirty card rescan work in 3th thread: 1.013 sec

Finished dirty card rescan work in 2th thread: 1.400 sec

Finished dirty card rescan work in 0th thread: 1.405 sec

Finished dirty card rescan work in 1th thread: 2.049 sec

Finished work stealing in 1th thread: 0.000 sec

Finished work stealing in 3th thread: 1.032 sec

Finished work stealing in 0th thread: 0.600 sec

Finished work stealing in 2th thread: 0.651 sec

, 2.0621770 secs]30403.919: [weak refs processing, 0.0035454 secs]30403.923: [class unloading, 0.1012155 secs]30404.024: [scrub sy

mbol & string tables, 0.0123691 secs] [1 CMS-remark: 515043K(2068480K)] 526387K(2093056K), 2.2264839 secs]

[2240 byte] By [abiesa] at [2007-10-2 5:33:37]
# 1

> ... there are

> some places where we have to allocate 100-200MB of

> semi persistent data over period of minute or two

Do you know what the size and type of tis data is?

(e.g. is it large primitive type arrays, or small reference-rich

objects, or ...) Does this data mutate a lot after being

allocated?

Do you have a larger snippet of the GC log surrounding

the remark event, preferably the _entire_ trace between

the start to the end of a cycle (CMS-init-mark to CMS-reset).

If that is too big to include here, you can send the information

to hotspot gc DASH feedback AT Sun DOT COM.

> (which is not so bad case IMHO - we are talking about

> few MB per second). When it happens, we get a quite

> large rescan pause. Is there any way to avoid this ?

> Put some more work to be done in concurrent mode ?

> (we have quite considerable buffer of free memory on

> top of our server needs so we can accept CMS

> rescanning stuff concurrently for a longer time

> without freeing memory, as long as it is not taking

> so big rescan pauses).

>

> Mandatory gc output follows

> > (cardTable: 446 cards, re-scanned 52114 cards, 2

> 2 iterations)

> [1 iterations, 0 waits, 52114 cards)] 30401.844:

> : [CMS-concurrent-abortable-preclean: 0.641/20.023

> secs]

> (CMS-concurrent-abortable-preclean yielded 3 times)

> 30401.857: [GC[YG occupancy: 11343 K (24576

> K)]30401.857: [Rescan (parallel) Finished root rescan

> work in 2th thread: 0.011 sec

> Finished root rescan work in 1th thread: 0.012 sec

> Finished root rescan work in 3th thread: 0.017 sec

> Finished root rescan work in 0th thread: 0.056 sec

> Finished dirty card rescan work in 3th thread: 1.013

> sec

> Finished dirty card rescan work in 2th thread: 1.400

> sec

> Finished dirty card rescan work in 0th thread: 1.405

> sec

> Finished dirty card rescan work in 1th thread: 2.049

> sec

> Finished work stealing in 1th thread: 0.000 sec

> Finished work stealing in 3th thread: 1.032 sec

> Finished work stealing in 0th thread: 0.600 sec

> Finished work stealing in 2th thread: 0.651 sec

> , 2.0621770 secs]30403.919: [weak refs processing,

> 0.0035454 secs]30403.923: [class unloading, 0.1012155

> secs]30404.024: [scrub sy

> mbol & string tables, 0.0123691 secs] [1 CMS-remark:

> 515043K(2068480K)] 526387K(2093056K), 2.2264839

> secs]

>

>

thank you.

ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 2

I'm also having the same problem, with rescan pauses going up to 660 seconds when large number of objects are being created.

Linux AMD64 with 32GB memory - jdk1.5.0

-Xmx20g

-Xms20g

-Xmn7g

-XX:+UseConcMarkSweepGC

-XX:+UseParNewGC

-verbose:gc

-XX:+PrintGCTimeStamps

-XX:+PrintGCDetails

-Dsun.rmi.dgc.client.gcInterval=72000000

-Dsun.rmi.dgc.server.gcInterval=72000000

-d64

0.000: [Full GC 0.000: [CMS: 0K->1558K(20437440K), 0.4600520 secs] 14499K->1558K(20479936K), [CMS Perm : 7530K->7526K(21248K)], 0.4600520 secs]

...

112.933: [Full GC 112.933: [ParNew: 42432K->0K(42496K), 0.0100010 secs] 918297K->876270K(20479936K), 0.0100010 secs]

...

147.557: [Full GC 147.557: [ParNew: 2419K->0K(42496K), 0.0100010 secs] 1056056K->1053940K(20479936K), 0.0100010 secs]

...

784.899: [GC [1 CMS-initial-mark: 10232160K(20437440K)] 10232312K(20479936K), 0.0200030 secs]

784.919: [CMS-concurrent-mark-start]

...

836.094: [CMS-concurrent-mark: 36.854/51.176 secs]

836.094: [CMS-concurrent-preclean-start]

836.094: [CMS-concurrent-preclean: 0.000/0.000 secs]

...

946.747: [CMS-concurrent-abortable-preclean-start]

946.747: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]

...

948.487: [GC[YG occupancy: 17079 K (42496 K)]948.487: [Rescan (parallel) , 660.5946670 secs]1609.082: [weak refs processing, 1.9702230 secs] [1 CMS-remark: 19386485K(20437440K)] 19403564K(20479936K), 662.5948930 secs]

1611.082: [CMS-concurrent-sweep-start]

...

1633.315: [CMS-concurrent-sweep: 9.551/22.233 secs]

1633.315: [CMS-concurrent-reset-start]

1633.525: [GC 1633.525: [ParNew: 42432K->0K(42496K), 0.1200140 secs] 17092621K->17080486K(20479936K), 0.1200140 secs]

1633.665: [CMS-concurrent-reset: 0.220/0.350 secs]

1634.815: [GC 1634.815: [ParNew: 42432K->0K(42496K), 0.1700190 secs] 17122918K->17116961K(20479936K), 0.1700190 secs]

1634.985: [GC [1 CMS-initial-mark: 17116961K(20437440K)] 17117380K(20479936K), 0.0200020 secs]

1635.005: [CMS-concurrent-mark-start]

...

1668.899: [GC 1668.899: [ParNew (promotion failed): 42429K->42429K(42496K), 0.6900780 secs]1669.589: [CMS1671.809: [CMS-concurrent-mark: 24.813/36.804 secs]

(concurrent mode failure): 19596317K->8668064K(20437440K), 60.9568870 secs] 19630537K->8668064K(20479936K), 61.6469650 secs]

...

1732.186: [GC [1 CMS-initial-mark: 8838364K(20437440K)] 8838830K(20479936K), 0.0200020 secs]

1732.206: [CMS-concurrent-mark-start]

...

1785.102: [CMS-concurrent-mark: 37.134/52.896 secs]

1785.102: [CMS-concurrent-preclean-start]

1785.102: [CMS-concurrent-preclean: 0.000/0.000 secs]

...

1812.915: [CMS-concurrent-abortable-preclean-start]

1812.915: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]

1813.035: [GC[YG occupancy: 35552 K (42496 K)]1813.035: [Rescan (parallel) , 0.0200030 secs]1813.055: [weak refs processing, 0.1300140 secs] [1 CMS-remark: 13714701K(20437440K)] 13750253K(20479936K), 0.1500170 secs]

1813.185: [CMS-concurrent-sweep-start]

1816.175: [CMS-concurrent-sweep: 2.990/2.990 secs]

1816.175: [CMS-concurrent-reset-start]

1816.365: [CMS-concurrent-reset: 0.190/0.190 secs]

1818.375: [GC [1 CMS-initial-mark: 10458899K(20437440K)] 10494754K(20479936K), 0.0400040 secs]

1818.415: [CMS-concurrent-mark-start]

...

1837.998: [CMS-concurrent-mark: 18.052/19.582 secs]

1837.998: [CMS-concurrent-preclean-start]

1837.998: [CMS-concurrent-preclean: 0.000/0.000 secs]

1838.548: [CMS-concurrent-abortable-preclean-start]

1838.548: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]

...

1854.690: [GC[YG occupancy: 641 K (42496 K)]1854.690: [Rescan (parallel) , 0.0200020 secs]1854.710: [weak refs processing, 1.7301950 secs] [1 CMS-remark: 10540826K(20437440K)] 10541467K(20479936K), 1.7601990 secs]

1856.450: [CMS-concurrent-sweep-start]

...

1885.113: [CMS-concurrent-sweep: 28.143/28.663 secs]

1885.123: [CMS-concurrent-reset-start]

1885.303: [CMS-concurrent-reset: 0.180/0.180 secs]

1915.997: [GC 1915.997: [ParNew: 42432K->0K(42496K), 0.0500060 secs] 6787450K->6748660K(20479936K), 0.0500060 secs]

adrian.yapa at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 3

There is a bug in the CMS collector when dealing with large arrays (~8M+). Not sure of when a fix will be officially released. The bug will cause very long parallel rescan times.

Playing with a patched version, my GC's have dropped from several hours (3+) down to 6 seconds max.

Once I have the bug number I'll post the bug parade link.

It's also good to see someone else working with large heaps. Is that 7g for your young generation? If so, how are your young gen collection times? With a 28g heap and a 512m young gen, I'm seeing ParNew pauses of 2+ seconds at times every 7 seconds. When trying 1G it's more than twice that. This is on Windows 2003 Server 64-bit.

Until the bug is fixed some options you might help are:

-XX:ParallelGCThreads=# of cpu's - 1

-XX:CMSMarkStackSize=2M

-XX:CMSMarkStackSizeMax=8M

-XX:+DisableExplicitGC

-XX:+ParallelRefProcEnabled

-XX:+CMSClassUnloadingEnabled

-XX:+CMSPermGenSweepingEnabled

Sun recommends that you set the number of GC threads to 1 one less than the number of logical processors so you don't accidentally choke the system with a runaway GC.

I've found that CMSMarkStackSize and CMSMarkStackSizeMax is useful in avoiding those pesky CMS benign failures (which will sometimes cause CMS to revert back to a serial collection).

Disabling explicit GC's is also good, in some cases I've seen calls to System.gc() trigger a serial collection instead of a CMS collection (ouch).

Tthe last three options like to travel together (I've heard of problems with hotspot crashes if you don't enable all three)

If the above settings don't help, try dropping back to just the parallel young collector and serial old collector, if your GC times shrink to about 60 to 90 seconds, you could be suffering from the CMS array rescan bug I mentioned above.

While waiting for the fix for the CMS bug things you can try are:

-XX:UseParallelOldGC (assuming you're on 1.5.0_06) along with -XX:UseParallelGC. In our case, the serial old collector did a better job than ParallelOldGC (somewhat counter-intuitive but it's true); it might work for you since you're running on a different OS.

-XX:+PrintClassHistogram might be useful in determining just what you've got hanging around in memory (i.e. if it's those large arrays). With this option turned on, collecting a thread dump (send a SIGQUIT signal to the jvm) will also print a histogram of what objects are hanging around (it may trigger a full GC when you do this, be warned). This looks to be the same type of output as running JMap -histo in Java 1.6.

And finally, the most invasive thing is to modify your code so it uses smaller arrays, and probably not very realistic.

Hope this helps!!

jscottarmstronga at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 4
Re: rescan of large arrays:CR 6367204 will be fixed in 5.0_0x (x >= 7) and in 6.0.Re: long scavenges:Try reshaping heap to reduce promotion (i.e. use survivor spaces).There's also likely an issue with card scanning with very large objectarrays.
ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 5
Re: jmap -histo vs +PrintClassHistogram/SIGQUITThe former does not precede the histogram census with a full gc,while the latter does. (So the former will also likely find some objectsthat are actually unreachable but not yet garbage-collected.
ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 6

> Re: rescan of large arrays:

> CR 6367204 will be fixed in 5.0_0x (x >= 7) and in

> 6.0.

>

Using Mustang b71 with this bug fixed solved the long Rescan problem. Since it will also be fixed in the supported 5.0_7, can I get a rough estimate of when 5.0_07 will be released? I read that Mustang is targeted for final release in fall 2006, but an earlier supported version would be ideal.

Thanks for your help.

adrian.yapa at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 7

> > Re: rescan of large arrays:

> > CR 6367204 will be fixed in 5.0_0x (x >= 7) and in

> > 6.0.

> >

>

> Using Mustang b71 with this bug fixed solved the long

> Rescan problem. Since it will also be fixed in the

> supported 5.0_7, can I get a rough estimate of when

> 5.0_07 will be released? I read that Mustang is

> targeted for final release in fall 2006, but an

> earlier supported version would be ideal.

CR 6367204 is fixed in 5.0_07, currently

targeted for GA around the last week of April '06.

If you hold a service contract with Sun you

can log a case to get a fix verification binary with the fix, this

could be used to confirm that this issue has been addressed to your

full satisfaction.

The following links detail some of Sun's Java SE related support

offerings:

http://www.sun.com/service/consulting/index.jsp?tsubcat=Get%20Support%20Information%20for%20Sun%20Software&tab=4

http://www.sun.com/service/warrantiescontracts/javamultiplatform.html

http://developers.sun.com/prodtech/support/

ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 8

> Re: long scavenges:

...

> There's also likely an issue with card scanning with

> very large object ...

CR 6298694: bad performance with big object in heap

(will be) fixed in Mustang b75.

Some possibility of performance fix also being available in 5.0_08

depending on customer need.

(See also: CR 6391214.)

See previous message on GA dates and support from Sun

for such bug fixes.

ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 9
> CR 6367204 is fixed in 5.0_07, currently> targeted for GA around the last week of April '06.Is there any current estimation about _07 availability ?
abiesa at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 10
> Is there any current estimation about _07> availability ?It's available for download at: http://java.sun.com/j2se/1.5.0/download.jsp
ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 11

Thanks a lot.

New version doesn't seem to have a problem with too long rescan pauses - we are not breaching 0.5s, while previously we used to see 10-16 seconds breaks every half an hour or so.

Unfortunately, it is still not possible to use parallel new gc together with CMS, but it is not critical issue for us at the moment (as we have reasonably sized new generation and old gen rescans don't have to be parallel due to this fix).

abiesa at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 12

> Thanks a lot.

>

> New version doesn't seem to have a problem with too

> long rescan pauses - we are not breaching 0.5s, while

> previously we used to see 10-16 seconds breaks every

> half an hour or so.

Excellent. Good to know that that particular problem is taken care of.

>

> Unfortunately, it is still not possible to use

> parallel new gc together with CMS, but it is not

> critical issue for us at the moment (as we have

> reasonably sized new generation and old gen rescans

> don't have to be parallel due to this fix).

Could you tell us what the problem is with using ParNew

with CMS?

ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 13

> Could you tell us what the problem is with using

> ParNew with CMS?

Jvm crashes in random way after 1-5 hours. This seems to be some kind of memory corruption problem which has a delayed result, because almost every time it crashes in different place. In most cases it is a SIGSEGV or SIGBUS at random places of code (not even on object allocation), once it went into infinite loop in parallel gc, using 100% of 4 cpus there. It is not possible to reproduce instantly, but we have never managed to get server running for longer than 5 hours. After playing with many the gc options, I have found out that it is a combination of CMS and new parallel gc which causes the crash - changing other options does not make a difference one way or another.

Same thing happens with our client program on windows (75% of code is shared with server). On my computer (2x dual core intel), it crashes within one hour, on my coworker computer (1xdual core intel), it has crashed only twice so far in total during 2 weeks of testing with cms+parallel gc.

I have submitted some crash logs and observations to sun bug channel, but unfortunately, they are not able to help without easy reproductible case - and we don't have one. In fact, crashes were not happening for us for some time too often with small number of objects in the system, only after we started to test real load we observed the problem.

To reassume:

- it is happening only with both CMS and new generation parallel gc enabled

- it happens in completly random place, mostly crashing (SIGSEGV or SIGBUS), but sometimes causing gc to go into infinite loop

- in windows, it is easy to reproduce on 2x2 cpu machine, quite hard to reproduce on 1x2 machine (but still happened twice)

- we never had any jvm crash with CMS and non-parallel gc

abiesa at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 14

If you have a Sun support contract we may be able to expedite the

diagnosis by means of a "debug jvm" that does some consistency

checking.

If not, could you download the latest Mustang beta binaries off

of http://download.java.net/jdk6/binaries/ and see if you can

reproduce the problem with those bits?

If so, we may be able to help you more easily because the Mustang

JVM (currently in beta) allows "heap verification" without needing to use a debug build.

I will take this opportunity to remind the forum users running into

issues to consider one of several support offerings available from

Sun at http://developers.sun.com/services/

ramki_at_jdca at 2007-7-16 1:44:24 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 15
I have tried running it with mustang once, but unfortunately I have hit not very stable version, which was crashing very fast for different reason.I can try to test it again. Should I use normal build or debug build ? Any special parameters to pass there in addition to my gc flags
abiesa at 2007-7-20 18:37:11 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 16

> I have tried running it with mustang once, but

> unfortunately I have hit not very stable version,

> which was crashing very fast for different reason.

>

> I can try to test it again. Should I use normal build

> or debug build ? Any special parameters to pass there

> in addition to my gc flags ?

Please try the debug build first using the same set of parameters that

elicited a crash when you used JDK 5.0. Let us know what you

find out. If this crashes for any reason, please open a bug at

http://bugs.sun.com/services/bugreport/index.jsp

referencing this forum thread, and supplying at least the following information:

(1) the precise version of JDK/JVM used

(2) complete list of command-line parameters used

(3) the hs_err* log file generated from the error

(4) any test case that you can share with us that would allow

us to reproduce the problem

(5) any workarounds you may have found (such as in the above

case -XX:-UseParNewGC as you stated before, for example)

(6) After that you can try running with the additional parameters

-XX:+VerifyBeforeGC -XX:+VerifyAfterGC -XX:+VerifyDuringGC

and include any further data as in (1) through (5) above.

We appreciate your help and look forward to helping fix the issues

you are having.

ramki_at_jdca at 2007-7-20 18:37:11 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 17

First - I would like to that you for your support.

I'm not able to reproduce the crash with client version of jvm on windows, using our swing program - so something has changed with version 6. As we have a release this week, I'll be only able to play with it on the server side next week - I'll first try to reproduce the problem with normal build, then with debug build and then with debug+options you have given.

BTW, I know that it is probably not helping much, but only thing which might be different in our program from most other applications which does not see this problem is very large number of weak references/queues, which are constantly used/emptied. We have probably around 1M weak references, with probably tens cleared every new generation gc (every few seconds) and few-umpteen thousands cleared/enqueued during every CMS (every 5-15 minutes).

I'll post the results of investigation next week. Thanks again for help.

abiesa at 2007-7-20 18:37:11 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 18
Do you have any native code or methods associated with saidweak references that are disposed when these are enqueued?It might also be diagnostic to run with -Xcheck:jnito catch possible issues related to JNI.
ramki_at_jdca at 2007-7-20 18:37:11 > top of Java-index,Java HotSpot Virtual Machine,Specifications...
# 19

Yes, we have some JNI code on server and we have suspected it at the start. We have disabled native library completly and unfortunately we got the same crashes. Additionally, our JNI lib is not there on windows and there the crash was also reproductible.

BTW, we have found another machine (beefy T2000 box) and we run a stress tess with JDK6.84 - and no crash for entire day under extreme load.

With 5.0_06 I was able to get reproduce crash also on T2000 (our normal box is Sun-Fire-880).

I'll do more tests during next week, but currently it seems that problem is fixed, or at least very reduced in JDK 6, as it is not appearing on windows nor on T2000 in during a test which is 100% crashing with version 5.

abiesa at 2007-7-20 18:37:11 > top of Java-index,Java HotSpot Virtual Machine,Specifications...