Consistently high YoungGen (ParScavenge) GC times

Expected YGC times are around 0.1s. "Sometimes" my YGC times jump to 10s level and stay there. Because the YGC is long and frequent JVM spends high portion (10%+) of time in GC. It is not a lonely GC peak, it is a GC duration that is consistenly above some level (3s, 5s, 10s or even 20s).

Our developers have set both the NewSize and MaxNewSize to 512m to make YGC short and predictable. But it does not seem to work as expected. TenuringThreshold is set to 0 to promote young objects to Old Gen after first YGC. OldGeneration GC is CMS.

What can make YGC times consistently high ?

GC related opts

-server -XX:+UseConcMarkSweepGC -XX:NewSize=512m -XX:MaxNewSize=512m -XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=40 -XX:+DisableExplicitGC

Platform:

Solaris 9, JVM 1.5.0_04, _06, Sparc architectures (4+ CPUs), 16+GB memory.

[922 byte] By [zarnovicana] at [2007-10-3 2:52:10]
# 1

Hi,

We would like to get a bit more info on this. Could you possibly let us know whether this happens with the latest 5.0, which is _08? Have you noticed whether the long GC pauses start appearing before or after a CMS GC cycle? Finally, I noticed that you don't explicitly set the heap size. What's the size of the heap when the long pauses appear? If you could share an extract from a GC log, it would be great.

Tony (HS GC Group)

TonyPrintezisa at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 2

Hi Tony,

thanks for your interest. Below is the example GC log.

We always set start and max heap size for each box in cluster explicitly. For example, the box bellow is SunFire V440, 4xCPU, 16GB mem, Sol8 108528-27. Start heap is 4G and max is 8GB.

My theory is that app is filling up YG faster than is the average lifetime of short-lived objects. YGC happens before objects become a garbage. Therefore most (or all) objects are promoted to Tenured generation every time. High ParScavenge time is then time spend allocating objects in CMS. That's the theory ;-)

I plan to upgrade customer's test cluster to 1.5.0_08 and do some tests in tweaking GC setting. The fist one will be to double the YG size.

Regards,

BranoZ

This is how it started this time. 50s is quite extreme example:

21998.046: [ParNew 2765647K->2264533K(4190272K), 0.1930396 secs]

22001.982: [ParNew 2780757K->2268339K(4190272K), 0.0863709 secs]

22004.203: [GC 2534762K(4190272K), 0.3614055 secs]

22006.736: [ParNew 2783998K->2283574K(4190272K), 0.1430293 secs]

22011.286: [ParNew 2794833K->2302833K(4190272K), 0.2536093 secs]

22015.136: [GC 2748687K(4326184K), 2.0708461 secs]

22017.641: [ParNew 2799778K->2303906K(4326184K), 0.1923121 secs]

22022.036: [ParNew 2820130K->2317597K(4326184K), 50.2958672 secs]

22076.085: [ParNew 2833821K->2321863K(4326184K), 50.1316508 secs]

22130.317: [ParNew 2838087K->2324692K(4326184K), 50.4934666 secs]

22185.006: [ParNew 2840916K->2333814K(4326184K), 50.6857353 secs]

22240.106: [ParNew 2850038K->2359014K(4326184K), 50.3882597 secs]

22292.626: [GC 2618318K(4326184K), 0.3510936 secs]

22295.186: [ParNew 2874430K->2384467K(4326184K), 49.8472176 secs]

22349.375: [ParNew 2891323K->2396182K(4326184K), 50.1110282 secs]

22403.254: [ParNew 2828910K->2316157K(4374736K), 50.4352997 secs]

22453.691: [GC 2316157K(4374736K), 0.0074683 secs]

22458.150: [ParNew 2832381K->2338460K(4374736K), 50.2660043 secs]

22512.901: [ParNew 2854684K->2364763K(4374736K), 50.3661762 secs]

22567.657: [ParNew 2880987K->2382561K(4374736K), 50.5471023 secs]

22622.293: [ParNew 2898785K->2385447K(4374736K), 50.3793910 secs]

22676.788: [ParNew 2901671K->2388276K(4374736K), 49.9774859 secs]

22728.806: [GC 2650214K(4374736K), 0.3244817 secs]

22731.177: [ParNew 2903593K->2390040K(4374736K), 50.0629669 secs]

22785.342: [ParNew 2898932K->2384656K(4374736K), 49.9562792 secs]

22839.320: [ParNew 2788365K->2276543K(4374736K), 3.0625406 secs]

22842.385: [GC 2279399K(4374736K), 0.0079225 secs]

22846.460: [ParNew 2792767K->2281581K(4374736K), 2.9592045 secs]

22853.282: [ParNew 2797805K->2284871K(4374736K), 3.0843182 secs]

22860.492: [ParNew 2801095K->2287568K(4374736K), 2.9546525 secs]

22867.708: [ParNew 2803792K->2308495K(4374736K), 3.1377750 secs]

..and this is from the same log now. It didn't drop below 3s. In HPJtune you can clearly see the terace like graph. Sometimes ParNew makes several switches to different levels (up or down) between two restarts.

140128.154: [ParNew 2565163K->2058622K(5711608K), 3.1290494 secs]

140143.496: [ParNew 2574846K->2072318K(5711608K), 3.2323037 secs]

140284.063: [ParNew 2588542K->2105889K(5711608K), 3.2494839 secs]

140287.315: [GC 2111125K(5711608K), 0.0083492 secs]

140335.187: [GC 2375049K(5711608K), 2.0569228 secs]

140342.039: [ParNew 1669011K->1155124K(5711608K), 3.1821818 secs]

140355.287: [ParNew 1138700K->624353K(5711608K), 3.0024189 secs]

140363.757: [ParNew 1140577K->658753K(5711608K), 3.1779254 secs]

140372.735: [ParNew 1174977K->690875K(5711608K), 3.2534315 secs]

140383.115: [ParNew 1207099K->694233K(5711608K), 3.1505525 secs]

140405.618: [ParNew 1210457K->698244K(5711608K), 2.9981982 secs]

140417.257: [ParNew 1214468K->703503K(5711608K), 3.2022825 secs]

140427.270: [ParNew 1219727K->737817K(5711608K), 3.1956219 secs]

140437.566: [ParNew 1254041K->768785K(5711608K), 3.3210778 secs]

140447.477: [ParNew 1285009K->772447K(5711608K), 3.1232145 secs]

140456.386: [ParNew 1288671K->775552K(5711608K), 3.1227987 secs]

140465.315: [ParNew 1291776K->779204K(5711608K), 3.0075379 secs]

140474.177: [ParNew 1295428K->782192K(5711608K), 3.0002545 secs]

140483.006: [ParNew 1298416K->785233K(5711608K), 3.1808492 secs]

140490.427: [ParNew 1301457K->789180K(5711608K), 2.9973702 secs]

140498.923: [ParNew 1305404K->801785K(5711608K), 3.0341292 secs]

140507.208: [ParNew 1318009K->832232K(5711608K), 3.4033004 secs]

140518.370: [ParNew 1348456K->854803K(5711608K), 3.1722517 secs]

140529.946: [ParNew 1371027K->856848K(5711608K), 2.9901605 secs]

140539.483: [ParNew 1373072K->860134K(5711608K), 3.0173354 secs]

140549.256: [ParNew 1376358K->900292K(5711608K), 3.1819922 secs]

140559.872: [ParNew 1416516K->924135K(5711608K), 3.1670494 secs]

140571.467: [ParNew 1440359K->926205K(5711608K), 3.1295034 secs]

140589.407: [ParNew 1442429K->932570K(5711608K), 3.0187656 secs]

140599.497: [ParNew 1448794K->966578K(5711608K), 3.1905812 secs]

140608.176: [ParNew 1482802K->1000361K(5711608K), 3.3525339 secs]

140617.893: [ParNew 1516585K->1005821K(5711608K), 3.1527976 secs]

140627.198: [ParNew 1522045K->1033823K(5711608K), 3.2412903 secs]

140635.415: [ParNew 1550047K->1061601K(5711608K), 3.2880912 secs]

140650.037: [ParNew 1577825K->1093346K(5711608K), 3.1435323 secs]

140683.236: [ParNew 1609570K->1141199K(5711608K), 3.5401014 secs]

140713.168: [ParNew 1657423K->1194808K(5711608K), 3.3053307 secs]

140722.668: [ParNew 1711032K->1210168K(5711608K), 3.1834288 secs]

140761.051: [ParNew 1726392K->1274228K(5711608K), 3.3631744 secs]

140841.084: [ParNew 1790452K->1340929K(5711608K), 3.4981368 secs]

140881.543: [ParNew 1857153K->1375003K(5711608K), 3.1548356 secs]

140911.008: [ParNew 1891227K->1469072K(5711608K), 3.5447226 secs]

140983.170: [ParNew 1985296K->1543070K(5711608K), 4.1797274 secs]

141012.498: [ParNew 2059294K->1581317K(5711608K), 3.2417649 secs]

141068.673: [ParNew 2097541K->1625653K(5711608K), 3.3013306 secs]

141128.463: [ParNew 2141877K->1671267K(5711608K), 3.3046175 secs]

141178.443: [ParNew 2187491K->1710921K(5711608K), 3.2494128 secs]

141237.067: [ParNew 2227145K->1752891K(5711608K), 3.1965221 secs]

141268.104: [ParNew 2269115K->1799532K(5711608K), 3.3200524 secs]

141331.488: [ParNew 2315756K->1820042K(5711608K), 3.1471623 secs]

141405.321: [ParNew 2336266K->1896598K(5711608K), 4.2776222 secs]

141438.659: [ParNew 2412822K->1930054K(5711608K), 3.1951150 secs]

141621.927: [ParNew 2446278K->1986950K(5711608K), 3.2985364 secs]

142399.516: [ParNew 2503174K->2007449K(5711608K), 3.0788738 secs]

142408.417: [ParNew 2523673K->2034172K(5711608K), 3.4990466 secs]

142418.043: [ParNew 2550396K->2036805K(5711608K), 3.1383824 secs]

142506.132: [ParNew 2553029K->2072454K(5711608K), 3.1656547 secs]

142601.308: [ParNew 2588678K->2102850K(5711608K), 3.1486802 secs]

142604.460: [GC 2109933K(5711608K), 0.0082203 secs]

142679.497: [GC 2366531K(5711608K), 1.3149546 secs]

142754.615: [ParNew 932987K->453486K(5711608K), 3.1585706 secs]

143298.743: [ParNew 969710K->482957K(5711608K), 3.2853786 secs]

143308.759: [ParNew 999181K->487386K(5711608K), 3.1410462 secs]

143318.435: [ParNew 1003610K->489341K(5711608K), 3.1152526 secs]

143326.843: [ParNew 1005565K->516596K(5711608K), 3.1784453 secs]

143335.992: [ParNew 1032820K->520940K(5711608K), 3.1349650 secs]

143345.601: [ParNew 1037164K->522456K(5711608K), 2.9855287 secs]

143355.095: [ParNew 1038680K->523980K(5711608K), 2.9844280 secs]

143364.532: [ParNew 1040204K->525973K(5711608K), 2.9859568 secs]

143373.994: [ParNew 1042197K->528024K(5711608K), 2.9859666 secs]

143380.585: [ParNew 1044248K->553935K(5711608K), 3.0960279 secs]

143387.158: [ParNew 1070159K->560324K(5711608K), 3.2001745 secs]

143393.919: [ParNew 1076548K->565960K(5711608K), 3.0185440 secs]

143401.046: [ParNew 1082184K->591880K(5711608K), 3.2659240 secs]

143409.763: [ParNew 1108104K->596278K(5711608K), 3.0180708 secs]

143418.044: [ParNew 1112502K->630460K(5711608K), 3.3750751 secs]

143426.782: [ParNew 1146684K->641672K(5711608K), 3.2145800 secs]

143438.725: [ParNew 1157896K->673599K(5711608K), 3.2033111 secs]

zarnovicana at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 3

Hi,

Thanks for sharing your GC logs with us.

> We always set start and max heap size for each box in cluster explicitly.

Good. I asked as I didn't see it in the command line parameters you showed.

> My theory is that app is filling up YG faster than is the average lifetime of short-lived objects.

I don't think this is the case. If you actually look at the GC log the young collections reclaim roughly the same amount of space, before and after the times spiking. I take it there's no phase change in your application (i.e., the application starting to do something radically different) around that time, or anything like that?

By looking at the GC log I can see that the problem occurs right after a marking / sweeping cycle. This is very similar to a bug we are currently looking at. The ID is 6433335 if you care to look it up. The young GCs just hit a pathological performance issue, after more free space is made available.

> I plan to upgrade customer's test cluster to 1.5.0_08 and do some tests in tweaking GC setting.

The bug seems to exist in both _08, as well as 6.0. So, I don't think upgrading will help. The workaround we propose is to try disabling parallel young GCs, -XX:-UseParNewGC. This will increase young GC times, but it might prevent the GC from hitting the worst-case that you're seeing.

Tony (HS GC Group)

TonyPrintezisa at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 4

> > My theory is that app is filling up YG faster than

> is the average lifetime of short-lived objects.

>

> I don't think this is the case. If you actually look

> at the GC log the young collections reclaim roughly

> the same amount of space, before and after the times

> spiking. I take it there's no phase change in your

That's what I wasn't sure about. If the reclaimed bytes in ParNew include promoted objects or not.

> application (i.e., the application starting to do

> something radically different) around that time, or

> anything like that?

Yes. The application creates bursts of activity that last for couple of minutes (large DB commits). But after that, ParNew does not return to previous level. No all bursts trigger this problem. Smaller jobs are passed without any effect.

> By looking at the GC log I can see that the problem

> occurs right after a marking / sweeping cycle. This

> is very similar to a bug we are currently looking at.

> The ID is 6433335 if you care to look it up. The

Yes. The bug pretty much describes my problem. Thanks for pointing me to the ID.

> young GCs just hit a pathological performance issue,

> after more free space is made available.

You are right, I had the feeling that it happens around the resize of main heap space. Our NewSize is fixed, though.

> The bug seems to exist in both _08, as well as 6.0.

> So, I don't think upgrading will help. The workaround

> we propose is to try disabling parallel young GCs,

> -XX:-UseParNewGC. This will increase young GC times,

> but it might prevent the GC from hitting the

> worst-case that you're seeing.

Good to know that somebody is looking at it already. Thanks for the hint, I will definitely try that.

So far I haven't been able to consistently reproduce it on test cluster (still on _06). Only one node has jumped from 0.05s to 0.25s level. That's barely recognizable.

Regards,

BranoZ

zarnovicana at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 5

> You are right, I had the feeling that it happens around the resize of main heap space. Our NewSize is fixed, though.

There is a related bug which has to do with resizing... The ID is 6459113.

> Good to know that somebody is looking at it already.

Do you have Sun support BTW? If you do, we might be able to give you a VM to try.

Tony (HS GC Group)

TonyPrintezisa at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 6

Hi,

> Do you have Sun support BTW? If you do, we might be

> able to give you a VM to try.

No, sorry. AFAIK we don't have Sun support.

The proposed workaround (-UseParNewGC) was applied on one of the production systems with great success. In one case the time spent in GC dropped from 35% to 5%.

Thanks a lot,

BranoZ

zarnovicana at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 7

> The proposed workaround (-UseParNewGC) was applied on

> one of the production systems with great success. In

> one case the time spent in GC dropped from 35% to

> 5%.

>

> Thanks a lot,

Hi, thanks for the heads up. We are very glad that the workaround was helpful! Incidentally, were the young GCs affected much by disabling ParNew?

Best regards,

Tony (HS GC Group)

TonyPrintezisa at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 8

> were the young GCs affected much by disabling ParNew?

ParScavenge:

avg interval: 5.71s 12.9s

avg duration: 2.02s 10.0s

peak duration: 17s 50s

Scavenge:

avg interval: 4.46s 4.81s

avg duration: 0.24s 0.37s

peak duration: 1.5s 5.4s

That's for two machines, 4xCPU, 32G of memory each, with NewSize fixed to 512M. Scavenge times are not that great. The time spend in GC is above 5%, but it is a much better than having 30%+.

The results are for one day before and after the change. However, the application load is not constant. We will see in long run..

Regards,

BranoZ

zarnovicana at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 9

It seems as though bugs 6433335 and 6459113 would

both help you. But I also noted these figures with some

interest and a good deal of consternation:

> > were the young GCs affected much by disabling

> ParNew?

>

> ParScavenge:

> avg interval: 5.71s 12.9s

> avg duration: 2.02s 10.0s

> peak duration: 17s 50s

>

> Scavenge:

> avg interval: 4.46s 4.81s

> avg duration: 0.24s 0.37s

> peak duration: 1.5s 5.4s

>

That indicates ParNew performing about 8 times worse

than DefNew on a machine with 4 CPU's. That's an

8X slowdown rather than the 4X theoretical speed-up,

i.e. a performance pathology that is 32-fold.

Now some of that, especially with large objects or blocks,

is fixed in 6433335.

Would you have a test case you might be able to share with

us, or, if that is not possible, would you be able to

test for us the binaries that have the fix for 6433335 and 6459113

so that we can determine whether your scenario has been

adequately addressed or whether further performance pathologies

remain that need fixing? If so please contact us

at hotspotgc DASH feedback AT Sun DOT com or via your

Sun support contact. (For the record, if you do not already

have a Sun support contract, a full complement of support

and service options are available at

http://developers.sun.com/services

)

For the record, you do not need a support contract to receive the test binaries that contain a fix. The test binaries are just test binaries; they are not supported for use in production. The fixes themselves will officially appear in 5u10 or so.

Message was edited by:

ramki_at_jdc

ramki_at_jdca at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...
# 10

Hi,

> Would you have a test case you might be able to share

> with

I was not able to reproduce this problem consistently outside of production environment. Test cluster(s) are usually less powerful and process less data. As said in earlier comment, the high ParNewtime were reproduced on test system, but the difference in average duration was not significant (fraction of second).

I believe it should not be that difficult to reproduce it with generators. One, generating background noise, alloc/dealloc at a stable pace. The other, generating random spikes triggering heap resize.

> For the record, you do not need a support contract to

> receive the test binaries that contain a fix. The

> test binaries are just test binaries; they are not

> supported for use in production. The fixes themselves

> will officially appear in 5u10 or so.

Thanks, but customers would never allow us to try it on production and in test environment, I would not see the difference. For me, the workaround fixed the problem. I'm grateful for the help received in this forum..

I'm sending you some more GC logs to look at, if you are interested.

Regards,

BranoZ

zarnovicana at 2007-7-14 20:41:05 > top of Java-index,Java HotSpot Virtual Machine,HotSpot Internals...