JVM hangs then resumes

***I POSTED THIS BEFORE BUT I THINK I PUT IT IN THE WRONG FORUM***

Hello,

I am running a java process standalone (no EJB container / servlet container or anything), just a standalone JVM. The process is run via a shell script whcih executes it in the background then exits (leaving JAVA running). The java process stays up between 03:00 and 23:00 then is shut down by another shell script. The start / stop scripts are started by a CONTROL-M agent running on the box and the JAVA code uses log4j-1.2.14 to write to STDOUT / STDERR and a rolling log file (max size 100Mb).

The process also uses JDBC with SYBASE jConnect (jconn3.jar) to connect to and read / write a SYBASE AS 12.5 database. The JDBC connection is opened when the process starts and closeed when it shuts down.

Between database writes, it calls shell scripts (using Runtime.getRuntime() / p.exec(...) / p.waitFor() etc) to perform various operations and captures the STDOUT / STDERR (but creates background threads to consume / display those streams so they don't hang the JVM).

Database tramsactions are used but committed every 40 records or so. No transactions are in progress during database reads (so it locks tables for minimum possible time).

The problem I have is that the process hangs then resumes. The hangs can be anywhere from 1min - 1h45min (longest so far). The hangs appear to be completely random, they can happen during any form of processing i.e. during database writes, reads or no database operations at all (i.e. just an iteration through a Vector in memory or something).

This problem appears to be isolated to one HPUX box. The test machine does not get the hangs even if it's loaded with other unrelated processes (since we initially thought it might be a contention problem). I have tried pointing the test machine at the production database and could not replicate the problem that way either (data retrieval is slower but it never hangs).

The network routes between bad-box->database and test-box->database are different and this probably accounts for the fact that traffic between JAVA and the database is approx. 6x slower (though not completely, you'd have thought these corporate WANs would be better than that?) Traffic is (as mentioned) approx. 6x faster between test-box and test-database (similar data volumes) vs between test box and live database.

To give an idea of timings, the application caches data on startup. The cache takes approx. 45min to load on the test box when it's pointing at the test database server and 5.5h to load on the bad-box (live box) pointing at the live database server, mainly owing to the hangs but also general slowness by comparison. The cache takes 3h on the test-box if pointed at the live database.

This application has been live for around 4-5 months and this problem has only manifested itself in the last fortnight or so. The table it caches data from has grown from approx. 4 million rows (when the app was implemented) to around 5 million now. I don't think volume increase is the answer though because it doesn't hang on the test box even when pointed at the live database with full volume.

Following are things we've tried, details of the boxes, traceroutes between them and JAVA versions. I don't have a stack dump 'at the time of hanging' yet as the only box I can reproduce the problem on is the live box so I can't knock over the application as an experiment during the working week (because of the 5.5h cache at startup time). I will post after the weekend and add the stack dump.

I also tried installing the identical JAVA version on the test box but it seems to make no difference. Note that I'm starting up JAVA with -Xms1024m -Xmx2048m. 'top' reports it as using approx. 1,700MB.

1. The OS is HPUX 11.11 on both boxes.

2. JAVA versions (they are different but I've tested using same vesion on each also and it makes no difference)

bad-box -

java version "1.4.2.02"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2.02-040225-17:20)

Java HotSpot(TM) Server VM (build 1.4.2 1.4.2.02-040225-19:18-PA_RISC2.0 PA2.0 (aCC_AP), mixed mode)

test box -

java version "1.4.2.09"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2.09-050713-03:34)

Java HotSpot(TM) Server VM (build 1.4.2 1.4.2.09-050713-05:46-PA_RISC2.0 PA2.0 (aCC_AP), mixed mode)

3. Trace routes between the various boxes involved (names changed for security :-) ):

Trace-routes between the different combinations:

test box -> SYBASE test (fastest application response / data retrieval time of all three)

traceroute to test_database.domain.name (128.8.107.214), 30 hops max, 40 byte packets

1 host1-a.domain.name (128.8.33.1) 0.477 ms 0.309 ms 0.240 ms

2 host1-b.domain.name (128.88.3.61) 0.300 ms 0.711 ms 0.272 ms

3 host1-c-1.domain.name (128.88.3.78) 0.293 ms 0.368 ms 0.258 ms

4 host1-c-1.domain.name (128.8.0.6) 2.829 ms 4.987 ms 4.109 ms

5 test_datbase_destination.domain.name (128.8.107.214) 0.289 ms 0.317 ms 0.217 ms

test box -> SYBASE live

traceroute to live_database_host.domain.name (128.31.16.194), 30 hops max, 40 byte packets

1 host1-a.domain.name (128.8.33.1) 0.441 ms 0.439 ms 0.231 ms

2 host1-b.domain.name (128.88.3.61) 0.300 ms 0.377 ms 0.249 ms

3 host1-c.domain.name (128.89.3.57) 1.001 ms 1.047 ms 0.921 ms

4 host4.domain.name (128.89.2.34) 7.886 ms 8.010 ms 7.865 ms

5 host5.domain.name (128.250.1.122) 8.274 ms 8.283 ms 8.069 ms

6 host1-d.domain.name (128.2.2.38) 8.251 ms 8.460 ms 8.090 ms

7 host1-e.domain.name (128.30.253.122) 8.205 ms 8.320 ms 8.132 ms

8 host1-f.domain.name (128.31.31.33) 8.344 ms 8.378 ms 8.507 ms

9 live_database_host.domain.name (128.31.16.194) 7.891 ms 7.946 ms 7.940 ms

bad-box -> SYBASE live box

traceroute to live_database_host.domain.name (128.31.16.194), 30 hops max, 40 byte packets

1 host1.domain.name (128.9.48.1) 0.414 ms 0.255 ms 0.219 ms

2 host2.domain.name (128.88.3.33) 0.329 ms 0.266 ms 0.254 ms

3 host3.domain.name (128.89.3.41) 1.074 ms 1.107 ms 1.093 ms

4 host4.domain.name (128.89.2.34) 6.586 ms 6.585 ms 6.577 ms

5 host5.domain.name (128.250.1.126) 6.872 ms 6.689 ms 6.793 ms

6 host6.domain.name (128.2.2.34) 6.824 ms 6.854 ms 6.799 ms

7 host7.domain.name (128.30.253.114) 6.907 ms 6.938 ms 6.799 ms

8 host8.domain.name (128.31.31.29) 6.980 ms 6.897 ms 6.789 ms

9 live_database_host.domain.name (128.31.16.194) 6.727 ms 6.523 ms 6.750 ms

4. software differences between boxes

test box

GOLDAPPS11i B.11.11.0509.429 Applications Patches for HP-UX 11i v1, September 2005

GOLDBASE11i B.11.11.0112.6 Gold Base Patches for HP-UX 11i, December 2001

GOLDBASE11i B.11.11.0509.429 Base Patches for HP-UX 11i v1, September 2005

GigEther-00 B.11.11.19 PCI/HSC GigEther;Supptd HW=A4926A/A4929A/A4924A/A4925A;SW=J1642AA

GigEther-01 B.11.11.20 PCI/PCI-X GigEther;Supptd HW=A6794A/A6825A/A6847A/A9782A/A9784A/A7109A/AB465A

HPUX11i-OE-Ent B.11.11.0509 HP-UX Enterprise Operating Environment Component

HPUXBase64 B.11.11 HP-UX 64-bit Base OS

HPUXBaseAux B.11.11.0509 HP-UX Base OS Auxiliary

HWEnable11i B.11.11.0509.430 Hardware Enablement Patches for HP-UX 11i v1, September 2005

J4269AA B.04.00.02 LDAP-UX Integration

Java15JDK 1.5.0.01.01 Java 1.5 JDK for HP-UX

Java15JDKadd 1.5.0.01.01 Java 1.5 JDK -AA addon for HP-UX

Java15JRE 1.5.0.01.01 Java 1.5 JRE for HP-UX

Java15JREadd 1.5.0.01.01 Java 1.5 JRE -AA addon for HP-UX

NPar B.11.11.01.04.01.01 nPartition Provider - HP-UX

OpenSSL A.00.09.07e.012 Secure Network Communications Protocol

RAID-00 B.11.11.01 PCI RAID; Supptd HW=A5856A

RAID-01 B.11.11.05 RAID SA; Supptd HW=A7143A/A9890A/A9891A

T1456AA 1.4.2.09.02 Java2 1.4 SDK for HP-UX

T1456AAaddon 1.4.2.09.02 Java2 1.4 SDK -AA addon for HP-UX

T1457AA 1.4.2.09.02 Java2 1.4 RTE for HP-UX

T1457AAaddon 1.4.2.09.02 Java2 1.4 RTE -AA addon for HP-UX

T1471AA A.04.00.002 HP-UX Secure Shell

bad-box

GOLDAPPS11i B.11.11.0406.5 Gold Applications Patches for HP-UX 11i v1, June 2004

GOLDBASE11i B.11.11.0406.5 Gold Base Patches for HP-UX 11i v1, June 2004

GOLDBASE11i B.11.11.0112.6 Gold Base Patches for HP-UX 11i, December 2001

GigEther-00 B.11.11.19 PCI/HSC GigEther;Supptd HW=A4926A/A4929A/A4924A/A4925A;SW=J1642AA

GigEther-01 B.11.11.15 PCI/PCI-X GigEther;Supptd HW=A6794A/A6825A/A6847A/A9782A/A9784A/A7109A

HPUX11i-OE-Ent B.11.11.0406 HP-UX Enterprise Operating Environment Component

HPUXBase64 B.11.11 HP-UX 64-bit Base OS

HPUXBaseAux B.11.11.0406 HP-UX Base OS Auxiliary

HWEnable11i B.11.11.0406.4 Hardware Enablement Patches for HP-UX 11i v1, June 2004

IEther-00 B.11.11.05 PCI/PCI-X IEther;Supptd HW=A7011A/A7012A/AB352A

OpenSSL A.00.09.07-d.002 Secure Network Communications Protocol

RAID-00 B.11.11.01 PCI RAID; Supptd HW=A5856A

T1455AA 1.3.1.13.01 Java2 1.3 Netscape Plugin for HP-UX

T1456AA 1.4.2.02.01 Java2 1.4 SDK for HP-UX

T1456AAaddon 1.4.2.02.01 Java2 1.4 SDK -AA addon for HP-UX

T1457AA 1.4.2.02.01 Java2 1.4 RTE for HP-UX

T1457AAaddon 1.4.2.02.01 Java2 1.4 RTE -AA addon for HP-UX

T1458AA 1.4.2.02.01 Java2 1.4 Netscape Plugin for HP-UX

T1471AA A.03.71.000 HP-UX Secure Shell

5. Kernel param differences between boxes:

# Differences in kernel params / settings between bad-box and test box

< ME VALUE BOOT MIN-MAX UNITS FLAGS

< bufpages 52427 - 0- Pages -

< create_fastlinks 1 - - -

< dbc_max_pct 25 - - -

< maxfiles 1024 - 30-2048 -

< maxfiles_lim 2048 - 30-2048 -

< maxswapchunks 4096 - 1-16384 -

< maxuprc 200 - 3- -

< msgmap 1023 - 3- -

< nbuf 190740 - 0- -

< ncallout 2064 - 6- -

< nfile 15594 - 14- -

< ninode 20000 - 14- -

< nproc 2052 - 10- -

< npty 512 - 1- -

< nstrpty 100 - - -

< semmap 1023 - 4- -

< shmmni 1024 - 3-1024 -

< swapmem_on 0 - - -

$

# Differences other way around.

> NAME VALUE BOOT MIN-MAX UNITS FLAGS

> bufpages 250002 - 0- Pages -

> create_fastlinks 0 - - -

> dbc_max_pct 50 - - -

> maxfiles 60 - 30-2048 -

> maxfiles_lim 1024 - 30-2048 -

> maxswapchunks 2048 - 1-16384 -

> maxuprc 75 - 3- -

> msgmap 39 - 3- -

> nbuf 222564 - 0- -

> ncallout 515 - 6- -

> nfile 501700 - 14- -

> ninode 476 - 14- -

> nproc 276 - 10- -

> npty 60 - 1- -

> nstrpty 60 - - -

> semmap 63 - 4- -

> shmmni 200 - 3-1024 -

> swapmem_on 1 - - -

6. truss output on the hanging process shows it issuing many nanosleep calls (never only 1, always multiple, even when it's not hanging). truss output on the running process on the test box shows nanosleeps but only one at a time - never > 1.

7. If the hang has happened while inside a database transaction, SYBASE reports that nothing is happening on its end while the 'client' (JAVA) end reports waits on I/O (using machine monitoring tools).

Any help appreciated!

[11485 byte] By [JNovice.classa] at [2007-11-27 10:20:53]
# 1

This was solved. The HPUX 'had' process (a few patches behind) had gone rogue and was eating up hundreds of Mb of memory as well as upsetting the box in general. Furthermore, the box is low on memory.

The JDBC is still working slowly compared with the UAT box looking at its own database but that's a separate issue.

JNovice.classa at 2007-7-28 17:06:19 > top of Java-index,Core,Monitoring & Management...