autofs freezes nfs

I am running a cluster of 329 x86 systems with solaris 10 x86 - 118855-36 and

Generic_125101-06; There are two networks: nge0 and e1000g0, where nge0 is

serviing the system services and e1000g0 is serving interprocess communications.

As of Tuesday of this week, one large job - 256 cpus with a regular input and output

of about 300 Gbytes every 9 hours fails because a random node freezes on the nfs

mount to the large file store (solaris 10 x86; nfs4). All the nfs mounts are governed

by autofs. On the node which freezes the nfs access to the data system, i stop aufofs

and the nfs mount is accessible again.

I have searched the databases, I have installed the absolutely latest recommneded patchset and I have installed the e1000g latest driver 125121-01 and nothing helps.

This job has run for the last 6 weeks without major interruptions. But on Tuesday

I installed 125121-01, backed it out again and since re-installed it. The problem

started with the installation of 125121-01, but should have had no effect, as autofs

is transmitted over the nge0 interface.

Anyone out there who might have an idea?

The load on the nfs file server has not changed. Anyhow the nfs fileserver should not

be affected by autofs.

[1315 byte] By [lydia.hecka] at [2007-11-27 3:14:10]
# 1

I have more information: runnintg autofs with verbose TRUE on both

automount and automountd I get on a system which is affected

May 3 21:05:18 m2261 nfs: [ID 559769 kern.info] NOTICE: [NFS4][Server: c-store2][Mntpt: /data/gimic]NFS server c-store2 ok

May 3 21:05:30 m2261 automountd[4271]: [ID 196269 daemon.error] dupreq_nonidemp: duplicate request in progress

May 3 21:08:15 m2261 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: c-store2][Mntpt: /data/gimic]NFS server c-store2 not responding; still trying

May 3 21:08:30 m2261 automountd[4271]: [ID 196269 daemon.error] dupreq_nonidemp: duplicate request in progress

May 3 21:08:48 m22

I search the web for the key words dupreq_nonidemp and found it

under autofs_main.c in open solaris.

This is a very serious issue and if system developers watch this, could I please urge them to take some actions.

It might not affect most sites, but if the IO is heavy, as in our case: 300 GBytes to be read in bunches of 4-8 systems at the time over ~1.5 hours

is causing this.

The nfs server is also solaris 10 x86-64

lydia.hecka at 2007-7-12 8:16:40 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2
We are seeing similar issues with autofs on SPARC with kernel patches 125100-05 and 125100-06.125100-07 is available as of this morning.
QHS_Systema at 2007-7-12 8:16:40 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

Thank you for your reply. I have applied 125101-07 to all my cluster nodes.

In the meantime I have also updated the fileserver to the latest recommeded

patchset and I installed 125101-07 on the fileserver.

Since I updated the fileserver to the recommended patchset with 125101-06

the problem had disappeared, although the user activity was identical.

I will keep an eye on the situation and will post my experiences here.

lydia.hecka at 2007-7-12 8:16:40 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...