autofs freezes nfs
I am running a cluster of 329 x86 systems with solaris 10 x86 - 118855-36 and
Generic_125101-06; There are two networks: nge0 and e1000g0, where nge0 is
serviing the system services and e1000g0 is serving interprocess communications.
As of Tuesday of this week, one large job - 256 cpus with a regular input and output
of about 300 Gbytes every 9 hours fails because a random node freezes on the nfs
mount to the large file store (solaris 10 x86; nfs4). All the nfs mounts are governed
by autofs. On the node which freezes the nfs access to the data system, i stop aufofs
and the nfs mount is accessible again.
I have searched the databases, I have installed the absolutely latest recommneded patchset and I have installed the e1000g latest driver 125121-01 and nothing helps.
This job has run for the last 6 weeks without major interruptions. But on Tuesday
I installed 125121-01, backed it out again and since re-installed it. The problem
started with the installation of 125121-01, but should have had no effect, as autofs
is transmitted over the nge0 interface.
Anyone out there who might have an idea?
The load on the nfs file server has not changed. Anyhow the nfs fileserver should not
be affected by autofs.
# 1
I have more information: runnintg autofs with verbose TRUE on both
automount and automountd I get on a system which is affected
May 3 21:05:18 m2261 nfs: [ID 559769 kern.info] NOTICE: [NFS4][Server: c-store2][Mntpt: /data/gimic]NFS server c-store2 ok
May 3 21:05:30 m2261 automountd[4271]: [ID 196269 daemon.error] dupreq_nonidemp: duplicate request in progress
May 3 21:08:15 m2261 nfs: [ID 733954 kern.info] NOTICE: [NFS4][Server: c-store2][Mntpt: /data/gimic]NFS server c-store2 not responding; still trying
May 3 21:08:30 m2261 automountd[4271]: [ID 196269 daemon.error] dupreq_nonidemp: duplicate request in progress
May 3 21:08:48 m22
I search the web for the key words dupreq_nonidemp and found it
under autofs_main.c in open solaris.
This is a very serious issue and if system developers watch this, could I please urge them to take some actions.
It might not affect most sites, but if the IO is heavy, as in our case: 300 GBytes to be read in bunches of 4-8 systems at the time over ~1.5 hours
is causing this.
The nfs server is also solaris 10 x86-64
# 3
Thank you for your reply. I have applied 125101-07 to all my cluster nodes.
In the meantime I have also updated the fileserver to the latest recommeded
patchset and I installed 125101-07 on the fileserver.
Since I updated the fileserver to the recommended patchset with 125101-06
the problem had disappeared, although the user activity was identical.
I will keep an eye on the situation and will post my experiences here.