Problem bringing down cluster or node.

Hi All,

I'm having an issue when I try to either shut down the entire cluster, or just one node from a cluster. I'm running Solaris 10 with Sun Cluster 3.1, on two V240's, when the node is shutting down, everything seems OK, then it outputs the messages shown below before completely hanging... No syncing disks, then OK prompt... Nothing. Any clues as to what's occuring?

- --

Oct 9 10:32:48 dione xntpd[482]: [ID 866926 daemon.notice] xntpd exiting on signal 15

Oct 9 10:32:49 dione FIN_SVC_CTRL: [ID 702911 local0.error] Warning:Because one or more of the sun cluster userland cluster services are offline this service goes offline

Oct 9 10:32:52 dione cl_eventlogd[2075]: [ID 247336 daemon.error] Going down on signal 15.

Oct 9 10:32:52 dione INITRGM: [ID 702911 local0.error] Warning: an attempt to stop or disable the svc:/system/cluster/rgm:default service was detected and ignored. A shutdown or reboot in progress is allowed to proceed as normal.

Oct 9 10:32:53 dione Cluster.PNM: [ID 226280 daemon.notice] PNM daemon exiting.

Oct 9 10:32:53 dione INITFED: [ID 702911 local0.error] Warning: an attempt to stop or disable the svc:/system/cluster/rpc-fed:default service was detected and ignored. A shutdown or reboot in progress is allowed to proceed as normal.

Oct 9 10:32:53 dione Cluster.RGM.rgmd: [ID 642220 daemon.error] There is already an instance of this daemon running

Oct 9 10:32:53 dione Cluster.RGM.fed: [ID 642220 daemon.error] There is already an instance of this daemon running

Oct 9 10:33:07 dione Cluster.PMF.pmfd: [ID 615790 daemon.notice] "cacao" Failed to stay up.

- --

I've not had this problem before with another cluster I built using V210's, Solaris 10 and Sun Cluster 3.1. I have applied all the updates, apart from the Java runtime updates - if I do update Java I receive errors about having an incompatible runtime.

I'm assuming that something isn't running, or is killed before it should be.

Thank you in advance,

Pete

[2068 byte] By [Azrael808] at [2007-11-26 10:39:19]
# 1

Looking at the logs for the rgm service reveals:

- -

[ Oct 2 16:09:04 Executing start method ("/usr/cluster/lib/svc/method/svc_rgm start") ]

[ Oct 2 16:09:05 Method "start" exited with status 0 ]

[ Oct 9 10:32:52 Stopping because service disabled. ]

[ Oct 9 10:32:52 Executing stop method ("/usr/cluster/lib/svc/method/svc_rgm stop") ]

[ Oct 9 10:32:53 Method "stop" exited with status 0 ]

[ Oct 9 10:32:53 Enabled. ]

[ Oct 9 10:32:53 Executing start method ("/usr/cluster/lib/svc/method/svc_rgm start") ]

[ Oct 9 10:32:53 Method "start" exited with status 0 ]

[ Oct 9 11:21:26 Executing start method ("/usr/cluster/lib/svc/method/svc_rgm start") ]

[ Oct 9 11:21:27 Method "start" exited with status 0 ]

- -

It looks like something has enabled the service after a successful shutdown, not sure why though.

Just tried rebooting the node into non-cluster mode using "reboot -- -x" and it shut down and rebooted with no problems at all. When I first observed the issue, I had tried to shut down the server using "init 0".

Even more confusing...

Azrael808 at 2007-7-7 2:50:47 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2
There is a bug going around which looks oddly similar to your experience. Did youapply any patches recently, and what happens if you do you use the supportedform of shutdown command (which requires -i 0 IIRC)?HTH,-ashu
ashu15 at 2007-7-7 2:50:47 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

Hi,

Sorry I haven't updated this - I've had plenty of other problems to resolve! I have just finished building another two-node cluster using two V210's, Solaris 10 and Sun Cluster 3.1. The same behaviour is exhibited - I updated both nodes to the latest patch level and ran "shutdown -i 0" on one node, and it never completed... It seems something is aware of two cluster services (rgm and rpc-fed) being killed as part of the shutdown, and somehow hangs the box.

If I can provide you more details, just tell me which logs, or which command's output you want me to post.

Even the scshutdown command hangs both nodes when they're shutting down.

Pete

Azrael808 at 2007-7-7 2:50:47 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4
I have EXACTLY the same problem with dual-node 3.1 u4 cluster.Solaris version is 10 u3 BETA (aka 11/06).It never happened with 10 u1 (03/06), so I need to get back to it.What S10 version you're running?-- leon
napobo3 at 2007-7-7 2:50:48 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5
The problem was: the critical patches were missing.I fixed it with one command:pca -i missing(see http://www.par.univie.ac.at/solaris/pca/ )Thanks to Tim for the hint.Regards,-- leon
ckobopoga at 2007-7-7 2:50:48 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6

Thanks for that link - that PCA script makes it far easier to keep my servers up to date! I appear to be able to shutdown/reboot/etc either node in the cluster using the "shutdown -i [rc]" command, but the error messages that I detailed above are still output.

Never mind, I feel more comfortable being able to shut down my servers cleanly!

Thanks for the help guys.

Pete

Azrael808 at 2007-7-7 2:50:48 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...