Cluster node panic on booting

Hi

I have setup a two nodes cluster with sun cluster 3.1 u4 on sun v890+StorageTek6140.On the cluster runs oracle RAC with oracle 10g+clusterware.

When all thing finished ,I mirrored the bootdisk with SVM on the nodes,but during boot in solaris,it panic like this:

Jun 23 15:14:37 hisa ID[SUNWudlm.udlm]: [ID 795570 local0.error] Unix DLM version (2) and SUN Unix DLM library version (1): compatible.

Jun 23 15:14:37 hisa Cluster.OPS.UCMMD: [ID 525628 daemon.notice] CMM: Cluster has reached quorum.

Jun 23 15:14:37 hisa Cluster.OPS.UCMMD: [ID 377347 daemon.notice] CMM: Node hisa (nodeid = 1) is up; new incarnation number = 1182582874.

Jun 23 15:14:37 hisa Cluster.OPS.UCMMD: [ID 377347 daemon.notice] CMM: Node hisb (nodeid = 2) is up; new incarnation number = 1182582873.

Jun 23 15:14:38 hisa java[1656]: [ID 807473 user.error] pkcs11_softtoken: Keystore version failure.

Jun 23 15:15:30 hisa cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking

Jun 23 15:15:30 hisa unix: [ID 836849 kern.notice]

Jun 23 15:15:30 hisa ^Mpanic[cpu2]/thread=2a100047cc0:

Jun 23 15:15:30 hisa unix: [ID 213328 kern.notice] kstat_q_exit: qlen == 0

Jun 23 15:15:30 hisa unix: [ID 100000 kern.notice]

Jun 23 15:15:30 hisa genunix: [ID 723222 kern.notice] 000002a100047020 SUNW,UltraSPARC-IV+:kstat_q_panic+8 (300026ab150, 0, ffffffffffffffff, 2200061, 300026ab150, 5800)

Jun 23 15:15:30 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000000002 0000060001815000 0000000000000000 0000030000241b80

Jun 23 15:15:30 hisa%l4-7: 0000030000241b80 0000000000000000 0000000000000000 0000000001297400

Jun 23 15:15:31 hisa genunix: [ID 723222 kern.notice] 000002a1000470d0 md:md_kstat_done+cc (600060dda08, 60001fc5938, 0, 600060dda30, 200, 300026ab040)

Jun 23 15:15:31 hisa genunix: [ID 179002 kern.notice]%l0-3: 00000300026ab040 00000300026ab150 0000000000000009 0000000000000008

Jun 23 15:15:31 hisa%l4-7: 00000300026d5d00 0000000000000002 0000000000000008 0000000000000000

Jun 23 15:15:31 hisa genunix: [ID 723222 kern.notice] 000002a100047180 md_sp:sp_done+114 (0, 600060dda08, 0, 60001fc5938, 6000750ddf0, 704b7800)

Jun 23 15:15:31 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000200061 00000300026aaff0 000000000000000b 000000000000000a

Jun 23 15:15:31 hisa%l4-7: 0000000000004000 0000000000000000 0000000000000001 00000000704b7800

Jun 23 15:15:31 hisa genunix: [ID 723222 kern.notice] 000002a100047230 md_stripe:stripe_done+13c (4, 6000721cb38, 703c1400, 6000750f930, 60007509ce8, 60007509d40)

Jun 23 15:15:31 hisa genunix: [ID 179002 kern.notice]%l0-3: 000006000750b730 0000000000004000 0000000000000000 0000000000000001

Jun 23 15:15:31 hisa%l4-7: 0000000000000000 0000000000000000 00000000703c1400 0000000000000000

Jun 23 15:15:31 hisa genunix: [ID 723222 kern.notice] 000002a1000472e0 did:did_done+3c (60004b8b9c0, 60001236a80, 6000750b770, 6000131d280, 2200061, 0)

Jun 23 15:15:31 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000001202390 00000000018cafd8 0000000001202310 0000000000200061

Jun 23 15:15:31 hisa%l4-7: 0000000002200061 00000000fdffffff 00000000fdfffc00 000000007b666310

Jun 23 15:15:32 hisa genunix: [ID 723222 kern.notice] 000002a100047390 ssd:ssd_return_command+198 (60001236a80, 60004b8b9c0, 4, 6000131d280, 4, 4)

Jun 23 15:15:32 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000000020 00000000018cad68 00000000018cac00 000000000126ced8

Jun 23 15:15:32 hisa%l4-7: 0000000000000020 00000000018caf08 00000000018cac00 0000000000000004

Jun 23 15:15:32 hisa genunix: [ID 723222 kern.notice] 000002a100047440 ssd:ssdintr+268 (60006e0f458, 0, 0, 6000594d680, 60004b8b9c0, 60001236a80)

Jun 23 15:15:32 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 0000000000000000 0000000000004000 0000060006e0f4f8

Jun 23 15:15:32 hisa%l4-7: 0000000000000000 0000000000000000 0000000000000001 0000000000000000

Jun 23 15:15:32 hisa genunix: [ID 723222 kern.notice] 000002a1000474f0 scsi_vhci:vhci_intr+7b0 (600011e8dc0, 60006e0f4b8, 600018b13e0, 0, 60001822388, 60006e0f458)

Jun 23 15:15:32 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000060001863a40 0000000000000000 0000060006e0f4b8 0000000000000000

Jun 23 15:15:32 hisa%l4-7: 0000000000000000 0000060006e0f4f8 00000600018b1284 0000000000000028

Jun 23 15:15:32 hisa genunix: [ID 723222 kern.notice] 000002a1000475d0 fcp:ssfcp_cmd_callback+64 (600018b1438, 0, 1, 813, 600018b1248, 600011c2f40)

Jun 23 15:15:32 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000000002 0000060001815000 0000000000000000 0000030000241b80

Jun 23 15:15:32 hisa%l4-7: 0000030000241b80 0000000000000000 0000000000000000 0000000001297400

Jun 23 15:15:33 hisa genunix: [ID 723222 kern.notice] 000002a100047680 qlc:ql_fast_fcp_post+178 (600018b15d8, 128ae70, 600018b1438, 60001236fc0, 60001237038, 128ae70)

Jun 23 15:15:33 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000400000 00000000018d5148 0000000000000803 0000000000000001

Jun 23 15:15:33 hisa%l4-7: 00000600018b1438 00000600018b1438 00000600018b1438 00000600018b1278

Jun 23 15:15:33 hisa genunix: [ID 723222 kern.notice] 000002a100047730 qlc:ql_24xx_status_entry+1ec (0, 300012008c0, 2a100047958, 2a10004796c, 0, 0)

Jun 23 15:15:33 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000000811 00000600018b15d8 0000000000000000 0000000000080811

Jun 23 15:15:33 hisa%l4-7: 00000000fff7ffff 0000000000000001 0000000000000001 0000000000000000

Jun 23 15:15:33 hisa genunix: [ID 723222 kern.notice] 000002a1000477e0 qlc:ql_response_pkt+248 (60001236fc0, 2a100047958, 2a10004796c, 2a100047968, 20aa, 2840)

Jun 23 15:15:33 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 0000000000004000 0000000000002000 0000000000000000

Jun 23 15:15:33 hisa%l4-7: 0000000000000000 00000300012008c0 0000000000000000 0000000000000000

Jun 23 15:15:33 hisa genunix: [ID 723222 kern.notice] 000002a100047890 qlc:ql_isr+664 (60001236fc0, a2, 8000, a2, ffffffffffffffff, 60001237018)

Jun 23 15:15:34 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000002000 0000000000004000 0000060001236fd8 00000000012db3a8

Jun 23 15:15:34 hisa%l4-7: 0000000000000001 0000000000000000 0000000000000000 0000000000000003

Jun 23 15:15:34 hisa genunix: [ID 723222 kern.notice] 000002a100047970 qlc:const_seg_900000101+db4 (60001236fc0, 0, 60001236fc0, 0, 0, 60001237018)

Jun 23 15:15:34 hisa genunix: [ID 179002 kern.notice]%l0-3: 0000000000002000 0000000000004000 0000060001236fd8 00000000012db3a8

Jun 23 15:15:34 hisa%l4-7: 0000000000000001 0000000000000001 0000000000000000 00000000000001ab

Jun 23 15:15:34 hisa genunix: [ID 723222 kern.notice] 000002a100047a20 pcisch:pci_intr_wrapper+b4 (6000122f7b0, 600010b5230, 0, 0, 0, 6000136c738)

Jun 23 15:15:34 hisa genunix: [ID 179002 kern.notice]%l0-3: 00000000018d5170 00000600010f8c80 00000000018d51b8 0000000000000001

Jun 23 15:15:34 hisa%l4-7: 0000030000220220 0000060001236fc0 0000000000000000 00000000012dc158

Jun 23 15:15:34 hisa unix: [ID 100000 kern.notice]

Jun 23 15:15:34 hisa genunix: [ID 672855 kern.notice] syncing file systems...

Has any one ever met like this?

[7392 byte] By [caiciaa] at [2007-11-27 8:56:46]
# 1
operation system is solaris 10 06/11+EIS patchAfter panic,it will reboot, sometimes once,sometimes twice,the node will boot normal into system.
caiciaa at 2007-7-12 21:20:13 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2
Please log a call with Sun support. There are a couple of mentions of this error string but nothing that I can see that says it's definitely fixed by a particular patch. If you are using QFS, then it might be solved by upgrading to QFS v4.6.Regards,Tim
Tim.Reada at 2007-7-12 21:20:13 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

Thank you for your attation!

I would like to add some other information to this issue

Between the hosts and storage,we did not use switch but directly connnect them use fibre cables ,I don't konw if this way could bring problems and we did not use QFS either.Beside panic at booting,sometimes the messages would display this information:

Jun 26 15:32:20 hisb genunix: [ID 454863 kern.info] dump on /dev/dsk/c4t500000E0147BACB0d0s1 size 8198 MB

Jun 26 15:33:18 hisb cacao[978]: [ID 388282 daemon.warning] com.sun.cacao.ModuleManager.garbage : Cannot garbage class loader for module com.sun.cacao.snmpv3_adaptor

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group rac-rg state on node hisb change to RG_PENDING_OFFLINE

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-svm-rs state on node hisb change to R_MON_STOPPING

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-udlm-rs state on node hisb change to R_MON_STOPPING

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-framework-rs state on node hisb change to R_MON_STOPPING

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <bin/rac_framework_monitor_stop> for resource <rac-framework-rs>, resource group <rac-rg>, timeout <3600> seconds

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <bin/rac_udlm_monitor_stop> for resource <rac-udlm-rs>, resource group <rac-rg>, timeout <300> seconds

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <bin/rac_svm_monitor_stop> for resource <rac-svm-rs>, resource group <rac-rg>, timeout <300> seconds

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <bin/rac_framework_monitor_stop> completed successfully for resource <rac-framework-rs>, resource group <rac-rg>, time used: 0% of timeout <3600 seconds>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-framework-rs state on node hisb change to R_ONLINE_UNMON

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <bin/rac_svm_monitor_stop> completed successfully for resource <rac-svm-rs>, resource group <rac-rg>, time used: 0% of timeout <300 seconds>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-svm-rs state on node hisb change to R_ONLINE_UNMON

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-svm-rs state on node hisb change to R_STOPPING

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource rac-svm-rs status on node hisb change to R_FM_UNKNOWN

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource rac-svm-rs status msg on node hisb change to <Stopping>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <bin/rac_svm_stop> for resource <rac-svm-rs>, resource group <rac-rg>, timeout <300> seconds

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <bin/rac_udlm_monitor_stop> completed successfully for resource <rac-udlm-rs>, resource group <rac-rg>, time used: 0% of timeout <300 seconds>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-udlm-rs state on node hisb change to R_ONLINE_UNMON

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-udlm-rs state on node hisb change to R_STOPPING

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <bin/rac_udlm_stop> for resource <rac-udlm-rs>, resource group <rac-rg>, timeout <300> seconds

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource rac-udlm-rs status on node hisb change to R_FM_UNKNOWN

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource rac-udlm-rs status msg on node hisb change to <Stopping>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource rac-svm-rs status msg on node hisb change to <RAC framework is running>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <bin/rac_svm_stop> completed successfully for resource <rac-svm-rs>, resource group <rac-rg>, time used: 0% of timeout <300 seconds>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-svm-rs state on node hisb change to R_OFFLINE

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource rac-udlm-rs status msg on node hisb change to <RAC framework is running>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <bin/rac_udlm_stop> completed successfully for resource <rac-udlm-rs>, resource group <rac-rg>, time used: 0% of timeout <300 seconds>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-udlm-rs state on node hisb change to R_OFFLINE

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-framework-rs state on node hisb change to R_STOPPING

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <bin/rac_framework_stop> for resource <rac-framework-rs>, resource group <rac-rg>, timeout <300> seconds

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource rac-framework-rs status on node hisb change to R_FM_UNKNOWN

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource rac-framework-rs status msg on node hisb change to <Stopping>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource rac-framework-rs status msg on node hisb change to <RAC framework is running>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <bin/rac_framework_stop> completed successfully for resource <rac-framework-rs>, resource group <rac-rg>, time used: 0% of timeout <300 seconds>

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource rac-framework-rs state on node hisb change to R_OFFLINE

Jun 26 15:33:21 hisb Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group rac-rg state on node hisb change to RG_OFFLINE

Jun 26 15:33:21 hisb xntpd[568]: [ID 866926 daemon.notice] xntpd exiting on signal 15

Jun 26 15:33:23 hisb root: [ID 702911 user.error] Oracle CRSD 1099 set to stop

Jun 26 15:33:23 hisb root: [ID 702911 user.error] Oracle CRSD 1099 shutdown completed

Jun 26 15:33:23 hisb root: [ID 702911 user.error] Oracle EVMD set to stop

Jun 26 15:33:23 hisb root: [ID 702911 user.error] Oracle CSSD being stopped

Jun 26 15:33:41 hisb FIN_SVC_CTRL: [ID 702911 local0.error] Warning: Because one or more of the sun cluster userland cluster services are offline this service goes offline

Jun 26 15:33:41 hisb cl_eventlogd[843]: [ID 247336 daemon.error] Going down on signal 15.

Jun 26 15:33:43 hisb root: [ID 702911 user.error] Oracle CSSD graceful shutdown

Jun 26 15:33:44 hisb Cluster.PNM: [ID 226280 daemon.notice] PNM daemon exiting.

Regards,

Caicia

caiciaa at 2007-7-12 21:20:13 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4

Nothing obvious in this output. I really think you need to log a call with Sun Support. They will be able to get all the information they need to help you if you can provide them with explorer dumps.

I'm wondering if it's something buried down in the SAN that is causing the problem, e.g HBA firmware. It looks like you might be booting from SAN given you dump device name.

Regards,

Tim

Tim.Reada at 2007-7-12 21:20:13 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5
Thanks !I will do that as you say!By the way,the hosts are booting from local disks which were mirrored by SVM.Regards,Caicia
caiciaa at 2007-7-12 21:20:13 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6
I think I get the reason.The default gateway of the nodes was 127.0.0.1,not a real gateway.After I change it,they seems works well till now and I will continue pay attention of them.Regards,Caicia
caiciaa at 2007-7-12 21:20:13 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...