General Solaris 10 Discussion - Strange SAN Problem

I have a configuration with two serves and two SAN Storages, each server is connected to both SAN Storages, we are using host based mirroring, just a plain two node cluster setup.

I have regularly warnings in the messages file, and now I have one disk offlined on one Path from one Node, all other disk are online, the same disk is also online on the other hosts.

I have the following entries in messages:

Mar 24 08:53:26 MyHostA scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):

Mar 24 08:53:26 MyHostA /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aa4 (ssd14): Command Timeout on path /pci@7c0/pci@0

/pci@8/SUNW,emlxs@0/fp@0,0 (fp0)

Mar 24 08:53:26 MyHostA scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aa4 (ssd14):

Mar 24 08:53:26 MyHostA SCSI transport failed: reason'timeout': retrying command

Mar 24 08:54:10 MyHostA scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):

Mar 24 08:54:10 MyHostA /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aae (ssd10): Command Timeout on path /pci@7c0/pci@0

/pci@8/SUNW,emlxs@0/fp@0,0 (fp0)

Mar 24 08:54:41 MyHostA scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):

Mar 24 08:54:41 MyHostA /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aa4 (ssd14): Command Timeout on path /pci@7c0/pci@0

/pci@8/SUNW,emlxs@0/fp@0,0 (fp0)

Mar 24 08:55:22 MyHostA scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):

Mar 24 08:55:22 MyHostA /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aaa (ssd11): Command Timeout on path /pci@7c0/pci@0

/pci@8/SUNW,emlxs@0/fp@0,0 (fp0)

Mar 24 08:57:58 MyHostA scsi: [ID 243001 kern.warning] WARNING: /pci@7c0/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 (fcp0):

Mar 24 08:57:58 MyHostA INQUIRY to D_ID=0x1f1400 lun=0x1 failed: State:Packet Transport error, Reason:Undefined. Giving

up

Mar 24 08:57:58 MyHostA scsi: [ID 243001 kern.info] /pci@7c0/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 (fcp0):

Mar 24 08:57:58 MyHostA offlining lun=1 (trace=0), target=1f1400 (trace=b90101)

Mar 24 08:57:58 MyHostA genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aa6 (ssd13) multipath status

: degraded, path /pci@7c0/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 (fp0) to target address: w50060e8004ebc058,1 is offline Load balancing

: round-robin

Mar 24 08:57:59 MyHostA genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aa6 (ssd13) multipath status

: optimal, path /pci@7c0/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 (fp0) to target address: w50060e8004ebc058,1 is online Load balancing:

round-robin

I suspect a bad element in the transmission paths between the servers and the SNA Storage, wich causes sometimes this timeout, but i have no idea how to diagnose such a problem

Thanks for any help

[3016 byte] By [Tom_Tigera] at [2007-11-26 23:12:06]
# 1
What kind of storage are you using?Does it support round-robbin (active/active)?There might be some tunables for the ssd driver like queue depth and so./Ulf
Uffe_ba at 2007-7-10 14:09:39 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 2

Hi Ulf,

it is a Hitachi HDS:

root@NodeA:/var/adm # luxadm display /dev/rdsk/c4t60060E8004EBC0000000EBC000001AAEd0s2

DEVICE PROPERTIES for disk: /dev/rdsk/c4t60060E8004EBC0000000EBC000001AAEd0s2

Vendor:HITACHI

Product ID:OPEN-V*4-SUN

Revision: 5007

Serial Num:50 0EBC01AAE

Unformatted capacity: 29748.750 MBytes

Write Cache: Enabled

Read Cache:Enabled

Minimum prefetch:0x0

Maximum prefetch:0x0

Device Type: Disk device

Path(s):

/dev/rdsk/c4t60060E8004EBC0000000EBC000001AAEd0s2

/devices/scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aae:c,raw

Controller/devices/pci@7c0/pci@0/pci@8/SUNW,emlxs@0/fp@0,0

Device Address 50060e8004ebc058,4

Host controller port WWN10000000c95035f2

Classprimary

StateOFFLINE

Controller/devices/pci@780/pci@0/pci@8/SUNW,emlxs@0/fp@0,0

Device Address 50060e8004ebc048,4

Host controller port WWN10000000c9503557

Classprimary

StateONLINE

On the other node the same disk is online

root@NodeB:~ # luxadm display /dev/rdsk/c4t60060E8004EBC0000000EBC000001AAEd0s2

DEVICE PROPERTIES for disk: /dev/rdsk/c4t60060E8004EBC0000000EBC000001AAEd0s2

Vendor:HITACHI

Product ID:OPEN-V*4-SUN

Revision: 5007

Serial Num:50 0EBC01AAE

Unformatted capacity: 29748.750 MBytes

Write Cache: Enabled

Read Cache:Enabled

Minimum prefetch:0x0

Maximum prefetch:0x0

Device Type: Disk device

Path(s):

/dev/rdsk/c4t60060E8004EBC0000000EBC000001AAEd0s2

/devices/scsi_vhci/ssd@g60060e8004ebc0000000ebc000001aae:c,raw

Controller/devices/pci@780/pci@0/pci@8/SUNW,emlxs@0/fp@0,0

Device Address 50060e8004ebc048,4

Host controller port WWN10000000c95035a2

Classprimary

StateONLINE

Controller/devices/pci@7c0/pci@0/pci@8/SUNW,emlxs@0/fp@0,0

Device Address 50060e8004ebc058,4

Host controller port WWN10000000c9503622

Classprimary

StateONLINE

Tom_Tigera at 2007-7-10 14:09:39 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 3
Hi,Can You verify the output from:cfgadm -o show_FCP_dev -alif 50060e8004ebc058 is configured or not.I guess that the scsi_vhci.conf and HDS hostmode is correct.Try check/replace the fibre cable on this path./BRUlf
Uffe_ba at 2007-7-10 14:09:39 > top of Java-index,Solaris Operating System,Solaris 10 Features...