Guest domain hangs with SAN devices

I have a T2000 setup with the primary domain being presented with a number of LUNs from a Clariion which I have used as boot and data devices for the guest domains. While this worked for a while, it now hangs regularly. The primary domain is fine, but the guest domain requires 2-3 reboots to become usable again and then might only last 20 minutes before requireing another reboot.

Sometimes it also seems as if the entire VDS config is stuck, even after rebooting the control/service domain. For example even after unbinding a guest domain, the disk devices dont reappear in format on the control domain.

Here is a stack trace from a core file I got of one of the guest domain hangs:

> 0t943::pid2proc

300084a53b0

> 300084a53b0::ps

SPIDPPIDPGIDSIDUIDFLAGS ADDR NAME

R943641943641101 0x4a004000 00000300084a53b0 vi

> 300084a53b0::print proc_t p_tlist

p_tlist = 0x3000744c6a0

> 0x3000744c6a0::findstack

stack pointer for thread 3000744c6a0: 2a1016f4c41

[ 000002a1016f4c41 cv_wait+0x38() ]

000002a1016f4cf1 vdc_send_request+0x2c()

000002a1016f4dc1 vdc_strategy+0x88()

000002a1016f4e91 vdev_mirror_io_start+0x1b4()

000002a1016f4f71 zil_lwb_write_start+0x20c()

000002a1016f5021 zil_commit+0x21c()

000002a1016f50d1 zfs_fsync+0xa8()

000002a1016f5181 fop_fsync+0x14()

000002a1016f5231 fdsync+0x20()

000002a1016f52e1 syscall_trap32+0xcc()

>

> 0x3000744c6a0::print kthread_t t_lwpchan

{

t_lwpchan.lc_wchan0 = 0

t_lwpchan.lc_wchan = 0x300035d76c8

}

> 0x300035d76c8::wchaninfo -v

ADDR TYPE NWAITERSTHREADPROC

00000300035d76c8 cond1: 000003000744c6a0 vi

Here is the bindings for the domain:

solprdinfs001[/]# ldm ls-bindings

Name:primary

State: active

Flags: transition,control,vio service

OS:

Util:0.1%

Uptime: 9m

Vcpu:16

vidpidutil strand

000.9%100%

110.0%100%

220.1%100%

330.0%100%

440.2%100%

550.1%100%

660.5%100%

770.2%100%

880.1%100%

990.1%100%

10100.1%100%

11110.1%100%

12120.1%100%

13130.2%100%

14140.3%100%

15150.2%100%

Mau:4

mau cpuset (0, 1, 2, 3)

mau cpuset (4, 5, 6, 7)

mau cpuset (8, 9, 10, 11)

mau cpuset (12, 13, 14, 15)

Memory: 3968M

real-addrphys-addrsize

0x40000000x40000003968M

Vars:reboot-command=boot

IO:pci@780 (bus_a)

pci@7c0 (bus_b)

Vldc:primary-vldc0

(HV Control channel)]

[LDC: 0x1]

[LDom primary(Domain Services channel)]

[LDC: 0x3]

[LDom primary(FMA Services channel)]

[LDC: 0xb]

[LDom ender-dev (Domain Services channel)]

[LDC: 0x11]

[LDom ipgdrpinfs001(Domain Services channel)]

Vldc:primary-vldc3

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

(SP channel)]

Vds:san

vdsdev: ender-bootdevice=/dev/dsk/c6t60060160B944130042C5D7DB5CE2DB11d0s2

vdsdev: ender-datadevice=/dev/dsk/c6t60060160B944130028DDB6665EE2DB11d0s2

vdsdev: ipgdrp-datadevice=/dev/dsk/c6t60060160B94413008072F232D4E0DB11d0s2

vdsdev: ipgdrp-bootdevice=/dev/dsk/c6t60060160B944130043C5D7DB5CE2DB11d0s2

[LDom ender-dev, dev-name: ender-boot]

[LDC: 0xe]

[LDom ender-dev, dev-name: ender-data]

[LDC: 0xf]

[LDom ipgdrpinfs001, dev-name: ipgdrp-boot]

[LDC: 0x18]

[LDom ipgdrpinfs001, dev-name: ipgdrp-data]

[LDC: 0x19]

Vcc:cons

[LDC: 0x10]

[LDom ender-dev, group: ender-dev, port: 2000]

[LDC: 0x1a]

[LDom ipgdrpinfs001, group: ipgdrpinfs001, port: 2001]

port-range=2000-2020

Vsw:admin-a

mac-addr=0:14:4f:fb:5b:ff

net-dev=e1000g0

[LDC: 0xc]

[LDom ender-dev, name: admin-a, mac-addr:0x144ffb8030]

[LDC: 0x12]

[LDom ipgdrpinfs001, name: admin-a, mac-addr:0x144ffa0d2e]

mode=prog,promisc

Vsw:admin-b

mac-addr=0:14:4f:f9:ab:44

net-dev=e1000g1

[LDC: 0x13]

[LDom ipgdrpinfs001, name: admin-b, mac-addr:0x144ffae5b0]

mode=prog,promisc

Vsw:backup

mac-addr=0:14:4f:fb:88:77

net-dev=e1000g2

[LDC: 0x15]

[LDom ipgdrpinfs001, name: backup, mac-addr:0x144ffb4ee2]

mode=prog,promisc

Vldcc: vldcc1 [FMA Services]

service: ldmfma

service: primary-vldc0 @ primary

[LDC: 0x4]

Vldcc: vldcc2 [SP Channel]

service: spfma

Vldcc: vldcc0 [Domain Services]

service: primary-vldc0 @ primary

[LDC: 0x2]

Vldcc: hvctl[Hypervisor Control]

service: primary-vldc0 @ primary

[LDC: 0x0]

Vcons: SP

-

Name:ipgdrpinfs001

State: active

Flags: transition

OS:

Util:27%

Uptime: 1m

Vcpu:8

vidpidutil strand

02493%100%

12592%100%

22692%100%

32791%100%

42892%100%

52992%100%

63092%100%

73192%100%

Mau:2

mau cpuset (24, 25, 26, 27)

mau cpuset (28, 29, 30, 31)

Memory: 2G

real-addrphys-addrsize

0xc8000000x17c8000002G

Vars:nvramrc=devalias net /virtual-devices@100/channel-devices@200/network@0

boot-device=/virtual-devices@100/channel-devices@200/disk@0:a disk net

use-nvramrc?=true

Vldcc: vldcc0 [Domain Services]

service: primary-vldc0 @ primary

[LDC: 0x0]

Vnet:admin-a [LDC: 0x2]

[Peer LDom: ender-dev, mac-addr 0x144ffb8030]

mac-addr=0:14:4f:fa:d:2e

service: admin-a @ primary

[LDC: 0x1]

Vnet:admin-b

mac-addr=0:14:4f:fa:e5:b0

service: admin-b @ primary

[LDC: 0x3]

Vnet:backup

mac-addr=0:14:4f:fb:4e:e2

service: backup @ primary

[LDC: 0x4]

Vdisk: bootipgdrp-boot@san

service: san @ primary

[LDC: 0x5]

Vdisk: dataipgdrp-data@san

service: san @ primary

[LDC: 0x6]

Vcons: [via LDC:7]

ipgdrpinfs001@cons [port:2001]

-

Name:ender-dev

State: active

Flags: transition

OS:

Util:0.4%

Uptime: 5m

Vcpu:8

vidpidutil strand

01649%100%

11748%100%

21848%100%

31948%100%

42048%100%

52148%100%

62248%100%

72348%100%

Mau:2

mau cpuset (16, 17, 18, 19)

mau cpuset (20, 21, 22, 23)

Memory: 2G

real-addrphys-addrsize

0xc0000000xfc0000002G

Vars:nvramrc=devalias net /virtual-devices@100/channel-devices@200/network@0

boot-device=/virtual-devices@100/channel-devices@200/disk@0:a disk net

auto-boot?=false

use-nvramrc?=true

Vldcc: vldcc0 [Domain Services]

service: primary-vldc0 @ primary

[LDC: 0x0]

Vnet:admin-a [LDC: 0x5]

[Peer LDom: ipgdrpinfs001, mac-addr 0x144ffa0d2e]

mac-addr=0:14:4f:fb:80:30

service: admin-a @ primary

[LDC: 0x1]

Vdisk: bootender-boot@san

service: san @ primary

[LDC: 0x2]

Vdisk: dataender-data@san

service: san @ primary

[LDC: 0x3]

Vcons: [via LDC:4]

ender-dev@cons [port:2000]

[7269 byte] By [amsaula] at [2007-11-27 1:57:56]
# 1
well - nice to see it maintained the formatting....
amsaula at 2007-7-12 1:34:03 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 2
Yeah, it's a constant pain here. Are you using the leadville drivers built into Solaris or are you using the EMC drivers?
unixconsolea at 2007-7-12 1:34:03 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 3

Leadville drivers on Sun qla2300 cards - patched up to current recommended bundle.

I think the issue might be with the vds driver - sometimes even when you unbind the domain it still hangs onto the devices (as they are not in format and running fuser on them shows they are in use by vds).

amsaula at 2007-7-12 1:34:03 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 4
If you are using Solaris 10 Update 3 with the required patches, it should work. I have the new emulex PCI-E cards in my box with MPXIO turned on. Haven't had any issues, so far I have 10 guest domains using SAN LUN's for boot devices.
unixconsolea at 2007-7-12 1:34:03 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...
# 5
Do you have any vds messages in /var/adm/messages on the service domain? alex.
achartrea at 2007-7-12 1:34:03 > top of Java-index,Administration Tools,Logical Domains for CoolThreads Servers...