Sol10, errors with SAN disks while MPxIO turned on
I've got a pair of T2000s hooked up to an IBM FASTT500 SAN, each with a dual-port Emulex card. One machine was hooked up by the Sun technician... it talks to the SAN without issue, with MPxIO turned on. I can add and remove disks (though not resize existing disks, but that's not a big deal) and the system can read and write to them just fine.
The second system I'm trying to set up myself. It has a working fibre connection, and the system can talk to the SAN-provided disks without any kind of problem, so long as I have MPxIO turned off.
I've done the "stmsboot -e" step, and that went okay. The difference occurs in scsi_vhci.conf, with these lines:
device-type-scsi-options-list =
"IBM3552","symmetric-option";
symmetric-option = 0x1000000;
When commented out, MPxIO is obviously turned off and I can talk to the disks just fine.Uncomment those lines, do a reconfiguring reboot, and the multipath devices show up fine, as you'd expect. Here's my luxadm probe -p and cfgadm -al output:
# luxadm probe -p
No Network Array enclosures found in /dev/es
Found Fibre Channel device(s):
Node WWN:200600a0b80c59c1 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59c10000001345212a6f:c,raw
Node WWN:200600a0b80c59c1 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59C100000012452127CBd0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59c100000012452127cb:c,raw
Node WWN:200600a0b80c59d6 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59550000001745212E6Ad0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59550000001745212e6a:c,raw
Node WWN:200600a0b80c59d6 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59550000000944ec4ec4:c,raw
cfgadm (note the "unusable" disk! How can I diagnose this?):
# cfgadm -al
Ap_Id Type ReceptacleOccupantCondition
c0scsi-busconnectedconfiguredunknown
c0::dsk/c0t0d0disk connectedconfiguredunknown
c0::dsk/c0t1d0disk connectedconfiguredunknown
c1scsi-busconnectedconfiguredunknown
c1::dsk/c1t0d0CD-ROMconnectedconfiguredunknown
c6fc-fabricconnectedconfiguredunknown
c6::200700a0b80c59c2disk connectedconfiguredunknown
c6::200700a0b80c59d7disk connectedconfiguredunusable
c7fc-fabricconnectedconfiguredunknown
c7::200600a0b80c59c2disk connectedconfiguredunknown
c7::200600a0b80c59d7disk connectedconfiguredunusable
usb0/1 unknownemptyunconfigured ok
usb0/2 unknownemptyunconfigured ok
usb1/1.1unknownemptyunconfigured ok
usb1/1.2unknownemptyunconfigured ok
usb1/1.3unknownemptyunconfigured ok
usb1/1.4unknownemptyunconfigured ok
usb1/2 unknownemptyunconfigured ok
However, trying to use them at all causes the following:
# format
Searchingfor disks...done
c8t600A0B80000C59C10000001345212A6Fd0: configured with capacity of 40.00GB
c8t600A0B80000C59C100000012452127CBd0: configured with capacity of 60.00GB
c8t600A0B80000C59550000000944EC4EC4d0: configured with capacity of 49.99GB
AVAILABLE DISK SELECTIONS:
0. c0t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@780/pci@0/pci@9/scsi@0/sd@0,0
1. c0t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@780/pci@0/pci@9/scsi@0/sd@1,0
2. c8t600A0B80000C59C10000001345212A6Fd0 <IBM-3552-0520 cyl 20478 alt 2 hd 64 sec 64>
/scsi_vhci/ssd@g600a0b80000c59c10000001345212a6f
3. c8t600A0B80000C59C100000012452127CBd0 <IBM-3552-0520 cyl 30718 alt 2 hd 64 sec 64>
/scsi_vhci/ssd@g600a0b80000c59c100000012452127cb
4. c8t600A0B80000C59550000000944EC4EC4d0 <IBM-3552-0520 cyl 25597 alt 2 hd 64 sec 64>
/scsi_vhci/ssd@g600a0b80000c59550000000944ec4ec4
Specify disk (enter its number): 4
selecting c8t600A0B80000C59550000000944EC4EC4d0
[disk formatted]
Disk not labeled. Label it now? y
Warning: error writing VTOC.
Illegal request during read
ASC: 0x94ASCQ: 0x1
Warning: error reading backup label.
Illegal request during read
ASC: 0x94ASCQ: 0x1
Warning: error reading backup label.
Warning: no backup labels
Write label failed
The disk is unwritable. Comment out the scsi_vhci.conf entry for the SAN, reboot, and I can format/label the same device just fine.
Anyone know where to start looking for this problem? I've compared settings with the *working* T2000, and I can't find any differences.
# 1
I'm pretty sure the IBM FastT500 is not a symmetric array. You shouldn't be editing the conf file to specify an array as symmetric when it isn't.
Make sure you have the latest patches for mpxio/scsi_vhci, remove the setting in the conf file, and have IBM tell you what bits you need to set to make it work with mpxio.
# 2
Well, part of my confusion is that the other T2000 is up and running fine, as set up by the Sun guy, without any special tweaks from IBM.
In fact, I can remap a disk from the failing machine to the functional machine and it can be read/written to without an issue. Put it back on the other and it starts failing again.
I should mention that these machines are using RDAC to talk to the array.
Edit: At least, I THOUGHT they used RDAC. It was just pointed out to me that RDAC is IBM's version of multipathing.
I'm getting more and more confused. I've obviously got SOME kind of failover going on on the older T2000. Yet it's probably not RDAC, and might not even be MPxIO.
# 3
When you did a luxadm probe, there appears to be four luns. One in particular is:
Node WWN:200600a0b80c59d6 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59550000001745212E6Ad0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59550000001745212e6a:c,raw
This does not show up when doing a format which is very odd. Just for interests sake, can you do the following and send it:
luxadm -e dump_map /dev/cfg/c6
and
luxadm -e dump_map /dev/cfg/c7
Also, do a:
cfgadm -o show_FCP_dev -al c6
and
cfgadm -o show_FCP_dev -al c7
Its like you are missing one path of the multipathing from the array. Would anyone have setup lun masking on one of the ports so only the other T2000 can see the luns?
Are you using a switch in between the array and the T2000? If so, perhaps one of the ports has a wrong setting?
I seriously doubt that mpxio is the cause of this especially if you have the exact same config files on the T2000 including sd.conf.. right?
By changing the scsi_vhci.conf, the system is not trying to use multipathing so the one good lun is there but where is that fourth lun?
Stephen
# 4
> When you did a luxadm probe, there appears to be four
> luns. One in particular is:
>
> Node WWN:200600a0b80c59d6 Device Type:Disk device
> Logical
> Path:/dev/rdsk/c8t600A0B80000C59550000001745212E6Ad0s
>
>Physical Path:
> /devices/scsi_vhci/ssd@g600a0b80000c59550000001745212
> 6a:c,raw
>
> This does not show up when doing a format which is
> very odd. Just for interests sake, can you do the
> following and send it:
>
> luxadm -e dump_map /dev/cfg/c6
# luxadm -e dump_map /dev/cfg/c6
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0210000 200700a0b80c59c2 200600a0b80c59c1 0x0 (Disk device)
1211000 200700a0b80c59d7 200600a0b80c59d6 0x0 (Disk device)
2411000 10000000c955c5f0 20000000c955c5f0 0x1f (Unknown Type,Host Bus Adapter)
> luxadm -e dump_map /dev/cfg/c7
# luxadm -e dump_map /dev/cfg/c7
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0100000 200600a0b80c59c2 200600a0b80c59c1 0x0 (Disk device)
1101000 200600a0b80c59d7 200600a0b80c59d6 0x0 (Disk device)
2203000 10000000c955c5ef 20000000c955c5ef 0x1f (Unknown Type,Host Bus Adapter)
> cfgadm -o show_FCP_dev -al c6
# cfgadm -o show_FCP_dev -al c6
Ap_Id Type ReceptacleOccupantCondition
c6fc-fabricconnectedconfiguredunknown
c6::200700a0b80c59c2,0 disk connectedconfiguredunknown
c6::200700a0b80c59c2,1 disk connectedconfiguredunknown
c6::200700a0b80c59d7,0 disk connectedconfiguredunusable
c6::200700a0b80c59d7,1 disk connectedconfiguredunknown
> cfgadm -o show_FCP_dev -al c7
# cfgadm -o show_FCP_dev -al c7
Ap_Id Type ReceptacleOccupantCondition
c7fc-fabricconnectedconfiguredunknown
c7::200600a0b80c59c2,0 disk connectedconfiguredunknown
c7::200600a0b80c59c2,1 disk connectedconfiguredunknown
c7::200600a0b80c59d7,1 disk connectedconfiguredunknown
I see what you mean... c7::200600a0b80c58d7,0 appears to be missing.
Wouldn't that just affect the one disk, though?
> from the array. Would anyone have setup lun masking
> on one of the ports so only the other T2000 can see
> the luns?
It's possible. I'll ask the SAN administrators to check.
> Are you using a switch in between the array and the
> T2000? If so, perhaps one of the ports has a wrong
Yes, two switches, actually. Originally the old T2000 was configured with two ports on a single switch (since those were all that were free), and we thought that might be a problem... but after we moved its second port to the same switch the new T2000 is configured on, it stayed healthy.
> I seriously doubt that mpxio is the cause of this
> especially if you have the exact same config files on
> the T2000 including sd.conf.. right?
Exact same sd.conf on both machines, yes.
One thing I've noticed... stmsboot -L on the working T2000 produces no output at all. However, on the new machine, I get the following:
# stmsboot -L
non-STMS device nameSTMS device name
/dev/rdsk/c6t200700A0B80C59D7d1 /dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0
/dev/rdsk/c6t200700A0B80C59C2d1 /dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0
/dev/rdsk/c6t200700A0B80C59C2d0 /dev/rdsk/c8t600A0B80000C59C100000012452127CBd0
/dev/rdsk/c7t200600A0B80C59D7d1 /dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0
/dev/rdsk/c7t200600A0B80C59C2d1 /dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0
/dev/rdsk/c7t200600A0B80C59C2d0 /dev/rdsk/c8t600A0B80000C59C100000012452127CBd0
... which seems odd.
# 5
When I do a luxadm probe -p now, only three LUNs show up.
# luxadm probe -p
No Network Array enclosures found in /dev/es
Found Fibre Channel device(s):
Node WWN:200600a0b80c59c1 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59c10000001345212a6f:c,raw
Node WWN:200600a0b80c59c1 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59C100000012452127CBd0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59c100000012452127cb:c,raw
Node WWN:200600a0b80c59d6 Device Type:Disk device
Logical Path:/dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0s2
Physical Path:
/devices/scsi_vhci/ssd@g600a0b80000c59550000000944ec4ec4:c,raw
... and this might be why:
Oct 3 15:31:44 TST genunix: [ID 408114 kern.info] /scsi_vhci/ssd@g600a0b80000c59550000001745212e6a (ssd3) offline
Oct 3 15:31:44 TST genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g600a0b80000c59550000001745212e6a (ssd3) multipath status: failed, path /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,emlxs@2,1/fp@0,0 (fp1) to target address: w200600a0b80c59d7,0 is offline Load balancing: round-robin
# 6
I've managed to reproduce the problem with MPxIO turned *off*, and I never thought I'd be so happy to induce a failure.
The system can talk to disks on either path. However, you CAN'T talk to the same disk on ANOTHER path after having done so. It seems related to whatever path you pick first... there's some kind of persistent binding going on, either within the switches on or on the system itself.
I'm still investigating, but at least I've got something less complex to fix now. :)
# 7
I have a rather comlex environment at work where I have storage and switches and we have outsourced systems as well. I have no problems doing stuff to what I manage but we have never ending issues with LUN's presented from the outsourced storage. I took me over a month to get what I wanted from the outsourced storage. LUN masking and only assigning LUNs to one path caused me major headaches as I had to prove to the outsourcer that my systems were not at fault and it was their technical people who needed to be more technical.
So, I have learnt an awful lot on trouble shooting. The oddest of all was the Cisco MDS switch mistakeningly identifying the port when in auto mode.
Gone are the days of easy storage. At least you have something to work with now.
Stephen
# 8
The SHARK - Last I checked - is an asymmetric array. You're basically seeing the LUN on both paths and when you access the LUN on the alternate path you're causing a LUN failover from one controller to the other. (Depends on how you cable it but that is probably it.)
Really - Take the line out of the conf file. Simply enable mpxio and, if the SHARK is set up correctly per IBMs instructions - mpxio will discover it just fine and dandy. No conf lines required.
# 9
Hi,
the FastT500 (e.g. IBM 3552) is an asymetric device.
If the file /kernel/drv/scsi_vhci.conf contains the line
"IBM3552", "symetric-option";
you let Solaris know to treat as an active-active device.
=> the abouve file says how MPxIO is working, the file
/kernel/drv/fp.conf
will state if MPxIO is working or not. The line
mpxio-disable="no";
says MPxIO is activ. The file will be altered by the stmsboot -e resp. stmsboot -d command.