Sol10, errors with SAN disks while MPxIO turned on

I've got a pair of T2000s hooked up to an IBM FASTT500 SAN, each with a dual-port Emulex card. One machine was hooked up by the Sun technician... it talks to the SAN without issue, with MPxIO turned on. I can add and remove disks (though not resize existing disks, but that's not a big deal) and the system can read and write to them just fine.

The second system I'm trying to set up myself. It has a working fibre connection, and the system can talk to the SAN-provided disks without any kind of problem, so long as I have MPxIO turned off.

I've done the "stmsboot -e" step, and that went okay. The difference occurs in scsi_vhci.conf, with these lines:

device-type-scsi-options-list =

"IBM3552","symmetric-option";

symmetric-option = 0x1000000;

When commented out, MPxIO is obviously turned off and I can talk to the disks just fine.Uncomment those lines, do a reconfiguring reboot, and the multipath devices show up fine, as you'd expect. Here's my luxadm probe -p and cfgadm -al output:

# luxadm probe -p

No Network Array enclosures found in /dev/es

Found Fibre Channel device(s):

Node WWN:200600a0b80c59c1 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59c10000001345212a6f:c,raw

Node WWN:200600a0b80c59c1 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59C100000012452127CBd0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59c100000012452127cb:c,raw

Node WWN:200600a0b80c59d6 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59550000001745212E6Ad0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59550000001745212e6a:c,raw

Node WWN:200600a0b80c59d6 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59550000000944ec4ec4:c,raw

cfgadm (note the "unusable" disk! How can I diagnose this?):

# cfgadm -al

Ap_Id Type ReceptacleOccupantCondition

c0scsi-busconnectedconfiguredunknown

c0::dsk/c0t0d0disk connectedconfiguredunknown

c0::dsk/c0t1d0disk connectedconfiguredunknown

c1scsi-busconnectedconfiguredunknown

c1::dsk/c1t0d0CD-ROMconnectedconfiguredunknown

c6fc-fabricconnectedconfiguredunknown

c6::200700a0b80c59c2disk connectedconfiguredunknown

c6::200700a0b80c59d7disk connectedconfiguredunusable

c7fc-fabricconnectedconfiguredunknown

c7::200600a0b80c59c2disk connectedconfiguredunknown

c7::200600a0b80c59d7disk connectedconfiguredunusable

usb0/1 unknownemptyunconfigured ok

usb0/2 unknownemptyunconfigured ok

usb1/1.1unknownemptyunconfigured ok

usb1/1.2unknownemptyunconfigured ok

usb1/1.3unknownemptyunconfigured ok

usb1/1.4unknownemptyunconfigured ok

usb1/2 unknownemptyunconfigured ok

However, trying to use them at all causes the following:

# format

Searchingfor disks...done

c8t600A0B80000C59C10000001345212A6Fd0: configured with capacity of 40.00GB

c8t600A0B80000C59C100000012452127CBd0: configured with capacity of 60.00GB

c8t600A0B80000C59550000000944EC4EC4d0: configured with capacity of 49.99GB

AVAILABLE DISK SELECTIONS:

0. c0t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>

/pci@780/pci@0/pci@9/scsi@0/sd@0,0

1. c0t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>

/pci@780/pci@0/pci@9/scsi@0/sd@1,0

2. c8t600A0B80000C59C10000001345212A6Fd0 <IBM-3552-0520 cyl 20478 alt 2 hd 64 sec 64>

/scsi_vhci/ssd@g600a0b80000c59c10000001345212a6f

3. c8t600A0B80000C59C100000012452127CBd0 <IBM-3552-0520 cyl 30718 alt 2 hd 64 sec 64>

/scsi_vhci/ssd@g600a0b80000c59c100000012452127cb

4. c8t600A0B80000C59550000000944EC4EC4d0 <IBM-3552-0520 cyl 25597 alt 2 hd 64 sec 64>

/scsi_vhci/ssd@g600a0b80000c59550000000944ec4ec4

Specify disk (enter its number): 4

selecting c8t600A0B80000C59550000000944EC4EC4d0

[disk formatted]

Disk not labeled. Label it now? y

Warning: error writing VTOC.

Illegal request during read

ASC: 0x94ASCQ: 0x1

Warning: error reading backup label.

Illegal request during read

ASC: 0x94ASCQ: 0x1

Warning: error reading backup label.

Warning: no backup labels

Write label failed

The disk is unwritable. Comment out the scsi_vhci.conf entry for the SAN, reboot, and I can format/label the same device just fine.

Anyone know where to start looking for this problem? I've compared settings with the *working* T2000, and I can't find any differences.

[4885 byte] By [Brandon.Hume] at [2007-11-26 10:33:07]
# 1

I'm pretty sure the IBM FastT500 is not a symmetric array. You shouldn't be editing the conf file to specify an array as symmetric when it isn't.

Make sure you have the latest patches for mpxio/scsi_vhci, remove the setting in the conf file, and have IBM tell you what bits you need to set to make it work with mpxio.

torreysun at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 2

Well, part of my confusion is that the other T2000 is up and running fine, as set up by the Sun guy, without any special tweaks from IBM.

In fact, I can remap a disk from the failing machine to the functional machine and it can be read/written to without an issue. Put it back on the other and it starts failing again.

I should mention that these machines are using RDAC to talk to the array.

Edit: At least, I THOUGHT they used RDAC. It was just pointed out to me that RDAC is IBM's version of multipathing.

I'm getting more and more confused. I've obviously got SOME kind of failover going on on the older T2000. Yet it's probably not RDAC, and might not even be MPxIO.

BrandonHume at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 3

When you did a luxadm probe, there appears to be four luns. One in particular is:

Node WWN:200600a0b80c59d6 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59550000001745212E6Ad0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59550000001745212e6a:c,raw

This does not show up when doing a format which is very odd. Just for interests sake, can you do the following and send it:

luxadm -e dump_map /dev/cfg/c6

and

luxadm -e dump_map /dev/cfg/c7

Also, do a:

cfgadm -o show_FCP_dev -al c6

and

cfgadm -o show_FCP_dev -al c7

Its like you are missing one path of the multipathing from the array. Would anyone have setup lun masking on one of the ports so only the other T2000 can see the luns?

Are you using a switch in between the array and the T2000? If so, perhaps one of the ports has a wrong setting?

I seriously doubt that mpxio is the cause of this especially if you have the exact same config files on the T2000 including sd.conf.. right?

By changing the scsi_vhci.conf, the system is not trying to use multipathing so the one good lun is there but where is that fourth lun?

Stephen

stephen2602 at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 4

> When you did a luxadm probe, there appears to be four

> luns. One in particular is:

>

> Node WWN:200600a0b80c59d6 Device Type:Disk device

> Logical

> Path:/dev/rdsk/c8t600A0B80000C59550000001745212E6Ad0s

>

>Physical Path:

> /devices/scsi_vhci/ssd@g600a0b80000c59550000001745212

> 6a:c,raw

>

> This does not show up when doing a format which is

> very odd. Just for interests sake, can you do the

> following and send it:

>

> luxadm -e dump_map /dev/cfg/c6

# luxadm -e dump_map /dev/cfg/c6

Pos Port_ID Hard_Addr Port WWN Node WWN Type

0210000 200700a0b80c59c2 200600a0b80c59c1 0x0 (Disk device)

1211000 200700a0b80c59d7 200600a0b80c59d6 0x0 (Disk device)

2411000 10000000c955c5f0 20000000c955c5f0 0x1f (Unknown Type,Host Bus Adapter)

> luxadm -e dump_map /dev/cfg/c7

# luxadm -e dump_map /dev/cfg/c7

Pos Port_ID Hard_Addr Port WWN Node WWN Type

0100000 200600a0b80c59c2 200600a0b80c59c1 0x0 (Disk device)

1101000 200600a0b80c59d7 200600a0b80c59d6 0x0 (Disk device)

2203000 10000000c955c5ef 20000000c955c5ef 0x1f (Unknown Type,Host Bus Adapter)

> cfgadm -o show_FCP_dev -al c6

# cfgadm -o show_FCP_dev -al c6

Ap_Id Type ReceptacleOccupantCondition

c6fc-fabricconnectedconfiguredunknown

c6::200700a0b80c59c2,0 disk connectedconfiguredunknown

c6::200700a0b80c59c2,1 disk connectedconfiguredunknown

c6::200700a0b80c59d7,0 disk connectedconfiguredunusable

c6::200700a0b80c59d7,1 disk connectedconfiguredunknown

> cfgadm -o show_FCP_dev -al c7

# cfgadm -o show_FCP_dev -al c7

Ap_Id Type ReceptacleOccupantCondition

c7fc-fabricconnectedconfiguredunknown

c7::200600a0b80c59c2,0 disk connectedconfiguredunknown

c7::200600a0b80c59c2,1 disk connectedconfiguredunknown

c7::200600a0b80c59d7,1 disk connectedconfiguredunknown

I see what you mean... c7::200600a0b80c58d7,0 appears to be missing.

Wouldn't that just affect the one disk, though?

> from the array. Would anyone have setup lun masking

> on one of the ports so only the other T2000 can see

> the luns?

It's possible. I'll ask the SAN administrators to check.

> Are you using a switch in between the array and the

> T2000? If so, perhaps one of the ports has a wrong

Yes, two switches, actually. Originally the old T2000 was configured with two ports on a single switch (since those were all that were free), and we thought that might be a problem... but after we moved its second port to the same switch the new T2000 is configured on, it stayed healthy.

> I seriously doubt that mpxio is the cause of this

> especially if you have the exact same config files on

> the T2000 including sd.conf.. right?

Exact same sd.conf on both machines, yes.

One thing I've noticed... stmsboot -L on the working T2000 produces no output at all. However, on the new machine, I get the following:

# stmsboot -L

non-STMS device nameSTMS device name

/dev/rdsk/c6t200700A0B80C59D7d1 /dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0

/dev/rdsk/c6t200700A0B80C59C2d1 /dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0

/dev/rdsk/c6t200700A0B80C59C2d0 /dev/rdsk/c8t600A0B80000C59C100000012452127CBd0

/dev/rdsk/c7t200600A0B80C59D7d1 /dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0

/dev/rdsk/c7t200600A0B80C59C2d1 /dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0

/dev/rdsk/c7t200600A0B80C59C2d0 /dev/rdsk/c8t600A0B80000C59C100000012452127CBd0

... which seems odd.

BrandonHume at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 5

When I do a luxadm probe -p now, only three LUNs show up.

# luxadm probe -p

No Network Array enclosures found in /dev/es

Found Fibre Channel device(s):

Node WWN:200600a0b80c59c1 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59C10000001345212A6Fd0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59c10000001345212a6f:c,raw

Node WWN:200600a0b80c59c1 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59C100000012452127CBd0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59c100000012452127cb:c,raw

Node WWN:200600a0b80c59d6 Device Type:Disk device

Logical Path:/dev/rdsk/c8t600A0B80000C59550000000944EC4EC4d0s2

Physical Path:

/devices/scsi_vhci/ssd@g600a0b80000c59550000000944ec4ec4:c,raw

... and this might be why:

Oct 3 15:31:44 TST genunix: [ID 408114 kern.info] /scsi_vhci/ssd@g600a0b80000c59550000001745212e6a (ssd3) offline

Oct 3 15:31:44 TST genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g600a0b80000c59550000001745212e6a (ssd3) multipath status: failed, path /pci@7c0/pci@0/pci@1/pci@0,2/SUNW,emlxs@2,1/fp@0,0 (fp1) to target address: w200600a0b80c59d7,0 is offline Load balancing: round-robin

BrandonHume at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 6

I've managed to reproduce the problem with MPxIO turned *off*, and I never thought I'd be so happy to induce a failure.

The system can talk to disks on either path. However, you CAN'T talk to the same disk on ANOTHER path after having done so. It seems related to whatever path you pick first... there's some kind of persistent binding going on, either within the switches on or on the system itself.

I'm still investigating, but at least I've got something less complex to fix now. :)

BrandonHume at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 7

I have a rather comlex environment at work where I have storage and switches and we have outsourced systems as well. I have no problems doing stuff to what I manage but we have never ending issues with LUN's presented from the outsourced storage. I took me over a month to get what I wanted from the outsourced storage. LUN masking and only assigning LUNs to one path caused me major headaches as I had to prove to the outsourcer that my systems were not at fault and it was their technical people who needed to be more technical.

So, I have learnt an awful lot on trouble shooting. The oddest of all was the Cisco MDS switch mistakeningly identifying the port when in auto mode.

Gone are the days of easy storage. At least you have something to work with now.

Stephen

stephen2602 at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 8

The SHARK - Last I checked - is an asymmetric array. You're basically seeing the LUN on both paths and when you access the LUN on the alternate path you're causing a LUN failover from one controller to the other. (Depends on how you cable it but that is probably it.)

Really - Take the line out of the conf file. Simply enable mpxio and, if the SHARK is set up correctly per IBMs instructions - mpxio will discover it just fine and dandy. No conf lines required.

torreysun at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...
# 9

Hi,

the FastT500 (e.g. IBM 3552) is an asymetric device.

If the file /kernel/drv/scsi_vhci.conf contains the line

"IBM3552", "symetric-option";

you let Solaris know to treat as an active-active device.

=> the abouve file says how MPxIO is working, the file

/kernel/drv/fp.conf

will state if MPxIO is working or not. The line

mpxio-disable="no";

says MPxIO is activ. The file will be altered by the stmsboot -e resp. stmsboot -d command.

mikeschnibm at 2007-7-7 2:41:23 > top of Java-index,Storage Forums,Storage General Discussion...