SunFire V120 with A1000

OK, we've got a SunFire V120 server with an X6541A PCI SCSI card installed. The card is connected to an A1000 RAID Array with a good quality HVD SCSI cable. The other port on the A1000 has a HVD terminator installed. There are eight 73GB drives in the array, and there are two internal drives running on the server's built-in SCSI. Running Solaris 8 on the server.

So far, I have not been able to get the server to recognize the array at all. If I run a "probe-scsi-all" I get this:

<i>

ok probe-scsi-all

/<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;5" target="_blank">scsi@5</a>,1

/<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;5" target="_blank">scsi@5</a>

Fatal SCSI error at script address 10 Unexpected disconnect

/<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;8" target="_blank">scsi@8</a>,1

/<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;8" target="_blank">scsi@8</a>

Target 0

Unit 0DiskFUJITSU MAP3367N SUN36G 0301

Target 1

Unit 0DiskSEAGATE ST336704LSUN36G 0326</i>

As you can see, it recognizes the internal drives on SCSI address 8 and the X6541 card at SCSI address 5, but reports an error on the port attached to the A1000.

I also get the following email to root occasionally:

<i>To: root

Subject: raid Event

Content-Length: 183

An array event has been detected on Controller Unknown

Device Unknown at Host <<our domain name here>> - Time 08/23/2005 21:14:27</i>

And I get errors on boot up:

<i>Sun Fire V120 (UltraSPARC-IIe 648MHz), No Keyboard

OpenBoot 4.0, 1536 MB memory installed, Serial #53835884.

Ethernet address 0:3:ba:35:78:6c, Host ID: 8335786c.

last command: boot

Boot device: disk File and args:

SunOS Release 5.8 Version Generic_108528-17 64-bit

Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved.

WARNING: /<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;5" target="_blank">scsi@5</a> (glm2):

unexpected SCSI interrupt while idle

configuring IPv4 interfaces: eri0.

Hostname: <<our domain name here>>

The system is coming up. Please wait.

checking ufs filesystems

/dev/rdsk/c0t1d0s6: is stable.

/dev/rdsk/c0t0d0s7: 52762 files, 11375733 used, 10986210 free

/dev/rdsk/c0t0d0s7: (39522 frags, 1368336 blocks, 0.1% fragmentation)

/dev/rdsk/c0t0d0s4: is stable.

8/24/2005 1:14:21 GMT LOM time reference

starting rpc services: rpcbind done.

Setting netmask of eri0 to 255.255.255.240

Setting default IPv4 interface for multicast: add net 224.0/4: gateway <<our domain name here>>

syslog service starting.

Print services started.

There are no devices (controllers) in the system; nvutil terminated.

There are no devices (controllers) in the system.

fwutil failed!

Array Monitor initiated

Aug 23 21:14:27 /usr/lib/osa/bin/arraymon: No RAID devices found to check.

RDAC daemons initiated

volume management starting.

Wnn6: Key License Server started....

Nihongo Multi Client Server (Wnn6 R2.34)

Finished Reading Files

httpd starting.

Starting nrpe: Starting mysqld daemon with databases from /usr/local/mysql/data started

The system is ready.</i>

The first SCSI error you see in the boot sequence:

<i>WARNING: /<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;5" target="_blank">scsi@5</a> (glm2):

unexpected SCSI interrupt while idle</i>

is sometimes replaced with:

<i>WARNING: invalid vector intr: number 0x7df, pil 0x0</i>

Any ideas at all? Is it the A1000, it's RAID controller, the X6541A, PROM settings? Anything?

Thanks!

- Matt

[4765 byte] By [Matt_M] at [2007-11-25 22:59:54]
# 1

Hello Matt,

I would suggest that you search the contract part of SunSolve or open a service case.

The SCSI errors are probably due to a hardware error, but maybe only because your OBP is outdated.

Update your OBP to the current version (4.0.17, Patch # 111991-07). There have been updates to fix problems with this hba (at least for others systems).

To check the version enter <b>.version</b> at the ok-prompt

Your Solaris 8 is nearly unpatched. The kernel patch is below Solaris 8 12/02 (which is 108528-18). Apply the lastest patch cluster or update to a current version.

Review the RaidManager / A1000 documentation for miminum patch levels. Is the firmware of your A1000 current ?

Michael

maal at 2007-7-5 17:49:01 > top of Java-index,Storage Forums,Storage General Discussion...
# 2
Yeah, I know we are way behind on patches. I will try that first.Of course, we have no way to check the firmware on the A1000 until we can connect to it. Chicken and egg...Thanks, will report back any success/failure.- Matt
Matt_M at 2007-7-5 17:49:01 > top of Java-index,Storage Forums,Storage General Discussion...
# 3

I would get the "Fatal SCSI error at script address 10 Unexpected disconnect" fixed first in probe-scsi-all.

Hopefully auto-boot? is false and you issue a reset-all before running probe-scsi-all. You could move the cable to the other port on the HBA: /<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,0/<a href="mailto:pci&#64;1" target="_blank">pci@1</a>/<a href="mailto:scsi&#64;5" target="_blank">scsi@5</a>,1 and try it again. You could try swapping cables & terminator (150-1890) to see if you can make problem move or disappear. What happens if you disconnect A1000 and run the probe? If it doesn't pass at this level you have zero chance in Solaris.

jds2n at 2007-7-5 17:49:01 > top of Java-index,Storage Forums,Storage General Discussion...
# 4

>> If it doesn't pass at this level you have zero chance in Solaris.

You're probably right.

I have tried swapping the ports on both the A1000 and the SCSI card. But I can't recall the exact effects -- problem is, I am in New Hampshire and the Server is in New York. "Hands-on" debugging is a bit difficult.

I do have auto-boot set to false, so no problem running the SCSI probe. I will probably have to do one more trip down there it looks like. Your suggestion of trying the SCSI probe with the A1000 disconnected is a good one - at least that will isolate to the A1000 or the SCSI card.

I have a hunch there is a hardware problem on the RAID contoller card in the A1000, but no proof yet.

Installed 300 or so patches last night. 50 or so to go. This part needs to get done anyway, even if it doesn't fix the RAID array!

Matt_M at 2007-7-5 17:49:01 > top of Java-index,Storage Forums,Storage General Discussion...
# 5
Hi Just to include in this discussionThe same error i was facing a few days back with Sun 420R and A-1000,Terminator at the A-1000 end was making blunder
s at 2007-7-5 17:49:01 > top of Java-index,Storage Forums,Storage General Discussion...