prtdiag report SC faulty

Hi,

we have 2 SunFire v440 in cluster with SunCluster 3.1.

SC isn't used for remote consolle but only to monitor temperature, fan, etc.

We use a KVM switch on both systems for consolle.

After ~10 days of uptime there is an error in /var/adm/messages:

rmclomv: .... WARNING: SC has stopped responding

Prtdiag doesn't report any informations about temperature, fan, etc and it reports :

SC faulty.

The "Service required" led on the front panel is on.

The only way to reset the error and the normally prtdiag use is to reboot the system (and unplug/plug of the power cord).

How can i resume prtdiag functionality?

[701 byte] By [adriano] at [2007-11-25 22:47:50]
# 1

How many times has this happened?

Is it happening on both of the V440's or just one?

When you connect to the SC, via the serial management port, when this happens are you able to view any output?

Are you able to break to the ALOM using the #. command when connected to the serial management port?

Can you post the following outputs:

1. Summary of the messages you are seeing.

2. The output from the showlogs command from the ALOM.

3. The output from the showplatform command from the ALOM.

4. The output from the showsc command from the ALOM.

5. The output that is shown on the serial management port when you issue the resetsc command from the ALOM.

stumoor at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 2

Thank tou for the fast reply.

Q1. How many times has this happened?

A1. The problem happened several times. I think 4/5 times; servers have high uptime and when the problem happen i can reset it only after a reboot.

Q2. Is it happening on both of the V440's or just one?

A2. Both systems have this problem but at this moment only one.

It seems the problem is on the server where the KVM is switched on:

now the consolle is switched on server1 (where there is the problem) and the problem isn't present on the second server where consolle and usb device (keyboard + mouse) aren't detected.

Q3. When you connect to the SC, via the serial management port, when this happens are you able to view any output?

A3. As soon is possible i'll try to connect to the SC serial port and report you other details.

adriano at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 3

Thanks the outputs would help.

But from what you said in answer to question 2, it is possible that the server is seeing the KVM being attached as taking the console away from the SC and seeing this as a fault. Tell me is the KVM being moved back and forth between the servers when they are live? and how are they being booted, I.E. is the KVM attached when they are booted.

Also what are the OBP settings for the intput and output devices set to.

A side issue here.

If the OBP input and output device settings are set to monitor and keyboard the output will go to the monitor and keyboard. However, this is not recommended with Sun Clusters, especially when you are moving the monitor and keyboard between servers. The reason for this is best described in a rather long winded explanation, sorry.

Under a situation where the outputs are being redirected to a monitor and keyboard, and assuming that they are being moved. Should one node crash, the other node is going to print messages to the console indicating that a node has dropped. Now, should the monitor and keyboard not be attached to the good node at this time the good node might crash as well, as it attempts to print this message to the console and cant find the monitor and keyboard that it thinks it has attached to it. As a result of this for clusters it is highly recommended that you administer these over a terminal server, which in the case of a V440 means going straight through the Serial Management Port and not through monitor and keyboard. Just a little heads up there.

stumoor at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 4

Sorry just reread the post :-).

When you say KVM I assume that both servers are connected to the KVM and all you are doing with the monitor and keyboard is selecting the relevant input on the KVM and not physically moving keyboard and monitor cable between them.

Either way the SC outputs will show you more. If they are echoing the SC failure messages that you are seeing in the prtdiag. Then I would 100% open a Sun Support case as this would need to be looked at closer.

stumoor at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 5

"SC isn't used for remote consolle but only to monitor temperature, fan, etc.

We use a KVM switch on both systems for consolle. "

Where do you plug in a KVM if not using SC? Is it one listed here:

<a href="http://www.sun.com/io_technologies/Input.html" target="_blank">http://www.sun.com/io_technologies/Input.html</a>

Have you tried the scadm resetrsc command in Solaris to reset the SC or the sc> resetsc when connected to serial mgt port if it is still alive? If you do successfuly log in to SC serial mgt do you sc> logout when finished?

Useful ALOM info can be found here:

<a href="http://docs.sun.com/source/817-1960-10/" target="_blank">http://docs.sun.com/source/817-1960-10/</a>

jds2n at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 6
That is a good ALOM website, never seen that one before. You learn something new every day, lol
stumoor at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 7

Both servers have a XVR-100 and are connected to a KVM USB Switch Tripp-Lite B006-004-R (not included in <a href="http://www.sun.com/io_technologies/Input.html" target="_blank">http://www.sun.com/io_technologies/Input.html</a> list ...).

I send a "scadm resetrsc" and incredibly SC reset to OK.

Fru Operational Status:

-

LocationStatus

-

SC okay

In messages file there are these lines:

Nov 14 16:04:07 node1 pseudo: [ID 129642 kern.info] pseudo-device: rmcadm0

Nov 14 16:04:07 node1 genunix: [ID 936769 kern.info] rmcadm0 is /pseudo/<a href="mailto:rmcadm&#64;0" target="_blank">rmcadm@0</a>

Nov 14 16:05:09 node1 rmclomv: [ID 714237 kern.notice] NOTICE: SC recovered

I'm wonder becouse i send the same command some times ago without success and the command gave the output:

scadm: unable to send data to SC

For the future i'll check if SC continue to work.

Thanks for the help.

adriano at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 8

"incredibly SC reset to OK"

Incredible? No.

Magic? Most likely.

Since these servers are mostly managed via Net Mgt or Serial Mgt ports the SC "seldom" goes to sleep since it is the system console.You might want to consider upgrading ALOM to 1.5 or 1.5.2 since this may make the SC more robust and survive periods of loneliness. "sc> showsc -v version" or "# scadm version -v" will indicate your current revision.

You can download the new version from:

<a href="http://www.sun.com/download/products.xml?id=415453f9" target="_blank">http://www.sun.com/download/products.xml?id=415453f9</a&g t;

jds2n at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 9
I agree that upgrading the servers ALOM firmware and possible OBP firmware is a good idea. However I am still leaning to the problem here being the KVM and I would 100% be getting rid of it if i had a cluster.
stumoor at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 10
I agree with you. Using a KVM does not allow you to see POST output or provide the SC's ability to power off / on the server remotely, among other things.
jds2n at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 11

The last ALOM version is 1.5.3 and the actual version installed on V440 is 1.4 . Now i'm upgrading to the last version and i'll check if the problem is fixed.

About the use of a KVM in a Sun Cluster environment i know that KVM is not certified by Sun but to use a terminal concentrator with a Sun Blade used just for a consolle purpose i think is not a reasonable solution for a 2-way cluster. So i prefer to use a simple KVM with the knew limitation and implement an ethernet connection on SC only for exceptional needs (i prefer use telnet as less as possible).

adriano at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 12
Ok keep us updated.The bit about the KVM and Sun Cluster was just a heads up as I have seen these issues so they do happen. If you review the Cluster documentation I think it even says the recomendation is for serial terminal server control. But they are your systems.
stumoor at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 13

Had the exact same problem. Also have the same confioguration with the KVM. Upgraded the ALOM firmware and the problem was resolved.

The firmware package contains two images; the main image and the boot image. You will have to load them both. The package name is: ALOM_1.5.3_fw.tar. Use the scadm command to load them.

First, check if the server keyswitch is in the locked position. You will not be able to update the ALOM unless the keyswitch is set to the normal position. To do this, logon to the system controller and run the showenvironment command to view the keyswitch position.

Sulfur2-sc> showenvironment

Keyswitch position: NORMAL

If it is set to locked, you must physically turn the key to normal.

# /usr/platform/`uname -i`/sbin/scadm download boot alombootfw

# /usr/platform/`uname -i`/sbin/scadm download alommainfw

Wait 60 seconds between each load.

Approximately 120 seconds after scadm completes, ALOM will be available for use.

Tony

cax689 at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 14
After sc firmware upgrade and after 5 weeks uptime all works fine.Thanks for help.
adriano at 2007-7-5 17:03:13 > top of Java-index,Sun Hardware,Servers - General Discussion...