Possible NIC hardware problem?

Hi,

I have two sun servers, both solaris 9, that act as firewalls.

I have installed only OS Core. The firewall software is Checkpoint.

These two servers are in cluster (checkpoint xl).

I have many other sun servers configured in this way that work perfectly.

But with these two servers I have a lot of network problems.

For instance, if I issue a nestat -rn I see a lot of entries (about one hundred!!). Do you know what this happens?

My security admins think that there could be a nic hardware problem.

How can I be sure that the network cards are ok?

If I run, on both servers, from ok prompt:

test net

I get

rejecting alloc-mem?

What does this mean?

The servers are:

- SUNFIRE V120

- SUNFIRE V240

Can you help me?

Thanks,

Tarek

[1055 byte] By [tarek] at [2007-11-25 22:44:35]
# 1
Can you tell us the OS and firmware patch level. Have you run disgnostics on these servers?
mlennon at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 2
Hi , I run diagnostic from the OK prompt, command: test net The OS is solaris 9. I don't know how to see the firmware version. Thanks, Tarek
tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 3
From the ok prompt you should run POST: ok setenv diag-level max ok setenv diag-switch? true To get the firmware revision use the following command: ok banner
mlennon at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 4

Hi,

the parameter diag-level was already set to max.

I put the diag-switch to true and diag-device to disk.

(before it was set to net and during boot phase the system stopped booting giving message Timeout waiting for ARP/RARP packet).

Banner command shows:

ok banner

Sun Fire V120 (UltraSPARC-IIe 648MHz), No Keyboard

OpenBoot 4.0, 512 MB memory installed, Serial #51693521.

Ethernet address 0:3:ba:14:c7:d1, Host ID: 8314c7d1.

Do I have to attach all the boot output?

However it doesn't seem to give errors.

Thanks again,

Tarek

tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 5

No don't attach all the boot output, if there is a faulty component it will echo FAILED at the end of the POST run. You would be better to get a more detailed firmware revision:

ok .version

Sorry about the banner command, I forgot about .version!

mlennon at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 6

Hi,

the .version shows:

ok .version

Firmware CORE Release 1.0.12 created 2002/1/8 13:0

Release 4.0 Version 12 created 2002/01/08 13:01

cPOST version 1.0.12 created 2002/1/8

CORE 1.0.12 2002/01/08 13:00

Can I be sure that if there are no errors (FAILED) during boot phase (after set of diag-switch? true) I don't have any hardware problems?

Thanks again,

Tarek

tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 7

You can upgrade your firmware to 111991-07, but I would read the manual and become familiar with diagnostics first.

Use this link and download 816-2090-10:

<a href="http&#58;&#47;&#47;www.sun.com/products-n-solutions/hardware/ docs/Servers/Netra_Servers/Netra_120/index.html" target="_blank"> http://www.sun.com/products-n-solutions/hardware/docs/Server s/Netra_Servers/Netra_120/index.html</a>

This document outlines diagnostic procedures and gives examples of output from failed components ( section 10.1 just so we are reading from the same sheet, if you will ). If these tests do not turn up any failed components then the next step I would take is apply all relevent patches, patch 111991-07 for firmware, Solaris recommended patch cluster and applicable patches for Checkpoint software. It wouldn't hurt to check the condition of layer 1 and 2 equipment firmware if applicable.

mlennon at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 8

Hi,

I have many other SunFire V120 and V240 that work perfectly with solaris 9 and checkpoint and without upgrading firmware.

I just want to be sure that from hardware point of view everything is ok.

On SunFire V120 there are no errors.

But today after I ran test scsi I get errors and now the system is not booting, it gives this error:

ok boot

Boot device: disk File and args:

Can't locate boot device

ok test scsi

Device scsi not found

ok

ok

ok .asr

scsiDisabled by FWDIAGS

OBDIAG failure

ok

It seems to be some problem with the scsi controller. But now I'm not able to enable controller again but without success. I'm trying with commad asr-enable

Do you have any suggestion?

Thanks again,

Tarek

tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 9

I forget to say that the above error is given by SUNFIRE V240.

ok banner

Sun Fire V240, No Keyboard

Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved.

OpenBoot 4.11.4, 512 MB memory installed, Serial #58636805.

Ethernet address 0:3:ba:7e:ba:5, Host ID: 837eba05.

ok

ok - .version

Release 4.11.4 created 2003/07/23 08:04

OBP 4.11.4 2003/07/23 08:04 Sun Fire V210/V240,Netra 240

OBDIAG 4.11.4 2003/07/23 08:05

POST 4.11.4 2003/07/23 11:42

tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 10

Ok, I take it you don't have a root mirror, this is a mistake with a network firewall that provides availability!

Try this command:

ok asr-clear

ok .asr

I am a bit confused now, please tell me what model server you are experiencing the issues with.

mlennon at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 11

Hei,

now system is booting!

I didn't understand what happened, can you explain better please?

The system having problem is the SUNFIRE V240.

The SUNFIRE V120 doesn't give any error.

Both systems are a Checkpoint XL Cluster.

I'm trying to understand if these two servers have any hardware error.

This because I successfully installed solaris core 9 on both systems. Then checkpoint has been installed.

But when we start using this cluster checkpoint starts giving errors on a nic. From the OS the only thing strange I see is that if I issue command (even after uninstalling checkpoint and rebooting system) netstat -rn I have a lot of stuff visualized.

I'm getting crazy to understand what's the problem.

Security admins are saying it could be hardware (NIC?), that's why I'm running diag.

The only thing I know is that I have installed many others solaris 9 and the checkpoint and everything works fine.

Really thanks for your help, it's much appreciated!

Tarek

tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 12

I would just like to make the point that even though you may have great stability with a number of systems without applying firmware and OS patches, it does not mean that the system you are currently experiencing these issues with will not benefit from firmware upgrade. Some of the systems may have different component revisions that require firmware upgrade. Some of the features that are included in firmware upgrades can improve asr ( automatic system recovery ), hardware diagnosability, hardware integrity and generally improves chances of discovering and isolating faults. So it would be a good idea to implement a plan to upgrade all your systems with the patest firmware.

I would put this system on alert and make a service call with Sun, asr ( automatic system recovery ) can detect early signs that a component is failing and just clearing the disk from asr should not be the only step taken here, the disk may need to be replaced.

I am currently looking through the internal hand book for some references for you, there should be some good spectrum stuff that I can point you to, I'll update shortly.

One more thing, when you were configuring Solaris for FW-1, there are a number fo kernel parameters to change, misconfiguration of any of these parameters could be a cause for FW-1 to report errors.

On the output from netstat -rn, I wouldn't worry about that too much, if the system is a gateway for a large number of hosts it will probably have an enrty for all networks being visted by hosts on the client side.

mlennon at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 13

Hi,

can you please explain what you said before:

you don't have a root mirror, this is a mistake with a network firewall that provides availability!

My system has a mirror software (Solstice Disksuite)!

So you think I have a hardware problem (disk)?

About the firmware upgrade I never did it before but I think it's time to learn :-)

On this server (SunFire V240) I will try the upgrade.

Thanks,

Tarek

tarek at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 14

... back to the original question on this thread.

Let's pause and take a breath.

If the OBP test of <i>watch-nets</i> is good, then the ethernet ports of the computers are fine and functional (I may have the spelling wrong -- sift for the command at the OK prompt).

I hope there's absolutely no auto-negotiation in any of the ethernet configurations, because then every port is slamming the other for recognition and you're just plain lucky when a temporary handshake gets established.

This is a networking issue.Yes. open a service case with Sun and get through to the Network support group, NOT to the Hardware group.My suspicion is that the issue is with the switch and its setup and not with the computers.

Bill at 2007-7-5 16:56:30 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 15

Ok, my mistake? if you have SDS running and the root disk mirrored all should be well, the reason I made that comment was that when you have a root mirror and the root disk is offline ( disabled by asr ) the system should boot from the mirror disk ( in your post you said it wouldn't boot at all ). It's hard for me to tell how your systems are configured from the information we are working with here, so please forgive any misunderstandings.

Like I said, I can't say 100% if you have a failing disk with the information you have posted, but if you are getting scsi related errors it will probably be best to contact Sun and have a FE take a look at the system. If you have logged any scsi related errors please post them and I'll see if anything points to a bad disk.

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 16

Thanks for coming in on this thread Bill, there is allot going on here, but not a great deal of feedback from Tarek on some of my points. I had mentioned layer 1 + 2 earlier, but no comment. Also I have been leaning towards network issues throughout, but I don't want to overlook any hardware related issues that may surface.

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 17
Installing and running Explorer on each will give the Sun support center all the information they should need to investigate ( other than the external hardware such as switches, site staff, etc.)<img src="images/smiley_icons/icon_smile.gif" border=0 alt="Smile">
at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 18

You said:

when you have a root mirror and the root disk is offline ( disabled by asr ) the system should boot from the mirror disk ( in your post you said it wouldn't boot at all )

Exactly!!

If the first disk have a failure the system should boot from the second disk. The problem is that it isn't booting at all!

And that's because the system doesn't see any scsi device: probe-scsi-all shows nothing.

The test scsi gives:

ok test scsi

Testing scsi

Bus fault

ERROR: Method 'move-memory' failed with a result = fd

DEVICE : /<a href="mailto:pci&#64;1c" target="_blank">pci@1c</a>,600000/<a href="mailto:scsi&#64;2" target="_blank">scsi@2</a>

SUBTEST : selftest:scsi-dma-test

MACHINE : Sun Fire V240

SERIAL# : 58636805

DATE: 09/22/2005 13:50:00 GMT

CONTR0LS: diag-level=max test-args=

scsi selftest failed, return code = 1

After I issue a asr-clear the system boots fine.

Thanks again,

Tarek

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 19

I will install explorer and send everything to sun. If I didn't understand wrong the problems you said could be:

- scsi hardware problem

- disk problem

- obsolete firmware

- network

All the nic cards have autonegotiation and also the switch.

But I'm really confused about the problem. If there's a disk problem does this impact on the network ?(netstat -rn gives lot of strange entries).

Lennon, what I missed to answer?

Thanks again,

Tarek

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 20
The watch-net gives:ok watch-netTimeout waiting for transmit completionInternal Loopback test -- Cannot send loopback packet
at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 21

I can't say for sure on the scsi end, if you ran diagnostics and it failed the scsi bus there could be an issue there, the best suggestion I can make is to get your Sun FE to take a look at the server ( even if there is no hardware issue, it won't hurt to have it checked ). I have a policy with all Sun systems that I work with I keep the firmware and OS patch at the latest revisions, this has provided me with great stability from systems ( even from systems that are over 5 years old! ).

The network related stuff is huge, there are so many contributing factors to take into account, this should involve your security team, systems administrators and network administrators. Some equipment configuration can conflict with Sun and cause problems. One issue ( brought up by Bill ), like auto negotiate can cause problems. Furthermore, options such as Cisco VLAN and EtherChannel conflicts with Sun VLAN and trunking, often used to give failover or increased performance on systems running Checkpoint. Then your kernel configuration parameters set for FW-1, incorrect settings can drastically decrease network performance. If you want to outline more about the layer 1, 2 and 3 equipment and firmware configuration it would be great.

All in all though, this goes far beyond the scope of the forum, what I would advise is get your three departments together ( Sec A, Sys A and Net A ) to review each components configuration and look for conflicts, failing that, the next step is to call in the consultant that designed the system and the various support elements involved ( Sun FE, network support and Checkpoint support ).

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 22

I already opened a hardware call to sun support.

The big problem here is that security admins and network admins say that the problem is hardware related and since I don't know anything about security and network I'm trying to do my best to understand if it's really a hardware problem or not.

The only thing I added for checkpoint in the /etc/system are:

set nfssrv:nfs_portmon=1

set noexec_user_stack_log=1

set noexec_user_stack=1

set nautopush=64

set md:mirrored_root_flag=1

I also usually install the latest Recommended Patch Cluster but unfortunately this time those patches give problems to checkpoint! There's a conflict between checkpoint and the latest Recommended! We contacted the Checkpoint support and they told us to don't install the latest Solaris Recommended Patch but an old version.

As you see there are a lot of problems.

About the firmware upgrade I will install the latest version.

From my point of view I have to understand if there's any solaris problem (os or hardware). If this is not the case network/security admins should do better there work!

But if the problem is mine (os or hardware) I have to solve it as soon as possible!

It's a very simple installation. Just CORE installation, SDS for mirror, commented some services in /etc/rc2.d and rc3.d for security reasons, and four new entries in /etc/system. That's all!

Ok. I think for today is enough! Thanks for your help. I will wait for sun support and try to understand what's wrong.

Thanks again.

Tarek

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 23
More questions:Have you used these /etc/system parameters on all systems throughout your network? Are these parameters set in addition to the standard kernel parameters used for Checkpoint?
at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 24

Hi,

I have to put this parameter "set nautopush=64" only on the systems with bge nic, otherwise checkpoint gives problems (so this has been already used).

set md:mirrored_root_flag=1 is for SDS quorum and I already used it on other systems.

About these parameters:

set nfssrv:nfs_portmon=1

set noexec_user_stack_log=1

set noexec_user_stack=1

I found them in an article related to hardening solaris but not sure if I have already used them or not. I'm checking, I will let you know within few minutes.

About the problem of autonegotiation, I'm trying to set NIC to 100FD.

On the qfe and hme I'm able to do it, but on bge it gives me unkown value.

This is the command I'm using:

ndd -set /dev/qfe instance 0

ndd -set /dev/qfe link_speed 1

ndd -set /dev/qfe link_mode 1

but if I issue the same commands for bge it gives me error, any idea?

Thanks,

Tarek

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 25

Hi,

I have checked and these parameters aren't set:

set nfssrv:nfs_portmon=1

set noexec_user_stack_log=1

set noexec_user_stack= 1

As far as I know, the first parameter is nfs related. I will not use it since I don't use nfs server on the firewall!

The other two parameters are to prevent buffer overflow attacks, do you think this will impact with the functionality of the firewall?

I put them for security reason but I can avoid using them.

Tarek

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 26

Tarek, I can only briefly come back to you this afternoon, busy day today! Anyway, I would suggest that you drop all these parameters ( not SDS!! ), the noexec_user_stack parameter can cause some applications to crash. I am not sure about the nautopush parameter, this parameter I have always associated with SYSV ptys, setting the number of terminals on the system. I wouldn't say it is necessary to change the defaults on a gateway host, so the default pt_cnt is 48, make nautopush the same: 48. Try to avoid using anything other than the parameters recommended by Checkpoint, Solaris is very secure out of the crate and once you lock down all the networking services ( telnet, ftp etc... ) you get a very secure environment to start out with. I would avoid using parameters found in articles, sometimes these articles can be written by journalists with limited experience working in a production environment. I recall seeing some notes on the bge settings somewhere, I'll come back to you on that later.

at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 27
Thanks Lennon.I have to set the nautopush parameter, it's mandatory for checkpoint running on sun servers with bge nic.About the others parameters I will disable them.I'm still waiting sun support for the hardware (scsi/disk ?) problem.Thanks
at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 28
I'm justgoing add some info about V240.The latest OBP is 4.17.1 ( patch 119234-01).Alom version is 1.5.3.
at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 29

If you setup the mirror with Solstice Diskuite then the problem is bad boot path.

from the ok prompt, check your boot path:

ok printenv nvramrc

It should give you a path similar to one below:

nvramrc

nvramrc=devalias rootdisk /pci@1f,4000/scsi@3/disk@0,0:a

devalias rootmirror /pci@1f,4000/scsi@3/disk@1,0:a

Make sure you replace the sd@ with the disk@ near the end of the line.

Disksuite will only work with "disk@" and it will give you that error if you still have the sd@ there.

Let me know if this fix your problem.

annguyen251a at 2007-7-21 14:25:53 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 30

Sorry, but this is an old, abandoned, discussion thread.

The five Hardware Forums were incorporated into this Sun site on 30-Mar-06.

Five H/W forums were combined into the three you currently see.

All posting dates were 'renumbered' at that time.

The only thing you can be sure of is that the discussions were originally from

some time between Sept. '05 thru Mar '06.

Four years' worth ofl older threads were summarily deleted.

The original poster of this particular thread (Tarek) never returned

to tell any of us of the results of their Sun service cases.

rukbata at 2007-7-21 14:25:58 > top of Java-index,Sun Hardware,Other Sun Hardware...