Intermittent network freezes

I have a Sun Fire x2200M2 with 11 zones. I currently have three interfaces configured.

bge0 - Two public IPs and 11 private IPs (for the zones)

bge1 - One public IP only used for access to ilom interface

nge0 - One private IP that is just a cross-over cable to another Sun Fire x2100 M2 server on a different private ip range than bge0 interfaces.

I am using ipf and ipnat to map public IP ports to the private zones/ports. Example. Accessing public IP1 port 53 maps to zone 1 port 53 (Master DNS), public port IP2 port 53 maps to zone 2 port 53 (Secondary DNS) and so on through other services, ie: MySQL, Apache, etc...

The odd problem I am having is the the network completely stops responding randomly for anywhere between 30 seconds to 2 minutes. The server is running, I can access it via ILOM (This interface ALWAYS works), touch all the zones, however, I can not ping out or come into the public IPs on bge0. Then suddenly it starts responding again.

The last time this happened, I was able to login via console (ILOM) and force a core dump, however, I do not have the knowledge to debug. I have the minimum support that I purchased with the servers for access to SunSolve patches (I think like $360 for three years), does anyone have any idea on what might be the problem or what to maybe look for in the core dump? If this is a hardware problem, I would like to get the server repaired, if it's software, I need to get it fixed.

This server runs a production webserver, so the intermittent network hangs is causing major problems.

[1590 byte] By [ITOa] at [2007-11-27 11:35:13]
# 1

I had this exact problem with my new x2200m2 (without a support contract). However, I was running the net on the nge0 interface. I could also not get the ipfilter to work on that interface (in any sane way) and so just switched it to the bge0 interface. Now it works perfectly.

Your comments make be believe the stalling had nothing to do with the interface. I have my machine patched up to the most recent AND!!!! i blanked the disks and installed a fresh version of Sol10 6/07 (or whatever is the latest release). Come to think of it I did all of those things at the same time so the fresh install probably cured the problems. Also... I did not install any of the software specific for the x2200m2 (whatever it is). I just installed sol10 and ran the update manager till everything was up to date.

kjard_usa at 2007-7-29 17:01:35 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 2

Well now that's disturbing. I had the problem with 06/07 fully patched from running smpatch update with the online support contract when I was first building the server.

I held off moving it into production because I knew Solaris 10 11/07 U3 was being released and I had all my jumpstart scripts ready to deploy quickly once I did a copy media.

Anyways, I installed the 11/07 and had to deploy in production, unfortunately, the problem has persisted.

Anyone else have any ideas? This is almost on par with everytime I patch the darn thing, it blows up the boot archive. Grrr! I have gotten very proficient in booting failsafe and 'bootadm update-archive -R /a'

Solaris is such a great OS, been using it since SunOS 4.1.3, however, they still need more work on their x86 platform, I have never had any of these problems on any SPARC platform.

ITOa at 2007-7-29 17:01:35 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 3

Did you install the os yourself or was it factory? Did you install any drivers? I guess it shouldnt matter though.

I think you are right about x86 issues. I love the insane speed of the x2200m2 (and my ultra 20m2) but next time I will go back to sparc (some of the new chips look very nice indeed). Something just feels fragile about the x86 compared to sparc.

I have also noticed imagemagick, which is a mess to begin with, does not behave nearly as well on x86 as it does on sparc.

kjard_usa at 2007-7-29 17:01:35 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 4

Yes, I did the install myself both times. I have even downloaded the latest firmware driver cd for the x2200 and updated it.

Still no happiness in my life.

I agree, I purchased the extra proc for the X2200 so I basically have four procs and it is amazingly fast, but you just can't do some of the basics (like patch and reboot when the root drive is mirrored with SDS) without undergoing major work like you can on the trusty 'ol SPARC.

All the time I have spent on this box, it would have been a better deal just to spend the extra $2500 and gone with the T1. :/

ITOa at 2007-7-29 17:01:35 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 5

Ouch, well.. the only suggestion I have is try using a different ethernet port.

Also... I noticed when I was working on the LOM (yes, wow it is great) it said something about being able to share the port with the system (never investigated it). Perhaps there is some kind of conflict going on? Perhaps there is some kind of sharing violation occurring? I am really out of my depth here though.

My identical issue did stop though, so keep the hope alive. Or just call sun and swap it for a t2000 :)

kjard_usa at 2007-7-29 17:01:35 > top of Java-index,Solaris Operating System,Solaris 10 Features...