FC3510 controller failure.
On the weekend I had a FC switch controllers due to the primary failing. Looking at the logs I see a battery charging and finally becoming fully charged. About a second later, there is:
Controller ALERT: Redundant Controller Failure Detected
looking from the host I see:
show redundancy
Primary controller location: Lower
Redundancy mode: Active-Active
Redundancy status: Failed
I look on the back of the FC and the upper controller shows an amber light and little else is lit up.
My question is: Can I reseat this controller with little impact just to see if this will "kickstart" it? (Would like to try this before making a support case).
KJ
[707 byte] By [
EL_kj] at [2007-11-26 11:28:51]

# 2
The 3510 controller failure rate on any firmware is poor to say the least. I have had at least one controller fail on every three 3510's at best. Considering that, I would never even think about pulling out a failed controller and shoving it back in to see what happens.
9 times out of 10, Sun will say upgrade your firmware to fix the controller failures and more often than not, the upgrade has caused more problems.
50 % of the time the new controller is down rev on the current controller firmware and this adds even more heart ache to the whole solution.
So, the moral of this story is don't put any data on those things that you think is important. Especially databases as when the 3510 goes down in a screaming heap, your transaction/redo log is bound to be unusable. If you have to use them ensure everything of importance is mirrored across seperate systems.
We have had a number of 5 day system outages caused by 351x controller failures in dual controller 3510/3511's. I only use them as paper weights now and they are good at that.
Stephen
# 3
I meant to reply to this days ago, but got caught up in a couple projects and it slipped my mind, sorry.
This whole process became quite a fiasco in the end. Ultimately I called Sun Support to get their input on the situation which led to me simply "unfail" ing the controller (regretfully my SAN knowledge is still in its infancy or I would have tried this on my own). At first, this appeared to have corrected everything. I sat and watched it doing its thing for 30 mins or so until I went to lunch. When I came back from lunch (about 45 mins after that) the world had fallen apart and BOTH of the controllers were in a faulted state causing a sub-mirror from every partition on that fibre to be in a "Needs Maintenance" or "Not Available" state.
Calling back Sun support, they wanted me to do what I had already done by that point, which was pull out the upper controller (the original failed one) and reboot. After explaining that I had done that, to no success - they wanted me to pull the lower controller out and try running off the upper. Strangely, even though this eventually ended up showing a blinking green state on the back of the FC, the "show fru" command showed the controller being in a faulted state. No matter how I sliced this, both controllers took a dive, and needed to be replaced. At least I was able to get limited console access to the failed controller and save all my settings.
Now I have a new question which involves upgrading all my FCs. Since these devices are part of a Cluster, do I ask here or in the cluster section?
KJ
EL_kj at 2007-7-7 3:44:36 >

# 4
Wow that sounds like a major $$%$@ :)
Touch wood I haven't had a controller failuer in either the 3510 or 3511 , i must say learning how to drive the 3510 has been fun , I've been handed down the SAN Administration role.
Thanks for the info I'll file that away in my mind for future reference :)
TheEsp
# 5
I express my sympathy at all this. I have often seen the 3510's internals conflict with each other especially in a show fru mode. I have on a number of times seen show red showing both controllers on line and active. However when I do a show battery, one battery is missing and no matter what you do even replacing the battery, it refuses to accept that the battery is there. Doing a show fru also states that the battery module is missing and to make things worse, the controller where the battery board resides is missing too. On one occasion, the missing controller in show fru was the active one and both leds on the back were green and the active controller was blinking green. Go figure. Its like the 3510 has no idea of what it is doing.
Only way to fix this is to reset the 3510.
So to your next question, what role will the 3510 play in your cluster? Considering the 3510 is so flakey, do you want it in a cluster environment which is supposed to never go down?
Regards
Stephen