Errors after initial Sun Cluster install
- SunOS conch 5.10 Generic_118833-36 sun4u sparc SUNW,Sun-Fire-V210
- Sun Cluster 3.2
I've gone through the scinstall process using the standard answers to questions. The only exception is that when it came to quorum, I answered I would set it up later, as I want to try to the quorum server. There's no shared storage - I'm seeing if it's possible to create a cluster using IP based replication.
I'm getting these error messages every 30 seconds (looks like a result of:
# svcs lrc:/etc/rc3_d/S91initgchb_resd
STATE STIMEFMRI
legacy_run16:19:29 lrc:/etc/rc3_d/S91initgchb_resd
#
)
Feb 8 16:38:59 conch Cluster.GCHB_resd: Unable to open door descriptor /var/run/rgmd_receptionist_door
Feb 8 16:38:59 conch Cluster.GCHB_resd: GCHB system error: scha_cluster_open failed with 18
Feb 8 16:38:59 conch : Bad file number
Feb 8 16:39:29 conch Cluster.GCHB_resd: Unable to open door descriptor /var/run/rgmd_receptionist_door
Feb 8 16:39:29 conch Cluster.GCHB_resd: GCHB system error: scha_cluster_open failed with 18
Feb 8 16:39:29 conch : Bad file number
Feb 8 16:39:59 conch Cluster.GCHB_resd: Unable to open door descriptor /var/run/rgmd_receptionist_door
Feb 8 16:39:59 conch Cluster.GCHB_resd: GCHB system error: scha_cluster_open failed with 18
Feb 8 16:39:59 conch : Bad file number
Feb 8 16:40:29 conch Cluster.GCHB_resd: Unable to open door descriptor /var/run/rgmd_receptionist_door
Feb 8 16:40:29 conch Cluster.GCHB_resd: GCHB system error: scha_cluster_open failed with 18
Feb 8 16:40:29 conch : Bad file number
There's no file system errors, and I'm at a complete loss as to why there appears to be this problem. Can anyone offer any advice?
Cheers,
Iain
# 1
Just to check since you only mention Sun Cluster 3.2 - the error messages you cite are coming from components used within Sun Cluster Geo Edition, specificly from the geo heartbeat component.
So are you trying to setup a Geo cluster?
If so you should first finish the Sun Cluster setup. And then give some details on what you performed for setting up the geo edition.
Greets
Thorsten
# 2
It's the Sun Cluster download from Sun's website, and I suspect the Geo component has been installed by virtue of installing the Availability Suite component (I'm thinking that might not actually be necessary now afterall if it's for IP-based disk replication between clusters as opposed to between nodes in a cluster). But in terms of the Sun Cluster install, I've literally ran through the scinstall program and done nothing else.
Iain
# 3
Iain,
Why would you need IP based replication inside a cluster? That doesn't make sense to me. Unless you chose to install Sun Cluster Geo Edition, it shouldn't get installed. You can use the prodreg command to browse the registry of installed programs and see what Solaris thinks is installed from the JES/JAS set.
If Geo Edition is installed, you probably want to remove it if it is just a single cluster that you need.
Regards,
Tim
# 4
Re: IP based replication inside a cluster - it's for an experiment so see whether a cluster can be built without using shared storage, and using the replication to ensure the data is kept up-to-date on the backup node. I'm seeing if you can build a cluster without spending loads of cash, especially since the actual data to be replicated is going to be a few megabytes and I don't really want to spend loads of cash on expensive (as in price per MB used) shared storage.
That said, the lack of shared storage probably breaks basic cluster design (!) and I know there will be other issues to do with cluster resiliency etc. This is all about seeing if it can be done or not, and I'm beginning to think that it *can't* be done ....
Iain
# 5
OK, this now makes sense.
The way to achieve this is to use Sun Cluster 3.2. You will need your two primary cluster nodes and a 3rd node to act as a quorum server. The latter just needs to be a very cheap machine capable of running Solaris 10.
You can then set up Sun Network Data Replicator (SNDR) which is part of availability suite to replicate the data between the cluster nodes. This should work without problems. No Sun Cluster Geo Edition is needed.
This is very much like what Sun's telco HA solution does.
Regards,
Tim
# 6
Hi,
there are 2 issues here.
1. THe error messages that you see. I get them on my freshly installed cluster as well. What did I do? I used the JES installer and installed SC3.2 and SCGeo 3.2 - to be configured later. Ithink that it should only install the packages but not configure any part of them. It seems that it does oitherwise. To me ghcb sound like global cluster heartbeat.. I'll follow up with the developers to get this clarified.
2. Replication within a cluster and no shared storage. THis has several aspects. I, too, see more and more customer demand to have this. If you get it to work let us know. I am not sure though, why you installed the SC Geo edition to achieve this, as I do not think it well help you here.
In any case I can only recommend to set up the quorum server before proceeding, otherwise your whole cluster will panic as soon as you do a single reboot. That is per design..
Regards
Hartmut
# 7
Thanks for the comments Tim, I'm glad to hear this idea isn't unreasonable! This begs the question: how do you get Availability Suite? Is it a product in its own right, or is it Sun Java Availability Suite? Linking to Sun Java Availability Suite via http://www.sun.com/software/swportfolio/get.jsp (eventually) leads to a Sun Cluster 3.1 download. At a previous job, I remember having a Availability Suite 3.2 CD but I'm hoping that it can be downloaded from somewhere. Any ideas?
Iain
# 8
Iain,I would guess it was in the availability suite bundle somewhere. As far as I can tell you cannot download it separately any more. Tim
# 9
Hi,
you got confused, what a surprise, if the marketing folks use the same name for different products. What you are looking for is this:
http://www.sun.com/storagetek/management_software/data_protection/availability/
It is the StorageTex Availability Suite, consisting of a snapshot component and a replication componente. I did not see it on the external download site, and I know that it is a product that has to be licensed seperately. I am pretty sure that it is not part of the Java Availability Suite, which is a subset of the Java Enterprise System and covers Sun Cluster and the Sun Cluster Geographic Edition.
Availability Suite would replicate volumes. If you only have a couple of megabytes to replicate, could you think of another way of doing this?
# 10
Is this Solaris 6/06
I had the EXACT same message about rgmd_door
noticed that the cluster/rgm service wasnt starting (and wouldnt start)
I started it manually and that message went away, but the cluster was still fouled up
noticed that the cluster wasnt booting to milestone, and it was because system/pool isnt there... not available in pre-11/06
so now I am screwed... my copy of SC 3.1 only supports Solaris 8 and 9, and I have a fouled up 11/06 image and too slow of a inet connection to download another one, and my SC 3.2 wont work with what I have now...
# 11
I checked with engineering and got the explanation where the GHCB messages are coming from.
1. They are harmless and do NOT indicate any problem with the cluster.
2. They will go away with a patch some time in the future. It is kind of a race condition between various services.
3. If you install the Sun Cluster Geographic Edition packages, which is, what you have done and what I have done by explicitely checking the box in the JES installer, SC Geo Edition will start its own heartbeat. This is so that an other cluster, already running SC Geo Edition would be able to contact this cluster without any manual configuration in the beginning.
4. Manually starting any services does not solve this problem.
Hope that helped
Hartmut
# 12
Instead of using Availability Suite, I guess there's an rdist, ufsdump/restore. I think AVS is a nice neat way of replicating the data for the cluster in real-time, but getting it is being a pain in the backside! From that page, there's no link to download the software (as far as I can tell), and even looking at http://www.sun.com/software/downloads/ there's nothing that stands out as being the actual package itself. According to the release notes, I should be getting:
SUNWscmr
SUNWscmu
SUNWspsvr
SUNWspsvu
SUNWiir
SUNWiiu
SUNWrdcr
SUNWrdcu
Bit of a long shot, but can these packages be downloaded individually?
Iain
# 13
I don't think so. They are either in the larger bundles or they aren't there at all. I can't see them in the JAS suite so may be they aren't available for download.Tim
# 14
Hmmm, looks like ?00 to purchase the media and around ?k for 12 months support (up to 1TB). Still cheaper than a shared array or HBAs to connect to the SAN, so it's definitely an option. I'll continue investigating other methods to see if there are other ways.
Thanks for all your comments on this.
Iain
# 15
Thanks to j2k4real and HartmutS for their comments about the messages I've been seeing.
I've made some progress - I've got a 2 node cluster set-up with Sun Cluster 3.2 running on Sol 10 03/05 and fully patched - and I'm trying to set-up a quorum server. According to the Sun Cluster Reference Manual, this is done by using 'clsetup'. However, when I run 'clsetup', I see the following:
# ./clsetup
>>> Initial Cluster Setup <<<
This program has detected that the cluster "installmode" attribute is
still enabled. As such, certain initial cluster setup steps will be
performed at this time. This includes adding any necessary quorum
devices, then resetting both the quorum vote counts and the
"installmode" property.
Please do not proceed if any additional nodes have yet to join the
cluster.
Is it okay to continue (yes/no) [yes]?
Unable to establish the list of cluster nodes.
Press Enter to continue:
#
Every time I run 'clsetup' I see the following on the console:
Feb 15 16:57:41 whelk Cluster.CCR: Unable to open door descriptor /var/run/rgmd_receptionist_door
Feb 15 16:57:41 whelk last message repeated 1 time
Feb 15 16:57:43 whelk Cluster.RGMPMF.lib: Unable to open door descriptor /var/run/rgmd_receptionist_door
I've read a little about the "installmode" attribute but I'm not sure how to change it or if that's even possible. I also noticed the following:
# svcs | grep cluster | grep offline
offline15:33:39 svc:/system/cluster/scslm:default
offline15:33:39 svc:/system/cluster/rgm:default
offline15:33:39 svc:/system/cluster/cl-svc-cluster-milestone:default
offline15:33:39 svc:/system/cluster/scsymon-srv:default
offline15:33:39 svc:/system/cluster/scslmclean:default
offline15:33:39 svc:/system/cluster/rpc-fed:default
offline15:33:39 svc:/system/cluster/sckeysync:default
#
# ./scconf -p | grep install
Failed to get node zone list
Failed to get node zone list
Cluster install mode:enabled
#
I don't know where the 'Failed to get node zone list' comes from, but it's present on standard error and I sometimes see it when I run clsetup.
So I guess the next question is: given the above, how do I set-up a quorum server for the cluster?
Iain
# 16
Using liveupdate, I mounted a Sol 10 11/06 .iso and upgraded my install to the latest release, repatched, re-installed Sun Cluster, and now it's working fine. And since, I've found documents on the web stating that Sun Cluster 3.2 only works with Sol 10 11/06 - which is probably why the error messages above were coming up!
So now I'm quite happy, and impressed with how easy liveupdate worked!
Iain
# 17
Hi,
the error messages you saw seemed to be harmless- But good that you solved most of the issues.
More information on the StorageTek Availability Suite for replicating data. I checked and it cannot be downloaded as a product. You can either buy it - and I think you checked that already or go to the Open Solaris pages http://opensolaris.org/os/project/avs/
But the packages available there will not install with S10 - for technical reasons. On the other hand SunCluster will not work with Open Solaris at the moment, but only with S10U3.
Hartmut