IPMP failover is slow.
I'm testing the IPMP failover event by unplugging all public interfaces on bunsen. It sits in this transitory state where resource group is unavailable for about 3 minutes before actually failing over to beaker. Is there a way to make this process faster?
-- Resource Groups and Resources --
Group Name Resources
-
Resources: soraya-publicsoraya httpd
-- Resource Groups --
Group Name Node NameState
- --
Group: soraya-publicbeaker Offline
Group: soraya-publicbunsen Pending offline
-- Resources --
Resource NameNode NameStateStatus Message
-----
Resource: soraya beaker OfflineOffline
Resource: soraya bunsen Online but not monitored Degraded - IPMP Failure.
Resource: httpdbeaker OfflineOffline
Resource: httpdbunsen Stopping Unknown - Stopping
[856 byte] By [
robindixon] at [2007-11-26 10:33:25]

# 1
Nevermind, I was being too lazy. Will look at the tunables :-)
# 2
I assume that soraya is the logicalHost resource. If so, is say that it isn't monitored. Did you enable monitoring on the resource? Quickest way would have been to do # scswitch -Z -g soraya-publicThen re-try the tests.Tim
# 3
Yep, monitoring was enabled.... but not "probe based monitoring", as I don't have test IP addresses assigned to the IPMP interfaces in the group.
# 4
Ah, the lack of test addresses will make a difference. The fail-over you are seeing is probably being driven as a response to the application probe failing rather than the network failure being detected.
Try adding test addresses. It should be *very* quick. The failure of the IPMP probe should trigger a give-over (as opposed to a take-over) of the service [IIRC].
Tim
# 5
Hey Tim (or someone?),
I've been stuffing around for ages trying to get ip.mpathd to enable ip based probing.Can't find any examples clear enough for someone as stupid as me in the documentation. Could you help me out with an example working config for my setup?
Below is my current non-working config. I've almost finished work for today :-)
# fgrep "" /etc/host*
/etc/hostname.ipge0:beaker netmask + broadcast + group beaker_public -failover up \
/etc/hostname.ipge0:addif beaker-probe0 depreciated netmask + broadcast up
/etc/hostname.ipge1:beaker netmask + broadcast + group beaker_public up depreciated -failover standby
/etc/hosts:#
/etc/hosts:# Internet host table
/etc/hosts:#
/etc/hosts:127.0.0.1localhost
/etc/hosts:72.5.124.61beaker loghost # Cluster Node 1
/etc/hosts:72.5.124.62 beaker-probe0
/etc/hosts:72.5.124.63 beaker-probe1
/etc/hosts:72.5.124.64bunsen # Cluster Node 2
/etc/hosts:72.5.124.65 bunsen-probe0
/etc/hosts:72.5.124.66 bunsen-probe1
/etc/hosts:72.5.124.67 soraya # Cluster Public
# 6
This doesn't look right to me. The hostname files should be:
hostname.ipge0
beaker netmask + broadcast + group beaker_public up \
addif beaker-probe0 depreciated -failover netmask + broadcast up
/etc/hostname.ipge1:
beaker-probe1 netmask + broadcast + group beaker_public -failover depreciated up
This is copying what I have in our lab.
Tim
# 7
Ok well this can now only be described as weird. I tried your exact config, and still during reboot:
Failed to configure IPv4 interface(s): ipge0
Hostname: beaker
Oct 6 08:14:47 in.mpathd[129]: No test address configured on interface ipge1; disabling probe-based failure detection on it
Oct 6 08:14:47 in.mpathd[129]: No test address configured on interface ipge0; disabling probe-based failure detection on it
ifconfig shows:
ipge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 72.5.124.61 netmask ffffff80 broadcast 72.5.124.255
groupname beaker_public
ipge0:1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 72.5.124.62 netmask ffff0000 broadcast 72.5.255.255
ipge1: flags=9000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER> mtu 1500 index 3
inet 72.5.124.63 netmask ffffff80 broadcast 72.5.124.255
groupname beaker_public
# 8
OK, that's probably because I've assumed you had the right netmask info in /etc/netmask or NIS. Look at the netmasks you have on these adapters.
ipge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 72.5.124.61 netmask ffffff80 broadcast 72.5.124.255
^^^^^^
groupname beaker_public
ipge0:1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 72.5.124.62 netmask ffff0000 broadcast 72.5.255.255
^^^^^^
You need to add:
72.5.124.0 255.255.255.0
to /etc/netmasks.
That will probably fix the problem. Let me know how you get on.
Tim
# 9
It is on a VLAN where the netmask *must* be:72.5.124.0 255.255.255.128This was already set in /etc/netmasks
# 10
OK, I made an incorrect guess at your netmask. So this entry is in /etc/netmasks on all nodes and /etc/nsswitch.conf says:
netmasks: cluster files
If this is all correct, then what rev of the cluster patch have you got installed? It should be
117949-25 on Solaris 9 SPARC
Solaris 8 SPARC as patch 117950
Solaris 9 x86as patch 117909
Solaris 10 SPARC as patch 120500
Solaris 10 x86as patch 120501
What happens if you plumb in logical addresses manually? Do they get the right netmasks?
Tim
# 11
The nsswitch.conf has the correct netmasks entry,
The cluster patch rev is correct.
I managed to get the netmask correct by changing /etc/hostname.* files as follows:
# grep "" /etc/hostname.*
/etc/hostname.ipge0:beaker netmask + broadcast + group beaker_public up \
/etc/hostname.ipge0:addif beaker-probe0 netmask + broadcast up depreciated -failover
/etc/hostname.ipge1:beaker-probe1 netmask + broadcast + group beaker_public -failover depreciated up
# ifconfig ipge0
ipge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 72.5.124.61 netmask ffffff80 broadcast 72.5.124.255
groupname beaker_public
ether 0:14:4f:f:13:26
# ifconfig ipge0:1
ipge0:1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 72.5.124.62 netmask ffffff80 broadcast 72.5.255.255
# ifconfig ipge1
ipge1: flags=9000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER> mtu 1500 index 3
inet 72.5.124.63 netmask ffffff80 broadcast 72.5.124.255
groupname beaker_public
ether 0:14:4f:f:13:27
So even with correct netmasks, mpathd still disables probe based failure detection.
# 12
Have a look at SunSolve info docs 70062 and 86869 and make sure you have followed the instructions there. I've not seen these types of problems with IPMP before when the set up is correct.Tim
# 13
I cut and pasted /etc/hostname.if's from Tim into a v240 single node cluster (bge interfaces), and had the same result:
Oct 11 08:00:53 in.mpathd[133]: No test address configured on interface bge1; disabling probe-based failure detection on it
Oct 11 08:00:53 in.mpathd[133]: No test address configured on interface bge0; disabling probe-based failure detection on it
Also of possible relevance:
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
ip_arp_done: init failed
************ THIS **************************
Failed to configure IPv4 interface(s): bge0
******* DOESNT SEEM RIGHT ************
To recap on what I am trying to do, using one of those ascii diagrams to illustrate it graphically:
example public address: 10.10.3.3
on the v240
- bge0 - active 10.10.3.3 >
group "hostname_public"
- bge1 - standby 10.10.3.3 -o
or on the t2000
-ipge0 - active 10.10.3.3 >
group "hostname_public"
- ipge1 - standby 10.10.3.3 o
Should this be doable with probe based failure detection? Example hostname.if that work would be nice :-)
Have followed all of those procedures in the above mentioned articles, even have ipfilter disabled.
# 14
> addif beaker-probe0 depreciated -failover netmask + broadcast up
There should be a + (plus sign) after broadcast here, and deprecated was not spelled correctly.
> /etc/hostname.ipge1:
> beaker-probe1 netmask + broadcast + group
> beaker_public -failover depreciated up
>
> This is copying what I have in our lab.
>
> Tim
>
# 15
Just shows I can't transcribe accurately. Unfortunately, I wasn't able to simple cut and paste as they were in two different windowing systems. I checked and the original did have the correct spelling and the + (ho hum)Tim
# 16
> Also of possible relevance:
>
> Copyright 1983-2006 Sun Microsystems, Inc. All
> rights reserved.
> Use is subject to license terms.
> ip_arp_done: init failed
> ************ THIS **************************
> Failed to configure IPv4 interface(s): bge0
> ******* DOESNT SEEM RIGHT ************
Looking for this error message in SunSolve - it appears that others have resolved this by updating their patch levels. Are you running old patch levels?
Your mention of IP filters also worried me - what other stuff did you have configured on the system? I think we need to roll this discussion back and check that you can get IPMP working on base Solaris first.
Thanks,
Tim
# 17
> This doesn't look right to me. The hostname files
> should be:
>
> hostname.ipge0
> beaker netmask + broadcast + group beaker_public up
> \
> addif beaker-probe0 depreciated -failover netmask +
> broadcast up
>
> /etc/hostname.ipge1:
> beaker-probe1 netmask + broadcast + group
> beaker_public -failover depreciated up
>
> This is copying what I have in our lab.
>
> Tim
>
You mean "deprecated". I don't beleive "depreciated" is legal option.
-- leon