DHCP HA agent fails to start DHCP (Sol 10 11/06 SC3.2)

I've set-up a two-node Sol 10 11/06 cluster (fully patched, including 120986-10, 121010-05, 124916-02, and 124918-02) and included the DHCP HA agent. I don't have shared storage, so I'm using a quorum server on another Sol 10 server. I've gone through the cluster configuration process, and brought the dhcp resource group (dhcp-rg) under managed control, but there's an error when I run:

clresourcegroup online dhcp-rg

Error:

--

Feb 21 08:55:10 whelk SC[SUNWscdhc.start_dhcp]:dhcp-rg:SUNWfiles: start_dhcp - /usr/lib/inet/in.dhcpd -i failed

/usr/lib/inet/in.dhcpd: option requires an argument -- i

Unknown option: ?

in.dhcpd:

Common: [-d] [-v] [-i interface, ...] [-h hops] [-l local_facility]

Server: [-n] [-t rescan_interval] [-o DHCP_offer_TTL]

[ -b automatic | manual]

Relay Agent: -r IP | hostname, ...

/usr/lib/inet/in.dhcpd: option requires an argument -- i

Unknown option: ?

--

I believe this error is the result of the cluster trying to start DHCP by running "/usr/lib/inet/in.dhcpd -i", instead of the correct way "svcadm enable -t network/dhcp-server:default" (as per /etc/init.d/dhcp). Since the config information lives in /etc/inet/dhcpsvc.conf, there's no reason to set anything on the command line.

Looking at /opt/SUNWscdhc/bin/functions, I think the error comes from line 605, where DHCP is started (incorrectly for Sol 10). I've corrected this by making the following changes to 'functions' (albeit crudely):

605,613c605,606

<# which version of Solaris?

<SOL=`uname -r | cut -d"." -f2`

<if [ ${SOL} -eq "10" ]; then

</usr/sbin/svcadm enable -t network/dhcp-server:default

<St=$?

<else

<${INDHCPD} ${USED_ADAPTER}

<St=$?

<fi

>${INDHCPD} ${USED_ADAPTER}

>St=$?

I'm sure there are other permutations and considerations to be made in the overall context of this file, and the need for a proper change made by Sun, assuming that this is indeed the root cause of the error messages shown above.

Once the above change is made (including the appropriate change to stop_dhcp) and I run 'clresourcegroup online dhcp-rg', I can see that:

- the logical host name resource is started

- the resource group changes to 'Pending online' and then 'Online'

- the resource changes to 'Starting'

The DHCP server starts briefly, but then shuts down and the cluster attempts to fail over to the second node. In /var/adm/messages I see:

Feb 21 12:11:20 whelk Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource dhcp-host-res state on no

de whelk change to R_ONLINE

Feb 21 12:11:21 whelk Cluster.PMF.pmfd: [ID 887656 daemon.notice] Process: tag="dhcp-rg,dhcp,0.svc",

cmd="/bin/sh -c /opt/SUNWscdhc/bin/start_dhcp -R dhcp -G dhcp-rg -N 10.0.1.0@1/10.0.1.0@2 ", Failed

to stay up.

I know that if I start DHCP manually by running 'svcadm enable -t network/dhcp-server:default' it works properly, so what I suspect here is that the Sun Cluster is failing to detect a error-free start of DHCP, and then shuts it down again. My knowledge of cluster is limited though, so this may not be correct.

Any advice? Should I log this as a bug with Sun?

Iain

[3353 byte] By [iainfirkinsa] at [2007-11-26 19:09:39]
# 1

Iain,

Assuming you've followed the documentation, I would log a bug. I've not tried the HA-DHCP service for a couple of years now so I can't comment on the error. I do vaguely remember seeing something like this at the time and it turned out that I hadn't followed the documentation properly though.

Tim

Tim.Reada at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2

The only deviation from the documentation I've done is not registering a shared disk resource (don't have Availability Suite yet) but I understand this is not a prerequisite for a working cluster. So I take onboard the point that following documentation in most cases should give you a working cluster, but fundamentally I think the HA agent is broken. And I'm aware that SC3.2+Sol10 11/06 is barely a couple of months old and there's likely to be some teething problems.

Any advice about the most appropriate category to store this bug under? On http://bugs.sun.com, there is no appropriate 'Product/Category' that SC falls under that I can see!

Iain

iainfirkinsa at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

I don't think it has anything to do with SC3.2 - it's probably the same agent that was in SC3.1. I don't have a good explanation of why it didn't just work. If you are logging bugs, then it should be logged under suncluster, but as far as I can tell, that web interface doesn't have anything other than Java related categories. You'll need to call Sun service.

Tim

Tim.Reada at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4

Actually I disagree with your analysis. The documentation for the HA DHCP agent

(http://docs.sun.com/app/docs/doc/819-3058/6n5a9h34s?a=view) does document that on S10 you have to specific disable the dhcp SMF via "svcadm disable dhcp-server" on all nodes.

The agent does implement the logic how to start/stop/probe dhcp itself, so the SMF for dhcp-server is not used if in.dhcpd is to be under Sun Cluster control.

The message you see "/usr/lib/inet/in.dhcpd: option requires an argument -- i" does indicate that the variable ${USED_ADAPTER} seems to get not or wrongly setup.

Could you provide the content of the /opt/SUNWscdhc/util/dhcp_config file that you used to register your specific resource?

Especially the setup for NETWORK is of interest, because that is used in order to setup ${USED_ADAPTER}.

Of course feel free to go through Sun service support in order to debug this further.

Greets

Thorsten

Thorsten.Frueaufa at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5

Thanks for the comments Thorsten. I was beginning to suspect that SC and SMF could be conflicting each other in terms of the DHCP service, so your comments reinforce that idea.

I went back to dhcp_config and specifically the NETWORK setting. Looking at some other examples, it does look like I didn't have the correct setting here. So I changed it to '10.0.1.0@sc_ipmp0@1/10.0.1.0@sc_ipmp0@2', re-registered the service, but I was still getting the same error message as before.

Now I know that I'm using 'ipmp', but it seems that the code in functions wasn't detecting that properly. Going through functions, I think the problem is on line 393:

elif [ `pkginfo -l SUNWscgds | grep VERSION | awk '{print $2}' | cut -d'.' -f1-2` = "3.1" ]

In SC 3.2, the VERSION string in SUNWscgds is 3.2. Therefore, this code only works if you're running SC 3.1. I can fix this by changing 3.1 to 3.2 in the line above, but I suspect a better solution would be to cater for both 3.1 and 3.2 in a single statement. With this change, DHCP is now running and I can fail it between cluster nodes.

What doesn't look right to me though is that DHCP binds to the physical interface (the code in functions returns 'bge0') rather than the logical interface. A WinXP laptop on the same subnet as the cluster fails to renew its lease after the cluster has failed over to the other node because the DHCP server IP address is the physical address of the node rather than the logical address of the service. A WinXP laptop on a different subnet (with the helper app set to the logical IP address for DHCP) fails to get an IP address at all. I can see the DHCP DISCOVER packet coming through to the live cluster node, but because DHCP is bound on a different interface it fails to pick it up and send an address back. The '-i' is overwriting the INTERFACE setting in /etc/inet/dhcpsvc.conf (which is set to the logical interface). If I manually set ADAPTER in functions to "bge0:1", DHCP then binds to the correct interface and does send out IP addresses.

Iain

iainfirkinsa at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6

Iain

This is a bug, as such I've filed a bug for you against the DHCP SC 3.2 agent. You are correct in your analysis that line 393 is in error. The workaround I've suggested in CR 6527537 is as follows,

bash-3.00# diff -w functions_old functions

393c393

<elif [ `pkginfo -l SUNWscgds | grep VERSION | awk '{print $2}' | cut -d'.' -f1-2` = "3.1" ]

>elif [ `pkginfo -l SUNWscgds | grep VERSION | awk '{print $2}' | cut -d'.' -f1-2` -ge "3.1" ]

Please note that the DHCP network table contains the "Server IP" and that after the DHCP service fails over from one cluster node to another the DHCP network table is updated so that the "Server IP" is now the new physical server. You can verify this using

pntadm -L

pntadm -P <network>

While I tend to agree this will prevent the lease from being renewed after the DHCP server has failed over it will nevertheless allow allocation of IP addresses.

Please note that currently in.dhcpd has to be configured to listen on the physical interface and currently does not support virtual/logical interfaces. Please search on SunSolve for "in.dhcp virtual interface" for more information, although there is an RFE open for this.

Once DHCP server supports logical interfaces it will make the above easier.

Regards

Neil

neil_garthwaitea at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 7
Sorry I had a typo, it should "in.dhcpd virtual interface", i.e. Please search on SunSolve for "in.dhcpd virtual interface" for more information, although there is an RFE open for this.
neil_garthwaitea at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 8

Thanks for the info about logging the bug Neil. I'll manually make the change here to 'functions'.

In terms of in.dhcpd binding to a logical interface, what I've got running here contradicts the document on SunSolve. As part of my testing, I've manually set 'ADAPTER' to 'bge0:1', and when I enable the group it does bind to the logical interface and it does respond to DHCP requests. Outside of SC, if I set 'INTERFACES' in inetsvc.conf to 'bge0:1' it binds to the logical interface and responds to DHCP requests.

This can be seen with the following:

root@whelk# ps -ef | grep dhcp

root 111299370 11:44:04 ?0:00 /bin/sh -c /opt/SUNWscgds/bin/gds_probe -R dhcp-res -T SUNW.gds:6 -G dhcp-rg

root 1107110 11:43:57 ?0:01 /usr/lib/inet/in.dhcpd -i bge0:1

root 11130 111290 11:44:04 ?0:03 /opt/SUNWscgds/bin/gds_probe -R dhcp-res -T SUNW.gds:6 -G dhcp-rg

root@whelk# ifconfig -a

lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1

inet 127.0.0.1 netmask ff000000

bge0: flags=9000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER> mtu 1500 index 2

inet 10.0.1.100 netmask ffffff00 broadcast 10.0.1.255

groupname sc_ipmp0

ether 0:3:ba:92:2e:55

bge0:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 2

inet 10.0.1.101 netmask ffffff00 broadcast 10.0.1.255

root@whelk# snoop 10.0.1.101

Using device /dev/bge0 (promiscuous mode)

10.0.2.254 -> 10.0.1.101 DHCP/BOOTP DHCPDISCOVER

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPOFFER

10.0.2.254 -> 10.0.1.101 DHCP/BOOTP DHCPREQUEST

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

10.0.2.181 -> 10.0.1.101 DHCP/BOOTP DHCPINFORM

10.0.1.101 -> 10.0.2.254DHCP/BOOTP DHCPACK

root@whelk# tail -1 /var/dhcp/SUNWfiles1_10_0_2_0

10.0.2.181|01001560C8B8FF|00|10.0.1.100|1172220588|4503881102347206794|whelk|

The DHCP server address on the lease on the laptop is the logical address (10.0.1.101) rather than the physical address. I've set a short lease for the testing (3 mins), and I can see a steady stream of DHCPINFORM/DHCPACK packets between the laptop and DHCP server.

When the cluster fails over, the DHCP requests start going to the other node, but still on the logical address. The first few attempts at renewing the lease fail, but after about 8 or so DHCPNAK responses from the server, the steady stream of DHCPINFORM/DHCPACK continues. During this time, the laptop doesn't lose connectivity to the network (seen with a constant ping) and continues to renew the lease.

Therefore, as far as I can see, DHCP does work with a logical interface.

The only problem I've run into is the field containing the physical address of the DHCP server in the lease table. I found that the table containing the same address range as the physical server did update properly, but another table didn't (ie. the 10.0.1.0 table did update but 10.0.2.0 didn't). This led to a stream of DHCPREQUEST from the laptop to the DHCP server, but the DHCP server ignored them. This continues until the lease expires, at which point the laptop disappears from the network. Running a quick 'sed' to change this field to the correct address immediately fixes the problem.

Iain

iainfirkinsa at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 9

Iain,

I understand what you are saying, however my point was that currently DHCP server does not support logical interfaces which is different to saying it's working.

I guess the driver for DHCP server to support logical interfaces can be derived from the request to run DHCP in a zone, where we can only use logical interfaces. Currently DHCP is not supported in a non-global zone, although I accept that is not what you are doing.

At least that's my understanding. Sorry it's not much help.

W.r.t the second network (10.0.2.0) not being updated in the DHCP network table after failover, did you configure the 2 networks when registering the SC DHCP agent, i.e.

NETWORK='10.0.1.0@sc_ipmp0@1/10.0.1.0@sc_ipmp0@2/10.0.2.0@sc_ipmp0@1/10.0.2.0@s c_ipmp0@2'

Regards

Neil

neil_garthwaitea at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 10

Hi Neil,

That makes sense. Before installing SC, I looked at putting DHCP in a non-global zone and came across the same thing you do.

As I add more and more networks to the DHCP scope, do I have to continue to append them in 'NETWORK' and re-register the DHCP agent? Are you aware of any upper limit that this could be?

Thanks,

Iain

iainfirkinsa at 2007-7-9 21:04:45 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...