DNS HA agent bugs?
I'm running SC3.2 on Sol 10 11/06 (fully patched). I'm using the DNS and DHCP agents side-by-side for my (test) cluster.
Failing the resource group between nodes works fine, without affecting DNS resolution or DDNS updates from DHCP. However, I have noticed a couple of problems with the agent.
All of my DNS config data lives in /var/named. If I create the resource group by running "clresource create -g dns-rg -t SUNW.dns -p Port_list=53/udp -p DNS_mode=conf -p Confdir_list=/var dns-res", I get an error:
# clresource create -g dns-rg -t SUNW.dns -p Port_list=53/udp -p DNS_mode=conf -p Confdir_list=/var dns-res
clresource: whelk - File /var/named.conf is not readable: No such file or directory.
clresource: whelk - Failed to validate configuration.
If I change "-p Confdir_list" to "/var/named", I still get an error:
# clresource create -g dns-rg -t SUNW.dns -p Port_list=53/udp -p DNS_mode=conf -p Confdir_list=/var/named dns-res
clresource: whelk - DNS database directory /var/named/named is not readable: No such file or directory
clresource: whelk - Failed to validate configuration.
If I create a link to named.conf called named in /var/named, the above command works properly. Is this a bug with the agent, or a requirement that something called 'named' must exist in the specified directory? Why did the first command error with "couldn't find /var/named.conf" and the second with "/var/named/named"? When BIND is running, I see:
/usr/sbin/in.named -c /var/named/named.conf
The second problem I have is with running BIND as a different user ie. named. This was easy to achieve within the SMF manifest, and if I was running BIND from the command line I would use "-u named" as an option. How can I achieve the same functionality with the DNS HA agent?
Iain
# 1
No, it's not a bug, it's just that the parameter isn't clear. I remember that I ran into the same problem some years ago. I might have even logged a bug. The answer (IIRC) is just to get the directory name right. I think it looks for a file called (effectively): ${Confdir_list}/named.${DNS_mode}
As for your second question. There isn't a good solution. You'd need to write a customer GDS agent to wrap the DNS service start in or hack around with wrapping one of the standard Solaris commands to start it.
Tim
# 2
Hi Tim,
> I think it looks for a file called (effectively): {Confdir_list}/named.${DNS_mode}
Based purely on observation, I'm finding that when 'Confdir_list' is set to '/var/named' and 'DNS_mode' is set to 'conf', the agent looks for a file called '/var/named/named' - ie. the extension is ignored. Perhaps there's some peculiar behaviour in the code that doesn't like a directory called 'named' to be part of the path?
As for running BIND as a different user, how can I raise an RFE for this?
Iain
# 3
Hi Iain,DNS agent has an extension property called DNS_mode which is by default set to conf and hence looks for named.conf. The other options it to set it to boot and then it looks for named.boot. There is no option available for file with a different or no extension.
# 4
Hi Iain,Just to elaborate it : The validation expects to find a file named named.conf/boot. If it is not present, the validation fails. This is the intended design.
# 5
Hi Maddy,
Just to confirm the problem I'm seeing here. I *do* have a file called /var/named/named.conf. When I run the 'clresource create', it fails with 'Can't find /var/named/named' (no .conf extension). If I move 'named.conf' to 'named', 'clresource create' succeeds, but named doesn't start because it can't find 'named.conf'. Only if I have a file called 'named.conf' and create a link to it called 'named' in '/var/named' do all the various steps work and named starts properly.
If I move '/var/named' to '/var/dns' and run the 'clresource create' again, it fails with the same error message. Therefore, given that when I specify a path with one level (ie. '/var') it fails with 'Couldn't find /var/named.conf', it looks like the validation process has a problem with paths containing two (and possibly more) levels.
Iain
# 6
You can either call Sun support and raise the RFE or you could send me (@sun.com) your details and provide a more detailed description of what you want and why you want it and I can log it when I have time.Tim
# 7
Hi Iain,
I think the problem is not with named.conf but it expects the lookup files to be under the directory <confdir_list>/named. Kindly refer to the below screenshot:
bash-3.00# ls -l /test/named.conf
-rw-r--r--1 rootroot 305 Mar 2 18:03 /test/named.conf
bash-3.00# scrgadm -at SUNW.dns -x Confdir_list=/test -j dns -g dns-rg -y resource_dependencies=hasp
pjaguar1 - DNS database directory /test/named is not readable: No such file or directory
pjaguar1 - Failed to validate configuration.
(C189917) VALIDATE on resource dns, resource group dns-rg, exited with non-zero exit status.
(C720144) Validation of resource dns in resource group dns-rg on node pjaguar1 failed.
bash-3.00# mkdir /test/named
bash-3.00# scrgadm -at SUNW.dns -x Confdir_list=/test -j dns -g dns-rg -y resource_dependencies=hasp
bash-3.00#
# 8
Hi Maddy,
I'm seeing different behaviour from your example, still relating to the extension of the named configuration file:
root@whelk# ls -ld /var/named
drwxr-xr-x2 namednamed512 Mar 2 13:35 /var/named
root@whelk# ls -la /var/named/named.conf
-rw-r--r--1 namednamed2867 Feb 23 11:56 /var/named/named.conf
root@whelk# clresource create -g dns-rg -t SUNW.dns -p Port_list=53/udp -p DNS_mode=conf -p Confdir_list=/var dns-res
clresource: whelk - File /var/named.conf is not readable: No such file or directory.
clresource: whelk - Failed to validate configuration.
clresource: (C189917) VALIDATE on resource dns-res, resource group dns-rg, exited with non-zero exit status.
root@whelk# clresource create -g dns-rg -t SUNW.dns -p Port_list=53/udp -p DNS_mode=conf -p Confdir_list=/var/named dns-res
clresource: whelk - DNS database directory /var/named/named is not readable: No such file or directory
clresource: whelk - Failed to validate configuration.
clresource: (C189917) VALIDATE on resource dns-res, resource group dns-rg, exited with non-zero exit status.
So in the first instance, clresource assumed Confdir_list contained named.conf. In the second example, clresource either failed to append the correct extension to named or assumed Confdir_list contained a directory called named. Either way, I don't think the behaviour isn't consistent from example to the next.
I note that you used 'scrgadm' in your example. Could that (help) explain the difference in behaviour?
Iain
# 9
It is not clear to me if you use a failover filesystem for your dns configuration?
If you do, then you need to setup a resource dependency to the HAStoragePlus resource by listing "-p Resource_Dependencies=hasp-rs". Note that is one difference that Maddy used and you didn't (using old-cli or new-cli style does not make a difference otherwise).
Specificly the message
"clresource: whelk - File /var/named.conf is not readable: No such file or directory."
is only output in the szenario that /var/named.conf is not found on all nodes where the dns-rg can be hosted _and_ if no HASP resource dependency is provided. If one is provided, then it is expected that this HASP resource must be online and validation happens only on the node where it is online.
If you do not use one, than you need either a global filesystem, or you need to make sure that all needed files are on all nodes.
So in order to understand the clrs error message, you need to check the syslog messages of all nodes to see on which node validation has failed and why.
Also note that using Confdir_list=/var is propapbly not a good idea, since the structure is that in your setup the agent expects ${Confdir_list}/named.conf as well as ${Confdir_list}/named/ which contains all of the DNS database files (listed in the named.conf file). And making /var either a global or failover filesystem is not really a good idea. Maintaining the config files an all nodes might also not be desireable.
The documentation is using a global filesystem (thus not showing the HASP dependency).
You can check http://docs.sun.com/app/docs/doc/819-2977/6n57s95iv?a=view
to see that the named directory is fixed.
Hope this helps.
Greets
Thorsten
# 10
Just a comment on your second question:
> As for your second question. There isn't a good
> solution. You'd need to write a customer GDS agent to
> wrap the DNS service start in or hack around with
> wrapping one of the standard Solaris commands to
> start it.
Apart from sugguesting an RFE for the agent, in this specific case I would maybe not recommend an GDS agent, but instead you might find the SMF proxy resource usefull.
Hava a look at
http://docs.sun.com/app/docs/doc/819-2974/6n57pdk2b?a=view
that way you might be able to leverage the existing dns SMF that exists in S10 (and that you need to disable for using SUNW.dns).
Greets
Thorsten
# 11
Thanks for your comments Thorsten.
I understand now the behaviour I'm seeing. The DNS HA agent assumes that the directory I specify contains a file called 'named.conf/.boot' and a directory called 'named' containing the DNS datafiles. Since I linked named to named.conf, these checks all pass, but not because I match the assumed directory structure.
You're right that I haven't used a HASP resource, and instead I make use of the global filesystems to replicate data between the nodes. I am interested in how you make proper use of global filesystems between nodes, given that if a node goes down it's global device disappears and can't be utilised as the single repository for the DNS data files.
Iain
# 12
While you can mount a device global, which is physically attached only to on node, from an HA perspective it does only limited sense (that is e.g. used for the global device tree for each node).
Thus normaly you have dual attached devices (shared devices) that you manage with a volume manager (like SVM or VxVM). If you then mount a filesystem within such a volume global then the Sun Cluster DCS layer will handle the volume manger to switch primary and secondary. In that case, if the node who is primary fails, the other node (previously secondary) becomes then primary - IOs do not get lost (just shortly frozen).
With a HASP resource you can even define AffinityOn=true in order to failover primary together with you resource group to always garuantuee IOs happen to the primary.
IOs to the secondory are carried through the interconnect.
If that sounds confusing, you might want to read the concepts guide (http://docs.sun.com/app/docs/doc/819-2969) in order to read a more complete explanation.
Greets
Thorsten
