396 lines
15 KiB
Plaintext
396 lines
15 KiB
Plaintext
|
||
|
||
|
||
|
||
|
||
|
||
Network Working Group T. Brisco
|
||
Request for Comments: 1794 Rutgers University
|
||
Category: Informational April 1995
|
||
|
||
|
||
DNS Support for Load Balancing
|
||
|
||
Status of this Memo
|
||
|
||
This memo provides information for the Internet community. This memo
|
||
does not specify an Internet standard of any kind. Distribution of
|
||
this memo is unlimited.
|
||
|
||
1. Introduction
|
||
|
||
This RFC is meant to first chronicle a foray into the IETF DNS
|
||
Working Group, discuss other possible alternatives to
|
||
provide/simulate load balancing support for DNS, and to provide an
|
||
ultimate, flexible solution for providing DNS support for balancing
|
||
loads of many types.
|
||
|
||
2. History
|
||
|
||
The history of this probably dates back well before my own time - so
|
||
undoubtedly some holes are here. Hopefully they can be filled in by
|
||
other authors.
|
||
|
||
Initially; "load balancing" was intended to permit the Domain Name
|
||
System (DNS) [1] agents to support the concept of "clusters" (derived
|
||
from the VMS usage) of machines - where all machines were
|
||
functionally similar or the same, and it didn't particularly matter
|
||
which machine was picked - as long as the load of the processing was
|
||
reasonably well distributed across a series of actual different
|
||
hosts. Around 1986 a number of different schemes started surfacing
|
||
as hacks to the Berkeley Internet Name Domain server (BIND)
|
||
distribution. Probably the most widely distributed of these were the
|
||
"Shuffle Address" (SA) modifications by Bryan Beecher, or possibly
|
||
Marshall Rose's "Round Robin" code.
|
||
|
||
The SA records, however, did a round-robin ordering of the Address
|
||
resource records, and didn't do much with regard to the particular
|
||
loads on the target machines. Matt Madison (of TGV) implemented some
|
||
changes that used VMS facilities to review the system loads, and
|
||
return A RRs in the order of least-loaded to most loaded.
|
||
|
||
The problem was with SAs was that load was not actually a factor, and
|
||
TGV's relied on VMS specific facilities to order the records. The SA
|
||
RRs required changes to the DNS specification (in file syntax and in
|
||
|
||
|
||
|
||
Brisco [Page 1]
|
||
|
||
RFC 1794 DNS Support for Load Balancing April 1995
|
||
|
||
|
||
record processing). These were both viewed as drawbacks and not as
|
||
general solutions.
|
||
|
||
Most of the Internet waited in anticipation of an IETF approved
|
||
method for simulating "clusters".
|
||
|
||
Through a few IETF DNS Working Group sessions (Chaired by Rob Austein
|
||
of Epilogue), it was collectively agreed upon that a number of
|
||
criteria must be met:
|
||
|
||
A) Backwards compatibility with the existing DNS RFC.
|
||
|
||
B) Information changes frequently.
|
||
|
||
C) Multiple addresses should be sent out.
|
||
|
||
D) Must interact with other RRs appropriately.
|
||
|
||
E) Must be able to represent many types of "loads"
|
||
|
||
F) Must be fast.
|
||
|
||
(A) would ensure that the installed base of BIND and other DNS
|
||
implementations would continue to operate and interoperate properly.
|
||
|
||
(B) would permit very fast update times - to enable modeling of
|
||
real-time data. Five minutes was thought as a normal interval,
|
||
though changes as fast as every sixty seconds could be imagined.
|
||
|
||
(C) would cover the possibility of a host's address being advertised
|
||
as optimal, yet the machine crashed during the period within the TTL
|
||
of the RR. The second-most preferable address would be advertised
|
||
second, the third-most preferable third, and so on. This would allow
|
||
a reasonable stab at recovery during machine failures.
|
||
|
||
(D) would ensure correct handling of all ancillary information - such
|
||
as MX, RP, and TXT information, as well as reverse lookup
|
||
information. It needed to be ensured that such processes as mail
|
||
handling continued to work in an unsurprising and predictable manner.
|
||
|
||
(E) would ensure the flexibility that everyone wished. A breadth of
|
||
"loads" were wished to be represented by various members of the DNS
|
||
Working Group. Some "loads" were fairly eclectic - such as the
|
||
address ordering by the RTT to the host, some were pragmatic - such
|
||
as balancing the CPU load evenly across a series of hosts. All
|
||
represented valid concerns within their own context, and the idea of
|
||
having separate RR types for each was unthinkable (primarily; it
|
||
would violate goal A).
|
||
|
||
|
||
|
||
Brisco [Page 2]
|
||
|
||
RFC 1794 DNS Support for Load Balancing April 1995
|
||
|
||
|
||
(F) needed to ensure a few things. Primarily that the time to
|
||
calculate the information to order the addressing information did not
|
||
exceed the TTL of the information distributed - i.e., that elements
|
||
with a TTL of five minutes didn't take six minutes to calculate.
|
||
Similarly; it seems a fairly clear goal in the DNS RFC that clients
|
||
should not be kept waiting - that request processing should continue
|
||
regardless of the state of any other processing occurring.
|
||
|
||
3. Possible Alternatives
|
||
|
||
During various discussions with the DNS Working Group and with the
|
||
Load Balancing Committee, it was noted that no existing solution
|
||
dealt with all wishes appropriately. One of the major successes of
|
||
the DNS is its flexibility - and it was felt that this needed to be
|
||
retained in all aspects. It was conceived that perhaps not only
|
||
address information would need to be changed rapidly, but other
|
||
records may also need to change rapidly (at least this could not be
|
||
ruled out - who knows what technologies lurk in the future).
|
||
|
||
Of primary concern to many was the ability to interact with older
|
||
implementations of DNS. The DNS is implemented widely now, and
|
||
changes to critical portions of the protocol could cause havoc for
|
||
years. It became rapidly apparent through conversations with Jon
|
||
Postel and Dave Crocker (Area Director) that modifications to the
|
||
protocol would be viewed dimly.
|
||
|
||
4. A Flexible Model
|
||
|
||
During many hours of discussions, it arose upon suggestion from Rob
|
||
Austein that the changes could be implemented without changes to the
|
||
protocol; if zone transfer behavior could be subtly changed, then the
|
||
zone transfer process could accommodate the changing of various RR
|
||
information. What was needed was a smarter program to do the zone
|
||
transfers. Pursuant to this, changes were made to BIND that would
|
||
permit the specification of the program to do the zone transfers for
|
||
particular zones.
|
||
|
||
There is no specification that a secondary has to receive updates
|
||
from its primary server in any specific manner - only that it needs
|
||
to check periodically, and obtain new zone copies when changes have
|
||
been made. Conceivably the zone transfer agent could obtain the
|
||
information from any number of sources (e.g., a load average daemon,
|
||
a round-robin sorter) and present the information back to the
|
||
nameserver for distribution.
|
||
|
||
A number of questions arose from this concept, and all seem to have
|
||
been dealt with accordingly. Primarily, the DNS protocol doesn't
|
||
guarantee ordering. While the DNS protocol doesn't guarantee
|
||
|
||
|
||
|
||
Brisco [Page 3]
|
||
|
||
RFC 1794 DNS Support for Load Balancing April 1995
|
||
|
||
|
||
ordering, it is clear that the ordering is predictive - that
|
||
information read in twice in the same order will be presented twice
|
||
in the same order to clients. Clients, of course, may reorder this
|
||
information, but that is deemed as a "local issue" as it is
|
||
configurable by the remote systems administrators (e.g., sortlists,
|
||
etc). The zone transfer agent would have to account for any "mis-
|
||
ordering" that may occur locally, but remote reordering (e.g., client
|
||
side sortlists) of RRs is is impossible to predict. Since local
|
||
mis-ordering is consistent, the zone transfer agents could easily
|
||
account for this.
|
||
|
||
Secondarily, but perhaps more subtly, the problem arises that zone
|
||
transfers aren't used by primary nameservers, only by secondary
|
||
nameservers. To clarify this, the idea of "fast" or "volatile"
|
||
subzones must be dealt with. In a volatile environment (where
|
||
address or other RR ordering changes rapidly), the refresh rate of a
|
||
zone must be set very low, and the TTL of the RRs handed out must
|
||
similarly be very low. There is no use in handing out information
|
||
with TTLs of an hour, when the conditions for ordering the RRs
|
||
changes minutely. There must be a relatively close relationship
|
||
between the refresh rates and TTLs of the information. Of course,
|
||
with very low refresh rates, zone transfers between the primary and
|
||
secondary would have to occur frequently. Given that primary and
|
||
secondary nameservers should be topologically and geographically far
|
||
apart, moving that much data that frequently is seen as prohibitive.
|
||
Also; the longer the propagation time between the primary and
|
||
secondary, the larger the window in which circumstances can change -
|
||
thus invalidating the secondary's information. It is generally
|
||
thought that passing volatile information on to a secondary is fairly
|
||
useless - if secondaries want accurate information, then they should
|
||
calculate it themselves and not obtain it via zone transfers. This
|
||
avoids the problem with secondaries losing contact with the primaries
|
||
(but access to the targets of the volatile domain are still
|
||
reachable), but the secondary has information that is growing stale.
|
||
|
||
What is essentially necessary is a secondary (with no primary) which
|
||
can calculate the necessary ordering of the RR data for itself (which
|
||
also avoids the problem of different versions of domain servers
|
||
predictively ordering RR information in different predictive
|
||
fashions). For a volatile zone, there is no primary DNS agent, but
|
||
rather a series of autonomous secondary agents. Each autonomous
|
||
secondary agent is, of course, capable of calculating the ordering or
|
||
content of the volatile RRs itself.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Brisco [Page 4]
|
||
|
||
RFC 1794 DNS Support for Load Balancing April 1995
|
||
|
||
|
||
5. Implementation
|
||
|
||
With some help from Masataka Ohta (Tokyo Institute of Technology), I
|
||
implemented modifications to BIND to permit the specification of the
|
||
zone transfer program (zone transfer agent) for particular domains:
|
||
|
||
transfer <domain-name> <program-name>
|
||
|
||
Currently I define a separate subdomain that has a few hosts defined
|
||
in it - all volatile information. The zone has a refresh rate of
|
||
300, and a minimum TTL of 300 indicated. The configuration file is
|
||
indicated as "volatile.hosts". Every 300 seconds a program "doAxfer"
|
||
is run to do the zone transfer. The program "doAxfer" reads the file
|
||
"volatile.hosts.template" and the file "volatile.hosts.list". The
|
||
addresses specified in volatile.hosts.list are rotated a random
|
||
number of times, and then substituted (in order) into
|
||
volatile.hosts.template to generate the file volatile.hosts. The
|
||
program "doAxfer" then exits with a value of 1 - to indicate to the
|
||
nameserver that the zone transfer was successful, and that the file
|
||
should be read in, and the information distributed. This results in
|
||
a host having multiple addresses, and the addresses are randomized
|
||
every five minutes (300 seconds).
|
||
|
||
Two bugs continue to plague us in this endeavor. BIND currently
|
||
considers any TTL under 300 seconds as "irrational", and substitutes
|
||
in the value of 300 instead. This greatly hampers the functionality
|
||
of volatile zones. In the fastest of all cases - a 0 TTL -
|
||
information would be used once, and then thrown away. Presumably the
|
||
new RR information could be calculated every 5 seconds, and the RRs
|
||
handed out with a TTL of 0. It must be considered that one
|
||
limitation of the speed of a zone is going to be the ability of a
|
||
machine to calculate new information fast enough.
|
||
|
||
The other bug that also effects this is that, as with TTLs, BIND
|
||
considers any zone refresh rate under 15 minutes to be similarly
|
||
irrational. Obviously zone refresh rates of 15 minutes is
|
||
unacceptable for this sort of applications.
|
||
|
||
For a work-around, the current code sets these same hard-coded values
|
||
to 60 seconds. Sixty seconds is still large enough to avoid any
|
||
residual bugs associated with small timer values, but is also short
|
||
enough to allow fast subzones to be of use.
|
||
|
||
This version of BIND is currently in release within Rutgers
|
||
University, operating in both "fast" and normal zones.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Brisco [Page 5]
|
||
|
||
RFC 1794 DNS Support for Load Balancing April 1995
|
||
|
||
|
||
6. Performance
|
||
|
||
While the performance of fast zones isn't exactly stellar, it is not
|
||
much more than the normal CPU loads induced by BIND. Testing was
|
||
performed on a Sun Sparc-2 being used as a normal workstation, but no
|
||
resolvers were using the name server - essentially the nameserver was
|
||
idle. For a configuration with no fast subzones, BIND accrued 11 CPU
|
||
seconds in 24 hours. For a configuration with one fast zone, six
|
||
address records, and being refreshed every 300 seconds (5 minutes),
|
||
BIND accrued 1 minute 4 seconds CPU time. For the same previous
|
||
configuration, but being refreshed every sixty seconds, BIND accrued
|
||
5 minutes and 38 seconds of CPU time.
|
||
|
||
As is no great surprise, the CPU load on the serving machine was
|
||
linear to the frequency of the refresh time. The sixty second
|
||
refresh configuration used approximately five times as much CPU time
|
||
as did the 300 second refresh configuration. One can easily
|
||
extrapolate that the overall CPU utilization would be linear to the
|
||
number of zones and the frequency of the refresh period. All of this
|
||
is based on a shell script that always indicated that a zone update
|
||
was necessary, a more intelligent program should realize when the
|
||
reordering of the RRs was unnecessary and avoid such periodic zone
|
||
reloads.
|
||
|
||
7. Acknowledgments
|
||
|
||
Most of the ideas in this document are the results of conversations
|
||
and proposals from many, many people - including, but not limited to,
|
||
Robert Austein, Stuart Vance, Masataka Ohta, Marshall Rose, and the
|
||
members of the IETF DNS Working Group.
|
||
|
||
8. References
|
||
|
||
[1] Mockapetris, P., "Domain Names - Implementation and
|
||
Specification", STD 13, RFC 1035, USC/Information Sciences
|
||
Institute, November 1987.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Brisco [Page 6]
|
||
|
||
RFC 1794 DNS Support for Load Balancing April 1995
|
||
|
||
|
||
9. Security Considerations
|
||
|
||
Security issues are not discussed in this memo.
|
||
|
||
10. Author's Address
|
||
|
||
Thomas P. Brisco
|
||
Associate Director for Network Operations
|
||
Rutgers University
|
||
Computing Services, Telecommunications Division
|
||
Hill Center for the Mathematical Sciences
|
||
Busch Campus
|
||
Piscataway, New Jersey 08855-0879
|
||
USA
|
||
|
||
Phone: +1-908-445-2351
|
||
EMail: brisco@rutgers.edu
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Brisco [Page 7]
|
||
|