396 lines
15 KiB
Plaintext
396 lines
15 KiB
Plaintext
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Network Working Group T. Brisco
|
|||
|
Request for Comments: 1794 Rutgers University
|
|||
|
Category: Informational April 1995
|
|||
|
|
|||
|
|
|||
|
DNS Support for Load Balancing
|
|||
|
|
|||
|
Status of this Memo
|
|||
|
|
|||
|
This memo provides information for the Internet community. This memo
|
|||
|
does not specify an Internet standard of any kind. Distribution of
|
|||
|
this memo is unlimited.
|
|||
|
|
|||
|
1. Introduction
|
|||
|
|
|||
|
This RFC is meant to first chronicle a foray into the IETF DNS
|
|||
|
Working Group, discuss other possible alternatives to
|
|||
|
provide/simulate load balancing support for DNS, and to provide an
|
|||
|
ultimate, flexible solution for providing DNS support for balancing
|
|||
|
loads of many types.
|
|||
|
|
|||
|
2. History
|
|||
|
|
|||
|
The history of this probably dates back well before my own time - so
|
|||
|
undoubtedly some holes are here. Hopefully they can be filled in by
|
|||
|
other authors.
|
|||
|
|
|||
|
Initially; "load balancing" was intended to permit the Domain Name
|
|||
|
System (DNS) [1] agents to support the concept of "clusters" (derived
|
|||
|
from the VMS usage) of machines - where all machines were
|
|||
|
functionally similar or the same, and it didn't particularly matter
|
|||
|
which machine was picked - as long as the load of the processing was
|
|||
|
reasonably well distributed across a series of actual different
|
|||
|
hosts. Around 1986 a number of different schemes started surfacing
|
|||
|
as hacks to the Berkeley Internet Name Domain server (BIND)
|
|||
|
distribution. Probably the most widely distributed of these were the
|
|||
|
"Shuffle Address" (SA) modifications by Bryan Beecher, or possibly
|
|||
|
Marshall Rose's "Round Robin" code.
|
|||
|
|
|||
|
The SA records, however, did a round-robin ordering of the Address
|
|||
|
resource records, and didn't do much with regard to the particular
|
|||
|
loads on the target machines. Matt Madison (of TGV) implemented some
|
|||
|
changes that used VMS facilities to review the system loads, and
|
|||
|
return A RRs in the order of least-loaded to most loaded.
|
|||
|
|
|||
|
The problem was with SAs was that load was not actually a factor, and
|
|||
|
TGV's relied on VMS specific facilities to order the records. The SA
|
|||
|
RRs required changes to the DNS specification (in file syntax and in
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 1]
|
|||
|
|
|||
|
RFC 1794 DNS Support for Load Balancing April 1995
|
|||
|
|
|||
|
|
|||
|
record processing). These were both viewed as drawbacks and not as
|
|||
|
general solutions.
|
|||
|
|
|||
|
Most of the Internet waited in anticipation of an IETF approved
|
|||
|
method for simulating "clusters".
|
|||
|
|
|||
|
Through a few IETF DNS Working Group sessions (Chaired by Rob Austein
|
|||
|
of Epilogue), it was collectively agreed upon that a number of
|
|||
|
criteria must be met:
|
|||
|
|
|||
|
A) Backwards compatibility with the existing DNS RFC.
|
|||
|
|
|||
|
B) Information changes frequently.
|
|||
|
|
|||
|
C) Multiple addresses should be sent out.
|
|||
|
|
|||
|
D) Must interact with other RRs appropriately.
|
|||
|
|
|||
|
E) Must be able to represent many types of "loads"
|
|||
|
|
|||
|
F) Must be fast.
|
|||
|
|
|||
|
(A) would ensure that the installed base of BIND and other DNS
|
|||
|
implementations would continue to operate and interoperate properly.
|
|||
|
|
|||
|
(B) would permit very fast update times - to enable modeling of
|
|||
|
real-time data. Five minutes was thought as a normal interval,
|
|||
|
though changes as fast as every sixty seconds could be imagined.
|
|||
|
|
|||
|
(C) would cover the possibility of a host's address being advertised
|
|||
|
as optimal, yet the machine crashed during the period within the TTL
|
|||
|
of the RR. The second-most preferable address would be advertised
|
|||
|
second, the third-most preferable third, and so on. This would allow
|
|||
|
a reasonable stab at recovery during machine failures.
|
|||
|
|
|||
|
(D) would ensure correct handling of all ancillary information - such
|
|||
|
as MX, RP, and TXT information, as well as reverse lookup
|
|||
|
information. It needed to be ensured that such processes as mail
|
|||
|
handling continued to work in an unsurprising and predictable manner.
|
|||
|
|
|||
|
(E) would ensure the flexibility that everyone wished. A breadth of
|
|||
|
"loads" were wished to be represented by various members of the DNS
|
|||
|
Working Group. Some "loads" were fairly eclectic - such as the
|
|||
|
address ordering by the RTT to the host, some were pragmatic - such
|
|||
|
as balancing the CPU load evenly across a series of hosts. All
|
|||
|
represented valid concerns within their own context, and the idea of
|
|||
|
having separate RR types for each was unthinkable (primarily; it
|
|||
|
would violate goal A).
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 2]
|
|||
|
|
|||
|
RFC 1794 DNS Support for Load Balancing April 1995
|
|||
|
|
|||
|
|
|||
|
(F) needed to ensure a few things. Primarily that the time to
|
|||
|
calculate the information to order the addressing information did not
|
|||
|
exceed the TTL of the information distributed - i.e., that elements
|
|||
|
with a TTL of five minutes didn't take six minutes to calculate.
|
|||
|
Similarly; it seems a fairly clear goal in the DNS RFC that clients
|
|||
|
should not be kept waiting - that request processing should continue
|
|||
|
regardless of the state of any other processing occurring.
|
|||
|
|
|||
|
3. Possible Alternatives
|
|||
|
|
|||
|
During various discussions with the DNS Working Group and with the
|
|||
|
Load Balancing Committee, it was noted that no existing solution
|
|||
|
dealt with all wishes appropriately. One of the major successes of
|
|||
|
the DNS is its flexibility - and it was felt that this needed to be
|
|||
|
retained in all aspects. It was conceived that perhaps not only
|
|||
|
address information would need to be changed rapidly, but other
|
|||
|
records may also need to change rapidly (at least this could not be
|
|||
|
ruled out - who knows what technologies lurk in the future).
|
|||
|
|
|||
|
Of primary concern to many was the ability to interact with older
|
|||
|
implementations of DNS. The DNS is implemented widely now, and
|
|||
|
changes to critical portions of the protocol could cause havoc for
|
|||
|
years. It became rapidly apparent through conversations with Jon
|
|||
|
Postel and Dave Crocker (Area Director) that modifications to the
|
|||
|
protocol would be viewed dimly.
|
|||
|
|
|||
|
4. A Flexible Model
|
|||
|
|
|||
|
During many hours of discussions, it arose upon suggestion from Rob
|
|||
|
Austein that the changes could be implemented without changes to the
|
|||
|
protocol; if zone transfer behavior could be subtly changed, then the
|
|||
|
zone transfer process could accommodate the changing of various RR
|
|||
|
information. What was needed was a smarter program to do the zone
|
|||
|
transfers. Pursuant to this, changes were made to BIND that would
|
|||
|
permit the specification of the program to do the zone transfers for
|
|||
|
particular zones.
|
|||
|
|
|||
|
There is no specification that a secondary has to receive updates
|
|||
|
from its primary server in any specific manner - only that it needs
|
|||
|
to check periodically, and obtain new zone copies when changes have
|
|||
|
been made. Conceivably the zone transfer agent could obtain the
|
|||
|
information from any number of sources (e.g., a load average daemon,
|
|||
|
a round-robin sorter) and present the information back to the
|
|||
|
nameserver for distribution.
|
|||
|
|
|||
|
A number of questions arose from this concept, and all seem to have
|
|||
|
been dealt with accordingly. Primarily, the DNS protocol doesn't
|
|||
|
guarantee ordering. While the DNS protocol doesn't guarantee
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 3]
|
|||
|
|
|||
|
RFC 1794 DNS Support for Load Balancing April 1995
|
|||
|
|
|||
|
|
|||
|
ordering, it is clear that the ordering is predictive - that
|
|||
|
information read in twice in the same order will be presented twice
|
|||
|
in the same order to clients. Clients, of course, may reorder this
|
|||
|
information, but that is deemed as a "local issue" as it is
|
|||
|
configurable by the remote systems administrators (e.g., sortlists,
|
|||
|
etc). The zone transfer agent would have to account for any "mis-
|
|||
|
ordering" that may occur locally, but remote reordering (e.g., client
|
|||
|
side sortlists) of RRs is is impossible to predict. Since local
|
|||
|
mis-ordering is consistent, the zone transfer agents could easily
|
|||
|
account for this.
|
|||
|
|
|||
|
Secondarily, but perhaps more subtly, the problem arises that zone
|
|||
|
transfers aren't used by primary nameservers, only by secondary
|
|||
|
nameservers. To clarify this, the idea of "fast" or "volatile"
|
|||
|
subzones must be dealt with. In a volatile environment (where
|
|||
|
address or other RR ordering changes rapidly), the refresh rate of a
|
|||
|
zone must be set very low, and the TTL of the RRs handed out must
|
|||
|
similarly be very low. There is no use in handing out information
|
|||
|
with TTLs of an hour, when the conditions for ordering the RRs
|
|||
|
changes minutely. There must be a relatively close relationship
|
|||
|
between the refresh rates and TTLs of the information. Of course,
|
|||
|
with very low refresh rates, zone transfers between the primary and
|
|||
|
secondary would have to occur frequently. Given that primary and
|
|||
|
secondary nameservers should be topologically and geographically far
|
|||
|
apart, moving that much data that frequently is seen as prohibitive.
|
|||
|
Also; the longer the propagation time between the primary and
|
|||
|
secondary, the larger the window in which circumstances can change -
|
|||
|
thus invalidating the secondary's information. It is generally
|
|||
|
thought that passing volatile information on to a secondary is fairly
|
|||
|
useless - if secondaries want accurate information, then they should
|
|||
|
calculate it themselves and not obtain it via zone transfers. This
|
|||
|
avoids the problem with secondaries losing contact with the primaries
|
|||
|
(but access to the targets of the volatile domain are still
|
|||
|
reachable), but the secondary has information that is growing stale.
|
|||
|
|
|||
|
What is essentially necessary is a secondary (with no primary) which
|
|||
|
can calculate the necessary ordering of the RR data for itself (which
|
|||
|
also avoids the problem of different versions of domain servers
|
|||
|
predictively ordering RR information in different predictive
|
|||
|
fashions). For a volatile zone, there is no primary DNS agent, but
|
|||
|
rather a series of autonomous secondary agents. Each autonomous
|
|||
|
secondary agent is, of course, capable of calculating the ordering or
|
|||
|
content of the volatile RRs itself.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 4]
|
|||
|
|
|||
|
RFC 1794 DNS Support for Load Balancing April 1995
|
|||
|
|
|||
|
|
|||
|
5. Implementation
|
|||
|
|
|||
|
With some help from Masataka Ohta (Tokyo Institute of Technology), I
|
|||
|
implemented modifications to BIND to permit the specification of the
|
|||
|
zone transfer program (zone transfer agent) for particular domains:
|
|||
|
|
|||
|
transfer <domain-name> <program-name>
|
|||
|
|
|||
|
Currently I define a separate subdomain that has a few hosts defined
|
|||
|
in it - all volatile information. The zone has a refresh rate of
|
|||
|
300, and a minimum TTL of 300 indicated. The configuration file is
|
|||
|
indicated as "volatile.hosts". Every 300 seconds a program "doAxfer"
|
|||
|
is run to do the zone transfer. The program "doAxfer" reads the file
|
|||
|
"volatile.hosts.template" and the file "volatile.hosts.list". The
|
|||
|
addresses specified in volatile.hosts.list are rotated a random
|
|||
|
number of times, and then substituted (in order) into
|
|||
|
volatile.hosts.template to generate the file volatile.hosts. The
|
|||
|
program "doAxfer" then exits with a value of 1 - to indicate to the
|
|||
|
nameserver that the zone transfer was successful, and that the file
|
|||
|
should be read in, and the information distributed. This results in
|
|||
|
a host having multiple addresses, and the addresses are randomized
|
|||
|
every five minutes (300 seconds).
|
|||
|
|
|||
|
Two bugs continue to plague us in this endeavor. BIND currently
|
|||
|
considers any TTL under 300 seconds as "irrational", and substitutes
|
|||
|
in the value of 300 instead. This greatly hampers the functionality
|
|||
|
of volatile zones. In the fastest of all cases - a 0 TTL -
|
|||
|
information would be used once, and then thrown away. Presumably the
|
|||
|
new RR information could be calculated every 5 seconds, and the RRs
|
|||
|
handed out with a TTL of 0. It must be considered that one
|
|||
|
limitation of the speed of a zone is going to be the ability of a
|
|||
|
machine to calculate new information fast enough.
|
|||
|
|
|||
|
The other bug that also effects this is that, as with TTLs, BIND
|
|||
|
considers any zone refresh rate under 15 minutes to be similarly
|
|||
|
irrational. Obviously zone refresh rates of 15 minutes is
|
|||
|
unacceptable for this sort of applications.
|
|||
|
|
|||
|
For a work-around, the current code sets these same hard-coded values
|
|||
|
to 60 seconds. Sixty seconds is still large enough to avoid any
|
|||
|
residual bugs associated with small timer values, but is also short
|
|||
|
enough to allow fast subzones to be of use.
|
|||
|
|
|||
|
This version of BIND is currently in release within Rutgers
|
|||
|
University, operating in both "fast" and normal zones.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 5]
|
|||
|
|
|||
|
RFC 1794 DNS Support for Load Balancing April 1995
|
|||
|
|
|||
|
|
|||
|
6. Performance
|
|||
|
|
|||
|
While the performance of fast zones isn't exactly stellar, it is not
|
|||
|
much more than the normal CPU loads induced by BIND. Testing was
|
|||
|
performed on a Sun Sparc-2 being used as a normal workstation, but no
|
|||
|
resolvers were using the name server - essentially the nameserver was
|
|||
|
idle. For a configuration with no fast subzones, BIND accrued 11 CPU
|
|||
|
seconds in 24 hours. For a configuration with one fast zone, six
|
|||
|
address records, and being refreshed every 300 seconds (5 minutes),
|
|||
|
BIND accrued 1 minute 4 seconds CPU time. For the same previous
|
|||
|
configuration, but being refreshed every sixty seconds, BIND accrued
|
|||
|
5 minutes and 38 seconds of CPU time.
|
|||
|
|
|||
|
As is no great surprise, the CPU load on the serving machine was
|
|||
|
linear to the frequency of the refresh time. The sixty second
|
|||
|
refresh configuration used approximately five times as much CPU time
|
|||
|
as did the 300 second refresh configuration. One can easily
|
|||
|
extrapolate that the overall CPU utilization would be linear to the
|
|||
|
number of zones and the frequency of the refresh period. All of this
|
|||
|
is based on a shell script that always indicated that a zone update
|
|||
|
was necessary, a more intelligent program should realize when the
|
|||
|
reordering of the RRs was unnecessary and avoid such periodic zone
|
|||
|
reloads.
|
|||
|
|
|||
|
7. Acknowledgments
|
|||
|
|
|||
|
Most of the ideas in this document are the results of conversations
|
|||
|
and proposals from many, many people - including, but not limited to,
|
|||
|
Robert Austein, Stuart Vance, Masataka Ohta, Marshall Rose, and the
|
|||
|
members of the IETF DNS Working Group.
|
|||
|
|
|||
|
8. References
|
|||
|
|
|||
|
[1] Mockapetris, P., "Domain Names - Implementation and
|
|||
|
Specification", STD 13, RFC 1035, USC/Information Sciences
|
|||
|
Institute, November 1987.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 6]
|
|||
|
|
|||
|
RFC 1794 DNS Support for Load Balancing April 1995
|
|||
|
|
|||
|
|
|||
|
9. Security Considerations
|
|||
|
|
|||
|
Security issues are not discussed in this memo.
|
|||
|
|
|||
|
10. Author's Address
|
|||
|
|
|||
|
Thomas P. Brisco
|
|||
|
Associate Director for Network Operations
|
|||
|
Rutgers University
|
|||
|
Computing Services, Telecommunications Division
|
|||
|
Hill Center for the Mathematical Sciences
|
|||
|
Busch Campus
|
|||
|
Piscataway, New Jersey 08855-0879
|
|||
|
USA
|
|||
|
|
|||
|
Phone: +1-908-445-2351
|
|||
|
EMail: brisco@rutgers.edu
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Brisco [Page 7]
|
|||
|
|