280 lines
15 KiB
HTML
280 lines
15 KiB
HTML
<!-- $NetBSD: debug.html,v 1.1 1998/12/30 20:20:33 mcr Exp $ -->
|
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN">
|
|
<html><head><title>
|
|
NTP Debugging Techniques
|
|
</title></head><body><h3>
|
|
NTP Debugging Techniques
|
|
</h3><hr>
|
|
|
|
<p>Once the NTP software distribution has been compiled and installed
|
|
and the configuration file constructed, the next step is to verify
|
|
correct operation and fix any bugs that may result. Usually, the command
|
|
line that starts the daemon is included in the system startup file, so
|
|
it is executed only at system boot time; however, the daemon can be
|
|
stopped and restarted from root at any time. Usually, no command-line
|
|
arguments are required, unless special actions described in the xntpd.8
|
|
man page are required. Once started, the daemon will begin sending
|
|
messages, as specified in the configuration file, and interpreting
|
|
received messages.
|
|
|
|
<p>The best way to verify correct operation is using the <a href =
|
|
"ntpq.html"><code>ntpq</code></a> and <a href = "xntpdc.html">
|
|
<code>xntpdc</code></a> utility programs, either on the server itself
|
|
or from another machine elsewhere in the network. The <code>ntpq</code>
|
|
program implements the management functions specified in Appendix A of
|
|
the NTP specification <a href =
|
|
"ftp://ftp.udel.edu/pub/people/mills/rfc/rfc1305/rfc1305c.ps"> RFC-1305,
|
|
Appendix A </a>. The <code>xntpdc</code> program implements additional
|
|
functions not provided in the standard. Both programs can be used to
|
|
inspect the state variables defined in the specification and, in the
|
|
case of <code>xntpdc</code>, additional ones of interest. In addition,
|
|
the <code>xntpdc</code> program can be used to selectively enable and
|
|
disable some functions of the daemon while the daemon is running.
|
|
|
|
<p>In extreme cases with elusive bugs, the daemon can operate in two
|
|
modes, depending on the presence of the <code>-d</code> command-line
|
|
debug switch. If not present, the daemon detaches from the controlling
|
|
terminal and proceeds autonomously. If one or more <code>-d</code>
|
|
switches are present, the daemon does not detach and generates special
|
|
output useful for debugging. In general, interpretation of this output
|
|
requires reference to the sources.
|
|
|
|
<p>Some problems are immediately apparent when the daemon first starts
|
|
running. The most common of these are the lack of a ntp (UDP port 123)
|
|
in the host <code>/etc/services</code> file. Note that NTP does not use
|
|
TCP in any form. Other problems are apparent in the system log file. The
|
|
log file should show the startup banner, some cryptic initialization
|
|
data, and the computed precision value. The next most common problem is
|
|
incorrect DNS names. Check that each DNS name used in the configuration
|
|
file responds to the Unix <code>ping</code> command.
|
|
|
|
<p>When first started, the daemon normally polls the servers listed in
|
|
the configuration file at 64-second intervals. In order to allow a
|
|
sufficient number of samples for the NTP algorithms to reliably
|
|
discriminate between correctly operating servers and possible intruders,
|
|
at least four valid messages from at least one server is required before
|
|
the daemon can not set the local clock. However, if the current local
|
|
time is greater than 1000 seconds in error from the server time, the
|
|
daemon will not set the local clock; instead, it will plant a message in
|
|
the system log and shut down. It is necessary to set the local clock to
|
|
within 1000 seconds first, either by a time-of-year hardware clock, by
|
|
first using the <a href = "ntpdate.html"> <code>ntpdate</code> </a>
|
|
program or manually be eyeball and wristwatch.
|
|
|
|
<p>After starting the daemon, run the <code>ntpq</code> program using
|
|
the <code>-n</code> switch, which will avoid possible distractions due
|
|
to name resolution problems. Use the <code>pe</code> command to display
|
|
a billboard showing the status of configured peers and possibly other
|
|
clients poking the daemon. After operating for a few minutes, the
|
|
display should be something like:
|
|
|
|
<pre>
|
|
ntpq>pe
|
|
remote refid st when poll reach delay offset disp
|
|
========================================================================
|
|
+128.4.2.6 132.249.16.1 2 131 256 373 9.89 16.28 23.25
|
|
*128.4.1.20 .WWVB. 1 137 256 377 280.62 21.74 20.23
|
|
-128.8.2.88 128.8.10.1 2 49 128 376 294.14 5.94 17.47
|
|
+128.4.2.17 .WWVB. 1 173 256 377 279.95 20.56 16.40
|
|
</pre>
|
|
|
|
<p>The host addresses shown in the <code>remote</code> column should
|
|
agree with the DNS entries in the configuration file, plus any peers not
|
|
mentioned in the file at the same or lower than your stratum that happen
|
|
to be configured to peer with you. Be prepared for surprises in cases
|
|
where the peer has multiple addresses or multiple names. The
|
|
<code>refid</code> entry shows the current source of synchronization for
|
|
each peer, while the <code>st</code> reveals its stratum and the
|
|
<code>poll</code> entry the polling interval, in seconds. The
|
|
<code>when</code> entry shows the time since the peer was last heard,
|
|
normally in seconds, while the <code>reach</code> entry shows the status
|
|
of the reachability register (see RFC-1305), which is in octal format.
|
|
The remaining entries show the latest delay, offset and dispersion
|
|
computed for the peer, in milliseconds.
|
|
|
|
<p>The tattletale character at the left margin displays the
|
|
synchronization status of each peer. The currently selected peer is
|
|
marked <code>*</code>, while additional peers designated acceptable for
|
|
synchronization, but not currently selected, are marked <code>+</code>.
|
|
Peers marked <code>*</code> and <code>+</code> are included in a
|
|
weighted average computation to set the local clock; the data produced
|
|
by peers marked with other symbols are discarded. See the
|
|
<code>ntpq</code> documentation for the meaning of these symbols.
|
|
|
|
<p>Additional details for each peer separately can be determined by the
|
|
following procedure. First, use the <code>as</code> command to display
|
|
an index of association identifiers, such as
|
|
|
|
<pre>
|
|
ntpq>as
|
|
ind assID status conf reach auth condition last_event cnt
|
|
===========================================================
|
|
1 11670 7414 no yes ok synchr. reachable 1
|
|
2 11673 7614 no yes ok sys.peer reachable 1
|
|
3 11833 7314 no yes ok outlyer reachable 1
|
|
4 11868 7414 no yes ok synchr. reachable 1
|
|
</pre>
|
|
|
|
<p>Each line in this billboard is associated with the corresponding line
|
|
the <code>pe</code> billboard above. Next, use the <code>rv</code>
|
|
command and the respective identifier to display a detailed synopsis of
|
|
the selected peer, such as
|
|
|
|
<pre>
|
|
ntpq>rv 11670
|
|
status=7414 reach, auth, sel_sync, 1 event, event_reach
|
|
srcadr=128.4.2.6, srcport=123, dstadr=128.4.2.7, dstport=123, keyid=1,
|
|
stratum=2, precision=-10, rootdelay=362.00, rootdispersion=21.99,
|
|
refid=132.249.16.1,
|
|
reftime=af00bb44.849b0000 Fri, Jan 15 1993 4:25:40.517,
|
|
delay= 9.89, offset= 16.28, dispersion=23.25, reach=373, valid=8,
|
|
hmode=2, pmode=1, hpoll=8, ppoll=10, leap=00, flash=0x0,
|
|
org=af00bb48.31a90000 Fri, Jan 15 1993 4:25:44.193,
|
|
rec=af00bb48.305e3000 Fri, Jan 15 1993 4:25:44.188,
|
|
xmt=af00bb1e.16689000 Fri, Jan 15 1993 4:25:02.087,
|
|
filtdelay= 16.40 9.89 140.08 9.63 9.72 9.22 10.79 122.99,
|
|
filtoffset= 13.24 16.28 -49.19 16.04 16.83 16.49 16.95 -39.43,
|
|
filterror= 16.27 20.17 27.98 31.89 35.80 39.70 43.61 47.52
|
|
</pre>
|
|
|
|
<p>A detailed explanation of the fields in this billboard are beyond the
|
|
scope of this discussion; however, most variables defined in the
|
|
specification RFC-1305 can be found. The most useful portion for
|
|
debugging is the last three lines, which give the roundtrip delay, clock
|
|
offset and dispersion for each of the last eight measurement rounds, all
|
|
in milliseconds. Note that the dispersion, which is an estimate of the
|
|
error, increases as the age of the sample increases. From these data, it
|
|
is usually possible to determine the incidence of severe packet loss,
|
|
network congestion, and unstable local clock oscillators. There are no
|
|
hard and fast rules here, since every case is unique; however, if one or
|
|
more of the rounds show zeros, or if the clock offset changes
|
|
dramatically in the same direction for each round, cause for alarm
|
|
exists.
|
|
|
|
<p>Finally, the state of the local clock can be determined using the
|
|
<code>rv</code> command (without the argument), such as
|
|
|
|
<pre>
|
|
ntpq>rv
|
|
status=0664 leap_none, sync_ntp, 6 events, event_peer/strat_chg
|
|
system="UNIX", leap=00, stratum=2, rootdelay=280.62,
|
|
rootdispersion=45.26, peer=11673, refid=128.4.1.20,
|
|
reftime=af00bb42.56111000 Fri, Jan 15 1993 4:25:38.336, poll=8,
|
|
clock=af00bbcd.8a5de000 Fri, Jan 15 1993 4:27:57.540, phase=21.147,
|
|
freq=13319.46, compliance=2
|
|
</pre>
|
|
|
|
<p>The most useful data in this billboard show when the clock was last
|
|
adjusted <code>reftime</code>, together with its status and most recent
|
|
exception event. An explanation of these data is in the specification
|
|
RFC-1305.
|
|
|
|
<p>When nothing seems to happen in the <code>pe</code> billboard after
|
|
some minutes, there may be a network problem. The most common network
|
|
problem is an access controlled router on the path to the selected peer.
|
|
No known public NTP time server selectively restricts access at this
|
|
time, although this may change in future; however, many private networks
|
|
do. It also may be the case that the server is down or running in
|
|
unsynchronized mode due to a local problem. Use the <code>ntpq</code>
|
|
program to spy on its own variables in the same way you can spy on your
|
|
own.
|
|
|
|
<p>Once the daemon has set the local clock, it will continuously track
|
|
the discrepancy between local time and NTP time and adjust the local
|
|
clock accordingly. There are two components of this adjustment, time and
|
|
frequency. These adjustments are automatically determined by the clock
|
|
discipline algorithm, which functions as a hybrid phase/frequency
|
|
feedback loop. The behavior of this algorithm is carefully controlled to
|
|
minimize residual errors due to network jitter and frequency variations
|
|
of the local clock hardware oscillator that normally occur in practice.
|
|
However, when started for the first time, the algorithm may take some
|
|
time to converge on the intrinsic frequency error of the host machine.
|
|
|
|
<p>It has sometimes been the experience that the local clock oscillator
|
|
frequency error is too large for the NTP discipline algorithm, which can
|
|
correct frequency errors as large as 30 seconds per day. There are two
|
|
possibilities that may result in this problem. First, the hardware time-
|
|
of-year clock chip must be disabled when using NTP, since this can
|
|
destabilize the discipline process. This is usually done using the <a
|
|
href = "tickadj.html"> <code>tickadj</code></a> program and the
|
|
<code>-s</code> command line argument, but other means may be necessary.
|
|
For instance, in the Sun Solaris kernel, this must be done using a
|
|
command in the system startup file.
|
|
|
|
<p>Normally, the daemon will adjust the local clock in small steps in
|
|
such a way that system and user programs are unaware of its operation.
|
|
The adjustment process operates continuously as long as the apparent
|
|
clock error exceeds 128 milliseconds, which for most Internet paths is a
|
|
quite rare event. If the event is simply an outlyer due to an occasional
|
|
network delay spike, the correction is simply discarded; however, if the
|
|
apparent time error persists for an interval of about 20 minutes, the
|
|
local clock is stepped to the new value (as an option, the daemon can be
|
|
compiled to slew at an accelerated rate to the new value, rather than be
|
|
stepped). This behavior is designed to resist errors due to severely
|
|
congested network paths, as well as errors due to confused radio clocks
|
|
upon the epoch of a leap second.
|
|
|
|
<p><h4>Debugging Checklist</h4>
|
|
|
|
<p>If the <code>ntpq</code> or <code>xntpdc</code> programs do not
|
|
show that messages are being received by the daemon or that received
|
|
messages do not result in correct synchronization, verify the following:
|
|
|
|
<ol>
|
|
|
|
<li>Verify the <code>/etc/services</code> file host machine is
|
|
configured to accept UDP packets on the NTP port 123. NTP is
|
|
specifically designed to use UDP and does not respond to TCP.
|
|
|
|
<p><li>Check the system log for <code>xntpd</code> messages about
|
|
configuration errors, name-lookup failures or initialization problems.
|
|
|
|
<p><li>Using the <code>xntpdc</code> program and <code>iostats</code>
|
|
command, verify that the received packets and packets sent counters are
|
|
incrementing. If the packets send counter does not increment and the
|
|
configuration file includes designated servers, something may be wrong
|
|
in the network configuration of the xntpd host. If this counter does
|
|
increment and packets are actually being sent to the network, but the
|
|
received packets counter does not increment, something may be wrong in
|
|
the network or the server may not be responding.
|
|
|
|
<p><li>If both the packets sent counter and received packets
|
|
counter do increment, but the <code>rec</code> timestamp in the
|
|
<code>pe</code> billboard shows a date in 1972, received packets are
|
|
probably being discarded for some reason. There is a handy, undocumented
|
|
state variable <code>flash</code> visible in the
|
|
<code>pe</code>billboard. The value is in hex and normally has the value
|
|
zero (OK). However, if something is wrong, the bits of this variable,
|
|
reading from the right, correspond to the sanity checks listed in
|
|
Section 3.4.3 of the NTP specification <a href =
|
|
"ftp://ftp.udel.edu/pub/people/mills/rfc/rfc1305/rfc1305b.ps">
|
|
RFC-1305</a>. A bit other than zero indicates the associated sanity
|
|
check failed.
|
|
|
|
<p><li>If the <code>org, rec</code> and <code>xmt</code>
|
|
timestamps in the <code>pe</code> billboard appear current, but the
|
|
local clock is not set, as indicated by a stratum number less than 16 in
|
|
the <code>rv</code> command without arguments, verify that valid clock
|
|
offset, roundtrip delay and dispersion are displayed for at least one
|
|
peer. The clock offset should be less than 1000 seconds, the roundtrip
|
|
delay less than one second and the dispersion less than one second.
|
|
|
|
<p><li>While the algorithm can tolerate a relatively large frequency
|
|
error (over 350 parts per million or 30 seconds per day), various
|
|
configuration errors (and in some cases kernel bugs) can exceed this
|
|
tolerance, leading to erratic behavior. This can result in frequent loss
|
|
of synchronization, together with wildly swinging offsets. Use the
|
|
<code>xntpdc</code> program (or temporary configuration file) and
|
|
<code>disable pll</code> command to prevent the <code>xntpd</code>
|
|
daemon from setting the clock. Using the <code>ntpq</code> or
|
|
<code>xntpdc</code> programs, watch the apparent offset as it varies
|
|
over time to determine the intrinsic frequency error. If the error
|
|
increases by more than 22 milliseconds per 64-second poll interval, the
|
|
intrinsic frequency must be reduced by some means. The easiest way to do
|
|
this is with the <a href="tickadj.html"><code>tickadj</code></a> program
|
|
and the <code>-t</code> command line argument.
|
|
|
|
</ol>
|
|
|
|
<hr><address>David L. Mills (mills@udel.edu)</address></body></html>
|