1469 lines
46 KiB
Plaintext
1469 lines
46 KiB
Plaintext
.\" $NetBSD: trans_design.nr,v 1.3 2000/03/13 23:03:35 soren Exp $
|
|
.\"
|
|
.NC "The Design of the ARGO Transport Entity"
|
|
.sh 1 "Protocol Hooks"
|
|
.pp
|
|
The design of the AOS kernel IPC support to some
|
|
extent mandates the
|
|
design of protocols.
|
|
Each protocol must provide the following
|
|
protocol hooks, which are procedures called through a
|
|
protocol switch table
|
|
(an array of type \fIprotosw\fR as described in
|
|
Chapter Five.
|
|
.ip "pr_input()" 5
|
|
Called when data are to be passed up from a lower layer.
|
|
.ip "pr_output()" 5
|
|
Called when data are to be passed down from a higher layer.
|
|
.ip "pr_init()" 5
|
|
Called when the system is brought up.
|
|
.ip "pr_fasttimo()" 5
|
|
Called every 200 milliseconds by the clock functional unit.
|
|
.ip "pr_slowtimo()" 5
|
|
Called every 500 milliseconds by the clock functional unit.
|
|
.ip "pr_drain()" 5
|
|
This is meant to be called when buffer space is low.
|
|
Each protocol is expected to provide this routine to free
|
|
non-critical buffer space.
|
|
This is not yet called anywhere.
|
|
.ip "pr_ctlinput()" 5
|
|
Used for exchanging information between
|
|
protocols, such as notifying a transport protocol of changes
|
|
in routing or configuration information.
|
|
.ip "pr_ctloutput()" 5
|
|
Supports the protocol-dependent
|
|
\fIgetsockopt()\fR
|
|
and
|
|
\fIsetsockopt()\fR
|
|
options.
|
|
.ip "pr_usrreq()" 5
|
|
Called by the socket code to pass along a \*(lquser request\*(rq -
|
|
in other words a service primitive.
|
|
This call is also used for other protocol functions.
|
|
The functions served by the \fIpr_usrreq()\fR routine are:
|
|
.ip " PRU_ATTACH" 10
|
|
Creates a protocol control block and attaches it to a given socket.
|
|
Called as a result of a \fIsocket()\fR system call.
|
|
.ip " PRU_DISCONNECT" 10
|
|
Called as a result of a
|
|
\fIclose()\fR system call.
|
|
Initiates disconnection.
|
|
.ip " PRU_DETACH" 10
|
|
Disassociates a protocol control block from a socket and recycles
|
|
the buffer space used for the protocol control block.
|
|
Called after PRU_DISCONNECT.
|
|
.ip " PRU_SHUTDOWN" 10
|
|
Called as a result of a
|
|
\fIshutdown()\fR system call.
|
|
If the protocol supports the notion of half-open connections,
|
|
this closes the connection in one direction or both directions,
|
|
depending on the arguments passed to
|
|
\fIshutdown\fR.
|
|
.ip " PRU_BIND" 10
|
|
Gives an address to a socket.
|
|
Called as a result of a
|
|
\fIbind()\fR system call, also
|
|
when
|
|
socket without a bound address is used.
|
|
In the latter case, an unused transport suffix is located and
|
|
bound to the socket.
|
|
.ip " PRU_LISTEN" 10
|
|
Called as a result of a
|
|
\fIlisten()\fR system call.
|
|
Marks the socket as willing to queue incoming connection
|
|
requests.
|
|
.ip " PRU_CONNECT" 10
|
|
Called as a result of a
|
|
\fIconnect()\fR system call.
|
|
Initiates a connection request.
|
|
.ip " PRU_ACCEPT" 10
|
|
Called as a result of an
|
|
\fIaccept()\fR system call.
|
|
Dequeues a pending connection request, or blocks waiting for
|
|
a connection request to arrive.
|
|
In the latter case, it marks the socket as willing to accept
|
|
connections.
|
|
.ip " PRU_RCVD" 10
|
|
The protocol module is expected to have put incoming data
|
|
into the socket's receive buffer, \fIso_rcv\fR.
|
|
When a receive primitive is used
|
|
(\fIrecv(), recvmsg(), recvfrom(),
|
|
read(), readv(), \fRand
|
|
\fIrecvv()\fR system calls)
|
|
the socket code module copies data from the
|
|
\fIso_rcv\fR to the user's
|
|
address space.
|
|
The protocol module may arrange to be informed each time the socket code
|
|
does this, in which case the socket code calls \fIpr_usrreq\fR(PRU_RCVD)
|
|
after the data were copied to the user.
|
|
.ip " PRU_SEND" 10
|
|
This performs the protocol-dependent part of a send primitive
|
|
(\fIsend(), sendmsg(), sendto(), write(), writev(),
|
|
\fRand \fIsendv()\fR system calls).
|
|
The socket code
|
|
(procedures \fIsendit() and \fIsosend()\fR)
|
|
moves outgoing data from the user's
|
|
address space into a chain of \fImbufs\fR.
|
|
The socket code takes as much data from the user as it
|
|
determines will fit into the outgoing socket buffer, so_snd.
|
|
It passes this much data in the form of an mbuf chain to the protocol
|
|
via \fIpr_usrreq\fR(PRU_SEND).
|
|
If there are more data than
|
|
the so_snd can accommodate,
|
|
the socket code, which is running on behalf of a user process,
|
|
puts the user process to sleep.
|
|
The protocol module is expected to wake up the user process when
|
|
more room appears in so_snd.
|
|
.ip " PRU_ABORT" 10
|
|
Called when a socket is closed and that socket
|
|
is accepting connections and has
|
|
queued pending
|
|
connection requests or
|
|
partially open connections.
|
|
.ip " PRU_CONTROL" 10
|
|
Called as a result of an
|
|
\fIioctl()\fR system call.
|
|
.ip " PRU_SENSE" 10
|
|
Called as a result of an
|
|
\fIfstat()\fR system call.
|
|
.ip " PRU_RCVOOB" 10
|
|
Performs the work of receiving \*(lqout-of-band\*(rq data.
|
|
The socket module has already allocated an mbuf into which
|
|
the protocol module is expected to put the incoming
|
|
\*(lqout-of-band\*(rq data.
|
|
The socket code will then move the data from this mbuf
|
|
to the user's address space.
|
|
.ip " PRU_SENDOOB" 10
|
|
Performs the work of sending \*(lqout-of-band\*(rq data.
|
|
The socket module has already moved the data
|
|
from the user's address space into a chain of mbufs,
|
|
which it now passes to the protocol module.
|
|
.ip " PRU_SOCKADDR" 10
|
|
Supports the system call
|
|
\fIgetsockname()\fR.
|
|
Puts the socket's bound address into an mbuf.
|
|
.ip " PRU_PEERADDR" 10
|
|
Supports the system call
|
|
\fIgetpeername\fR().
|
|
Puts the peer's address into an mbuf.
|
|
.ip " PRU_CONNECT2" 10
|
|
This is used in the Unix domain to support pipes.
|
|
It is not generally supported by transport protocols.
|
|
.ip " PRU_FASTTIMO, PRU_SLOWTIMO" 10
|
|
These are superfluous.
|
|
None of the transport protocols uses them.
|
|
.ip " PRU_PROTORCV, PRU_PROTOSEND" 10
|
|
None of the transport protocols uses these.
|
|
.ip " PRU_SENDEOT" 10
|
|
This was added to support TP.
|
|
This indicates that the end of the data sent in this
|
|
send primitive should
|
|
be marked by the protocol as the end of the TSDU.
|
|
.sh 1 "The Interface Between the Transport Entity and Lower Layers"
|
|
.pp
|
|
The transport layer may run over a network layer such as IP
|
|
or the ISO connectionless network layer,
|
|
or it may run over a multi-purpose layer such as the service
|
|
provided by X.25.
|
|
X.25 is viewed as a network layer when
|
|
TP runs over X.25, and as a
|
|
subnetwork layer
|
|
when IP is running over X.25.
|
|
The software interface between data link and network layers differs
|
|
considerably from the software interface between transport and network
|
|
layers in AOS.
|
|
For this reason some modification of the transport-to-lower-layer
|
|
interface is necessary to support the suite of protocols included in
|
|
ARGO.
|
|
.pp
|
|
In AOS it is assumed that the transport layer will run over one
|
|
and only one network layer, and therefore it may call the
|
|
network layer output procedure directly.
|
|
In order to allow TP to run over a set of lower layers,
|
|
all domain-specific functions have been put into a set of routines
|
|
that are called indirectly through a domain-specific switch table.
|
|
The primary reason for this is that the transport and network
|
|
layers share information, mostly information pertaining to addresses.
|
|
The protocol control blocks for different network layers
|
|
differ, so the transport layer cannot just directly
|
|
access the network layer's pcb.
|
|
Similarly, a network layer may not directly access the transport
|
|
pcb because a multitude of transport protocols can run over each
|
|
of the network protocols.
|
|
.pp
|
|
To permit different network-layer protocol control blocks to coexist
|
|
under one transport layer, all transport-dependent control
|
|
information was put into a transport-specific protocol control block.
|
|
A new field, \fIso_tpcb\fR,
|
|
was added to the \fIsocket\fR structure to hold a pointer to
|
|
the transport-layer protocol control block.
|
|
The existing
|
|
field \fCso_pcb\fR is used for the network layer pcb.
|
|
.pp
|
|
The following structure was added to allow domain-specific
|
|
functions to be called indirectly.
|
|
All these functions operate on a network-layer pcb.
|
|
.pp
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l s s s.
|
|
struct nl_protosw {
|
|
.T&
|
|
l l l l.
|
|
+int+nlp_afamily;+/* address family */
|
|
+int+(*nlp_putnetaddr)();+/* puts addrs in pcb */
|
|
+int+(*nlp_getnetaddr)();+/* gets addrs from pcb */
|
|
+int+(*nlp_putsufx)();+/* transp suffix -> pcb */
|
|
+int+(*nlp_getsufx)();+/* gets t-suffix */
|
|
+int+(*nlp_recycle_suffix)();+/* zeroes suffix */
|
|
+int+(*nlp_mtu)();+/* get maximum
|
|
+++transmission unit size */
|
|
+int+(*nlp_pcbbind)();+/* bind to pcb */
|
|
+int+(*nlp_pcbconn)();+/* connect */
|
|
+int+(*nlp_pcbdisc)();+/* disconnect */
|
|
+int+(*nlp_pcbdetach)();+/* detach pcb */
|
|
+int+(*nlp_pcballoc)();+/* allocate a pcb */
|
|
+int+(*nlp_output)();+/* emit packet */
|
|
+int+(*nlp_dgoutput)();+/* emit datagram */
|
|
+caddr_t+nlp_pcblist;+/* list of pcbs
|
|
+++for management
|
|
+++of connections */
|
|
};
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
The switch is based on the address family chosen when the
|
|
\fIsocket()\fR system call is made prior to connection establishment.
|
|
This unfortunately ties the address family to the domain,
|
|
but the only alternative is to add an argument to the \fIsocket()\fR
|
|
system call to let the user specify the desired network layer.
|
|
In the case of a connection oriented environment with no multi-homing,
|
|
it would be possible to determine which network layer is to be
|
|
used
|
|
from routing
|
|
information, but to do this requires unrealistic assumptions
|
|
about the environment.
|
|
For these reasons, linking the address family to the network
|
|
layer protocol is seen as the least of the evils.
|
|
The transport suffixes are kept in the network layer's pcb
|
|
as well as in the transport layer because
|
|
full transport address pairs are used to identify a connection
|
|
in the Internet domain.
|
|
.sh 1 "The Architecture of the Transport Protocol Entity"
|
|
.pp
|
|
A set of protocol hooks is required
|
|
by the AOS IPC architecture.
|
|
These hooks are used by the protocol-independent parts of the kernel
|
|
to gain entry to protocol-specific code.
|
|
The protocol code can be entered in one of the following ways:
|
|
.ip "1) " 5
|
|
at boot time, when autoconfiguration
|
|
initializes each protocol through
|
|
the
|
|
\fIpr_init()\fR
|
|
hook,
|
|
.ip "2) " 5
|
|
from above, either
|
|
a user program making a system call, through
|
|
the \fIpr_usrreq()\fR or \fIpr_ctloutput()\fR hooks, or
|
|
from a higher layer protocol using the
|
|
\fIpr_output()\fR hook,
|
|
.ip "3) " 5
|
|
from below, a device interrupt servicing an incoming packet
|
|
through the \fIpr_input()\fR and \fIpr_ctlinput()\fR hooks, and
|
|
.ip "4) " 5
|
|
from a clock interrupt through the \fIpr_slowtimo()\fR
|
|
or the
|
|
\fIpr_fasttimo()\fR hook.
|
|
.\" FIGURE
|
|
.so figs/trans_flow.nr
|
|
.\".so figs/trans_flow.grn
|
|
.pp
|
|
The protocol code can be divided into
|
|
the following modules, which are described in more detail below.
|
|
.CF
|
|
shows the flow of data and control
|
|
among these modules.
|
|
.in +5
|
|
.ip "Timers and References:" 5
|
|
The code executed on behalf of \fIpr_slowtimo()\fR.
|
|
The fast timeout is not used by TP.
|
|
.ip "Driver:" 5
|
|
This is the finite state machine for TP.
|
|
.ip "Input: " 5
|
|
This is the module that decodes incoming packets,
|
|
identifies or creates the pcb for which
|
|
the packet is destined, and creates an "event" to
|
|
pass to the driver.
|
|
.ip "Output:" 5
|
|
This is the module that creates a packet header of a given type
|
|
with fields containing
|
|
values that are appropriate to the connection
|
|
on which the packet is being sent, appends data if necessary,
|
|
and hands a packet
|
|
to the lower layer, according to the transport-to-lower-layer
|
|
interface.
|
|
.ip "Send: " 5
|
|
This module packetizes data from the outbound
|
|
socket buffer, \fIso_snd\fR,
|
|
handles retransmissions of packetized data, and
|
|
drops packetized data from the retransmission queue.
|
|
.ip "Receive:" 5
|
|
This module reorders packets if necessary,
|
|
depacketizes data, passes it to the socket code module,
|
|
and determines when acknowledgments should be sent.
|
|
.in -5
|
|
.sh 1 "Timers and References"
|
|
.pp
|
|
TP identifies sockets by \fIreference numbers\fR, or
|
|
\fIreferences\fR,
|
|
which are \*(lqfrozen\*(rq (may not be reassigned)
|
|
until some locally defined time after
|
|
a connection is broken and its protocol control block
|
|
is discarded.
|
|
An array of \fIreference blocks\fR is maintained by TP.
|
|
The reference number of a reference block is its
|
|
offset in the array.
|
|
When a reference block is in use it contains
|
|
a pointer to the pcb for the socket to which the
|
|
reference applies.
|
|
.pp
|
|
The system clock calls the \fIpr_slowtimo()\fR and
|
|
\fIpr_fasttimo()\fR hooks for each protocol in the protocol switch table
|
|
every 500 and 200 microseconds, respectively.
|
|
Each protocol handles its own timers its own way.
|
|
The timers in TP take two forms
|
|
- those that typically are cancelled and
|
|
those that usually expire.
|
|
The latter form may have more than one instantiation at any given
|
|
time.
|
|
The former may not.
|
|
The two are implemented slightly
|
|
differently for the sake of performance.
|
|
.pp
|
|
The timers that normally expire
|
|
are kept in a queue, their values all relative
|
|
to the value of preceding timer.
|
|
Thus all timer values are decremented by a single
|
|
operation on the value of the first timer.
|
|
The timer is represented by the Ecallout structure:
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l s s s.
|
|
struct Ecallout {
|
|
.T&
|
|
l l l l.
|
|
+int+c_time;+/* incremental time */
|
|
+int+c_func;+/* function to call */
|
|
+u_int+c_arg1;+/* argument to routine */
|
|
+u_int+c_arg2;+/* argument to routine */
|
|
+int+c_arg3;+/* argument to routine */
|
|
+struct Ecallout+*c_next;
|
|
};
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
When an Ecallout structure migrates to the head
|
|
of the E timer list, and its \fIc_time\fR
|
|
field is decremented to zero,
|
|
the function stored in \fIc_func\fR is
|
|
called, with \fIc_arg1, c_arg2\fR, and \fIc_arg3\fR
|
|
as arguments.
|
|
Setting and cancelling these timers
|
|
are accomplished by a linear search and one
|
|
insertion or deletion from the timer queue.
|
|
This queue is linked to the
|
|
reference block associated with a communication endpoint.
|
|
This form used for the reference timer
|
|
and for the retransmission timers for data TPDUs.
|
|
.pp
|
|
The second form of timer, the type that
|
|
typically is cancelled, is used for several
|
|
timers - the inactivity timer, the sendack timer,
|
|
and the retransmission
|
|
timer for all types of TPDUs except data TPDUs.
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l s s s.
|
|
struct Ccallout {
|
|
.T&
|
|
l l l l.
|
|
+int+c_time;+/* incremental time */
|
|
+int+c_active;+/* this timer is active? */
|
|
};
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
All of these timers are stored
|
|
directly
|
|
in the reference block.
|
|
These timers are decremented in one linear scan of
|
|
the reference blocks.
|
|
Cancelling, setting, and both
|
|
cancelling and resetting one of these timers is accomplished by a
|
|
single assignment to an array element.
|
|
.sh 1 "Driver"
|
|
.pp
|
|
This is the finite state machine for TP.
|
|
A connection is managed by the finite state machine (fsm).
|
|
All events that pertain to a connection cause the
|
|
finite state machine driver to be called.
|
|
The driver takes two arguments - the pcb for the connection
|
|
and an event structure.
|
|
The event structure contains a field that discriminates
|
|
the different types of events, and a union of
|
|
structures that are specific to the event types.
|
|
The driver evaluates a set of predicates based on the current
|
|
state of the finite state machine (which is kept in the pcb) and the event type.
|
|
The result of the predicate evaluation determines
|
|
a set of actions to take and a state transition.
|
|
The driver takes the actions and if they complete
|
|
without errors, the driver makes the state transition.
|
|
.pp
|
|
The states, event types, predicates, actions, and state transitions are all
|
|
specified as a \fIxebec transition file\fR.
|
|
\fIXebec\fR is a utility that takes a human-readable description
|
|
of a finite state machine
|
|
and produces a set of tables and C source code for the driver.
|
|
The driver procedure is called \fItp_driver()\fR.
|
|
It is located in a file generated by xebec,
|
|
\fCtp_driver.c\fR.
|
|
For more details about xebec, see the manual page \fIxebec(1)\fR.
|
|
.pp
|
|
The transition file for TP is \fCtp.trans\fR,
|
|
and it is a good place to begin a perusal of the TP
|
|
source code.
|
|
.sh 1 "Input"
|
|
.pp
|
|
This is the module that decodes an incoming packet,
|
|
locates or creates the pcb for which
|
|
the packet is destined, and creates an event to
|
|
pass to the driver.
|
|
The network layer passes a packet up to the appropriate
|
|
transport layer by indirectly calling a transport input
|
|
routine through the protocol switch table for the network
|
|
domain.
|
|
There is one protocol switch entry for TP for each domain in which
|
|
TP will run (Internet, ISO).
|
|
In the Internet domain, the protocol switch field \fIpr_input()\fR
|
|
takes the value \fItpip_input()\fR.
|
|
This procedure accepts a packet from IP, with the IP header
|
|
still intact.
|
|
It extracts the network addresses from the IP header,
|
|
strips the IP header, and calls the domain-independent
|
|
input procedure for TP,
|
|
\fItp_input()\fR.
|
|
\fITp_input()\fR
|
|
decodes a TPDU.
|
|
The multitude of options, the variable-length
|
|
nature of the options, the semantics of the
|
|
options, and the possible combinations of concatenated
|
|
TPDUs make this a
|
|
complex procedure.
|
|
It is sensitive to changes, and from
|
|
the point of view of a software maintenance, it is a
|
|
potential hazard.
|
|
Because it is in the
|
|
critical path of TP however, some compromise
|
|
was made between maintainability and efficiency.
|
|
Multiple copies of sections of code were avoided as much as
|
|
possible,
|
|
not for the sake of saving space, but rather for the sake
|
|
of maintainability.
|
|
Ironically,
|
|
this detracts somewhat from the readability of the code.
|
|
.pp
|
|
Once a TPDU has been decoded and a pcb has been
|
|
identified for the TPDU,
|
|
the appropriate fields of the TPDU
|
|
are extracted and their values are placed in
|
|
an event structure.
|
|
Finally, \fItp_driver()\fR is called with
|
|
the event structure and the pcb as parameters.
|
|
.sh 1 "Output"
|
|
.pp
|
|
This module creates a TPDU header of a given type
|
|
with field values that are appropriate to the connection
|
|
on which the TPDU is being sent, appends data if necessary,
|
|
and hands a TPDU
|
|
to the lower layer according to the transport-to-lower-layer
|
|
interface.
|
|
Whenever a TPDU is to be sent to the peer or prospective peer,
|
|
the function \fItp_emit()\fR
|
|
is called, passing as arguments the pcb a TPDU type and several miscellaneous
|
|
other type-specific arguments, possibly including some data.
|
|
The data are in the form of an mbuf chain.
|
|
\fITp_emit()\fR prepends to the data an mbuf containing a TP header,
|
|
fills in the fields of the header according to the parameters
|
|
given, performs the checksum if appropriate, and
|
|
calls a domain-specific output routine.
|
|
For the Internet domain, this output routine is
|
|
\fItpip_output()\fR, which takes
|
|
as arguments the mbuf chain representing the TPDU,
|
|
and a network level pcb.
|
|
Some protocol errors cannot be associated with
|
|
a connection
|
|
but require that TP issue
|
|
an ER TPDU or a DR TPDU.
|
|
When these errors occur the routine
|
|
\fItp_error_emit()\fR is called.
|
|
This procedure creates the appropriate type of TPDU
|
|
and passes it to a domain-dependent routine for transmitting datagrams.
|
|
In the Internet domain,
|
|
\fItpip_output_dg()\fR is called.
|
|
This takes as arguments an mbuf chain representing the TPDU,
|
|
a source network address, and a destination network address.
|
|
.sh 1 "Send"
|
|
.\" FIGURE
|
|
.so figs/mbufsnd.nr
|
|
.\".so figs/mbufsnd.grn
|
|
.pp
|
|
This module packetizes data from the outbound
|
|
socket buffer, \fIso_snd\fR,
|
|
handles retransmissions of packetized data, and
|
|
drops packetized data from the retransmission queue.
|
|
The major routine in this module is \fItp_send()\fR, which
|
|
takes a range of sequence numbers as arguments.
|
|
For each sequence number in the range,
|
|
it packetizes the an appropriate amount
|
|
of outbound data, and places the resulting TPDU on
|
|
a retransmission control queue subject to the
|
|
constraints imposed by the rules of expedited data,
|
|
maximum packet sizes, and end-of-TSDU markers.
|
|
.pp
|
|
The most complicating factor is that of managing
|
|
expedited data.
|
|
A normal datum may not be sent (for its first time) before the
|
|
acknowledgment of any expedited datum
|
|
that was received from the user after the
|
|
normal datum was received.
|
|
In order to enforce this rule,
|
|
each TPDU must be marked in some way
|
|
so that it will be known which expedited datum
|
|
must be delivered and acknowledged by the peer before this TPDU may be transmitted
|
|
for the first time.
|
|
Markers are placed in \fIso_snd\fR
|
|
when an
|
|
outgoing expedited datum arrives from the user.
|
|
A marker is an mbuf structure with an \fIm_len\fR
|
|
of zero, but with the data area nevertheless containing
|
|
the sequence number of an expedited data TPDU.
|
|
The \fIm_type\fR of a marker is a new type, MT_XPD.
|
|
.pp
|
|
\fITp_send()\fR stops packetizing data when it encounters a marker
|
|
for an unacknowledged expedited datum.
|
|
If it encounters a marker for an expedited TPDU that has already
|
|
been acknowledged, the marker is jettisoned.
|
|
.CF
|
|
illustrates the structure of the sending socket buffer used
|
|
for normal data.
|
|
.pp
|
|
When \fItp_send()\fR moves data from mbufs on \fIso_snd\fR to the retransmission
|
|
control queue, it needs to know
|
|
how many octets of data can be placed in each TPDU.
|
|
The appropriate amount depends on, among other things,
|
|
the maximum transmission unit of the network layer
|
|
on the route the packet will take.
|
|
To determine the maximum transmission unit,
|
|
TP queries the network layer through
|
|
the domain-dependent switch table's field, \fInl_mtu\fR.
|
|
In the Internet domain, this resolves to \fItp_inmtu()\fR.
|
|
The header sizes for the network and transport layers
|
|
also affect the amount of data that can go into a packet,
|
|
and these sizes depend on the connection's characteristics.
|
|
.pp
|
|
Once the maximum amount of data per TPDU is determined,
|
|
\fItp_send()\fR can pull this amount off the \fIso_snd\fR queue to form
|
|
a TPDU,
|
|
assign a TPDU sequence number,
|
|
and place the new TPDU on the
|
|
retransmission control queue.
|
|
The retransmission control queue is a list of mbuf chains.
|
|
Each mbuf chain represents one TPDU, preceded by an
|
|
\fIrtc structure\fR:
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l s s s.
|
|
struct tp_rtc {
|
|
.T&
|
|
l l l l.
|
|
+struct tp_rtc+*tprt_next;+/* next rtc struct in list */
|
|
+SeqNum+tprt_seq;+/* seq # of this TPDU */
|
|
+int+tprt_eot;+/* end of TSDU? */
|
|
+int+tprt_octets;+/* # octets in this TPDU */
|
|
+struct mbuf+*tprt_data;+/* ptr to the octets of data */
|
|
.\"/* Performance measurment info: */
|
|
.\"int tprt_window; /* in which call to tp_send() was
|
|
.\" * this TPDU formed?
|
|
.\" */
|
|
.\"struct timeval tprt_sess_time; /* time session received the
|
|
.\" * majority of the data for this packet on send;
|
|
.\" * on recv, this is the time it's given to session
|
|
.\" */
|
|
.\"struct timeval tprt_net_time; /* time first copy was given to net layer
|
|
.\" * on send; on receive it's the time received from
|
|
.\" * the network
|
|
.\" */
|
|
};
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
Once TPDUs are on the retransmission control queue,
|
|
they are retransmitted or dropped by the actions
|
|
of timers.
|
|
The procedure \fItp_sbdrop()\fR
|
|
removes the TPDUs from the retransmission queue.
|
|
It takes a sequence number as an argument and drops
|
|
all TPDUs up to and including the TPDU with that sequence number.
|
|
.pp
|
|
When an AK TPDU arrives, the values from
|
|
its credit and sequence number fields
|
|
are passed to \fItp_goodack()\fR, which
|
|
determines whether or not the AK brought any news with it,
|
|
and therefore whether TP can send more data
|
|
or expedited data.
|
|
If this AK acknowledges something heretofore unacknowledged,
|
|
\fItp_goodack()\fR drops the appropriate TPDU(s) from the retransmission
|
|
control list, computes the smoothed average round trip time
|
|
and standard deviation of the round trip time,
|
|
and updates
|
|
the retransmission timer based on these statistics.
|
|
It sets a flag in the pcb if the TP entity is obliged to
|
|
send the flow control confirmation parameter on its next
|
|
AK TPDU.
|
|
\fITp_goodack()\fR returns true if the AK brought some news with it,
|
|
either with respect to a change in credit or with respect to
|
|
new acknowledgments.
|
|
.pp
|
|
The function \fItp_goodXack()\fR is called when an XAK TPDU
|
|
arrives.
|
|
It takes the XAK sequence number as an argument and
|
|
determines if the XAK acknowledges the last XPD TPDU sent.
|
|
If so, it drops the expedited data from the outgoing
|
|
expedited data buffer.
|
|
By its definition in the TP specification,
|
|
the expedited data stream has a window
|
|
of size 1,
|
|
that is,
|
|
only one expedited datum (packet) can be buffered
|
|
at a time.
|
|
\fITp_goodXack()\fR returns true if the XAK acknowledged
|
|
the last XPD TPDU sent and the data were dropped,
|
|
and it returns false if the acknowledgment caused no action to be taken.
|
|
.\" NEXT FIGURE
|
|
.so figs/mbufrcv.nr
|
|
.\".so figs/mbufrcv.grn
|
|
.sh 1 "Receive"
|
|
.pp
|
|
This module reorders incoming TPDUs if necessary,
|
|
depacketizes data, passes it to the socket code module,
|
|
and determines when acknowledgments should be sent.
|
|
The function
|
|
\fItp_stash()\fR
|
|
takes an DT TPDU as an argument, and if the TPDU is not in
|
|
sequence, it saves the TPDU in a \fItp_rtc\fR structure in
|
|
a list, with the TPDUs
|
|
kept in order.
|
|
When the next expected TPDU arrives, the
|
|
list of out-of-order TPDUs is scanned for
|
|
more TPDUs in sequence, updating
|
|
a field in the pcb, \fItp_rcvnxt\fR which
|
|
always contains the sequence
|
|
number of
|
|
the next expected TPDU.
|
|
If an acknowledgment is to be generated
|
|
at any time, the value of tp_rcvnxt goes into the
|
|
\fIYR-TU-NR\fR\** field of the acknowledgment TPDU.
|
|
.(f
|
|
\**
|
|
This is the name used in ISO 8073 for the field
|
|
which indicates the sequence number of the next expected DT TPDU.
|
|
.)f
|
|
.pp
|
|
\fITp_stash()\fR returns true if an acknowledgment needs to be generated
|
|
immediately, false not.
|
|
The acknowledgment strategy is therefore implemented in this routine.
|
|
Acknowledgments may be generated for one or more of several reasons,
|
|
listed below.
|
|
\fITp_stash()\fR increments a counter for each of these reasons
|
|
for which an acknowledgment is generated, and a counter for TPDUs
|
|
that are not acknowledged immediately.
|
|
.ip "ACK_STRAT_EACH" 5
|
|
The acknowledgment strategy in use calls for acknowledging each
|
|
data packet with an AK TPDU.
|
|
.ip "ACK_STRAT_FULLWIN" 5
|
|
The acknowledgment strategy in use calls for acknowledging
|
|
upon receiving the DT TPDU that represents the upper window
|
|
edge of the last advertised window.
|
|
.ip "ACK_DUP" 5
|
|
A duplicate data TPDU was received.
|
|
.ip "ACK_REORDER" 5
|
|
A DT TPDU arrived in the window but out of order.
|
|
.ip "ACK_EOT" 5
|
|
A DT TPDU arrived, and it had the end-of-TSDU flag set.
|
|
.pp
|
|
Upon receipt of a DT TPDU that is in order, and upon reordering
|
|
DT TPDUs,
|
|
\fItp_stash()\fR
|
|
places the TSDUs into the socket's receive
|
|
socket buffer, \fIso->so_rcv\fR in mbuf chains, with
|
|
TSDUs delimited by mbufs of the \fIm_type\fR MT_EOT,
|
|
which is a new type with the ARGO kernel.
|
|
.CF
|
|
illustrates the structure of the receiving socket buffer used
|
|
for normal data.
|
|
.pp
|
|
A separate socket buffer, \fItpcb->tp_Xrcv\fR,
|
|
is used for
|
|
buffering expedited data.
|
|
Only one expedited data packet may reside in this buffer at a time
|
|
because the TP standard limits the size of the window on expedited flow
|
|
to be 1.
|
|
This means the data structures are straightforward;
|
|
there is no need to distinguish between separate TSDUs in this socket buffer.
|
|
.pp
|
|
Credit is determined
|
|
by dividing the total amount of available
|
|
space in the receive buffer
|
|
by the negotiated maximum TPDU size.
|
|
TP can often offer a larger credit than this if it uses
|
|
an average of the measured actual TPDU sizes.
|
|
This strategy was once an option in the ARGO kernel,
|
|
but it was removed because unless the actual TPDU size
|
|
is constant, it leads to reneging of credit,
|
|
retransmissions, and decreased performance.
|
|
It does not work well when there is any fluctuation in the sizes
|
|
of TPDUs and it carries the penalty of lengthening the critical path
|
|
of the TP entity.
|
|
.sh 1 "Major Data Structures and Types"
|
|
.pp
|
|
In addition to the types commonly used in the kernel,
|
|
such as
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l l l.
|
|
+typedef+unsigned char+u_char;
|
|
+typedef+unsigned int+u_int;
|
|
+typedef+unsigned short+u_short;
|
|
.TE
|
|
\fR
|
|
.)b
|
|
TP uses the following types:
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l l l.
|
|
+typedef+unsigned int+SeqNum
|
|
+typedef+unsigned short+RefNum;
|
|
+typedef+int+ProtoHook;
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.pp
|
|
Sequence numbers can be either 7 or 31 bits.
|
|
An unsigned integer is used in all cases, and the proper type
|
|
of arithmetic is performed with bit masks.
|
|
Reference numbers are 16 bits.
|
|
ProtoHook is the type of the procedures that are in switch
|
|
tables, which,
|
|
although they are not functions,
|
|
are declared \fIint\fR rather than \fIvoid\fR
|
|
to be consistent with the rest of the kernel.
|
|
.pp
|
|
The following structures are fundamental
|
|
types used throughout TP,
|
|
in addition to those already described in the
|
|
section,
|
|
"The Design of the Transport Entity".
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l s s s.
|
|
struct tp_ref {
|
|
.T&
|
|
l l l l.
|
|
+u_char+tpr_state;+/* REF_FROZEN...*/
|
|
+struct Ccallout+tpr_callout[N_CTIMERS];+/* C timers */
|
|
+struct Ecallout+tpr_calltodo;+/* E timers list */
|
|
+struct tp_pcb+*tpr_pcb;+/* --> PCB */
|
|
};
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
The reference structure is logically a part of the protocol
|
|
control block and it is linked to a pcb, but it may outlive
|
|
a pcb.
|
|
When a connection is dissolved, the pcb may be recycled
|
|
but the reference structure must remain until the reference
|
|
timer goes off.
|
|
The field \fItpr_state\fR takes the values
|
|
REF_FROZEN (a reference timer is ticking),
|
|
REF_OPEN (in use, has timers and an associated pcb),
|
|
REF_OPENING (has a pcb but no timers), and
|
|
REF_FREE (free to reallocate).
|
|
.pp
|
|
The TP protocol control block is too large to fit into
|
|
one mbuf structure so it comprises two structures
|
|
linked together, the
|
|
\fItp_pcb\fR structure and the.
|
|
\fItp_pcb_aux\fR structure.
|
|
The \fItp_pcb_aux\fR structure contains
|
|
items that are used less frequently than those in
|
|
the former structure, since each access to these
|
|
items requires a second pointer dereference.
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l s s s.
|
|
struct tp_pcb_aux {
|
|
.T&
|
|
l l l s.
|
|
+struct sockbuf+tpa_Xsnd;+/* for expedited data */
|
|
+struct sockbuf+tpa_Xrcv;+/* for expedited data */
|
|
+u_char +tpa_vers;+/* protocol version */
|
|
+u_char +tpa_peer_acktime;+/* to compute DT TPDU
|
|
+++retrans timer value */
|
|
+SeqNum+tpa_Xsndnxt;+/* seq # of
|
|
+++next XPD to send */
|
|
+SeqNum+tpa_Xuna;+/* seq # of
|
|
+++unacked XPD */
|
|
+SeqNum+tpa_Xrcvnxt;+/* next XPD seq #
|
|
+++expect to recv */
|
|
+/* addressing */
|
|
+u_short+tpa_domain;+/* domain AF_ISO,...*/
|
|
+u_short+tpa_fsuffixlen;+/* foreign suffix */
|
|
+u_char+tpa_fsuffix[MAX_TSAP_SEL_LEN];+
|
|
+u_short+tpa_lsuffixlen;+/* local suffix */
|
|
+u_char+tpa_lsuffix[MAX_TSAP_SEL_LEN];+
|
|
.T&
|
|
l s s s.
|
|
+/* AK subsequencing */
|
|
.T&
|
|
l l l s.
|
|
+u_short+tpa_s_subseq;+/* next subseq to send */
|
|
+u_short+tpa_r_subseq;+/* highest recv subseq */
|
|
};
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.pp
|
|
The major portion of the protocol control block is in the
|
|
\fItp_pcb\fR structure:
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(%);
|
|
l s s s.
|
|
struct tp_pcb {
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
.\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3
|
|
% % %
|
|
.\"456789 123456789- 123456789 123456-789 123456789 1234567890
|
|
.\"
|
|
%struct tp_ref%*tp_refp;%
|
|
.T&
|
|
l l l s.
|
|
%%/* reference structure */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct tp_pcb_aux%*tp_aux;%
|
|
.T&
|
|
l l l s.
|
|
%%/*rest of tpcb (auxiliary struct)*/%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%caddr_t%tp_npcb;%/* to ll pcb */
|
|
%struct nl_protosw%*tp_nlproto;%
|
|
.T&
|
|
l l l s.
|
|
% %/* domain-dependent routines */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct socket%*tp_sock;%/* back ptr */
|
|
.\" ***************************************
|
|
.T&
|
|
l s s s.
|
|
|
|
/* local and foreign reference numbers: */
|
|
.T&
|
|
l l l l.
|
|
%RefNum%tp_lref;%
|
|
%RefNum%tp_fref;%
|
|
.\" ***************************************
|
|
.T&
|
|
l s s s.
|
|
.\"456789 123456789 123456789 123456789 123456789 1234567890
|
|
|
|
/* Stuff for sequence space arithmetic:
|
|
* Maintaining 2 sequence spaces is a pain so we set these
|
|
* values once at connection establishment time. Sequence
|
|
* number arithmetic is a set of macros which uses these.
|
|
* Sequence numbers are stored as 32 bits.
|
|
* tp_seqmask tells which of the 32 bits is used.
|
|
* tp_seqibt is the lsb that is not used. When set,
|
|
* it indicates wraparound has occurred.
|
|
* tp_seqhalf is the value that is half the sequence space.
|
|
* (or half plus one).
|
|
*/
|
|
.T&
|
|
l l l l.
|
|
%u_int%tp_seqmask;%/* mask */
|
|
%u_int%tp_seqbit;%/* wraparound */
|
|
%u_int%tp_seqhalf;%/* half space */
|
|
.\" ***************************************
|
|
.T&
|
|
l s s s.
|
|
|
|
/* flags: values are defined in tp_user.h.
|
|
* Here we keep such info as which options
|
|
* are in use: checksum, extended format,
|
|
* flow control in class 2, etc.
|
|
* See tp(4p) man page.
|
|
*/
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%u_short%tp_state;%/* fsm */
|
|
%short%tp_retrans;%
|
|
.T&
|
|
l l l s.
|
|
% % /* # times to retransmit */%
|
|
.\" ***************************************
|
|
.T&
|
|
l s s s.
|
|
|
|
/* credit & sequencing info for SENDING: */
|
|
.T&
|
|
l l l s.
|
|
%u_short%tp_fcredit;%
|
|
% %/* remote real window */%
|
|
%u_short%tp_cong_win;%
|
|
% %/* remote congestion window */%
|
|
.\" ***************************************
|
|
%SeqNum%tp_snduna;%
|
|
.T&
|
|
l l l s.
|
|
% %/* seq # of lowest unacked DT */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct tp_rtc %*tp_snduna_rtc;%
|
|
.T&
|
|
l l l s.
|
|
% %/* ptr to mbufs containing lowest%
|
|
%% * unacked TPDUs sent so far%
|
|
%% */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%SeqNum%tp_sndhiwat;%
|
|
.T&
|
|
l l l s.
|
|
% %/* highest DT sent yet */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct tp_rtc%*tp_sndhiwat_rtc;%
|
|
.T&
|
|
l l l s.
|
|
% %/* ptr to mbufs containing the last%
|
|
%% * DT sent - this is the last item %
|
|
%% * on the list that starts%
|
|
%% * at tp_snduna_rtc%
|
|
%% */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%int %tp_Nwindow;%/* for perf. measmt */
|
|
.\" ***************************************
|
|
.T&
|
|
l s s s.
|
|
|
|
/* credit & sequencing info for RECEIVING: */
|
|
.\" ***************************************
|
|
.T&
|
|
l l l s.
|
|
%SeqNum%tp_sent_lcdt;%
|
|
%%/* cdt according to last AK sent */%
|
|
%SeqNum%tp_sent_uwe;%
|
|
% %/* upper window edge, according to%
|
|
%% * the last AK sent %
|
|
%% */*
|
|
%SeqNum%tp_sent_rcvnxt;%
|
|
% %/* rcvnxt, according to%
|
|
%% * the last AK sent%
|
|
%% */*
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%short%tp_lcredit;%/* local */
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%SeqNum%tp_rcvnxt;%
|
|
.T&
|
|
l l l s.
|
|
% %/* next DT seq# we expect to recv */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct tp_rtc%*tp_rcvnxt_rtc;%
|
|
.T&
|
|
l l l s.
|
|
% %/* ptr to mbufs containing unacked %
|
|
%% * DTs received out of order, and %
|
|
%% * which we haven't acknowledged%
|
|
%% */%
|
|
.\" ***************************************
|
|
.TE
|
|
.TS
|
|
tab(%);
|
|
l s s s.
|
|
/* Items kept in the aux structure: */
|
|
|
|
.\" ***************************************
|
|
.T&
|
|
l s s l.
|
|
#define tp_vers%tp_aux->tpa_vers
|
|
#define tp_peer_acktime%tp_aux->tpa_peer_acktime
|
|
#define tp_Xsnd%tp_aux->tpa_Xsnd
|
|
#define tp_Xrcv%tp_aux->tpa_Xrcv
|
|
#define tp_Xrcvnxt%tp_aux->tpa_Xrcvnxt
|
|
#define tp_Xsndnxt%tp_aux->tpa_Xsndnxt
|
|
#define tp_Xuna%tp_aux->tpa_Xuna
|
|
#define tp_domain%tp_aux->tpa_domain
|
|
#define tp_fsuffixlen%tp_aux->tpa_fsuffixlen
|
|
#define tp_fsuffix%tp_aux->tpa_fsuffix
|
|
#define tp_lsuffixlen%tp_aux->tpa_lsuffixlen
|
|
#define tp_lsuffix%tp_aux->tpa_lsuffix
|
|
#define tp_s_subseq%tp_aux->tpa_s_subseq
|
|
#define tp_r_subseq%tp_aux->tpa_r_subseq
|
|
.\" ***************************************
|
|
.T&
|
|
l s s s.
|
|
% % %
|
|
/* parameters per-connection controllable by user: */
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct%tp_conn_param%_tp_param;
|
|
% % %
|
|
.\" ***************************************
|
|
.T&
|
|
l s s l.
|
|
#define tp_Nretrans%_tp_param.p_Nretrans
|
|
#define tp_dr_ticks%_tp_param.p_dr_ticks
|
|
#define tp_cc_ticks%_tp_param.p_cc_ticks
|
|
#define tp_dt_ticks%_tp_param.p_dt_ticks
|
|
#define tp_xpd_ticks%_tp_param.p_x_ticks
|
|
#define tp_cr_ticks%_tp_param.p_cr_ticks
|
|
#define tp_keepalive_ticks%_tp_param.p_keepalive_ticks
|
|
#define tp_sendack_ticks%_tp_param.p_sendack_ticks
|
|
#define tp_refer_ticks%_tp_param.p_ref_ticks
|
|
#define tp_inact_ticks%_tp_param.p_inact_ticks
|
|
#define tp_xtd_format%_tp_param.p_xtd_format
|
|
#define tp_xpd_service%_tp_param.p_xpd_service
|
|
#define tp_ack_strat%_tp_param.p_ack_strat
|
|
#define tp_rx_strat%_tp_param.p_rx_strat
|
|
#define tp_use_checksum%_tp_param.p_use_checksum
|
|
#define tp_tpdusize%_tp_param.p_tpdusize
|
|
#define tp_class%_tp_param.p_class
|
|
#define tp_winsize%_tp_param.p_winsize
|
|
#define tp_netservice%_tp_param.p_netservice
|
|
#define tp_no_disc_indications%_tp_param.p_no_disc_indications
|
|
#define tp_dont_change_params%_tp_param.p_dont_change_params
|
|
.\" ***************************************
|
|
.TE
|
|
.\" ***************************************
|
|
.\" ***************************************
|
|
.\" ***************************************
|
|
.TS
|
|
tab(%);
|
|
l l l l.
|
|
.\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3
|
|
.\"456789 123456789- 123456789 123456-789 123456789 1234567890
|
|
.\"
|
|
.T&
|
|
l l l s.
|
|
%%/* log2(the negotiated max size) */%
|
|
.T&
|
|
l l l l.
|
|
%int%tp_l_tpdusize;%/* # bytes */
|
|
.\" ***************************************
|
|
%struct timeval%tp_rtt;%
|
|
.T&
|
|
l l l s.
|
|
% %/* smoothed avg round-trip time */%
|
|
%struct timeval%tp_rtv;%
|
|
% %/* std deviation of round-trip time */%
|
|
%struct timeval%tp_rttemit[ TP_RTT_NUM + 1 ];%
|
|
%%/* times that the last TP_RTT_NUM %
|
|
%% * DT_TPDUs were transmitted %
|
|
%% */%
|
|
.\" ***************************************
|
|
%unsigned % %
|
|
% tp_sendfcc:1,%/* shall next ack %
|
|
% %include flow control conf. param? */%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l s.
|
|
% tp_trace:1,%/* is this pcb being traced?%
|
|
%% * (not used yet) %
|
|
%% */%
|
|
.\" ***************************************
|
|
% tp_perf_on:1,%/* statistics being kept? */%
|
|
.\" ***************************************
|
|
% tp_reneged:1,%/* have we reneged on credit%
|
|
%% * since the last AK TPDU was sent? %
|
|
%% */%
|
|
% tp_decbit:4,%/* congestion experienced? */%
|
|
% tp_flags:8,%/* see #defines below */%
|
|
.\" ***************************************
|
|
% tp_unused:16;%%
|
|
.T&
|
|
l s s l.
|
|
#define TPF_XPD_PRESENT%TPFLAG_XPD_PRESENT
|
|
#define TPF_NLQOS_PDN%TPFLAG_NLQOS_PDN
|
|
#define TPF_PEER_ON_SAMENET%TPFLAG_PEER_ON_SAMENET
|
|
%%%
|
|
.\" ***************************************
|
|
.T&
|
|
l l l l.
|
|
%struct tp_pmeas%*tp_p_meas;%
|
|
.T&
|
|
l l l s.
|
|
% %/* ptr to mbuf to hold the perf.%
|
|
%% * statistics structure %
|
|
%% */%
|
|
.\" ***************************************
|
|
};
|
|
.TE
|
|
\fR
|
|
.\"
|
|
.\" end of tpcb structure (thank you)
|
|
.\"
|
|
.)b
|
|
.fi
|
|
.sh 1 "Sequence Number Arithmetic"
|
|
.pp
|
|
Sequence numbers in TP can be either 7 bits
|
|
(\*(lqnormal format\*(rq)
|
|
or 31 bits
|
|
(\*(lqextended format\*(rq).
|
|
Sequence numbers are unsigned integers,
|
|
regardless of their format.
|
|
Three fields are kept in the pcb to manage the sequence
|
|
number arithmetic:
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l l l.
|
|
+u_int+tp_seqmask;+/* mask for seq space */
|
|
+u_int+tp_seqbit;+/* bit for seq # wraparound */
|
|
+u_int+tp_seqhalf;+/* half the seq space */
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
\fITp_seqmask\fR
|
|
is a bit mask indicating which bits are legitimate
|
|
for a sequence number of either format.
|
|
It takes the value 0x7f if 7-bit sequence numbers are in use,
|
|
and 0x7fffffff if 31-bit sequence numbers are in use.
|
|
\fITp_seqbit\fR
|
|
is the bit that becomes set when a sequence number wraps around
|
|
while being incremented.
|
|
Its value is 0x80 for normal format, 0x80000000 for extended format.
|
|
\fITp_seqhalf\fR
|
|
takes the value which is in the middle of the sequence space,
|
|
0x40 for normal format,
|
|
and
|
|
0x40000000 for extended format.
|
|
.(b
|
|
.nf
|
|
The macro
|
|
.fi
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l l l.
|
|
SEQ(tpcb, x)
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
extracts a sequence number from the location
|
|
in which it is stored.
|
|
.pp
|
|
The macros
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l s s l.
|
|
+SEQ_GT(tpcb, seq, t)+is seq > t?
|
|
+SEQ_GEQ(tpcb, seq, t)+is seq >= t?
|
|
+SEQ_LT(tpcb, seq, t)+is seq < t?
|
|
+SEQ_LEQ(tpcb, seq, t)+is seq <= t?
|
|
+SEQ_INC(tpcb, seq)+seq\+\+
|
|
+SEQ_DEC(tpcb, seq)+seq--
|
|
+SEQ_SUB(tpcb, seq, amt)+seq -= amt
|
|
+SEQ_ADD(tpcb, seq, amt)+seq \+= amt
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
perform the indicated comparisons and arithmetic
|
|
on their arguments.
|
|
.pp
|
|
An example of how these macros
|
|
are used is as follows.
|
|
To determine if a sequence
|
|
number \fIseq\fR is in a receive window
|
|
bounded by
|
|
\fIlwe\fR and \fIuwe\fR,
|
|
we define the
|
|
macro
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l.
|
|
#define+IN_RWINDOW(tpcb, seq, lwe, uwe)\\
|
|
+( SEQ_GEQ(tpcb, seq, lwe) && SEQ_LT(tpcb, seq, uwe) )
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.sh 1 "TP Implementation Options"
|
|
.pp
|
|
The transport protocol specification leaves several
|
|
things to the discretion of the implementor,
|
|
some of which may affect the performance
|
|
of individual connections and
|
|
aggregate performance.
|
|
Wherever different strategies are likely to favor
|
|
the performance of
|
|
individual connections to the detriment of aggregate performance
|
|
or vice versa, the
|
|
various strategies are under the control of options via the
|
|
\fIgetsockopt()\fR and
|
|
\fIsetsockopt()\fR system calls (see the manual pages
|
|
\fIgetsockopt(2)\fR,
|
|
\fIsetsockopt(2)\fR
|
|
and
|
|
\fItp(4p)\fR
|
|
for details).
|
|
In some cases the preferred strategies differ for the different
|
|
subnetworks, so the strategies chosen will be determined
|
|
by the subnetwork in use.
|
|
.sh 2 "TPDU size"
|
|
.pp
|
|
The limitation of the maximum TPDU size to a power of two is
|
|
unfortunate in the LAN environment.
|
|
For example, if the maximum NSDU size is around 1500, as in the case of an
|
|
Ethernet,
|
|
using a maximum TPDU size of 1024 reduces
|
|
the possible throughput by approximately 30%.
|
|
TP negotiates a maximum TPDU size of 2048 and
|
|
generates TPDUs of size around 1500.
|
|
Obviously this works well only when the peer is known to be
|
|
using the same scheme (so that the peer
|
|
doesn't send TPDUs of size 2048 and cause its
|
|
network layer to fragment the TPDUs).
|
|
This is likely to be the case in a LAN where
|
|
all protocol entities are under the same administrative
|
|
control.
|
|
The maximum TPDU size negotiated is under the control of the user,
|
|
so
|
|
it is possible to prevent this scheme from being used
|
|
by default
|
|
when the peer is not on the same LAN, by
|
|
setting the \fItp.tpdusize\fR parameter in the ARGO directory service
|
|
file to
|
|
something less than the network's maximum transmission
|
|
unit.
|
|
.\"***********************************************************
|
|
.sh 2 "Congestion Window Strategy"
|
|
.pp
|
|
The congestion window strategy from the
|
|
DoD Internet
|
|
was adapted for use with TP.
|
|
The strategy is intended to minimize the
|
|
adverse effect
|
|
of transport's retransmission on an
|
|
already congested network.
|
|
.pp
|
|
A TP entity keeps two notions of the peer's window:
|
|
the real window, which is that advertised by the peer
|
|
in AK TPDUs, and the congestion window, which is a locally
|
|
controlled window.
|
|
TP uses the smaller of the two windows when transmitting.
|
|
The congestion window starts small, which keeps a
|
|
new connection from overloading the network with a sudden
|
|
burst of packets
|
|
immediately after connection establishement.
|
|
This is called \fIslow start\fR.
|
|
For each successful acknowledgment received, the congestion
|
|
window grows by one, until eventually the real window
|
|
is the one in use.
|
|
If a retransmission timer expires, the congestion window
|
|
is reset to size one.
|
|
.pp
|
|
The congestion window strategy is used for class 4 unless
|
|
the transport user requests that it not be used.
|
|
The slow start strategy is used for traffic over a PDN
|
|
unless
|
|
the transport user requests that it not be used.
|
|
Slow start is not used for traffic over a LAN unless
|
|
its use is requested by the transport user.
|
|
.\"***********************************************************
|
|
.sh 2 "Retransmission strategies"
|
|
.pp
|
|
A retransmission timer is invoked for each set of DT TPDUs
|
|
sent in one send operation (call to \fItp_send()\fR).
|
|
This set of packets is called the \fIsend window\fR for the purpose
|
|
of this discusssion.
|
|
.pp
|
|
The number of TPDUs
|
|
in a send window
|
|
depends on the remote credit and the amount of data
|
|
in the local send buffers.
|
|
When a retransmission timer goes off, the lower
|
|
window edge
|
|
is reevaluated but the upper window edge is not reevaluated.
|
|
.pp
|
|
There are several retransmission strategies implemented in
|
|
ARGO TP.
|
|
The choice of strategies is the user's, and is made with the
|
|
\fIsetsockopt()\fR system call.
|
|
The strategies are summarized here:
|
|
.ip "Retransmit LWE TPDU only:" 5
|
|
Only the TPDU representing the new lower window edge
|
|
is retransmitted.
|
|
This is the default retransmission strategy.
|
|
.ip "Retransmit whole send window:" 5
|
|
Retransmission begins with the new lower window edge
|
|
and continues up to the old upper window edge.
|
|
.pp
|
|
The value of the data retransmission timer
|
|
adapts to the average round trip time and the standard deviation of
|
|
the round trip time.
|
|
A round trip time is the time that passes between
|
|
the moment of a packet's first transmission and
|
|
the moment it is first acknowledged.
|
|
The average round trip time
|
|
is kept by the sending side of TP, using
|
|
a formula for
|
|
smoothing the average:
|
|
.(b
|
|
\fC
|
|
.TS
|
|
tab(+);
|
|
l l l l.
|
|
#define+TP_RTT_ALPHA+3
|
|
#define+TP_RTV_ALPHA+2
|
|
+++
|
|
#define+SMOOTH(alpha, old, new) \\
|
|
+(((new-old) >> alpha ) \+ (old) )
|
|
.TE
|
|
\fR
|
|
.)b
|
|
.lp
|
|
The times included in the average are chosen as follows.
|
|
The time of
|
|
each packet's initial transmission is kept (for the last
|
|
\fIN\fR packets, where \fIN\fR is a defined constant).
|
|
When an AK TPDU arrives, ARGO TP subtracts the initial transmission
|
|
time for the lowest unacknowledged sequence number that was
|
|
acknowledged by this AK TPDU from the current time,
|
|
and apply the resulting time to the average.
|
|
Hence, not all packets are included in this average,
|
|
which is as it should be since
|
|
the purpose of this measurement is
|
|
to find a good value for the retransmission timer.
|
|
.pp
|
|
Each time part of a window is retransmitted,
|
|
the retransmission timer for that window is increased.
|
|
This does not affect the retransmission timers for other windows.
|
|
.\"***********************************************************
|
|
.sh 2 "Acknowledgment strategies"
|
|
.pp
|
|
The transport protocol specification
|
|
requires acknowledgments to be sent immediately
|
|
upon receipt
|
|
of CC TPDUs (in class 4), XPD TPDUs, and DT TPDUs containing an
|
|
EOT marker, and at other times as required for flow control,
|
|
otherwise acknowledgments may be delayed.
|
|
In addition to the times when an acknowledgment is required,
|
|
ARGO TP transmits an AK TPDU whenever the user receives some data,
|
|
thereby increasing the size of the window.
|
|
For those times when
|
|
immediate acknowledgment is optional,
|
|
ARGO TP offers two acknowledgment strategies:
|
|
.ip " Acknowledge each TPDU" 10
|
|
Upon receipt of a DT TPDU and AK TPDU is sent.
|
|
.ip " Acknowledge full window" 10
|
|
Acknowledgment is issued
|
|
upon receipt of enough data to
|
|
consume the last advertised credit.
|
|
.pp
|
|
The latter strategy
|
|
requires a timer to trigger an acknowledgment
|
|
in case the peer doesn't send the entire window
|
|
quickly.
|
|
This timer is called the
|
|
\fIsendack timer\fR.
|
|
The upper bound on the value of this timer
|
|
is called the \fIlocal acknowledgment time\fR.
|
|
The local acknowledgment time may be "advertised" to the
|
|
peer during connection establishment, and the
|
|
peer may choose to use this value to
|
|
adjust its retransmission timers.
|
|
The ARGO TP entity advertises its local acknowledgment time
|
|
on a CR TPDU, but it is not
|
|
constrained by
|
|
the remote acknowledge time, should the peer
|
|
advertise it.
|
|
Instead,
|
|
ARGO TP adapts its sendack timer
|
|
to the behavior of the connection.
|
|
.pp
|
|
Under the assumption that the round trip time is
|
|
often
|
|
symmetric,
|
|
and lacking
|
|
a method to measure
|
|
the round trip time in the other direction,
|
|
ARGO TP uses the measured average round trip time
|
|
to adjust the sendack timer.
|
|
.pp
|
|
The choice of strategies is made with the
|
|
\fIsetsockopt()\fR system call.
|
|
The default strategy is
|
|
to
|
|
delay acknowledgments until the most recently advertised window is filled.
|