oskit/oskit-20020317/doc/env.tex

593 lines
27 KiB
TeX
Executable File

%
% Copyright (c) 1997-1998, 2000 University of Utah and the Flux Group.
% All rights reserved.
%
% The University of Utah grants you the right to copy and reproduce this
% document or portions thereof for academic, research, evaluation, and
% personal use only, provided that (1) the title page appears prominently,
% and (2) these copyright and permission notices are retained in all copies.
% To arrange for alternate terms, contact the University of Utah at
% csl-dist@cs.utah.edu or +1-801-585-3271.
%
\section{Introduction}
Because the components of the \oskit{}
are intended to be usable in many different kernel and user-mode environments,
it is important that their requirements be defined fully,
not only in terms of interface dependencies,
but also in terms of \emph{execution environment}.
A component's execution environment
involves many implicit and sometimes subtle factors
such as whether and in what cases the component is reentrant.
A client using a component
must either use an execution environment
directly compatible with the environment expected by the component,
or it must be able to \emph{provide} an environment
in which the component can run
by adding appropriate glue code.
For example, most of the \oskit's components
are not inherently thread- or multiprocessor-safe;
although they can be used in multithreaded or multiprocessor environments,
they rely on the client OS to provide the necessary locking
as part of the ``glue'' the client OS uses to interface with the component.
In order to make it reasonably easy for the client OS
to adapt to the execution environment requirements of \oskit{} components,
the execution models used by the \oskit{} components
are purposely kept as simple and easy to understand as possible
without sacrificing significant functionality or performance.
Another factor driving the \oskit{} components' execution models
is the goal of being able to integrate large amounts of code,
such as device drivers and network protocol stacks,
virtually unmodified from traditional kernels such as BSD and Linux;
this requirement inevitably places some restrictions
on the execution models of the \oskit{} components
derived from these source bases.
However, in general,
even the execution models of these complex \oskit{} components
are considerably simpler and more well-defined
than the original execution environments
of the legacy systems from which the components were adapted;
this simplification is enabled
by carefully-designed \oskit{} glue code surrounding the legacy code
which emulates and hides from the \oskit{} user
the more subtle aspects of the legacy execution environments.
Since the \oskit{} includes components
with a wide range of size and complexity,
and as a result different components naturally tend to have
different levels of dependence on their execution environment,
the \oskit{} defines a set of standard execution models
arranged on a continuum from simplest and least restrictive
to most complex and demanding on the client OS\@.
This range of execution models allows the client OS
to adopt the simpler \oskit{} components quickly and with minimal fuss,
while still providing all the detailed environment information necessary
for the client OS to incorporate the most demanding components.
For example, the basic memory-management libraries,
LMM and AMM,
use an extremely simple execution models with very few restrictions,
allowing them to be used immediately in almost any context.
The device driver libraries, on the other hand,
necessarily place much greater demands on the client
since they must deal with interrupts, DMA,
and other hardware facilities closely tied to the execution environment;
however, these requirements are defined explicitly and generically
so that with a little effort even these components
can be used in many different contexts.
The remaining sections of this chapter
describe each of the execution models used in the \oskit{},
in order from simplest to most complex.
In general, each succeeding execution model
builds on and extends the previous model.
\section{Pure Model}
\label{pure-env}
The \emph{pure} execution model
is the most basic of the \oskit{} execution environments;
it has the following properties:
\begin{itemize}
\item Pure functions and components have \emph{no} implicit global state.
For example, they do not define or use any global or static variables;
they only use and manipulate data passed explicitly the client.
For example, many functions in the minimal C library,
such as \texttt{memcpy} and \texttt{strlen},
are pure by nature in that they only touch data areas
passed in parameters by the client
and have no other side-effects.
As a less trivial example,
the LMM and AMM memory manager components
include functions that maintain state across calls,
but only in explicit data structures visible to the client.
\item Pure functions and components are fully reentrant and thread-safe
with respect to disjoint data sets.
For example, if the client OS uses the LMM
to manage several separate and disjoint memory pools,
then the \emph{lmm} functions may be run concurrently
on two different processors in a multiprocessor system
without synchronization,
as long as each processor is manipulating
a different LMM memory pool.
This property is a natural consequence
of the fact that pure \oskit{} components
maintain no implicit global state.
\item Pure functions and components are \emph{not} reentrant or thread-safe
with respect to overlapping data sets.
For example,
just as it is not safe to make several \texttt{memcpy} calls
concurrently with overlapping destination buffers,
it is not safe to call LMM functions concurrently
on the same memory pool.
This is true for interrupt-style as well as thread-style concurrency:
for example, the client OS must not call an LMM function
on a particular memory pool from within an interrupt handler
if the interrupt handler might have interrupted
another LMM function call using the same memory pool.
In order to use these components
in an interruptible or multithreaded/multiprocessor environment,
the client OS must wrap them with appropriate synchronization code,
such as locking or interrupt priority management,
in order to ensure that only one call can be made at a time
for a given data set.
\item Pure functions and components are \emph{not} reentrant
with respect to overlapping data sets
even during callbacks from the component to the client OS\@.
In other words, callbacks are assumed to be atomic
as far as the component is concerned.
For example, the address map manager, AMM,
makes calls back to the client OS
to allocate and free memory
for use in maintaining the address map,
as well as to perform client-defined processing
when address map regions are split or joined together.
During the processing of one of these callbacks,
the client OS must not attempt
to make a reentrant call back into the AMM
operating on the same address map.
\end{itemize}
\psfigure{pure-model}{%
Illustration of the execution model used by pure components.
Separate components, and separate instances of each component,
are fully independent and have no implicit shared global state;
therefore they can be invoked concurrently with no synchronization.
However, each individual instance of a component
(e.g., a particular LMM memory pool)
is single-threaded and non-reentrant;
the client OS must avoid concurrent calls to that instance,
as well as recursive calls to the same instance through callbacks.
}
Figure~\ref{fig-pure-model} illustrates the pure execution environment.
Since pure functions and components contain no implicit global state,
separate ``instances'' or uses of these components by the client
can be treated as completely independent objects:
although each individual instance of the component
is single-threaded and non-reentrant,
the client OS can manage synchronization
between independent instances of the component
in any way it chooses.
\section{Impure Model}
Components that use the \emph{impure} execution model
act just like those operating in the pure model,
except that they may contain global shared state
and therefore must be treated as a single ``instance''
for synchronization and reentrancy purposes.
For example,
many of the functions in \texttt{liboskit_kern},
the kernel support library,
set up and access global processor registers and data structures,
and are therefore impure.
Similarly,
some of the functions in the minimal C library,
such as \texttt{malloc} and its relatives,
inherently require the use of global state
and therefore are impure.
The impure execution model has the following properties:
\begin{itemize}
\item Impure functions and components may depend on implicit global state,
such as global or static variables or special processor registers.
\item Impure functions and components are \emph{not} reentrant or thread-safe,
except when explicitly stated otherwise.
In order to use these components
in an interruptible or multithreaded/multiprocessor environment,
the client OS must provide appropriate synchronization code.
Many impure components and functions
provide explicit synchronization hooks
for the convenience of the client OS\@.
For example, the minimal C library's \texttt{malloc} functions
make calls to \texttt{mem_lock} and \texttt{mem_unlock},
which are empty functions by default
but can be overridden by the client to provide real synchronization;
see Section~\ref{memalloc} for more information.
\item Impure functions and components are \emph{not} reentrant
even during callbacks from the component to the client OS,
except when explicitly stated otherwise.
In other words, callbacks are assumed to be atomic
as far as the component is concerned.
\end{itemize}
\section{Blocking Model}
The \emph{blocking} execution model
extends the impure model to support non-preemptive multithreading;
it is essentially the execution model used
in traditional Unix-like kernels such as BSD and Linux.
Components that use the blocking model
have the same properties as those that use the impure model,
except that they are re-entrant
with respect to \emph{some} callback functions;
these functions are known as \emph{blocking} functions.
This means that,
whenever the component makes a call to a blocking function,
the client OS may re-enter the component in a different context,
e.g., in the context of a different thread or processor.
The set of callback functions that are assumed to be blocking
is part of the component's contract with the client OS;
in general, a function is blocking
unless it is explicitly stated to be nonblocking.
In order to use a blocking-model component
in a fully preemptive, interruptible, or multiprocessor environment,
the client OS must do essentially the same thing
to adapt to the component's execution model
as it would for a pure or impure component:
namely, it must surround the component with a lock
which is taken before entry to the component and released on exit.
However, because the component supports re-entrancy
through callbacks that are defined to be blocking functions,
the client OS's implementations of these callback functions
may release the component's lock temporarily
and re-acquire it before returning into the component,
thereby allowing other concurrent uses of the component.
%XXX diagram
\section{Interruptible Blocking Model}
\label{intr-blocking-model}
The \emph{interruptible blocking} execution model,
unlike the other models,
allows a component to be re-entered at arbitrary points
under certain conditions.
In the interrupt model,
there are two ``levels'' in which a component's code may execute:
interrupt level and process level.
(Note that in this context we use the term ``process''
only because it is the traditional term used in this context;
the components in fact have no notion of an actual ``process.'')
The interrupt model also assumes a one-bit \emph{interrupt enable} flag,
which the component can control through well-defined callback functions
which must be provided by the client OS\@.
When the component is running at either level and interrupts are enabled,
the component may be re-entered at interrupt level,
typically to execute an \emph{interrupt handler} of some kind.
To be specific,
the properties of the interruptible blocking model are as follows:
\begin{enumerate}
\item Each component is a single-threaded execution domain:
only one (virtual or physical) CPU may execute code
in the component at a given time.
For example, on a multiprocessor,
only one processor may be allowed
to execute in a component set at a time at process level;
this can be handled by placing a global lock around the component.
(Different components can be allowed to execute concurrently,
as long as the client OS takes appropriate precautions
to keep them fully independent of each other.)
%XXX the above sentence is obscure: need to explain somewhere.
Similarly, if the host OS is preemptible,
then the OS must ensure that if a component is preempted,
then that component will not be re-entered in another context
before the first context has finished or entered a blocking function.
\item Multiple process-level activities
may be outstanding at a given time in a component,
as long as only one is actually \emph{executing} at a time
(as required by rule 1).
A subset of the callback functions provided by the client OS
are defined as \emph{blocking functions};
whenever one of these functions is called,
the host OS may start a new activity in the component,
or switch back to other previously blocked activities.
\item The host OS supplies each outstanding activity with a separate stack,
which is retained across blocking function calls.
Stacks are only relinquished by a component
when the operation completes and the component's code returns
from the original call that was used to invoke it.
\item Code in a component always runs at one of two levels:
process level or interrupt level.
Whenever the host OS invokes a component
through its interface,
it enters the component at one of these two levels.
Typically, some of the component's exported functions or methods
can be invoked only at process level,
while others can be invoked only at interrupt level,
and there may be a few that can be invoked at either level;
which functions can be invoked at which levels
is part of the component's interface
(its contract with the client OS),
and thus is defined in the component's description.
Typically, most of a component's entrypoints
can only be invoked at process level;
therefore, entrypoints for which no level is explicitly specified
can be entered only at process level.
\item Both process-level and interrupt-level execution in a component
can be interrupted at any time
by interrupt handlers in the component,
except when the code has disabled interrupts
using {\tt osenv_intr_disable} (see Section~\ref{osenv-intr-disable}).
\com{%
(XXX is the host OS \emph{required} to allow interrupt handlers
to interrupt process-level activities?
Yes for drivers that busy-wait for interrupts to occur.)
}
\item When a component is entered at process level,
the component assumes that interrupts are enabled.
The component may temporarily disable interrupts during processing,
but must re-enable them before returning to the client OS\@.
Similarly, when a component is entered at interrupt level
(e.g., a hardware interrupt handler in a device driver),
interrupts are assumed to be initially disabled
as if {\tt osenv_intr_disable} had been called implicitly
before entering the handler.
However, the component may re-enable interrupts,
at which point the client OS is allowed
to interrupt the component again with other interrupt-level activities.
\item Interrupt-level activities must be strictly stacked.
In other words,
when the client OS interrupts a process-level activity in a component,
that interrupt-level activity
must be allowed to run to completion
before the client OS may resume the process-level activity.
Similarly, if an interrupt-level activity is itself interrupted,
then the most recent interrupt-level activity must finish
before the client OS may resume previous interrupt-level activities.
This constraint is generally satisfied automatically
if the client OS is running on a uniprocessor
and uses only a single stack
for both process-level and interrupt-level processing;
however, the \oskit{} components
do not require the client OS to use only a single stack
as long as it meets these re-entrancy requirements.
\item Code in a component that may run at interrupt level
may not call blocking callback functions
provided by the client OS;
only nonblocking callback functions may be called at interrupt level.
\end{enumerate}
Although on the surface it may appear
that these requirements place severe restrictions on the host OS,
the required execution model can in fact be provided
quite easily even in most kernels supporting other execution models.
The following sections describe some example techniques
for providing this execution model.
\subsection{Use in multiprocessor kernels}
%XXX
\paragraph{Global spin lock:}
The easiest way to provide the required execution model
for interruptible, blocking components
in a nonpreemptive, process-model, multiprocessor kernel such as Mach 3.0
is to place a single global spin lock
around all code running in the device driver framework.
A process must acquire this lock before entering driver code,
and release it after the operation completes.
(This includes both process-level entry through the component's interface,
and interrupt-level entry into the components' interrupt handlers.)
In addition, all blocking callback functions
which the host OS supplies
should release the global lock before blocking
and acquire the lock again after being woken up.
This way, other processors, and other processes on the same processor,
can run code in the same or other drivers while the first operation is blocked.
Note that this global lock must be handled carefully
in order to avoid deadlock situations.
A simple, ``naive'' non-reentrant spin lock will not work,
because if an interrupt occurs on a processor
that is already executing process-level driver code,
and that interrupt tries to lock the global lock again,
it will deadlock because the lock is already held by the process-level code.
The typical solution to this problem
is to implement the lock as a ``reentrant'' lock,
so that the same processor can lock it twice
(once at process level and once at interrupt level)
without deadlocking.
Another strategy for handling the deadlock problem
is for the host OS simply to
disable interrupts before acquiring the global spin lock
and enable interrupts after releasing it,
so that interrupt handlers are only called
while the process-level device driver code is blocked.
(In this case, the {\tt osenv_intr_enable} and {\tt osenv_intr_disable} calls,
provided by the OS to the drivers,
would do nothing because interrupts are always disabled
during process-level execution.)
This strategy is not recommended, however,
because it will increase interrupt latency
and break some existing partially-compliant device drivers
which busy-wait at process level for conditions set by interrupt handlers.
%XXX ref to appropriate place in linux-dev
\paragraph{Spin lock per component:}
As a refinement to the approach described above,
to achieve better parallelism,
the host OS kernel may want to maintain
a separate spin lock for each component.
This way, for example,
a network device driver can be run on one processor
while a disk device driver is being run on another.
This parallelism is allowed by the framework
because components are fully independent
and do not share data with each other directly.
Of course, the client OS must be careful to maintain this independence
in the way it uses the components:
for example, if the client OS wants to have one component make calls to another
(e.g., to connect a file system component to a disk device driver),
and it wants the two components
to be separately synchronized and use separate locks,
the client OS must interpose some of its own code
to release one lock and acquire the other
during calls from one component to the other.
\subsection{Use in preemptive kernels}
\label{fdev-in-preemptive}
The issues and solutions for implementing the required execution model
in preemptive kernels are similar to those for multiprocessor kernels:
basically, locks are used to protect the component's code.
Again, the locking granularity can be global or per-component
(or anything in between, as the OS desires).
However, in this case,
a blocking lock must be used rather than a simple spin lock
because the lock must continue to be held
if a process running the component's code is preempted.
(Note the distinction between OS-level ``blocking,''
which can occur at any time during execution of the component's code
but is made invisible to the component's code through the use of locks;
and component-level ``blocking,''
which only occurs when a component calls a blocking function.)
An alternative solution to the preemption problem
is simply to disable preemption while running the component's code.
This solution is likely to be simpler in terms of implementation
and to have less overhead,
but it may greatly increase thread dispatch latency,
possibly defeating the purpose of kernel preemption in the first place.
\subsection{Use in multiple-interrupt-level kernels}
Many existing kernels, particularly those derived from Unix or BSD,
implement a range of ``interrupt priority levels,''
typically assigning different levels to different classes of devices
such as block, character, or network devices.
In addition, some processor architectures, such as the 680x0,
directly support and require the use of some kind of IPL-based scheme.
Although the \oskit{} device drivers and other \oskit{} components
do not directly support a notion of interrupt priority levels,
it can be simulated fairly easily in IPL-based kernels
by assigning a particular IPL to each component used by the kernel.
In this case,
the {\tt osenv_intr_disable} routine provided by the kernel
does not disable \emph{all} interrupts,
but instead only disables interrupts at the interrupt priority level
that the client OS has assigned to the calling component,
and at all lower priority levels.
This way, although the code in each component
is only aware of interrupts being ``enabled'' or ``disabled,''
the host OS can in effect enforce a general IPL-based scheme.
An obvious limitation, of course,
is that all of the device drivers in a particular driver set
must generally have the same IPL\@.
However, this is usually not a problem,
since the drivers in a set are usually closely related anyway.
%XXX driver set -> component set?
\subsection{Use in interrupt-model kernels}
Many small kernels use a pure interrupt model internally
rather than a traditional process model;
this basically means that there is only one kernel stack per \emph{processor}
rather than one kernel stack per process,
and therefore kernel code can't block
without losing all of the state on its stack.
This is probably the most difficult environment
in which to use the framework,
since the framework fundamentally assumes
one stack per outstanding component invocation.
Nevertheless, there are a number of reasonable ways
to work around this mismatch of execution model,
some of which are described briefly below as examples:
\begin{itemize}
\item {\bf Switch stacks while running driver code.}
Before the kernel invokes a component operation
(e.g., makes a {\tt read} or {\tt write} request),
it allocates a special ``alternate'' kernel stack,
possibly from a ``pool'' of stacks reserved for this purpose.
This alternate stack is associated with the outstanding operation
until the operation completes;
the kernel switches to the alternate stack
before executing process-level component code,
and switches back to the per-processor kernel stack
when the driver blocks or returns.
Depending on the details of the kernel's execution model,
the kernel may also have to switch back to the per-processor stack
when the process-level component code is interrupted,
due to an event such as a hardware interrupt
or a page fault occurring while copying data
into or out of a user-mode process's address space.
However, note that stack switching is only required
when running process-level component code;
interrupt handlers in this execution model
are already ``interrupt model'' code and need no special adaptation.
\item {\bf Run process-level device driver code on a separate kernel thread.}
If the kernel supports kernel threads in some form
(threads that run using a traditional process model
but happen to execute in the kernel's address space),
then process-level component code can be run on a kernel thread.
Basically, the kernel creates or otherwise ``fires off''
a new kernel thread for each new component operation invoked,
and the thread terminates when the operation is complete.
(If thread creation and termination are expensive,
then a ``pool'' of available threads can be cached.)
The kernel must ensure that the threads
active in a particular component at a given time
cannot preempt each other arbitrarily except
in the blocking functions defined by this framework;
one way to do this is with locks (see Section~\ref{fdev-in-preemptive}).
Conceptually, this solution is more or less isomorphic
to the stack switching solution described above,
since a context switch basically amounts to a stack switch;
only the low-level details are really different.
\item {\bf Run the device drivers in user mode.}
If a process-model environment cannot easily
be provided or simulated within the kernel,
then the best solution may be
to run components in user mode,
as ordinary threads running on top of the kernel.
Of course, this solution brings with it
various potential complications and efficiency problems;
however, in practice they may be fairly easily surmountable,
especially in kernels that already support
other kinds of user-level OS components
such as device drivers, file systems, etc.
\com{XXX specific to device drivers
Also, even if some \oskit{} components can only be run in user mode,
there is nothing to prevent other ``native'' components
designed specifically for the kernel in question
from running in supervisor mode with minimal performance overhead;
this way, extremely popular or performance-critical hardware
can be supported directly in the kernel,
while the drivers from this framework can be run in user mode
to increase the breadth of supported hardware.
}
\item {\bf Run the device drivers at an intermediate privilege level.}
Some processor architectures, such as the x86 and {\sc PA-RISC},
support multiple privilege levels
besides just ``supervisor mode'' and ``user mode.''
Kernels for such architectures
may want to run blocking \oskit{} components under this framework
at an intermediate privilege level,
if this approach results in a net win
in terms of performance or implementation complexity.
Alternatively, on most architectures,
the kernel may be able to run blocking \oskit{} components
in user mode but with an address map identical to the kernel's,
allowing them direct access to physical memory
and other important kernel resources.
\end{itemize}