scheduler_common.h is now meant for types, variables and functions used
by both core scheduler code and implementations of scheduler modes.
Functions like switch_thread() and update_thread_times() do not belong
there anymore.
* Remove possibility to temporarily disable small task packing.
* When small task packing target gets overloaded continue packing
threads on another core, but avoid migrating the already packed
ones.
Scheduler still tends to needlessly migrate threads to another cores
when under heavier load, but it is now much better than before.
* pin idle threads to their specific CPUs
* allow scheduler to implement SMP_MSG_RESCHEDULE handler
* scheduler_set_thread_priority() reworked
* at reschedule: enqueue old thread after dequeueing the new one
* Thread::scheduler_lock protects thread state, priority, etc.
* sThreadCreationLock protects thread creation and removal and list of
threads in team.
* Team::signal_lock and Team::time_lock protect list of threads in team
as well.
* Scheduler uses its own internal locking.
* No need for the atomically changed variables to be declared as
volatile.
* Drop support for atomically getting and setting unaligned data.
* Introduce atomic_get_and_set[64]() which works the same as
atomic_set[64]() used to. atomic_set[64]() does not return the
previous value anymore.
The flag main purpose is to avoid race conditions between event handler
and cancel_timer(). However, cancel_timer() is safe even without
using gSchedulerLock.
If the event is scheduled to happen on other CPU than the CPU that
invokes cancel_timer() then cancel_timer() either disables the event
before its handler starts executing or waits until the event handler
is done.
If the event is scheduled on the same CPU that calls cancel_timer()
then, since cancel_timer() disables interrupts, the event is either
executed before cancel_timer() or when the timer interrupt handler
starts running the event is already disabled.
Reads and writes to uid_t and gid_t are atomic anyway. The only real
problem that may happen here is inconsistent state of triples
effective_{u, g}id, saved_set_{u, g}id, real_{u, g}id, but team locks
protect us against that.
The fact that thread is waiting doesn't mean that it is nice to the others.
If the thread, indeed, waits for a longer time its penalty will be cancelled
anyway, however if the thread waits for a very short time do not count that
as being nice since lower priority threads didn't have much chance to run.
Simple scheduler behaves exactly the same as affine scheduler with a
single core. Obviously, affine scheduler is more complicated thus
introduces greater overhead but quite a lot of multicore logic has been
disabled on single core systems in the previous commit.
This is preparation for small task packing. We want to have as many idle
cores as possible. To achieve that we put all threads on the most heavily
loaded core (so the other ones can become idle). However, we don't really
want to do that if there are CPU bound tasks and if any of the cores
becomes overloaded.
Performance mode:
If there have been a lot of activity on the core since the thread went
sleep its data in cache probably has been overwritten.
Power saving mode:
If the thread went to sleep a long time ago either there has been a
lot of activity on its core or the core has been idle and it may
be more efficient to wake another one.
The longer core is idle the deeper idle state it has entered. That's
why the scheduler should always choose the core that has gone idle
most recently (both for performance and power saving reasons).
Moreover, if there are more than one package the scheduler should
minimize the number of packages with at least one core active when
power saving is the priority. Contrary, as many packages as possible
should be used when aiming for high performance.
There is a global heap of cores, where the key is the highest priority
of threads running on that core. Moreover, for each core there is
a heap of logical processors on this core where the key is the priority
of currently running thread.
The per-core heap is used for load balancing among logical processors
on that core. The global heap is used in initial decision where to put
the thread (note that the algorithm that makes this decision is not
complete yet).
The scheduler is in very early stage. There is no thread migration and
the algorithms choosing CPU for thread are very simple.
Since affine scheduler is going to use one run queue per core simple on
single core machines it will work exactly the same as simple scheduler.
That would allow us to have only one scheduler implementation usable
on all kinds of machines.
Simple scheduler is used when we do not have to worry about cache affinity
(i.e. single core with or without SMT, multicore with all cache levels
shared).
When we replace gSchedulerLock with more fine grained locking affine
scheduler should also be chosen when logical CPU count is high (regardless
of cache).
In SMP systems simple scheduler will be used only when all logical
processors share all levels of cache and the number of CPUs is low.
In such systems we do not have to care about cache affinity and
the contention on the lock protecting shared run queue is low. Single
run queue makes load balancing very simple.
Kernel support for yielding to all (including lower priority) threads
has been removed. POSIX sched_yield() remains unchanged.
If a thread really needs to yield to everyone it can reduce its priority
to the lowest possible and then yield (it will then need to manually
return to its prvious priority upon continuing).
Each thread has its minimal priority that depends on the static priority.
However, it is still able to starve threads with even lower priority
(e.g. CPU bound threads with lower static priority). To prevent this
another penalty is introduced. When the minimal priority is reached
penalty (count mod minimal_priority) is added, where count is the number
of time slices since the thread reached its minimal priority. This prevents
starvation of lower priorirt threads (since all CPU bound threads may have
their priority temporaily reduced to 1) but preserves relation between
static priorities - when there are two CPU bound threads the one with
higher static priority would get more CPU time.