Simple scheduler is used when we do not have to worry about cache affinity
(i.e. single core with or without SMT, multicore with all cache levels
shared).
When we replace gSchedulerLock with more fine grained locking affine
scheduler should also be chosen when logical CPU count is high (regardless
of cache).
In SMP systems simple scheduler will be used only when all logical
processors share all levels of cache and the number of CPUs is low.
In such systems we do not have to care about cache affinity and
the contention on the lock protecting shared run queue is low. Single
run queue makes load balancing very simple.
Kernel support for yielding to all (including lower priority) threads
has been removed. POSIX sched_yield() remains unchanged.
If a thread really needs to yield to everyone it can reduce its priority
to the lowest possible and then yield (it will then need to manually
return to its prvious priority upon continuing).
Each thread has its minimal priority that depends on the static priority.
However, it is still able to starve threads with even lower priority
(e.g. CPU bound threads with lower static priority). To prevent this
another penalty is introduced. When the minimal priority is reached
penalty (count mod minimal_priority) is added, where count is the number
of time slices since the thread reached its minimal priority. This prevents
starvation of lower priorirt threads (since all CPU bound threads may have
their priority temporaily reduced to 1) but preserves relation between
static priorities - when there are two CPU bound threads the one with
higher static priority would get more CPU time.
The maximum penalty the thread can receive is now limited depending on
the real thread priority. However, since it make it possible to starve
threads with priority lower than that limit. To prevent that threads
that have already earned the maximum penalty are periodically forced
to yield CPU to all other threads.
Until now, when the thread has been preempted by higher priority
thread it was then placed at the end of its priority FIFO and given
a new time slice. This patch changes it allowing the thread to
complete its time slice (when the higher priority threads are done),
unless there was very little time left in which case this time is added
to the next time slice.
Apart from making the algorithm more fair this change allows to identify
CPU bound threads more easily. (Earlier they could 'hide' by being
preempted by higher priority thread and consequently never using
their whole time slice).
This patch appears to fix#8007.
Thread that consume its whole quantum has its priority reduced. The penalty
is cancelled when the thread voluntarily gives up CPU. Real-time threads
are not affected.
The problem of thread starvation is not solved completely. The worst case
latency is still unbounded (even in systems with bounded number of threads).
When a middle priority thread is constantly preempted by high priority
threads it would not earn the penalty, thus the lower priority threads
still can be starved. Moreover, the punishment is probably too aggressive
as it reduces priority of virtually all CPU bound threads to 1.