Simple scheduler behaves exactly the same as affine scheduler with a
single core. Obviously, affine scheduler is more complicated thus
introduces greater overhead but quite a lot of multicore logic has been
disabled on single core systems in the previous commit.
This is preparation for small task packing. We want to have as many idle
cores as possible. To achieve that we put all threads on the most heavily
loaded core (so the other ones can become idle). However, we don't really
want to do that if there are CPU bound tasks and if any of the cores
becomes overloaded.
Performance mode:
If there have been a lot of activity on the core since the thread went
sleep its data in cache probably has been overwritten.
Power saving mode:
If the thread went to sleep a long time ago either there has been a
lot of activity on its core or the core has been idle and it may
be more efficient to wake another one.
The longer core is idle the deeper idle state it has entered. That's
why the scheduler should always choose the core that has gone idle
most recently (both for performance and power saving reasons).
Moreover, if there are more than one package the scheduler should
minimize the number of packages with at least one core active when
power saving is the priority. Contrary, as many packages as possible
should be used when aiming for high performance.
There is a global heap of cores, where the key is the highest priority
of threads running on that core. Moreover, for each core there is
a heap of logical processors on this core where the key is the priority
of currently running thread.
The per-core heap is used for load balancing among logical processors
on that core. The global heap is used in initial decision where to put
the thread (note that the algorithm that makes this decision is not
complete yet).
The scheduler is in very early stage. There is no thread migration and
the algorithms choosing CPU for thread are very simple.
Since affine scheduler is going to use one run queue per core simple on
single core machines it will work exactly the same as simple scheduler.
That would allow us to have only one scheduler implementation usable
on all kinds of machines.
Simple scheduler is used when we do not have to worry about cache affinity
(i.e. single core with or without SMT, multicore with all cache levels
shared).
When we replace gSchedulerLock with more fine grained locking affine
scheduler should also be chosen when logical CPU count is high (regardless
of cache).
In SMP systems simple scheduler will be used only when all logical
processors share all levels of cache and the number of CPUs is low.
In such systems we do not have to care about cache affinity and
the contention on the lock protecting shared run queue is low. Single
run queue makes load balancing very simple.
Kernel support for yielding to all (including lower priority) threads
has been removed. POSIX sched_yield() remains unchanged.
If a thread really needs to yield to everyone it can reduce its priority
to the lowest possible and then yield (it will then need to manually
return to its prvious priority upon continuing).
Each thread has its minimal priority that depends on the static priority.
However, it is still able to starve threads with even lower priority
(e.g. CPU bound threads with lower static priority). To prevent this
another penalty is introduced. When the minimal priority is reached
penalty (count mod minimal_priority) is added, where count is the number
of time slices since the thread reached its minimal priority. This prevents
starvation of lower priorirt threads (since all CPU bound threads may have
their priority temporaily reduced to 1) but preserves relation between
static priorities - when there are two CPU bound threads the one with
higher static priority would get more CPU time.
The maximum penalty the thread can receive is now limited depending on
the real thread priority. However, since it make it possible to starve
threads with priority lower than that limit. To prevent that threads
that have already earned the maximum penalty are periodically forced
to yield CPU to all other threads.
Until now, when the thread has been preempted by higher priority
thread it was then placed at the end of its priority FIFO and given
a new time slice. This patch changes it allowing the thread to
complete its time slice (when the higher priority threads are done),
unless there was very little time left in which case this time is added
to the next time slice.
Apart from making the algorithm more fair this change allows to identify
CPU bound threads more easily. (Earlier they could 'hide' by being
preempted by higher priority thread and consequently never using
their whole time slice).
This patch appears to fix#8007.
Thread that consume its whole quantum has its priority reduced. The penalty
is cancelled when the thread voluntarily gives up CPU. Real-time threads
are not affected.
The problem of thread starvation is not solved completely. The worst case
latency is still unbounded (even in systems with bounded number of threads).
When a middle priority thread is constantly preempted by high priority
threads it would not earn the penalty, thus the lower priority threads
still can be starved. Moreover, the punishment is probably too aggressive
as it reduces priority of virtually all CPU bound threads to 1.
Since we are using libraries originally intendent for user mode in kernel
mode providing them with some userland functions is inevitable. This
particular patch is to make zlib happy and able to call exit() when
its debug assertions fails.
Latest gcc converts the old ones to the new ones anyway...
including when passing to gas, which of course is not new enough,
so we have to also force gcc to pass the old one around in one case.
jam fails in execve() trying to run the command due to
a too large arguments list because of the many objects in libgcc.
We split them into two intermediate objects,
then we link them to libroot.
* __pthread_destroy_thread() will in turn free the pthread_thread object.
* this fixes a leak of 2072 bytes on each thread construction/destruction
and #9945. MediaExtractor spawns a thread on construction, which leaked
its pthread_thread object on destuction.
If the alternate signal stack is used randomize the initial stack
pointer in the same way it is randomized on "normal" thread stacks.
Also, update MINSIGSTKSZ value so that regardless of where the new
stack pointer points to there is at least 4k of stack left.
Support for 64-bit atomic operations for ARMv7+ is currently stubbed
out in libroot, but our current targets do not use it anyway.
We now select atomics-as-syscalls automatically based on the ARM
architecture we're building for. The intent is to do away with
most of the board specifics (at the very least on the kernel side)
and just specify the lowest ARMvX version you want to build for.
This will give flexibility in being able to distribute a single
image for a wide range of devices, and building a tuned system
for one specific core type.
Now we check whether the virtual address corresponding to the PTE lies
in an allocated virtual address range. This fixes a cause of #8345:
The assertion would trigger when such an entry was encountered. There
might be other causes that trigger the same assertion, though.
This adds the -mapcs-frame compiler flag for ARM to have "stable"
stack frames, adds support to the kernel for dumping stack crawls,
and initial support for iframes. There' much more functionality
to unlock in KDL, but this makes debugging already a lot more
comfortable.....
This helps when debugging, since when a driver/module causes a crash
while registering with the device manager, you can actually look at
the device manager state ;-)
The previously used method for programming the timer did not take
into account that our timespec is 64bit while the register we poke
it into is 32 bit. Since the PXA (SoC in Verdex target) has a limited
scale of resolution (us,ms,second) we dynamicly determine the one
that we can most closely match, and set that.
For f.ex. snooze to work however, we also need system_time to work.
The current implementation uses a system timer at microsecond
resolution to keep track of time.
Although the code is far from perfect, committing it now before
it gets lost, since I'm working on the infrastructure code
to properly factor out the SoC specific code out of the core
ARM architecture code (so the kernel can support more then
our poor old Verdex QEMU target ;))