A bit hackish implementation of a profiler for the scheduler.
SCHEDULER_ENTER_FUNCTION at the begining of each function aren't nice and
usage of __PRETTY_FUNCTION__ isn't any better (both gcc and clang support
it though), but it was quick to implement and doesn't lose information
on inlined functions. It's just a tool, not an integral part of the kernal
anyway.
Apart from the refactoring this commit takes the opportunity and removes
unnecessary read locks when choosing a package and a core from idle lists.
The data structures are accessed in a thread safe way and it does not really
matter whether the obtained data becomes outdated just when we release the
lock or during our search for the appropriate package/core.
Some SMT implementations (e.g. recent AMD microarchitectures) have
separate L1d cache for each SMT thread (which AMD decides to call "cores").
This means that we shouldn't move threads to another logical processor too
often even if it belongs to the same core. We aren't very strict about
this as it would complicate load balancing, but we try to reduce unnecessary
migrations.
atomic_{get, set}64() are problematic on architectures without 64 bit
compare and swap.
Also, using sequential lock instead of atomic access ensures that
any reads from cpu_ent::active_time won't require any writes to shared
memory.
As weak aliases are not supported on OS X, this caused problems when
building Haiku on OS X, as this file is also used for the host tools.
Signed-off-by: Axel Dörfler <axeld@pinc-software.de>
The client code is not supposed to change the topology info.
It would be also nice if cpu_topology_node::children was an array of
pointers to const but that would require several const_casts in the
topology tree generation code so it's probably not worth it.
Apparently, reading from dr3 is slower than reading from memory
with cache hit.
Also, depending on hypervisor configuration, accessing dr3 may cause
a VM exit (and, at least on kvm, it does), what makes it much slower
than a memory access even when there is a cache miss.
On x86 we mainly want to disable PAE, which is now also used with less
memory as long as NX support is available. Ideally we'd check this
condition as well and only add the menu item, if the kernel would
enable PAE.
Add get_safemode_option_early() and get_safemode_boolean_early() to get
safemode options before the kernel heap has been initialized. They use a
simplified parser.
CreateSubRequest() could still return an error and break out of the
while loop without exiting the outer for loop.
Instead we reset the error code before entering the for loop.
This reverts the extra for loop condition from
"do_iterative_fd_io_iterate(): Support sparse files".
When reading a file with more than 8 block_runs, get_vecs() would
return B_BUFFER_OVERFLOW which would never create any subrequest due
to the test on error == B_OK on the loop, but instead just fail.
Except for the get_vecs() return code, where it is not wanted,
the test made no sense as all other assignments are tested directly
or passed around with break.
Works for me but I don't guarantee it's completely correct.
* When exec()'ing we'd otherwise get (harmless but annoying) messages
from vm_page_fault(). With syscall tracing enabled we can get userland
stack traces anyway.
* Simplify by using TRACE_ENTRY_SELECTOR().
The spec explicitly states that pthread_join shall not return EINTR, so
we have to retry the wait when it gets interrupted instead of letting
the error code through.
* VMTranslationMap:
- Add DebugPrintMappingInfo(): Given a virtual address it is supposed
to print the paging structure information for that address. To be
implemented by derived classes.
- Add DebugGetReverseMappingInfo(): Given a physical addresss it is
supposed to find all virtual addresses mapped to it. To be
implemented by derived classes.
* X86VMTranslationMapPAE: Implement the new methods
DebugPrintMappingInfo() and DebugGetReverseMappingInfo().
* Add KDL command "mapping". It supports both virtual address lookups
and reverse lookups.
__flatten_process_args() does now have the executable path as an
additional (optional) parameter. If specified, the function will read
the file's SYS:ENV attribute (if set) and use its value to modified the
environment it is preparing for the new process. Currently supported
attribute values are strings consisting of "<var>=<value>" substrings
separated by "\0" (backslash zero), with '\' being used as an escape
character. The environment will be altered to contain the specified
"<var>=<value>" elements, replacing a preexisting <var> element (if
any).
A possible use case would be setting a SYS:ENV attribute with value
"DISABLE_ASLR=1" on an executable that needs ASLR disabled.
* VMAddressSpace: Add randomizingEnabled property.
* VMUserAddressSpace: Randomize addresses only when randomizingEnabled
property is set.
* create_team_arg(): Check, if the team's environment contains
"DISABLE_ASLR=1". Set the team's address space property
randomizingEnabled accordingly in load_image_internal() and
exec_team().
In that case the caller ideally wants to obtain an allocation at the
specified address, which was thwarted by using
B_RANDOMIZED_BASE_ADDRESS. Use B_BASE_ADDRESS instead.
This improves the experience with the gcc 4 pre-compiled headers
implementation (which expects to be able to map the PCH file at the same
address where it was located originally when it had been created), but
doesn't fix it completely. As long as ASLR is active, it is always
possible that something else (mapped shared objects, heap, stack) is in
the way.
Unless a free range was found before the first area a specified base
address was ignored. In the non-randomized case this could result in
a range other than (i.e. starting before) the preferred one being
chosen, although the preferred range was available.
devfs_get_device() returns the device for a given path (if any), also
acquiring a reference to its vnode (thus ensuring the device won't go
away). devfs_put_device() puts the device vnode's reference.
* Create new interface for cpuidle modules (similar to the cpufreq
interface)
* Generic cpuidle module is no longer needed
* Fix and update Intel C-State module
* Determine whether called from userland or kernel.
* Check the buffer address via IS_USER_ADDRESS(), if from userland.
* Simplify things by merging UserRead() with Read() and
UserWrite() with Write().
* Increase FIFO buffer capacity from 32 to 64 KiB and the FIFO atomic
write size ({BUF_SIZE}) from 512 bytes to 4 KiB (both like Linux).
* Fix *pathconf(..., _PC_PIPE_BUF). It was returning 4 KiB although the
implemented atomic write size was 512 bytes only. Now both *pathconf()
and the FIFO implementation refer to the same constant.
scheduler_common.h is now meant for types, variables and functions used
by both core scheduler code and implementations of scheduler modes.
Functions like switch_thread() and update_thread_times() do not belong
there anymore.
* get_file_attribute(): Use O_NOTRAVERSE, so we correctly read the
attribute from symlinks.
* internal_path_for_path(): Shuffle things around a bit: The dependency
is resolved before handling B_FIND_PATH_PACKAGE_PATH, now. This adds
support for getting the package file for a dependency. The dependency
was ignored in this case before.
* Use kSystemPackageLinksDirectory instead of hard-coding "/packages".
* Remove possibility to temporarily disable small task packing.
* When small task packing target gets overloaded continue packing
threads on another core, but avoid migrating the already packed
ones.
Scheduler still tends to needlessly migrate threads to another cores
when under heavier load, but it is now much better than before.
It's a browser for the system package content, where entries can be
selected to blacklist them. The selected entries are removed from the
packagefs instance in the boot loader, so that e.g. selected drivers
won't be picked up. The paths are also added to the safe mode driver
settings and will be interpreted when the system packagefs instance is
mounted by the kernel.
* Make Menu and MenuItem polymorphic.
* MenuItem:
- Make SetMarked() virtual, so it can be overridden.
- Add SetSubmenu() and Supermenu().
- Delete the submenu in the destructor.
* Menu:
- Add Entered()/Exited() hooks. They frame the time the user navigates
the menu or any of its submenus. The hooks allow for subclasses
populating their item list dynamically.
- Add SortItems().
* Update boot loader menu copyright text to include 2013, now that it is
over soon. :-)
In each installation location, it is now possible to create a settings
file "packages" that allows to blacklist entries contained in packages.
The format is:
Package <package name> {
EntryBlacklist {
<entry path>
...
}
}
...
<package name> is the base name (no version) of the respective package
(e.g. "haiku"), <entry path> is an installation location relative path
(e.g. "add-ons/Translators/FooTranslator").
Blacklisted entries will be ignored by packagefs, i.e. they won't appear
in the file system. This addresses the issue that it may be necessary to
remove a problematic file (e.g. driver, add-on, or library), which would
otherwise require editing the containing package file.
The settings file is not not "live". Changes take effect only after
reboot (respectively when remounting the concerned packagefs volume).
* Move PathBuffer helper class out of find_paths.cpp into its own
header.
* find_directory():
- Make use of MemoryDeleter to simplify things.
- Make use of PathBuffer for a simpler and more correct handling.
- Make B_UTILITIES_DIRECTORY to B_APPS_DIRECTORY. /boot/utilities
doesn't exist anyway.
- Resolve the concerned constants to the architecture specific
subdirectory, when called in a secondary architecture context, just
like find_path*().
* get_architectures() returns the primary and the secondary
architectures in one array. That turned out to be convenient.
* Add C++ versions for get[_secondary]_architectures(), returning a
BStringList.
* Add get_architecture(), get_primary_architecture(),
get_secondary_architectures(), guess_architecture_for_path() to get
the caller's architecture, the primary architecture, all secondary
architectures, or the architecture associated with a specified path
respectively.
* Rename the find_path*() functions to find_path*_etc() and add an
optional architecture parameter. Add simplified find_path*()
functions.
* BPathFinder: Add FindPath[s]() versions with an architecture
parameter.
* pin idle threads to their specific CPUs
* allow scheduler to implement SMP_MSG_RESCHEDULE handler
* scheduler_set_thread_priority() reworked
* at reschedule: enqueue old thread after dequeueing the new one
In case something went wrong, call unlock_memory_etc() with the rounded
base address instead of with the original address. If the original
address wasn't page aligned, unlock_memory_etc() would otherwise try to
unlock an additional page.
* Check whether the vectors we get are sparse file vectors and satisfy
them immediately instead of creating a subrequest. Untested, since the
API isn't used by ext2 as it should be.
* Add error == B_OK to the condition of the outer loop.
Besides that it failed to actually iterate through the vectors, it
shouldn't try to clear physical memory in the first place. The iovecs
refer to virtual address ranges. Rename it to zero_iovecs() to avoid
confusion.
Simplify the code, which also fixes the bug that the I/O context's root
was ignored when it was a mount point, thus resulting in globally rooted
paths in this case.
* Thread::scheduler_lock protects thread state, priority, etc.
* sThreadCreationLock protects thread creation and removal and list of
threads in team.
* Team::signal_lock and Team::time_lock protect list of threads in team
as well.
* Scheduler uses its own internal locking.
* No need for the atomically changed variables to be declared as
volatile.
* Drop support for atomically getting and setting unaligned data.
* Introduce atomic_get_and_set[64]() which works the same as
atomic_set[64]() used to. atomic_set[64]() does not return the
previous value anymore.
The new functions are meant to replace many uses of find_directory():
* find_paths() is supposed to be used when the directories of a certain
kind in all installation directories are needed (e.g. font
directories, add-on directory, etc.). Using this API makes code
robust wrt addition or removal of installation locations.
* find_path() is supposed to be used when files/directories associated
with a loaded program, library, or add-on need to be found (e.g. data
files or global settings).
* find_path_for_path() is similar to find_path(), but it starts from a
given path instead of an image.
The flag main purpose is to avoid race conditions between event handler
and cancel_timer(). However, cancel_timer() is safe even without
using gSchedulerLock.
If the event is scheduled to happen on other CPU than the CPU that
invokes cancel_timer() then cancel_timer() either disables the event
before its handler starts executing or waits until the event handler
is done.
If the event is scheduled on the same CPU that calls cancel_timer()
then, since cancel_timer() disables interrupts, the event is either
executed before cancel_timer() or when the timer interrupt handler
starts running the event is already disabled.
Reads and writes to uid_t and gid_t are atomic anyway. The only real
problem that may happen here is inconsistent state of triples
effective_{u, g}id, saved_set_{u, g}id, real_{u, g}id, but team locks
protect us against that.
Whatever we read from the drive in the boot loader isn't what we can
read from the device later, so rather skip the check sum test for
identifying the boot device in the kernel when booting off CD. Fixes
#10147.
The fact that thread is waiting doesn't mean that it is nice to the others.
If the thread, indeed, waits for a longer time its penalty will be cancelled
anyway, however if the thread waits for a very short time do not count that
as being nice since lower priority threads didn't have much chance to run.
* Replace ports list mutex with R/W-lock.
* Move team port list protection to separate array of mutexes.
Relieve contention on sPortsLock by removing Team::port_list from its
protected items. With this, set_port_owner() only needs to acquire the
sPortsLock for reading.
* Add another hash table holding the ports by name. Used by find_port()
so it doesn't have to iterate over the list anymore.
* Use slab-based memory allocator for port messages. sPortQuotaLock was
acquired on every message send or receive and was thus another point
of contention. The lock is not necessary anymore.
* Lock for port hashes and Port::lock are no longer locked in a nested
fashion to reduce chances of blocking other threads.
* Make operations concurrency-safe by adding an atomically accessed
Port::state which provides linearization points to port creation and
deletion. Both operations are now divided into logical and physical
parts, the logical part just updating the state and the physical part
adding/remove it to/from the port hash and team port list.
* set_port_owner() is the only remaining function which still locks
Port::lock and one or two of sTeamListLock[] in a nested fashion.
Since it needs to move the port from one team list to another and
change Port::owner, there's no way around.
* Ports are now reference counted to make accesses to already-deleted
ports safe.
* Should fix#8007.
Use platform_{allocate,free}_region() to allocate/free chunks >= 16 KiB.
This reduces the usage of our dedicated (and limited) heap region,
particularly since packagefs makes some larger temporary allocations
while reading the package file. Should fix#10136.
Comparing the complete disk_identifer structure isn't helpful as long as
we don't (can't) compare it in the kernel as well. ATM we only check the
check sums there, so that's what we need to do here as well. This fixes
potential mix-ups when booting off one of multiple equally sized disks.
Simple scheduler behaves exactly the same as affine scheduler with a
single core. Obviously, affine scheduler is more complicated thus
introduces greater overhead but quite a lot of multicore logic has been
disabled on single core systems in the previous commit.
This is preparation for small task packing. We want to have as many idle
cores as possible. To achieve that we put all threads on the most heavily
loaded core (so the other ones can become idle). However, we don't really
want to do that if there are CPU bound tasks and if any of the cores
becomes overloaded.
Performance mode:
If there have been a lot of activity on the core since the thread went
sleep its data in cache probably has been overwritten.
Power saving mode:
If the thread went to sleep a long time ago either there has been a
lot of activity on its core or the core has been idle and it may
be more efficient to wake another one.
The longer core is idle the deeper idle state it has entered. That's
why the scheduler should always choose the core that has gone idle
most recently (both for performance and power saving reasons).
Moreover, if there are more than one package the scheduler should
minimize the number of packages with at least one core active when
power saving is the priority. Contrary, as many packages as possible
should be used when aiming for high performance.
There is a global heap of cores, where the key is the highest priority
of threads running on that core. Moreover, for each core there is
a heap of logical processors on this core where the key is the priority
of currently running thread.
The per-core heap is used for load balancing among logical processors
on that core. The global heap is used in initial decision where to put
the thread (note that the algorithm that makes this decision is not
complete yet).
The scheduler is in very early stage. There is no thread migration and
the algorithms choosing CPU for thread are very simple.
Since affine scheduler is going to use one run queue per core simple on
single core machines it will work exactly the same as simple scheduler.
That would allow us to have only one scheduler implementation usable
on all kinds of machines.
Simple scheduler is used when we do not have to worry about cache affinity
(i.e. single core with or without SMT, multicore with all cache levels
shared).
When we replace gSchedulerLock with more fine grained locking affine
scheduler should also be chosen when logical CPU count is high (regardless
of cache).
In SMP systems simple scheduler will be used only when all logical
processors share all levels of cache and the number of CPUs is low.
In such systems we do not have to care about cache affinity and
the contention on the lock protecting shared run queue is low. Single
run queue makes load balancing very simple.
This allows naming the boot loader package canoncically. Due to code
size limitations we cannot perform a more correct name check, but there
shouldn't be any other entries in the packages directory with a name
with "haiku_loader" prefix, anyway.
Kernel support for yielding to all (including lower priority) threads
has been removed. POSIX sched_yield() remains unchanged.
If a thread really needs to yield to everyone it can reduce its priority
to the lowest possible and then yield (it will then need to manually
return to its prvious priority upon continuing).
Each thread has its minimal priority that depends on the static priority.
However, it is still able to starve threads with even lower priority
(e.g. CPU bound threads with lower static priority). To prevent this
another penalty is introduced. When the minimal priority is reached
penalty (count mod minimal_priority) is added, where count is the number
of time slices since the thread reached its minimal priority. This prevents
starvation of lower priorirt threads (since all CPU bound threads may have
their priority temporaily reduced to 1) but preserves relation between
static priorities - when there are two CPU bound threads the one with
higher static priority would get more CPU time.
Instead of listing all the objects we want from the libgcc archive
we just make a copy of it and remove those we don't want, and link
to it.
This should allow returning MAXLINE in jam to a sane value.
* Increase general allocation alignment from 4 to 8 byte. That was even
incorrect.
* Use a splay tree instead of a singly linked list to manage the free
chunks. That increases the size of the per-chunk structure to manage
the free chunks, i.e. the of minimally allocatable memory size (from
align(sizeof(void*)) to align(3 * sizeof(void*))), but make finding
and inserting chunks much faster.
Fixes#10063 respectively improves the situation significantly.
The maximum penalty the thread can receive is now limited depending on
the real thread priority. However, since it make it possible to starve
threads with priority lower than that limit. To prevent that threads
that have already earned the maximum penalty are periodically forced
to yield CPU to all other threads.