This patch makes it possible to inline rdmsr and wrmsr instruction. The
performance impact shouldn't be significant since they are used relatively
rarely and wrmsr is usually a serializing instruction, but there is no reason
not to do so.
The goal of this patch is to amortize the cost of context switch by making
the compiler aware that context switch clobbers all registers. Because all
register need to be saved anyway there is no additional cost of using
callee saved register in the function that does the context switch.
Similarly to previous patch regarding GDT this is mostly a rewrite of
IDT handling code from C to C++. Thanks to constexpr IDT is now entirely
generated at compile-time.
Virtually no functional change, just rewriting the code from
"C in *.cpp files" to C++. Use of constexpr may be advantageous but
that code is not performance critical anyway.
While resolving TLS related relocations it is necessary to know the DSO
that defines the symbol. Without proper support in caching that information
is available only when the symbol is resolved first time. That works well
for TLS since TLS_DTPMOD is guaranteed to be before TLS_DTPOFF relocation.
This patch makes the newly introduced parts of the interface work in a
general case.
Previously TLS_DTPMOD relocation blindly returned ID of the current DSO.
This patch does proper symbol lookup if there is a symbol assigned to the
relocation and uses ID of the DSO in which the symbol is defined.
This patch introduces support of ELF based TLS handling with lazy allocation
and initalization of TLS block for each DSO and thread. The implementation
generally follows the official ABI except that generation counter in dtv
is in fact a pointer to Generation object that contains both generation
counter and size of the dtv. That simplified the implementation a bit, but
could be changed later. The ABI requirements regariding in memory position
of TLS block is not honoured what results in static TLS model being
unsupported. However, that should not be a problem as long as
"executables" in Haiku are in fact shared objects and optimizations which
require specific TLS block in memory layout are not possible anyway.
* Fixes missing atomic stuff that gcc requires
* The gcc build still fails further down, because of a mixup of
VFP/nonVFP objects (at least for beagle build).
The scheduler expects that all threads expect the initial idle threads
have priority in range [THREAD_MIN_SET_PRIORITY, THREAD_MAX_SET_PRIORITY].
If the requested pririty is out of range the value is clamped. Failing
with B_BAD_VALUE is probably an overkill since there isn't any real
change in the guarantees provided by the scheduler about the behavior
of such thread. Also, BeBook suggests that spawn_thread() can specify
priority 0.
For potential boot volumes with older packages states the respective
item in the boot volume menu now has a sub menu for selecting a state.
The boot loader functionality for this feature is complete -- i.e. the
respective kernel is loaded and the name of the old state is added to
the kernel args -- but kernel packagefs and package daemon support is
still missing.
... in filenames. Replace the existing Unicode conversion functions
with UTF conversion functions from js that he relicensed MIT for us.
Put the UTF conversion functions in a private but shared code location
so that they can be accessed throughout the kernel.
Right now we only provide functions to convert between UTF-8 and UTF-16.
At some point we should also add functions to convert between UTF-8 and
UTF-32 and UTF-16 and UTF-32 but these aren't needed by exfat.
Remove the old Unicode conversion functions from exfat as they assumed
UCS-2 characters and don't work with UTF-16 used by exfat.
Rename most variables with the term length with code unit where code units
are intended. The term length, when used, means length in bytes while code
units represent either a full 2-byte UTF-16 character or half a 4-byte
surrogate pair.
This patch remove the old thread migration logic which used few special
cases and (broken) general check that attempted to balance threads.
The new logic is pretty straightforward and seems perform well without
any additional special cases. Current core is compared with the least loaded
one and the thread is migrated if that would result in estimated loads of
both cores (i.e. the current one and the least loaded one) to become closer
to the average load (i.e. average of that two cores).
Currently, ThreadData::ShouldRebalance() (and mode specific functions
it calls) only decides whether to migrate thread to another core or not.
However, in most cases it actually needs to find the best candidate for
new core so it could as well return that information.
After load_image() the child thread is suspended and the parent is
expected to resume it later. However, it is possible that the parent
attempts to resume its child after it has been notified that the image
had been loaded but before the child managed to suspend itself. In such
case the child would suspends itself after that wake up attempt and,
consequently will not be ever resumed.
To mitigate that problem flag Thread::going_to_suspend has been added
which helps synchronizing thread suspension and continuation in a similar
way that "traditional" thread blocking is performed. This means that
the child should behave in a following manner: set its going_to_suspend flag,
notify the parent (i.e. any thread that may want to resume it), acquire
its scheduler_lock and suspend itself if the going_to_suspend flag is set.
The parent should follow pattern: clear going_to_suspend flag of the thread
that is about to be resumed, acquire that thread scheduler_lock and enqueue
it in a run queue if it is suspended.
Thanks Oliver for reporting the bug and identifying what causes it.
Most of the actual UserEvent work is done in DPC so that we don't have
to care about the limitations of the context in which UserEvent::Fire()
is invoked. This requires appropriate management of lifetime of UserEvent
instances to make sure that DoDPC() method is always called on a valid
object.
- POSIX says the behavior for pthread_equal is undefined for
uninitialized arguments.
- However, gcc C++11 threads supports expects C++-compatible behavior,
that is, two uninitialized pthread_t should compare equal.
Avoids some runtime asserts in latest WebKit version.
In low latency mode the scheduler would not attempt to balance load
on not heavily loaded cores unless difference in load exceeded
kLoadDifference * 2 (i.e. 40 percentage points), which does not seem
to be good enough.
To make sure that load statistics are accurate on idle cores each time
idle thread is scheduled a timer is set to update load when current
load measurement interval elapses. However, core load is defined as the
average load during last measurement interval and idle core may be still
considered busy if it was not idle during entire measurement interval.
Since, load update timer is a one shot timer that information will not be
updated until the core becomes active again.
To mitigate that issue load update timer is set to fire after two load
measurement intervals had elapsed.
Should fix#10628. If there is a race condition with a writer getting
minimum or maximum from double ended heap may incorrectly result NULL.
Which is not expected in the most of the thread migration logic. Apart
from that, because of the race condition heap state may be observed as
inconsistent thus failing assertions.
ended heap
* Align all allocations of more than 8 bytes to 8-byte.
* Avoids hitting ASSERTs in WebKit when built in debug mode (it assumes
at least 8 byte alignment)
The main purpose of using atomic_get() was the necessity of a compiler
barrier to prevent the compiler from optimizing busy loops. However,
each such loop contains in its body at least one statement that acts
as a compiler barrier (namely, cpu_wait() or cpu_pause()) making
atomic_get() redundant (well, atomic_get() is stronger - it also issues
a load barrier but in these particular cases we do not need it).
If the initial attempt to acquire read spinlock fails we use more relaxed
loop (which doesn't require CPU to lock the bus). However, check in that
loop, incorrectly, didn't allow a lock to be acquired when there was at
least one other reader.
* Add isb just because.
* pdziepak pointed out that ARMv5 and before
had different barrier support.
* pdziepak also mentioned that dsb was too strong
for __sync_synchronize
* On ARMv6 or older, we do a simulated dsb.
* Move __sync_synchronize into thread.c in libroot
and use the new arch_atomic.h dsb/dmb defines.
* Gets arm @bootstrap-raw to end of bootstrap.
GCC doesn't provide an ARM implementation of it. It's easy to write one
for ARMv6 and above, while older archs will need this implemented as a
syscall just like other atomics.
We have the same problem as on x86_64: posiiton dependant code isn't
allowed in shared libraries. Since Kernel.so is not used at runtime,
we can use the same hack as on x86_64, and use elfedit to make the
linker think our kernel is a shared library.
* As per the ML discussions. Bumps MIPS to tier 3.
* We've reached a unanimous descision that MIPS doesn't
target any real / valid hardware Haiku wants to pursue
at the moment. In the event that anyone wants to pursue
MIPS, feel free to fork Haiku into your own repository
(and we'll even link to it on the website ports page)
* If someone develops a viable plan for MIPS (and gets the
port working, it can be readded at a later date)
This is required to use some SSE instructions, which are generated by
gcc 4.8, most notably when compiling WebKit code (but it may happen
elsewhere as well).
Fixes about 900 crashes and 10000 test failures in WebKit, so this must
be working. Fixes#10509 for x86.
* Use atomic_get_and_set for return value
* Atomics are no longer volatile
* Add missing arch_cpu_pause stub
* Move arch_cpu_idle to arch_cpu header to match
other architectures
* This will be used to implement compressed http streams
* Remove the custom BDataOutput class, and use BDataIO instead, for
easier integration with existing code.
The initial core assignment has to be done without any knowledge about
the thread behaviour. Moreover, short lived tasks may spent most of their
time executing on that initial core. This patch attempts to improve the
qualiti of that initial decision.
The main purpose of this patch is to eliminate the delay between thread
migration and result of that migration being visible in load statistics.
Such delay, in certain circumstances, may cause some cores to become
overloaded because the scheduler migrates too many threads to them before
the effect of migration becomes apparent.
In order to keep the scheduler tickless core load is computed and updated
only during various scheduler events (i.e. thread enqueue, reschedule, etc).
The problem it creates is that if a core becomes idle its load may remain
outdated for an extended period of time thus resulting in suboptimal thread
migration decisions.
The solution to this problem is to add a timer each time an idle thread is
scheudled which, after kLoadMeasureInterval, would fire and force load
update.
Priority penalties were made more strict in order to prevent situation
when two or more high priority threads uses up all available CPU time
in such manner that they do not receive a penalty but starve low priority
threads.
However, a significant change to thread priorites has been made since and
now priority of all non real time threads varies in a range from 1 to
static priority minus penalty. This means that the scheduler is able to
prevent thread starvation without any complex penalty policies.
Originially, core load was a sum of eastimated loads of all currently
running or ready threads on a given core. Such value is changing very
rapidly preventing the thread migration logic from making any reasonable
decisions.
This patch changes the way core load is computed to make it more stable
thus improving the qualitiy of decisions made by the thread migration logic.
Currently core load is a sum of estimated loads of all threads that have been
ready during last load measurement interval and haven't been migrated or
killed.
The main reason for this patch is to fix gcc 4.8.2 warning about
hierarchyLevels possibly being used not initialized. Such thing
actually can not happen since all x2APIC CPUs are aware of at least
3 topology levels. However, once more topology levels are introduced
we will have to deal with CPUs that do not report information about all
of them.
User timers may cause another thread to become ready in which case we would
like this to happen before scheduler_reschedule() chooses next thread to
be executed.
UserEvent can be fired from scheduler_reschedule() i.e. while holding current
thread scheduler_lock. If the current thread goes sleep and during reschedule
one of its timers sends a signel to it, then scheduler_enqueue_in_run_queue()
attempts to acquire again its scheduler_lock resulting in a deadlock.
There was also a minor issue with both scheduler_reschedule() and
scheduler_enqueue_in_run_queue() acquiring current CPU scheduler mode lock.
* Fix incorrect cpu vendor name mapping
* Add additional CPU architectures
* Add additional CPU vendors
* Rework PowerPC arch_system_info passing
PVR back for cpu model
On multisocket systems as well as under virtual machines logical CPUs
may use separate TSC. We could attempt to synchronize them what probably
would solve problems on multisocket systems. Unfortunately, when running
under hypervisor there is still a chance that TSC will get out of sync
again (e.g. cpufreq enabled on host when there is no invariant TSC). As
long as we use RDTSC as our main time source the scheduler must accept the
fact that time may go backwards (what isn't really a serious problem).
Add boot loader debug menu option "Save syslog from previous session
during boot". If enabled (defaults to true), the previous session's
debug syslog data is copy to a separate buffer and passed to the
kernel, which writes it back to the file /var/log/previous_syslog.
As long as Haiku still boots, this should now be the most convenient way
to retrieve the output from a kernel crash.
This reverts commit 667617ad043a4587d8d366d5192d9ad291cfa37a.
Scheduler profiler uses CPU local data to store function information, hence
arch_thread_context_switch() usually is not a problem. However, when
we switch to a new thread we end up scheduler_new_thread_entry() instead
of scheduler_reschedule() what may corrupt data collected by the profiler.
The symbol is needed for global objects. Usually, GCC also requires
this, but for some reason, the linking error only occurs when using
Clang.
Signed-off-by: Jérôme Duval <jerome.duval@gmail.com>
* Displays standard CPUID, and shows what the
internal CPUID used by OS.h *should* be.
* Should help out in identifying new CPU's
as all end users have to do is run sysinfo
to get the CPU info + value for OS.h
Nested functions are a (again, broken) GNU extension which is not
supported by Clang. It has been replaced by a bunch of gotos and a
variable that works as a return address.
Previous implementation based on the actual load of each core and share
each thread has in that load turned up to be very problematic when
balancing load on very heavily loaded systems (i.e. more threads
consuming all available CPU time than there is logical CPUs).
The new approach is to estimate how much load would a thread produce
if it had all CPU time only for itself. Summing such load estimations
of each thread assigned to a given core we get a rank that contains
much more information than just simple actual core load.
* Previously PE binaries would trigger the "incorrectly
executable" dialog. Now we get a special message for
B_LEGACY_EXECUTABLE and B_UNKNOWN_EXECUTABLE
* Legacy at the moment is a R3 x86 PE binary. This could
be extended to gcc2 binaries someday far, far, down the
road though
* The check for legacy is based on a PE flag I see
set on every R3 binary (that isn't set on dos ones)
* Unknown is something we know *is* an executable, but
can't do anything with (such as an MSDOS or Windows
application)
* No performance drops as we do the PE scan last
* Tested on x86 and x86_gcc2
This field forces kernel to track each CPU load all the time. It is not
a problem with the current scheduler on a multicore systems, but on
single core machnies or with any other future scheduler this field may
become just an unnecessary burden. It isn't difficult for an application
to compute CPU load by itself when it needs it.
A bit hackish implementation of a profiler for the scheduler.
SCHEDULER_ENTER_FUNCTION at the begining of each function aren't nice and
usage of __PRETTY_FUNCTION__ isn't any better (both gcc and clang support
it though), but it was quick to implement and doesn't lose information
on inlined functions. It's just a tool, not an integral part of the kernal
anyway.
Apart from the refactoring this commit takes the opportunity and removes
unnecessary read locks when choosing a package and a core from idle lists.
The data structures are accessed in a thread safe way and it does not really
matter whether the obtained data becomes outdated just when we release the
lock or during our search for the appropriate package/core.
Some SMT implementations (e.g. recent AMD microarchitectures) have
separate L1d cache for each SMT thread (which AMD decides to call "cores").
This means that we shouldn't move threads to another logical processor too
often even if it belongs to the same core. We aren't very strict about
this as it would complicate load balancing, but we try to reduce unnecessary
migrations.
atomic_{get, set}64() are problematic on architectures without 64 bit
compare and swap.
Also, using sequential lock instead of atomic access ensures that
any reads from cpu_ent::active_time won't require any writes to shared
memory.
As weak aliases are not supported on OS X, this caused problems when
building Haiku on OS X, as this file is also used for the host tools.
Signed-off-by: Axel Dörfler <axeld@pinc-software.de>
The client code is not supposed to change the topology info.
It would be also nice if cpu_topology_node::children was an array of
pointers to const but that would require several const_casts in the
topology tree generation code so it's probably not worth it.
Apparently, reading from dr3 is slower than reading from memory
with cache hit.
Also, depending on hypervisor configuration, accessing dr3 may cause
a VM exit (and, at least on kvm, it does), what makes it much slower
than a memory access even when there is a cache miss.
On x86 we mainly want to disable PAE, which is now also used with less
memory as long as NX support is available. Ideally we'd check this
condition as well and only add the menu item, if the kernel would
enable PAE.
Add get_safemode_option_early() and get_safemode_boolean_early() to get
safemode options before the kernel heap has been initialized. They use a
simplified parser.
CreateSubRequest() could still return an error and break out of the
while loop without exiting the outer for loop.
Instead we reset the error code before entering the for loop.
This reverts the extra for loop condition from
"do_iterative_fd_io_iterate(): Support sparse files".
When reading a file with more than 8 block_runs, get_vecs() would
return B_BUFFER_OVERFLOW which would never create any subrequest due
to the test on error == B_OK on the loop, but instead just fail.
Except for the get_vecs() return code, where it is not wanted,
the test made no sense as all other assignments are tested directly
or passed around with break.
Works for me but I don't guarantee it's completely correct.
* When exec()'ing we'd otherwise get (harmless but annoying) messages
from vm_page_fault(). With syscall tracing enabled we can get userland
stack traces anyway.
* Simplify by using TRACE_ENTRY_SELECTOR().
The spec explicitly states that pthread_join shall not return EINTR, so
we have to retry the wait when it gets interrupted instead of letting
the error code through.
* VMTranslationMap:
- Add DebugPrintMappingInfo(): Given a virtual address it is supposed
to print the paging structure information for that address. To be
implemented by derived classes.
- Add DebugGetReverseMappingInfo(): Given a physical addresss it is
supposed to find all virtual addresses mapped to it. To be
implemented by derived classes.
* X86VMTranslationMapPAE: Implement the new methods
DebugPrintMappingInfo() and DebugGetReverseMappingInfo().
* Add KDL command "mapping". It supports both virtual address lookups
and reverse lookups.
__flatten_process_args() does now have the executable path as an
additional (optional) parameter. If specified, the function will read
the file's SYS:ENV attribute (if set) and use its value to modified the
environment it is preparing for the new process. Currently supported
attribute values are strings consisting of "<var>=<value>" substrings
separated by "\0" (backslash zero), with '\' being used as an escape
character. The environment will be altered to contain the specified
"<var>=<value>" elements, replacing a preexisting <var> element (if
any).
A possible use case would be setting a SYS:ENV attribute with value
"DISABLE_ASLR=1" on an executable that needs ASLR disabled.
* VMAddressSpace: Add randomizingEnabled property.
* VMUserAddressSpace: Randomize addresses only when randomizingEnabled
property is set.
* create_team_arg(): Check, if the team's environment contains
"DISABLE_ASLR=1". Set the team's address space property
randomizingEnabled accordingly in load_image_internal() and
exec_team().
In that case the caller ideally wants to obtain an allocation at the
specified address, which was thwarted by using
B_RANDOMIZED_BASE_ADDRESS. Use B_BASE_ADDRESS instead.
This improves the experience with the gcc 4 pre-compiled headers
implementation (which expects to be able to map the PCH file at the same
address where it was located originally when it had been created), but
doesn't fix it completely. As long as ASLR is active, it is always
possible that something else (mapped shared objects, heap, stack) is in
the way.
Unless a free range was found before the first area a specified base
address was ignored. In the non-randomized case this could result in
a range other than (i.e. starting before) the preferred one being
chosen, although the preferred range was available.
devfs_get_device() returns the device for a given path (if any), also
acquiring a reference to its vnode (thus ensuring the device won't go
away). devfs_put_device() puts the device vnode's reference.
* Create new interface for cpuidle modules (similar to the cpufreq
interface)
* Generic cpuidle module is no longer needed
* Fix and update Intel C-State module
* Determine whether called from userland or kernel.
* Check the buffer address via IS_USER_ADDRESS(), if from userland.
* Simplify things by merging UserRead() with Read() and
UserWrite() with Write().
* Increase FIFO buffer capacity from 32 to 64 KiB and the FIFO atomic
write size ({BUF_SIZE}) from 512 bytes to 4 KiB (both like Linux).
* Fix *pathconf(..., _PC_PIPE_BUF). It was returning 4 KiB although the
implemented atomic write size was 512 bytes only. Now both *pathconf()
and the FIFO implementation refer to the same constant.
scheduler_common.h is now meant for types, variables and functions used
by both core scheduler code and implementations of scheduler modes.
Functions like switch_thread() and update_thread_times() do not belong
there anymore.
* get_file_attribute(): Use O_NOTRAVERSE, so we correctly read the
attribute from symlinks.
* internal_path_for_path(): Shuffle things around a bit: The dependency
is resolved before handling B_FIND_PATH_PACKAGE_PATH, now. This adds
support for getting the package file for a dependency. The dependency
was ignored in this case before.
* Use kSystemPackageLinksDirectory instead of hard-coding "/packages".
* Remove possibility to temporarily disable small task packing.
* When small task packing target gets overloaded continue packing
threads on another core, but avoid migrating the already packed
ones.
Scheduler still tends to needlessly migrate threads to another cores
when under heavier load, but it is now much better than before.
It's a browser for the system package content, where entries can be
selected to blacklist them. The selected entries are removed from the
packagefs instance in the boot loader, so that e.g. selected drivers
won't be picked up. The paths are also added to the safe mode driver
settings and will be interpreted when the system packagefs instance is
mounted by the kernel.
* Make Menu and MenuItem polymorphic.
* MenuItem:
- Make SetMarked() virtual, so it can be overridden.
- Add SetSubmenu() and Supermenu().
- Delete the submenu in the destructor.
* Menu:
- Add Entered()/Exited() hooks. They frame the time the user navigates
the menu or any of its submenus. The hooks allow for subclasses
populating their item list dynamically.
- Add SortItems().
* Update boot loader menu copyright text to include 2013, now that it is
over soon. :-)
In each installation location, it is now possible to create a settings
file "packages" that allows to blacklist entries contained in packages.
The format is:
Package <package name> {
EntryBlacklist {
<entry path>
...
}
}
...
<package name> is the base name (no version) of the respective package
(e.g. "haiku"), <entry path> is an installation location relative path
(e.g. "add-ons/Translators/FooTranslator").
Blacklisted entries will be ignored by packagefs, i.e. they won't appear
in the file system. This addresses the issue that it may be necessary to
remove a problematic file (e.g. driver, add-on, or library), which would
otherwise require editing the containing package file.
The settings file is not not "live". Changes take effect only after
reboot (respectively when remounting the concerned packagefs volume).
* Move PathBuffer helper class out of find_paths.cpp into its own
header.
* find_directory():
- Make use of MemoryDeleter to simplify things.
- Make use of PathBuffer for a simpler and more correct handling.
- Make B_UTILITIES_DIRECTORY to B_APPS_DIRECTORY. /boot/utilities
doesn't exist anyway.
- Resolve the concerned constants to the architecture specific
subdirectory, when called in a secondary architecture context, just
like find_path*().
* get_architectures() returns the primary and the secondary
architectures in one array. That turned out to be convenient.
* Add C++ versions for get[_secondary]_architectures(), returning a
BStringList.
* Add get_architecture(), get_primary_architecture(),
get_secondary_architectures(), guess_architecture_for_path() to get
the caller's architecture, the primary architecture, all secondary
architectures, or the architecture associated with a specified path
respectively.
* Rename the find_path*() functions to find_path*_etc() and add an
optional architecture parameter. Add simplified find_path*()
functions.
* BPathFinder: Add FindPath[s]() versions with an architecture
parameter.
* pin idle threads to their specific CPUs
* allow scheduler to implement SMP_MSG_RESCHEDULE handler
* scheduler_set_thread_priority() reworked
* at reschedule: enqueue old thread after dequeueing the new one
In case something went wrong, call unlock_memory_etc() with the rounded
base address instead of with the original address. If the original
address wasn't page aligned, unlock_memory_etc() would otherwise try to
unlock an additional page.