1097 lines
45 KiB
C++
1097 lines
45 KiB
C++
/* ----------------------------------------------------------------------------
|
||
Copyright (c) 2018, Microsoft Research, Daan Leijen
|
||
This is free software; you can redistribute it and/or modify it under the
|
||
terms of the MIT license. A copy of the license can be found in the file
|
||
"license.txt" at the root of this distribution.
|
||
-----------------------------------------------------------------------------*/
|
||
|
||
#error "documentation file only!"
|
||
|
||
|
||
/*! \mainpage
|
||
|
||
This is the API documentation of the
|
||
[mimalloc](https://github.com/koka-lang/mimalloc) allocator
|
||
(pronounced "me-malloc") -- a
|
||
general purpose allocator with excellent [performance](bench.html)
|
||
characteristics. Initially
|
||
developed by Daan Leijen for the run-time systems of the
|
||
[Koka](https://github.com/koka-lang/koka) and [Lean](https://github.com/leanprover/lean) languages.
|
||
|
||
Notable aspects of the design include:
|
||
|
||
- __small and consistent__: the library is less than 3500 LOC using simple and
|
||
consistent data structures. This makes it very suitable
|
||
to integrate and adapt in other projects. For runtime systems it
|
||
provides hooks for a monotonic _heartbeat_ and deferred freeing (for
|
||
bounded worst-case times with reference counting).
|
||
- __free list sharding__: "the big idea": instead of one big free list (per size class) we have
|
||
many smaller lists per memory "page" which both reduces fragmentation
|
||
and increases locality --
|
||
things that are allocated close in time get allocated close in memory.
|
||
(A memory "page" in mimalloc contains blocks of one size class and is
|
||
usually 64KB on a 64-bit system).
|
||
- __eager page reset__: when a "page" becomes empty (with increased chance
|
||
due to free list sharding) the memory is marked to the OS as unused ("reset" or "purged")
|
||
reducing (real) memory pressure and fragmentation, especially in long running
|
||
programs.
|
||
- __lazy initialization__: pages in a segment are lazily initialized so
|
||
no memory is touched until it becomes allocated, reducing the resident
|
||
memory and potential page faults.
|
||
- __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation
|
||
times (_wcat_), bounded space overhead (~0.2% meta-data, with at most 16.7% waste in allocation sizes),
|
||
and has no internal points of contention using atomic operations almost
|
||
everywhere.
|
||
|
||
Further information:
|
||
|
||
- \ref build
|
||
- \ref using
|
||
- \ref overrides
|
||
- \ref bench
|
||
- \ref malloc
|
||
- \ref extended
|
||
- \ref aligned
|
||
- \ref heap
|
||
- \ref typed
|
||
- \ref analysis
|
||
- \ref options
|
||
|
||
*/
|
||
|
||
|
||
/// \defgroup malloc Basic Allocation
|
||
/// The basic allocation interface.
|
||
/// \{
|
||
|
||
|
||
/// Free previously allocated memory.
|
||
/// The pointer `p` must have been allocated before (or be \a NULL).
|
||
/// @param p pointer to free, or \a NULL.
|
||
void mi_free(void* p);
|
||
|
||
/// Allocate \a size bytes.
|
||
/// @param size number of bytes to allocate.
|
||
/// @returns pointer to the allocated memory or \a NULL if out of memory.
|
||
/// Returns a unique pointer if called with \a size 0.
|
||
void* mi_malloc(size_t size);
|
||
|
||
/// Allocate zero-initialized `size` bytes.
|
||
/// @param size The size in bytes.
|
||
/// @returns Pointer to newly allocated zero initialized memory,
|
||
/// or \a NULL if out of memory.
|
||
void* mi_zalloc(size_t size);
|
||
|
||
/// Allocate zero-initialized \a count elements of \a size bytes.
|
||
/// @param count number of elements.
|
||
/// @param size size of each element.
|
||
/// @returns pointer to the allocated memory
|
||
/// of \a size*\a count bytes, or \a NULL if either out of memory
|
||
/// or when `count*size` overflows.
|
||
///
|
||
/// Returns a unique pointer if called with either \a size or \a count of 0.
|
||
/// @see mi_zalloc()
|
||
void* mi_calloc(size_t count, size_t size);
|
||
|
||
/// Re-allocate memory to \a newsize bytes.
|
||
/// @param p pointer to previously allocated memory (or \a NULL).
|
||
/// @param newsize the new required size in bytes.
|
||
/// @returns pointer to the re-allocated memory
|
||
/// of \a newsize bytes, or \a NULL if out of memory.
|
||
/// If \a NULL is returned, the pointer \a p is not freed.
|
||
/// Otherwise the original pointer is either freed or returned
|
||
/// as the reallocated result (in case it fits in-place with the
|
||
/// new size). If the pointer \a p is \a NULL, it behaves as
|
||
/// \a mi_malloc(\a newsize). If \a newsize is larger than the
|
||
/// original \a size allocated for \a p, the bytes after \a size
|
||
/// are uninitialized.
|
||
void* mi_realloc(void* p, size_t newsize);
|
||
|
||
/// Reallocate memory to \a newsize bytes, with extra memory initialized to zero.
|
||
/// @param p Pointer to a previously allocated block (or \a NULL).
|
||
/// @param newsize The new required size in bytes.
|
||
/// @returns A pointer to a re-allocated block of \a newsize bytes, or \a NULL
|
||
/// if out of memory.
|
||
///
|
||
/// If the \a newsize is larger than the original allocated size of \a p,
|
||
/// the extra bytes are initialized to zero.
|
||
void* mi_rezalloc(void* p, size_t newsize);
|
||
|
||
/// Re-allocate memory to \a count elements of \a size bytes, with extra memory initialized to zero.
|
||
/// @param p Pointer to a previously allocated block (or \a NULL).
|
||
/// @param count The number of elements.
|
||
/// @param size The size of each element.
|
||
/// @returns A pointer to a re-allocated block of \a count * \a size bytes, or \a NULL
|
||
/// if out of memory or if \a count * \a size overflows.
|
||
///
|
||
/// If there is no overflow, it behaves exactly like `mi_rezalloc(p,count*size)`.
|
||
/// @see mi_reallocn()
|
||
/// @see mi_rezalloc()
|
||
/// @see [recallocarray()](http://man.openbsd.org/reallocarray) (on BSD).
|
||
void* mi_recalloc(void* p, size_t count, size_t size);
|
||
|
||
|
||
/// Try to re-allocate memory to \a newsize bytes _in place_.
|
||
/// @param p pointer to previously allocated memory (or \a NULL).
|
||
/// @param newsize the new required size in bytes.
|
||
/// @returns pointer to the re-allocated memory
|
||
/// of \a newsize bytes (always equal to \a p),
|
||
/// or \a NULL if either out of memory or if
|
||
/// the memory could not be expanded in place.
|
||
/// If \a NULL is returned, the pointer \a p is not freed.
|
||
/// Otherwise the original pointer is returned
|
||
/// as the reallocated result since it fits in-place with the
|
||
/// new size. If \a newsize is larger than the
|
||
/// original \a size allocated for \a p, the bytes after \a size
|
||
/// are uninitialized.
|
||
void* mi_expand(void* p, size_t newsize);
|
||
|
||
/// Allocate \a count elements of \a size bytes.
|
||
/// @param count The number of elements.
|
||
/// @param size The size of each element.
|
||
/// @returns A pointer to a block of \a count * \a size bytes, or \a NULL
|
||
/// if out of memory or if \a count * \a size overflows.
|
||
///
|
||
/// If there is no overflow, it behaves exactly like `mi_malloc(p,count*size)`.
|
||
/// @see mi_calloc()
|
||
/// @see mi_zallocn()
|
||
void* mi_mallocn(size_t count, size_t size);
|
||
|
||
/// Re-allocate memory to \a count elements of \a size bytes.
|
||
/// @param p Pointer to a previously allocated block (or \a NULL).
|
||
/// @param count The number of elements.
|
||
/// @param size The size of each element.
|
||
/// @returns A pointer to a re-allocated block of \a count * \a size bytes, or \a NULL
|
||
/// if out of memory or if \a count * \a size overflows.
|
||
///
|
||
/// If there is no overflow, it behaves exactly like `mi_realloc(p,count*size)`.
|
||
/// @see mi_recalloc()
|
||
/// @see [reallocarray()](<http://man.openbsd.org/reallocarray>) (on BSD)
|
||
void* mi_reallocn(void* p, size_t count, size_t size);
|
||
|
||
/// Re-allocate memory to \a newsize bytes,
|
||
/// @param p pointer to previously allocated memory (or \a NULL).
|
||
/// @param newsize the new required size in bytes.
|
||
/// @returns pointer to the re-allocated memory
|
||
/// of \a newsize bytes, or \a NULL if out of memory.
|
||
///
|
||
/// In contrast to mi_realloc(), if \a NULL is returned, the original pointer
|
||
/// \a p is freed (if it was not \a NULL itself).
|
||
/// Otherwise the original pointer is either freed or returned
|
||
/// as the reallocated result (in case it fits in-place with the
|
||
/// new size). If the pointer \a p is \a NULL, it behaves as
|
||
/// \a mi_malloc(\a newsize). If \a newsize is larger than the
|
||
/// original \a size allocated for \a p, the bytes after \a size
|
||
/// are uninitialized.
|
||
///
|
||
/// @see [reallocf](https://www.freebsd.org/cgi/man.cgi?query=reallocf) (on BSD)
|
||
void* mi_reallocf(void* p, size_t newsize);
|
||
|
||
|
||
/// Allocate and duplicate a string.
|
||
/// @param s string to duplicate (or \a NULL).
|
||
/// @returns a pointer to newly allocated memory initialized
|
||
/// to string \a s, or \a NULL if either out of memory or if
|
||
/// \a s is \a NULL.
|
||
///
|
||
/// Replacement for the standard [strdup()](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strdup.html)
|
||
/// such that mi_free() can be used on the returned result.
|
||
char* mi_strdup(const char* s);
|
||
|
||
/// Allocate and duplicate a string up to \a n bytes.
|
||
/// @param s string to duplicate (or \a NULL).
|
||
/// @param n maximum number of bytes to copy (excluding the terminating zero).
|
||
/// @returns a pointer to newly allocated memory initialized
|
||
/// to string \a s up to the first \a n bytes (and always zero terminated),
|
||
/// or \a NULL if either out of memory or if \a s is \a NULL.
|
||
///
|
||
/// Replacement for the standard [strndup()](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strndup.html)
|
||
/// such that mi_free() can be used on the returned result.
|
||
char* mi_strndup(const char* s, size_t n);
|
||
|
||
/// Resolve a file path name.
|
||
/// @param fname File name.
|
||
/// @param resolved_name Should be \a NULL (but can also point to a buffer
|
||
/// of at least \a PATH_MAX bytes).
|
||
/// @returns If successful a pointer to the resolved absolute file name, or
|
||
/// \a NULL on failure (with \a errno set to the error code).
|
||
///
|
||
/// If \a resolved_name was \a NULL, the returned result should be freed with
|
||
/// mi_free().
|
||
///
|
||
/// Replacement for the standard [realpath()](http://pubs.opengroup.org/onlinepubs/9699919799/functions/realpath.html)
|
||
/// such that mi_free() can be used on the returned result (if \a resolved_name was \a NULL).
|
||
char* mi_realpath(const char* fname, char* resolved_name);
|
||
|
||
/// \}
|
||
|
||
// ------------------------------------------------------
|
||
// Extended functionality
|
||
// ------------------------------------------------------
|
||
|
||
/// \defgroup extended Extended Functions
|
||
/// Extended functionality.
|
||
/// \{
|
||
|
||
/// Maximum size allowed for small allocations in
|
||
/// #mi_malloc_small and #mi_zalloc_small (usually `128*sizeof(void*)` (= 1KB on 64-bit systems))
|
||
#define MI_SMALL_SIZE_MAX (128*sizeof(void*))
|
||
|
||
/// Allocate a small object.
|
||
/// @param size The size in bytes, can be at most #MI_SMALL_SIZE_MAX.
|
||
/// @returns a pointer to newly allocated memory of at least \a size
|
||
/// bytes, or \a NULL if out of memory.
|
||
/// This function is meant for use in run-time systems for best
|
||
/// performance and does not check if \a size was indeed small -- use
|
||
/// with care!
|
||
void* mi_malloc_small(size_t size);
|
||
|
||
/// Allocate a zero initialized small object.
|
||
/// @param size The size in bytes, can be at most #MI_SMALL_SIZE_MAX.
|
||
/// @returns a pointer to newly allocated zero-initialized memory of at
|
||
/// least \a size bytes, or \a NULL if out of memory.
|
||
/// This function is meant for use in run-time systems for best
|
||
/// performance and does not check if \a size was indeed small -- use
|
||
/// with care!
|
||
void* mi_zalloc_small(size_t size);
|
||
|
||
/// Return the available bytes in a memory block.
|
||
/// @param p Pointer to previously allocated memory (or \a NULL)
|
||
/// @returns Returns the available bytes in the memory block, or
|
||
/// 0 if \a p was \a NULL.
|
||
///
|
||
/// The returned size can be
|
||
/// used to call \a mi_expand successfully.
|
||
/// The returned size is always at least equal to the
|
||
/// allocated size of \a p, and, in the current design,
|
||
/// should be less than 16.7% more.
|
||
///
|
||
/// @see [_msize](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/msize?view=vs-2017) (Windows)
|
||
/// @see [malloc_usable_size](http://man7.org/linux/man-pages/man3/malloc_usable_size.3.html) (Linux)
|
||
/// @see mi_good_size()
|
||
size_t mi_usable_size(void* p);
|
||
|
||
/// Return the used allocation size.
|
||
/// @param size The minimal required size in bytes.
|
||
/// @returns the size `n` that will be allocated, where `n >= size`.
|
||
///
|
||
/// Generally, `mi_usable_size(mi_malloc(size)) == mi_good_size(size)`.
|
||
/// This can be used to reduce internal wasted space when
|
||
/// allocating buffers for example.
|
||
///
|
||
/// @see mi_usable_size()
|
||
size_t mi_good_size(size_t size);
|
||
|
||
/// Eagerly free memory.
|
||
/// @param force If \a true, aggressively return memory to the OS (can be expensive!)
|
||
///
|
||
/// Regular code should not have to call this function. It can be beneficial
|
||
/// in very narrow circumstances; in particular, when a long running thread
|
||
/// allocates a lot of blocks that are freed by other threads it may improve
|
||
/// resource usage by calling this every once in a while.
|
||
void mi_collect(bool force);
|
||
|
||
/// Print statistics.
|
||
/// @param out Output file. Use \a NULL for \a stderr.
|
||
///
|
||
/// Most detailed when using a debug build.
|
||
void mi_stats_print(FILE* out);
|
||
|
||
/// Reset statistics.
|
||
void mi_stats_reset();
|
||
|
||
/// Initialize mimalloc on a thread.
|
||
/// Should not be used as on most systems (pthreads, windows) this is done
|
||
/// automatically.
|
||
void mi_thread_init();
|
||
|
||
/// Uninitialize mimalloc on a thread.
|
||
/// Should not be used as on most systems (pthreads, windows) this is done
|
||
/// automatically. Ensures that any memory that is not freed yet (but will
|
||
/// be freed by other threads in the future) is properly handled.
|
||
void mi_thread_done();
|
||
|
||
/// Print out heap statistics for this thread.
|
||
/// @param out Output file. Use \a NULL for \a stderr.
|
||
///
|
||
/// Most detailed when using a debug build.
|
||
void mi_thread_stats_print(FILE* out);
|
||
|
||
/// Type of deferred free functions.
|
||
/// @param force If \a true all outstanding items should be freed.
|
||
/// @param heartbeat A monotonically increasing count.
|
||
///
|
||
/// @see mi_register_deferred_free
|
||
typedef void (mi_deferred_free_fun)(bool force, unsigned long long heartbeat);
|
||
|
||
/// Register a deferred free function.
|
||
/// @param deferred_free Address of a deferred free-ing function or \a NULL to unregister.
|
||
///
|
||
/// Some runtime systems use deferred free-ing, for example when using
|
||
/// reference counting to limit the worst case free time.
|
||
/// Such systems can register (re-entrant) deferred free function
|
||
/// to free more memory on demand. When the \a force parameter is
|
||
/// \a true all possible memory should be freed.
|
||
/// The per-thread \a heartbeat parameter is monotonically increasing
|
||
/// and guaranteed to be deterministic if the program allocates
|
||
/// deterministically. The \a deferred_free function is guaranteed
|
||
/// to be called deterministically after some number of allocations
|
||
/// (regardless of freeing or available free memory).
|
||
/// At most one \a deferred_free function can be active.
|
||
void mi_register_deferred_free(mi_deferred_free_fun* deferred_free);
|
||
|
||
/// \}
|
||
|
||
// ------------------------------------------------------
|
||
// Aligned allocation
|
||
// ------------------------------------------------------
|
||
|
||
/// \defgroup aligned Aligned Allocation
|
||
///
|
||
/// Allocating aligned memory blocks.
|
||
///
|
||
/// \{
|
||
|
||
/// Allocate \a size bytes aligned by \a alignment.
|
||
/// @param size number of bytes to allocate.
|
||
/// @param alignment the minimal alignment of the allocated memory.
|
||
/// @returns pointer to the allocated memory or \a NULL if out of memory.
|
||
/// The returned pointer is aligned by \a alignment, i.e.
|
||
/// `(uintptr_t)p % alignment == 0`.
|
||
///
|
||
/// Returns a unique pointer if called with \a size 0.
|
||
/// @see [_aligned_malloc](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc?view=vs-2017) (on Windows)
|
||
/// @see [aligned_alloc](http://man.openbsd.org/reallocarray) (on BSD, with switched arguments!)
|
||
/// @see [posix_memalign](https://linux.die.net/man/3/posix_memalign) (on Posix, with switched arguments!)
|
||
/// @see [memalign](https://linux.die.net/man/3/posix_memalign) (on Linux, with switched arguments!)
|
||
void* mi_malloc_aligned(size_t size, size_t alignment);
|
||
void* mi_zalloc_aligned(size_t size, size_t alignment);
|
||
void* mi_calloc_aligned(size_t count, size_t size, size_t alignment);
|
||
void* mi_realloc_aligned(void* p, size_t newsize, size_t alignment);
|
||
void* mi_rezalloc_aligned(void* p, size_t newsize, size_t alignment);
|
||
void* mi_recalloc_aligned(void* p, size_t count, size_t size, size_t alignment);
|
||
|
||
/// Allocate \a size bytes aligned by \a alignment at a specified \a offset.
|
||
/// @param size number of bytes to allocate.
|
||
/// @param alignment the minimal alignment of the allocated memory at \a offset.
|
||
/// @param offset the offset that should be aligned.
|
||
/// @returns pointer to the allocated memory or \a NULL if out of memory.
|
||
/// The returned pointer is aligned by \a alignment at \a offset, i.e.
|
||
/// `((uintptr_t)p + offset) % alignment == 0`.
|
||
///
|
||
/// Returns a unique pointer if called with \a size 0.
|
||
/// @see [_aligned_offset_malloc](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-offset-malloc?view=vs-2017) (on Windows)
|
||
void* mi_malloc_aligned_at(size_t size, size_t alignment, size_t offset);
|
||
void* mi_zalloc_aligned_at(size_t size, size_t alignment, size_t offset);
|
||
void* mi_calloc_aligned_at(size_t count, size_t size, size_t alignment, size_t offset);
|
||
void* mi_realloc_aligned_at(void* p, size_t newsize, size_t alignment, size_t offset);
|
||
void* mi_rezalloc_aligned_at(void* p, size_t newsize, size_t alignment, size_t offset);
|
||
void* mi_recalloc_aligned_at(void* p, size_t count, size_t size, size_t alignment, size_t offset);
|
||
|
||
/// \}
|
||
|
||
/// \defgroup heap Heap Allocation
|
||
///
|
||
/// First-class heaps that can be destroyed in one go.
|
||
///
|
||
/// \{
|
||
|
||
/// Type of first-class heaps.
|
||
/// A heap can only be used for allocation in
|
||
/// the thread that created this heap! Any allocated
|
||
/// blocks can be freed or reallocated by any other thread though.
|
||
struct mi_heap_s;
|
||
|
||
/// Type of first-class heaps.
|
||
/// A heap can only be used for (re)allocation in
|
||
/// the thread that created this heap! Any allocated
|
||
/// blocks can be freed by any other thread though.
|
||
typedef struct mi_heap_s mi_heap_t;
|
||
|
||
/// Create a new heap that can be used for allocation.
|
||
mi_heap_t* mi_heap_new();
|
||
|
||
/// Delete a previously allocated heap.
|
||
/// This will release resources and migrate any
|
||
/// still allocated blocks in this heap (efficienty)
|
||
/// to the default heap.
|
||
///
|
||
/// If \a heap is the default heap, the default
|
||
/// heap is set to the backing heap.
|
||
void mi_heap_delete(mi_heap_t* heap);
|
||
|
||
/// Destroy a heap, freeing all its still allocated blocks.
|
||
/// Use with care as this will free all blocks still
|
||
/// allocated in the heap. However, this can be a very
|
||
/// efficient way to free all heap memory in one go.
|
||
///
|
||
/// If \a heap is the default heap, the default
|
||
/// heap is set to the backing heap.
|
||
void mi_heap_destroy(mi_heap_t* heap);
|
||
|
||
/// Set the default heap to use for mi_malloc() et al.
|
||
/// @param heap The new default heap.
|
||
/// @returns The previous default heap.
|
||
mi_heap_t* mi_heap_set_default(mi_heap_t* heap);
|
||
|
||
/// Get the default heap that is used for mi_malloc() et al.
|
||
/// @returns The current default heap.
|
||
mi_heap_t* mi_heap_get_default();
|
||
|
||
/// Get the backing heap.
|
||
/// The _backing_ heap is the initial default heap for
|
||
/// a thread and always available for allocations.
|
||
/// It cannot be destroyed or deleted
|
||
/// except by exiting the thread.
|
||
mi_heap_t* mi_heap_get_backing();
|
||
|
||
/// Allocate in a specific heap.
|
||
/// @see mi_malloc()
|
||
void* mi_heap_malloc(mi_heap_t* heap, size_t size);
|
||
|
||
/// Allocate zero-initialized in a specific heap.
|
||
/// @see mi_zalloc()
|
||
void* mi_heap_zalloc(mi_heap_t* heap, size_t size);
|
||
|
||
/// Allocate \a count zero-initialized elements in a specific heap.
|
||
/// @see mi_calloc()
|
||
void* mi_heap_calloc(mi_heap_t* heap, size_t count, size_t size);
|
||
|
||
/// Allocate \a count elements in a specific heap.
|
||
/// @see mi_mallocn()
|
||
void* mi_heap_mallocn(mi_heap_t* heap, size_t count, size_t size);
|
||
|
||
/// Duplicate a string in a specific heap.
|
||
/// @see mi_strdup()
|
||
char* mi_heap_strdup(mi_heap_t* heap, const char* s);
|
||
|
||
/// Duplicate a string of at most length \a n in a specific heap.
|
||
/// @see mi_strndup()
|
||
char* mi_heap_strndup(mi_heap_t* heap, const char* s, size_t n);
|
||
|
||
/// Resolve a file path name using a specific \a heap to allocate the result.
|
||
/// @see mi_realpath()
|
||
char* mi_heap_realpath(mi_heap_t* heap, const char* fname, char* resolved_name);
|
||
|
||
|
||
/// \}
|
||
|
||
/// \defgroup typed Typed Macros
|
||
///
|
||
/// Typed allocation macros
|
||
///
|
||
/// \{
|
||
|
||
/// Allocate a block of type \a tp.
|
||
/// @param tp The type of the block to allocate.
|
||
/// @returns A pointer to an object of type \a tp, or
|
||
/// \a NULL if out of memory.
|
||
///
|
||
/// **Example:**
|
||
/// ```
|
||
/// int* p = mi_malloc_tp(int)
|
||
/// ```
|
||
///
|
||
/// @see mi_malloc()
|
||
#define mi_malloc_tp(tp) ((tp*)mi_malloc(sizeof(tp)))
|
||
|
||
/// Allocate a zero-initialized block of type \a tp.
|
||
#define mi_zalloc_tp(tp) ((tp*)mi_zalloc(sizeof(tp)))
|
||
|
||
/// Allocate \a count zero-initialized blocks of type \a tp.
|
||
#define mi_calloc_tp(tp,count) ((tp*)mi_calloc(count,sizeof(tp)))
|
||
|
||
/// Allocate \a count blocks of type \a tp.
|
||
#define mi_mallocn_tp(tp,count) ((tp*)mi_mallocn(count,sizeof(tp)))
|
||
|
||
/// Re-allocate to \a count blocks of type \a tp.
|
||
#define mi_reallocn_tp(p,tp,count) ((tp*)mi_reallocn(p,count,sizeof(tp)))
|
||
|
||
/// Re-allocate to \a count zero-initialized blocks of type \a tp.
|
||
#define mi_recalloc_tp(p,tp,count) ((tp*)mi_recalloc(p,count,sizeof(tp)))
|
||
|
||
/// Allocate a block of type \a tp in a heap \a hp.
|
||
#define mi_heap_malloc_tp(hp,tp) ((tp*)mi_malloc(hp,sizeof(tp)))
|
||
|
||
/// Allocate a zero-initialized block of type \a tp in a heap \a hp.
|
||
#define mi_heap_zalloc_tp(hp,tp) ((tp*)mi_zalloc(hp,sizeof(tp)))
|
||
|
||
/// Allocate \a count zero-initialized blocks of type \a tp in a heap \a hp.
|
||
#define mi_heap_calloc_tp(hp,tp,count) ((tp*)mi_calloc(hp,count,sizeof(tp)))
|
||
|
||
/// Allocate \a count blocks of type \a tp in a heap \a hp.
|
||
#define mi_heap_mallocn_tp(hp,tp,count) ((tp*)mi_mallocn(hp,count,sizeof(tp)))
|
||
|
||
/// \}
|
||
|
||
/// \defgroup analysis Heap Introspection
|
||
///
|
||
/// Inspect the heap at runtime.
|
||
///
|
||
/// \{
|
||
|
||
/// Does a heap contain a pointer to a previously allocated block?
|
||
/// @param heap The heap.
|
||
/// @param p Pointer to a previously allocated block (in any heap)-- cannot be some
|
||
/// random pointer!
|
||
/// @returns \a true if the block pointed to by \a p is in the \a heap.
|
||
/// @see mi_heap_check_owned()
|
||
bool mi_heap_contains_block(mi_heap_t* heap, const void* p);
|
||
|
||
/// Check safely if any pointer is part of a heap.
|
||
/// @param heap The heap.
|
||
/// @param p Any pointer -- not required to be previously allocated by us.
|
||
/// @returns \a true if \a p points to a block in \a heap.
|
||
///
|
||
/// Note: expensive function, linear in the pages in the heap.
|
||
/// @see mi_heap_contains_block()
|
||
/// @see mi_heap_get_default()
|
||
bool mi_heap_check_owned(mi_heap_t* heap, const void* p);
|
||
|
||
/// Check safely if any pointer is part of the default heap of this thread.
|
||
/// @param p Any pointer -- not required to be previously allocated by us.
|
||
/// @returns \a true if \a p points to a block in default heap of this thread.
|
||
///
|
||
/// Note: expensive function, linear in the pages in the heap.
|
||
/// @see mi_heap_contains_block()
|
||
/// @see mi_heap_get_default()
|
||
bool mi_check_owned(const void* p);
|
||
|
||
/// An area of heap space contains blocks of a single size.
|
||
/// The bytes in freed blocks are `committed - used`.
|
||
typedef struct mi_heap_area_s {
|
||
void* blocks; ///< start of the area containing heap blocks
|
||
size_t reserved; ///< bytes reserved for this area
|
||
size_t committed; ///< current committed bytes of this area
|
||
size_t used; ///< bytes in use by allocated blocks
|
||
size_t block_size; ///< size in bytes of one block
|
||
} mi_heap_area_t;
|
||
|
||
/// Visitor function passed to mi_heap_visit_blocks()
|
||
/// @returns \a true if ok, \a false to stop visiting (i.e. break)
|
||
///
|
||
/// This function is always first called for every \a area
|
||
/// with \a block as a \a NULL pointer. If \a visit_all_blocks
|
||
/// was \a true, the function is then called for every allocated
|
||
/// block in that area.
|
||
typedef bool (mi_block_visit_fun)(const mi_heap_t* heap, const mi_heap_area_t* area, void* block, size_t block_size, void* arg);
|
||
|
||
/// Visit all areas and blocks in a heap.
|
||
/// @param heap The heap to visit.
|
||
/// @param visit_all_blocks If \a true visits all allocated blocks, otherwise
|
||
/// \a visitor is only called for every heap area.
|
||
/// @param visitor This function is called for every area in the heap
|
||
/// (with \a block as \a NULL). If \a visit_all_blocks is
|
||
/// \a true, \a visitor is also called for every allocated
|
||
/// block in every area (with `block!=NULL`).
|
||
/// return \a false from this function to stop visiting early.
|
||
/// @param arg Extra argument passed to \a visitor.
|
||
/// @returns \a true if all areas and blocks were visited.
|
||
bool mi_heap_visit_blocks(const mi_heap_t* heap, bool visit_all_blocks, mi_block_visit_fun* visitor, void* arg);
|
||
|
||
/// \}
|
||
|
||
/// \defgroup options Runtime Options
|
||
///
|
||
/// Set runtime behavior.
|
||
///
|
||
/// \{
|
||
|
||
/// Runtime options.
|
||
typedef enum mi_option_e {
|
||
mi_option_page_reset, ///< Reset page memory when it becomes free.
|
||
mi_option_cache_reset, ///< Reset segment memory when a segment is cached.
|
||
mi_option_pool_commit, ///< Commit segments in large pools.
|
||
mi_option_show_stats, ///< Print statistics to `stderr` when the program is done.
|
||
mi_option_show_errors, ///< Print error messages to `stderr`.
|
||
mi_option_verbose, ///< Print verbose messages to `stderr`.
|
||
_mi_option_last
|
||
} mi_option_t;
|
||
|
||
bool mi_option_enabled(mi_option_t option);
|
||
void mi_option_enable(mi_option_t option, bool enable);
|
||
void mi_option_enable_default(mi_option_t option, bool enable);
|
||
|
||
long mi_option_get(mi_option_t option);
|
||
void mi_option_set(mi_option_t option, long value);
|
||
void mi_option_set_default(mi_option_t option, long value);
|
||
|
||
|
||
/// \}
|
||
|
||
/*! \page build Building
|
||
|
||
Checkout the sources from Github:
|
||
```
|
||
git clone https://github.com/koka-lang/mimalloc.git
|
||
```
|
||
|
||
## Windows
|
||
|
||
Open `ide/vs2017/mimalloc.sln` in Visual Studio 2017 and build.
|
||
The `mimalloc` project builds a static library, while the
|
||
`mimalloc-override` project builds a DLL for overriding malloc
|
||
in the entire program.
|
||
|
||
## MacOSX, Linux, BSD, etc.
|
||
|
||
We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system:
|
||
|
||
- `mkdir -p out/release` (create a build directory)
|
||
- `cd out/release` (go to it)
|
||
- `cmake ../..` (generate the make file)
|
||
- `make` (and build)
|
||
|
||
This will build the library as a shared (dynamic)
|
||
library (`.so` or `.dylib`), a static library (`.a`), and
|
||
as a single object file (`.o`).
|
||
|
||
- `sudo make install` (install the library and header files in `/usr/lib` and `/usr/include`)
|
||
|
||
Use the option `-DCMAKE_INSTALL_PREFIX=../local` (for example) to the `ccmake`
|
||
command to install to a local directory to see what gets installed.
|
||
|
||
You can build the debug version which does many internal checks and
|
||
maintains detailed statistics as:
|
||
|
||
- `mkdir -p out/debug`
|
||
- `cd out/debug`
|
||
- `cmake -DCMAKE_BUILD_TYPE=Debug ../..`
|
||
- `make`
|
||
|
||
This will name the shared library as `libmimalloc-debug.so`.
|
||
|
||
Or build with `clang`:
|
||
|
||
- `CC=clang cmake ../..`
|
||
|
||
Use `ccmake`<sup>2</sup> instead of `cmake`
|
||
to see and customize all the available build options.
|
||
|
||
Notes:
|
||
1. Install CMake: `sudo apt-get install cmake`
|
||
2. Install CCMake: `sudo apt-get install cmake-curses-gui`
|
||
*/
|
||
|
||
/*! \page using Using the library
|
||
|
||
The preferred usage is including `<mimalloc.h>`, linking with
|
||
the shared- or static library, and using the `mi_malloc` API exclusively for allocation. For example,
|
||
```
|
||
gcc -o myprogram -lmimalloc myfile.c
|
||
```
|
||
|
||
mimalloc uses only safe OS calls (`mmap` and `VirtualAlloc`) and can co-exist
|
||
with other allocators linked to the same program.
|
||
If you use `cmake`, you can simply use:
|
||
```
|
||
find_package(mimalloc 1.0 REQUIRED)
|
||
```
|
||
in your `CMakeLists.txt` to find a locally installed mimalloc. Then use either:
|
||
```
|
||
target_link_libraries(myapp PUBLIC mimalloc)
|
||
```
|
||
to link with the shared (dynamic) library, or:
|
||
```
|
||
target_link_libraries(myapp PUBLIC mimalloc-static)
|
||
```
|
||
to link with the static library. See `test\CMakeLists.txt` for an example.
|
||
|
||
|
||
You can pass environment variables to print verbose messages (`MIMALLOC_VERBOSE=1`)
|
||
and statistics (`MIMALLOC_STATS=1`) (in the debug version):
|
||
```
|
||
> env MIMALLOC_STATS=1 ./cfrac 175451865205073170563711388363
|
||
|
||
175451865205073170563711388363 = 374456281610909315237213 * 468551
|
||
|
||
heap stats: peak total freed unit
|
||
normal 2: 16.4 kb 17.5 mb 17.5 mb 16 b ok
|
||
normal 3: 16.3 kb 15.2 mb 15.2 mb 24 b ok
|
||
normal 4: 64 b 4.6 kb 4.6 kb 32 b ok
|
||
normal 5: 80 b 118.4 kb 118.4 kb 40 b ok
|
||
normal 6: 48 b 48 b 48 b 48 b ok
|
||
normal 17: 960 b 960 b 960 b 320 b ok
|
||
|
||
heap stats: peak total freed unit
|
||
normal: 33.9 kb 32.8 mb 32.8 mb 1 b ok
|
||
huge: 0 b 0 b 0 b 1 b ok
|
||
total: 33.9 kb 32.8 mb 32.8 mb 1 b ok
|
||
malloc requested: 32.8 mb
|
||
|
||
committed: 58.2 kb 58.2 kb 58.2 kb 1 b ok
|
||
reserved: 2.0 mb 2.0 mb 2.0 mb 1 b ok
|
||
reset: 0 b 0 b 0 b 1 b ok
|
||
segments: 1 1 1
|
||
-abandoned: 0
|
||
pages: 6 6 6
|
||
-abandoned: 0
|
||
mmaps: 3
|
||
mmap fast: 0
|
||
mmap slow: 1
|
||
threads: 0
|
||
elapsed: 2.022s
|
||
process: user: 1.781s, system: 0.016s, faults: 756, reclaims: 0, rss: 2.7 mb
|
||
```
|
||
|
||
The above model of using the `mi_` prefixed API is not always possible
|
||
though in existing programs that already use the standard malloc interface,
|
||
and another option is to override the standard malloc interface
|
||
completely and redirect all calls to the _mimalloc_ library instead.
|
||
|
||
See \ref overrides for more info.
|
||
|
||
*/
|
||
|
||
/*! \page overrides Overriding Malloc
|
||
|
||
Overriding standard malloc can be done either _dynamically_ or _statically_.
|
||
|
||
## Dynamic override
|
||
|
||
This is the recommended way to override the standard malloc interface.
|
||
|
||
### Unix, BSD, MacOSX
|
||
|
||
On these system we preload the mimalloc shared
|
||
library so all calls to the standard `malloc` interface are
|
||
resolved to the _mimalloc_ library.
|
||
|
||
- `env LD_PRELOAD=/usr/lib/libmimalloc.so myprogram` (on Linux, BSD, etc.)
|
||
- `env DYLD_INSERT_LIBRARIES=usr/lib/libmimalloc.dylib myprogram` (On MacOSX)
|
||
|
||
Note certain security restrictions may apply when doing this from
|
||
the [shell](https://stackoverflow.com/questions/43941322/dyld-insert-libraries-ignored-when-calling-application-through-bash).
|
||
|
||
You can set extra environment variables to check that mimalloc is running,
|
||
like:
|
||
```
|
||
env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/libmimalloc.so myprogram
|
||
```
|
||
or run with the debug version to get detailed statistics:
|
||
```
|
||
env MIMALLOC_STATS=1 LD_PRELOAD=/usr/lib/libmimallocd.so myprogram
|
||
```
|
||
|
||
### Windows
|
||
|
||
On Windows you need to link your program explicitly with the mimalloc
|
||
DLL, and use the C-runtime library as a DLL (the `/MD` or `/MDd` switch).
|
||
To ensure the mimalloc DLL gets loaded it is easiest to insert some
|
||
call to the mimalloc API in the `main` function, like `mi_version()`.
|
||
|
||
Due to the way mimalloc overrides the standard malloc at runtime, it is best
|
||
to link to the mimalloc import library first on the command line so it gets
|
||
loaded right after the universal C runtime DLL (`ucrtbase`). See
|
||
the `mimalloc-override-test` project for an example.
|
||
|
||
|
||
## Static override
|
||
|
||
You can also statically link with _mimalloc_ to override the standard
|
||
malloc interface. The recommended way is to link the final program with the
|
||
_mimalloc_ single object file (`mimalloc-override.o` (or `.obj`)). We use
|
||
an object file instead of a library file as linkers give preference to
|
||
that over archives to resolve symbols. To ensure that the standard
|
||
malloc interface resolves to the _mimalloc_ library, link it as the first
|
||
object file. For example:
|
||
|
||
```
|
||
gcc -o myprogram mimalloc-override.o myfile1.c ...
|
||
```
|
||
|
||
|
||
|
||
## List of Overrides:
|
||
|
||
The specific functions that get redirected to the _mimalloc_ library are:
|
||
|
||
```
|
||
// C
|
||
void* malloc(size_t size);
|
||
void* calloc(size_t size, size_t n);
|
||
void* realloc(void* p, size_t newsize);
|
||
void free(void* p);
|
||
|
||
// C++
|
||
void operator delete(void* p);
|
||
void operator delete[](void* p);
|
||
|
||
void* operator new(std::size_t n) noexcept(false);
|
||
void* operator new[](std::size_t n) noexcept(false);
|
||
void* operator new( std::size_t n, std::align_val_t align) noexcept(false);
|
||
void* operator new[]( std::size_t n, std::align_val_t align) noexcept(false);
|
||
|
||
void* operator new ( std::size_t count, const std::nothrow_t& tag);
|
||
void* operator new[]( std::size_t count, const std::nothrow_t& tag);
|
||
void* operator new ( std::size_t count, std::align_val_t al, const std::nothrow_t&);
|
||
void* operator new[]( std::size_t count, std::align_val_t al, const std::nothrow_t&);
|
||
|
||
// Posix
|
||
int posix_memalign(void** p, size_t alignment, size_t size);
|
||
|
||
// Linux
|
||
void* memalign(size_t alignment, size_t size);
|
||
void* aligned_alloc(size_t alignment, size_t size);
|
||
void* valloc(size_t size);
|
||
void* pvalloc(size_t size);
|
||
size_t malloc_usable_size(void *p);
|
||
|
||
// BSD
|
||
void* reallocarray( void* p, size_t count, size_t size );
|
||
void* reallocf(void* p, size_t newsize);
|
||
void cfree(void* p);
|
||
|
||
// Windows
|
||
void* _expand(void* p, size_t newsize);
|
||
size_t _msize(void* p);
|
||
|
||
void* _malloc_dbg(size_t size, int block_type, const char* fname, int line);
|
||
void* _realloc_dbg(void* p, size_t newsize, int block_type, const char* fname, int line);
|
||
void* _calloc_dbg(size_t count, size_t size, int block_type, const char* fname, int line);
|
||
void* _expand_dbg(void* p, size_t size, int block_type, const char* fname, int line);
|
||
size_t _msize_dbg(void* p, int block_type);
|
||
void _free_dbg(void* p, int block_type);
|
||
```
|
||
|
||
*/
|
||
|
||
/*! \page bench Performance
|
||
|
||
|
||
tldr: In our benchmarks, mimalloc always outperforms
|
||
all other leading allocators (jemalloc, tcmalloc, hoard, and glibc), and usually
|
||
uses less memory (with less then 25% more in the worst case) (as of Jan 2019).
|
||
A nice property is that it does consistently well over a wide range of benchmarks.
|
||
|
||
Disclaimer: allocators are interesting as there is no optimal algorithm -- for
|
||
a given allocator one can always construct a workload where it does not do so well.
|
||
The goal is thus to find an allocation strategy that performs well over a wide
|
||
range of benchmarks without suffering from underperformance in less
|
||
common situations (which is what our second benchmark set tests for).
|
||
|
||
|
||
## Benchmarking
|
||
|
||
We tested _mimalloc_ with 5 other allocators over 11 benchmarks.
|
||
The tested allocators are:
|
||
|
||
- **mi**: The mimalloc allocator (version tag `v1.0.0`).
|
||
- **je**: [jemalloc](https://github.com/jemalloc/jemalloc), by [Jason Evans](https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) (Facebook);
|
||
currently (2018) one of the leading allocators and is widely used, for example
|
||
in BSD, Firefox, and at Facebook. Installed as package `libjemalloc-dev:amd64/bionic 3.6.0-11`.
|
||
- **tc**: [tcmalloc](https://github.com/gperftools/gperftools), by Google as part of the performance tools.
|
||
Highly performant and used in the Chrome browser. Installed as package `libgoogle-perftools-dev:amd64/bionic 2.5-2.2ubuntu3`.
|
||
- **jx**: A compiled version of a more recent instance of [jemalloc](https://github.com/jemalloc/jemalloc).
|
||
Using commit ` 7a815c1b` ([dev](https://github.com/jemalloc/jemalloc/tree/dev), 2019-01-15).
|
||
- **hd**: [Hoard](https://github.com/emeryberger/Hoard), by Emery Berger \[1].
|
||
One of the first multi-thread scalable allocators.
|
||
([master](https://github.com/emeryberger/Hoard), 2019-01-01, version tag `3.13`)
|
||
- **mc**: The system allocator. Here we use the LibC allocator (which is originally based on
|
||
PtMalloc). Using version 2.27. (Note that version 2.26 significantly improved scalability over
|
||
earlier versions).
|
||
|
||
All allocators run exactly the same benchmark programs and use `LD_PRELOAD` to override the system allocator.
|
||
The wall-clock elapsed time and peak resident memory (_rss_) are
|
||
measured with the `time` program. The best scores over 5 runs are used.
|
||
Performance is reported relative to mimalloc, e.g. a time of 66% means that
|
||
mimalloc ran 1.5× faster (i.e. that mimalloc finished in 66% of the time
|
||
that the other allocator needed).
|
||
|
||
## On a 16-core AMD EPYC running Linux
|
||
|
||
Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
|
||
consisting of a 16-core AMD EPYC 7000 at 2.5GHz
|
||
with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||
|
||
|
||
The first benchmark set consists of programs that allocate a lot. Relative
|
||
elapsed time:
|
||
|
||
![bench-r5a-4xlarge-t1](bench-r5a-4xlarge-t1.png)
|
||
|
||
and memory usage:
|
||
|
||
![bench-r5a-4xlarge-m1](bench-r5a-4xlarge-m1.png)
|
||
|
||
The benchmarks above are (with N=16 in our case):
|
||
|
||
- __cfrac__: by Dave Barrett, implementation of continued fraction factorization:
|
||
uses many small short-lived allocations. Factorizes as `./cfrac 175451865205073170563711388363274837927895`.
|
||
- __espresso__: a programmable logic array analyzer \[3].
|
||
- __barnes__: a hierarchical n-body particle solver \[4]. Simulates 163840 particles.
|
||
- __leanN__: by Leonardo de Moura _et al_, the [lean](https://github.com/leanprover/lean)
|
||
compiler, version 3.4.1, compiling its own standard library concurrently using N cores (`./lean --make -j N`).
|
||
Big real-world workload with intensive allocation, takes about 1:40s when running on a
|
||
single high-end core.
|
||
- __redis__: running the [redis](https://redis.io/) 5.0.3 server on
|
||
1 million requests pushing 10 new list elements and then requesting the
|
||
head 10 elements. Measures the requests handled per second.
|
||
- __alloc-test__: a modern [allocator test](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/)
|
||
developed by by OLogN Technologies AG at [ITHare.com](http://ithare.com). Simulates intensive allocation workloads with a Pareto
|
||
size distribution. The `alloc-testN` benchmark runs on N cores doing 100×10<sup>6</sup>
|
||
allocations per thread with objects up to 1KB in size.
|
||
Using commit `94f6cb` ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04)
|
||
|
||
We can see mimalloc outperforms the other allocators moderately but all
|
||
these modern allocators perform well.
|
||
In `cfrac`, mimalloc is about 13%
|
||
faster than jemalloc for many small and short-lived allocations.
|
||
The `cfrac` and `espresso` programs do not use much
|
||
memory (~1.5MB) so it does not matter too much, but still mimalloc uses about half the resident
|
||
memory of tcmalloc (and almost 5× less than Hoard on `espresso`).
|
||
|
||
_The `leanN` program is most interesting as a large realistic and concurrent
|
||
workload and there is a 6% speedup over both tcmalloc and jemalloc. This is
|
||
quite significant: if Lean spends (optimistically) 20% of its time in the allocator
|
||
that means that mimalloc is 1.5× faster than the others._
|
||
|
||
The `alloc-test` is very allocation intensive and we see the larger
|
||
diffrerences here. Since all allocators perform almost identical on `alloc-test1`
|
||
as `alloc-testN`, we can see that they are all excellent and scale (almost) linearly.
|
||
|
||
The second benchmark set test specific aspects of the allocators and
|
||
shows more extreme differences between allocators:
|
||
|
||
![bench-r5a-4xlarge-t2](bench-r5a-4xlarge-t2.png)
|
||
|
||
|
||
|
||
![bench-r5a-4xlarge-m2](bench-r5a-4xlarge-m2.png)
|
||
|
||
The benchmarks in the second set are (again with N=16):
|
||
|
||
- __larson__: by Larson and Krishnan \[2]. Simulates a server workload using 100
|
||
separate threads where
|
||
they allocate and free many objects but leave some objects to
|
||
be freed by other threads. Larson and Krishnan observe this behavior
|
||
(which they call _bleeding_) in actual server applications, and the
|
||
benchmark simulates this.
|
||
- __sh6bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
|
||
single-threaded allocation where some of the objects are freed
|
||
in a usual last-allocated, first-freed (LIFO) order, but others
|
||
are freed in reverse order. Using the public [source](http://www.microquill.com/smartheap/shbench/bench.zip) (retrieved 2019-01-02)
|
||
- __sh8bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
|
||
multithreaded allocation (with N threads) where, just as in `larson`, some objects are freed
|
||
by other threads, and some objects freed in reverse (as in `sh6bench`).
|
||
Using the public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) (retrieved 2019-01-02)
|
||
- __cache-scratch__: by Emery Berger _et al_ \[1]. Introduced with the Hoard
|
||
allocator to test for _passive-false_ sharing of cache lines: first some
|
||
small objects are allocated and given to each thread; the threads free that
|
||
object and allocate another one and access that repeatedly. If an allocator
|
||
allocates objects from different threads close to each other this will
|
||
lead to cache-line contention.
|
||
|
||
In the `larson` server workload mimalloc is 2.5× faster than
|
||
tcmalloc and jemalloc which is quite surprising -- probably due to the object
|
||
migration between different threads. Also in `sh6bench` mimalloc does much
|
||
better than the others (more than 4× faster than jemalloc). a
|
||
We cannot explain this well but believe it may be
|
||
caused in part by the "reverse" free-ing in `sh6bench`. Again in `sh8bench`
|
||
the mimalloc allocator handles object migration between threads much better .
|
||
|
||
The `cache-scratch` benchmark also demonstrates the different architectures
|
||
of the allocators nicely. With a single thread they all perform the same, but when
|
||
running with multiple threads the allocator induced false sharing of the
|
||
cache lines causes large run-time differences, where mimalloc is up to
|
||
20× faster than tcmalloc here. Only the original jemalloc does almost
|
||
as well (but the most recent version, jxmalloc, regresses). The
|
||
Hoard allocator is specifically designed to avoid this false sharing and we
|
||
are not sure why it is not doing well here (although it runs still 5× as
|
||
fast as tcmalloc and jxmalloc).
|
||
|
||
|
||
## On a 8-core Intel Xeon running Linux
|
||
|
||
Testing on a compute optimized Amazon EC2 instance ([c5d.2xlarge](https://aws.amazon.com/ec2/instance-types/))
|
||
consisting of a 8-core Intel Xeon Platinum at 3GHz (up to 3.5GHz turbo boost)
|
||
with 16GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||
|
||
First the regular workload benchmarks (with N=8):
|
||
|
||
![bench-c5d-2xlarge-t1](bench-c5d-2xlarge-t1.png)
|
||
|
||
|
||
|
||
![bench-c5d-2xlarge-m1](bench-c5d-2xlarge-m1.png)
|
||
|
||
Most results are quite similar to the 16-core AMD machine except the
|
||
the differences are less pronounced with all a bit closer to mimalloc performance.
|
||
|
||
This is shown too in the second set of benchmarks:
|
||
|
||
![bench-c5d-2xlarge-t2](bench-c5d-2xlarge-t2.png)
|
||
|
||
|
||
|
||
![bench-c5d-2xlarge-m2](bench-c5d-2xlarge-m2.png)
|
||
|
||
On the server workload of `larson` everyone does a bit better on the 8-cores
|
||
than on 16. On the other benchmarks the performance does not improve though.
|
||
|
||
|
||
## On Windows (4-core Intel Xeon)
|
||
|
||
Testing on a HP Z4 G4 Workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz
|
||
with 16GB ECC memory, running Windows 10 Pro (version 10.0.17134 Build 17134)
|
||
with Visual Studio 2017 (version 15.8.9).
|
||
|
||
Since we cannot use `LD_PRELOAD` on Windows we compiled a subset of our
|
||
allocators and benchmarks and linked them statically. The **je** benchmark
|
||
is therefore equivalent to the **jx** benchmark in the previous graphs.
|
||
The **mc** allocator now refers to the standard Microsoft allocator.
|
||
Unfortunately we could not get Hoard to work on Windows at this time.
|
||
|
||
We used the Windows call `QueryPerformanceCounter` to measure elapsed wall-clock
|
||
times, and `GetProcessMemoryInfo` to measure the peak working set (rss).
|
||
|
||
First the regular workload benchmarks:
|
||
|
||
![bench-z4-win-t1](bench-z4-win-t1.png)
|
||
|
||
|
||
|
||
![bench-z4-win-m1](bench-z4-win-m1.png)
|
||
|
||
Here mimalloc and tcmalloc perform very similar, and outperform the system
|
||
allocator by a significant margin. Somehow jemalloc does much worse than
|
||
running on Linux. It it not clear why yet, but it might be a compilation issue:
|
||
when running through the profiler the `__chkstk` routine takes
|
||
quite some time. This is a compiler inserted runtime function to check for enough
|
||
stack space if there are many local variables or when the compiler cannot make
|
||
a static estimate. Perhaps this is the culprit but it needs more investigation.
|
||
|
||
The second set of benchmarks shows again more pronounced differences:
|
||
|
||
![bench-z4-win-t2](bench-z4-win-t2.png)
|
||
|
||
|
||
|
||
![bench-z4-win-m2](bench-z4-win-m2.png)
|
||
|
||
In the `larson` server workload mimalloc is 25% faster than
|
||
tcmalloc, and both significantly outperform the system allocator.
|
||
(again probably due to the object
|
||
migration between different threads).
|
||
Also in `sh6bench` and `sh8bench`, mimalloc scales much
|
||
better than the others.
|
||
|
||
## References
|
||
|
||
- \[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson.
|
||
_Hoard: A Scalable Memory Allocator for Multithreaded Applications_
|
||
the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.
|
||
[pdf](http://www.cs.utexas.edu/users/mckinley/papers/asplos-2000.pdf)
|
||
|
||
- \[2] P. Larson and M. Krishnan. _Memory allocation for long-running server applications_. In ISMM, Vancouver, B.C., Canada, 1998.
|
||
[pdf](http://citeseemi.ist.psu.edu/viewdoc/download;jsessionid=5F0BFB4F57832AEB6C11BF8257271088?doi=10.1.1.45.1947&rep=rep1&type=pdf)
|
||
|
||
- \[3] D. Grunwald, B. Zorn, and R. Henderson.
|
||
_Improving the cache locality of memory allocation_. In R. Cartwright, editor,
|
||
Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993.
|
||
[pdf](http://citeseemi.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf)
|
||
|
||
- \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986.
|
||
|
||
*/
|