- __leanN__: by Leonardo de Moura _et al_, the [lean](https://github.com/leanprover/lean)
compiler, version 3.4.1, compiling its own standard library concurrently using N cores (`./lean --make -j N`).
Big real-world workload with intensive allocation, takes about 1:40s when running on a
single high-end core.
- __redis__: running the [redis](https://redis.io/) 5.0.3 server on
1 million requests pushing 10 new list elements and then requesting the
head 10 elements. Measures the requests handled per second.
- __alloc-test__: a modern [allocator test](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/)
developed by by OLogN Technologies AG at [ITHare.com](http://ithare.com). Simulates intensive allocation workloads with a Pareto
size distribution. The `alloc-testN` benchmark runs on N cores doing 100×10<sup>6</sup>
allocations per thread with objects up to 1KB in size.
Using commit `94f6cb` ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04)
We can see mimalloc outperforms the other allocators moderately but all
these modern allocators perform well.
In `cfrac`, mimalloc is about 13%
faster than jemalloc for many small and short-lived allocations.
The `cfrac` and `espresso` programs do not use much
memory (~1.5MB) so it does not matter too much, but still mimalloc uses about half the resident
memory of tcmalloc (and 4× less than Hoard on `espresso`).
_The `leanN` program is most interesting as a large realistic and concurrent
workload and there is a 6% speedup over both tcmalloc and jemalloc._ (This is
quite significant: if Lean spends (optimistically) 20% of its time in the allocator
that implies a 1.5× speedup with mimalloc).
The large `redis` benchmark shows a similar speedup.
The `alloc-test` is very allocation intensive and we see the largest
diffrerences here when running with 16 cores in parallel.
The second benchmark tests specific aspects of the allocators and
shows more extreme differences between allocators:
The benchmarks in the second set are (again with N=16):
- __larson__: by Larson and Krishnan \[2]. Simulates a server workload using 100
separate threads where
they allocate and free many objects but leave some objects to
be freed by other threads. Larson and Krishnan observe this behavior
(which they call _bleeding_) in actual server applications, and the
benchmark simulates this.
- __sh6bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
single-threaded allocation where some of the objects are freed
in a usual last-allocated, first-freed (LIFO) order, but others
are freed in reverse order. Using the public [source](http://www.microquill.com/smartheap/shbench/bench.zip) (retrieved 2019-01-02)
- __sh8bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
multithreaded allocation (with N threads) where, just as in `larson`, some objects are freed
by other threads, and some objects freed in reverse (as in `sh6bench`).
Using the public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) (retrieved 2019-01-02)
- __cache-scratch__: by Emery Berger _et al_ \[1]. Introduced with the Hoard
allocator to test for _passive-false_ sharing of cache lines: first some
small objects are allocated and given to each thread; the threads free that
object and allocate another one and access that repeatedly. If an allocator
allocates objects from different threads close to each other this will
lead to cache-line contention.
In the `larson` server workload mimalloc is 2.5× faster than
tcmalloc and jemalloc which is quite surprising -- probably due to the object
migration between different threads. Also in `sh6bench` mimalloc does much
better than the others (more than 4× faster than jemalloc).
We cannot explain this well but believe it may be
caused in part by the "reverse" free-ing in `sh6bench`. Again in `sh8bench`
the mimalloc allocator handles object migration between threads much better .
The `cache-scratch` benchmark also demonstrates the different architectures
of the allocators nicely. With a single thread they all perform the same, but when
running with multiple threads the allocator induced false sharing of the
cache lines causes large run-time differences, where mimalloc is
20× faster than tcmalloc here. Only the original jemalloc does almost
as well (but the most recent version, jxmalloc, regresses). The
Hoard allocator is specifically designed to avoid this false sharing and we
are not sure why it is not doing well here (although it still runs almost 5×