add updated benchmarks

2020-01-22 15:05:02 -08:00 · 2020-01-22 15:05:02 -08:00 · af2cfe255a
commit af2cfe255a
parent 5bc1c52ae6
1 changed files with 117 additions and 91 deletions
--- a/readme.md
+++ b/readme.md
@ -313,68 +313,71 @@ under your control or otherwise mixing of pointers from different heaps may occu

 # Performance

+Last update: 2020-01-20
+
 We tested _mimalloc_ against many other top allocators over a wide
 range of benchmarks, ranging from various real world programs to
 synthetic benchmarks that see how the allocator behaves under more
-extreme circumstances.
+extreme circumstances. In our benchmark suite, _mimalloc_ outperforms other leading
+allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and has a similar memory footprint. A nice property is that it
+does consistently well over the wide range of benchmarks.

-In our benchmarks, _mimalloc_ always outperforms all other leading
-allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less
-memory (up to 25% more in the worst case). A nice property is that it
-does *consistently* well over the wide range of benchmarks.
-
-Allocators are interesting as there exists no algorithm that is generally
+General memory allocators are interesting as there exists no algorithm that is
 optimal -- for a given allocator one can usually construct a workload
 where it does not do so well. The goal is thus to find an allocation
 strategy that performs well over a wide range of benchmarks without
-suffering from underperformance in less common situations (which is what
-the second half of our benchmark set tests for).
+suffering from (too much) underperformance in less common situations.

-We show here only the results on an AMD EPYC system (Apr 2019) -- for
-specific details and further benchmarks we refer to the [technical report](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action).
+As always, interpret these results with care since some benchmarks test synthetic
+or uncommon situations that may never apply to your workloads. For example, most
+allocators do not do well on `xmalloc-testN` but that includes the best
+industrial allocators like _jemalloc_ and _tcmalloc_ that are used in some of
+the world's largest systems (like Chrome or FreeBSD).

-The benchmark suite is scripted and available separately
+We show here only an overview -- for
+more specific details and further benchmarks we refer to the
+[technical report](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action).
+The benchmark suite is automated and available separately
 as [mimalloc-bench](https://github.com/daanx/mimalloc-bench).


-## Benchmark Results
+## Benchmark Results on 36-core Intel

-Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
-consisting of a 16-core AMD EPYC 7000 at 2.5GHz
-with 128GB ECC memory, running	Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
-The measured allocators are _mimalloc_ (mi),
-Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (tc) used in Chrome,
-[_jemalloc_](https://github.com/jemalloc/jemalloc) (je) by Jason Evans used in Firefox and FreeBSD,
-[_snmalloc_](https://github.com/microsoft/snmalloc) (sn) by Liétar et al. \[8], [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) (rp) by Mattias Jansson at Rampant Pixels,
-[_Hoard_](https://github.com/emeryberger/Hoard) by Emery Berger \[1],
-the system allocator (glibc) (based on _PtMalloc2_), and the Intel thread
-building blocks [allocator](https://github.com/intel/tbb) (tbb).
+Testing on a big Amazon EC2 compute instance
+([c5.18xlarge](https://aws.amazon.com/ec2/instance-types/#Compute_Optimized))
+consisting of a 72 processor Intel Xeon at 3GHz
+with 144GiB ECC memory, running	Ubuntu 18.04.1 with LibC 2.27 and GCC 7.4.0.
+The measured allocators are _mimalloc_ (mi, tag:v1.4.0, page reset enabled)
+and its secure build as _smi_,
+Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (tc, tag:gperftools-2.7) used in Chrome,
+Facebook's [_jemalloc_](https://github.com/jemalloc/jemalloc) (je, tag:5.2.1) by Jason Evans used in Firefox and FreeBSD,
+the Intel thread building blocks [allocator](https://github.com/intel/tbb) (tbb, tag:2020),
+[rpmalloc](https://github.com/mjansson/rpmalloc) (rp,tag:1.4.0) by Mattias Jansson,
+the original scalable [_Hoard_](https://github.com/emeryberger/Hoard) (tag:3.13) allocator by Emery Berger \[1],
+the memory compacting [_Mesh_](https://github.com/plasma-umass/Mesh) (git:51222e7) allocator by
+Bobby Powers _et al_ \[8],
+and finally the default system allocator (glibc, 2.7.0) (based on _PtMalloc2_).

-![bench-r5a-1](doc/bench-r5a-1.svg)
-![bench-r5a-2](doc/bench-r5a-2.svg)
+![bench-c5-18xlarge-a](doc/bench-c5-18xlarge-2020-01-20-a.svq)
+![bench-c5-18xlarge-b](doc/bench-c5-18xlarge-2020-01-20-b.svq)

-Memory usage:
+Any benchmarks ending in `N` run on all processors in parallel.
+Results are averaged over 10 runs and reported relative
+to mimalloc (where 1.2 means it took 1.2&times; longer to run).
+The legend also contains the _overall relative score_ between the
+allocators where 100 points is the maximum if an allocator is fastest on
+all benchmarks.

-![bench-r5a-rss-1](doc/bench-r5a-rss-1.svg)
-![bench-r5a-rss-1](doc/bench-r5a-rss-2.svg)
+The single threaded _cfrac_ benchmark by Dave Barrett is an implementation of
+continued fraction factorization which uses many small short-lived allocations.
+All allocators do well on such common usage, where _mimalloc_ is just a tad
+faster than _tcmalloc_ and
+_jemalloc_.

-(note: the _xmalloc-testN_ memory usage should be disregarded as it
-allocates more the faster the program runs).
-
-In the first five benchmarks we can see _mimalloc_ outperforms the other
-allocators moderately, but we also see that all these modern allocators
-perform well -- the times of large performance differences in regular
-workloads are over :-).
-In _cfrac_ and _espresso_, _mimalloc_ is a tad faster than _tcmalloc_ and
-_jemalloc_, but a solid 10\% faster than all other allocators on
-_espresso_. The _tbb_ allocator does not do so well here and lags more than
-20\% behind _mimalloc_. The _cfrac_ and _espresso_ programs do not use much
-memory (~1.5MB) so it does not matter too much, but still _mimalloc_ uses
-about half the resident memory of _tcmalloc_.
-
-The _leanN_ program is most interesting as a large realistic and
-concurrent workload of the [Lean](https://github.com/leanprover/lean) theorem prover
-compiling its own standard library, and there is a 8% speedup over _tcmalloc_. This is
+The _leanN_ program is interesting as a large realistic and
+concurrent workload of the [Lean](https://github.com/leanprover/lean)
+theorem prover compiling its own standard library, and there is a 7%
+speedup over _tcmalloc_. This is
 quite significant: if Lean spends 20% of its time in the
 allocator that means that _mimalloc_ is 1.3&times; faster than _tcmalloc_
 here. (This is surprising as that is not measured in a pure
@ -383,19 +386,23 @@ outsized improvement here because _mimalloc_ has better locality in
 the allocation which improves performance for the *other* computations
 in a program as well).

-The _redis_ benchmark shows more differences between the allocators where
-_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do
-not do well and are over 40\% slower.
+The single threaded _redis_ benchmark again show that most allocators do well on such workloads where _tcmalloc_
+did best this time.

-The _larson_ server workload allocates and frees objects between
-many threads. Larson and Krishnan \[2] observe this
-behavior (which they call _bleeding_) in actual server applications, and the
-benchmark simulates this.
-Here, _mimalloc_ is more than 2.5&times; faster than _tcmalloc_ and _jemalloc_
-due to the object migration between different threads. This is a difficult
-benchmark for other allocators too where _mimalloc_ is still 48% faster than the next
-fastest (_snmalloc_).
+The _larsonN_ server benchmark by Larson and Krishnan \[2] allocates and frees between threads. They observed this
+behavior (which they call _bleeding_) in actual server applications, and the benchmark simulates this.
+Here, _mimalloc_ is quite a bit faster than _tcmalloc_ and _jemalloc_ probably due to the object migration between different threads.

+The _mstressN_ workload performs many allocations and re-allocations,
+and migrates objects between threads (as in _larsonN_). However, it also
+creates  and destroys the _N_ worker threads a few times keeping some objects
+alive beyond the life time of the allocating thread. We observed this
+behavior in many larger server applications.
+
+The [_rptestN_](https://github.com/mjansson/rpmalloc-benchmark) benchmark
+by Mattias Jansson is a allocator test originally designed
+for _rpmalloc_, and tries to simulate realistic allocation patterns over
+multiple threads. Here the differences between allocators become more apparent.

 The second benchmark set tests specific aspects of the allocators and
 shows even more extreme differences between them.
@ -404,46 +411,62 @@ The _alloc-test_, by
 [OLogN Technologies AG](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/), is a very allocation intensive benchmark doing millions of
 allocations in various size classes. The test is scaled such that when an
 allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it
-means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and
-_Hoard_ seem to scale less well and do more than 10% worse on the
-multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are
-more than 10% slower as _mimalloc_ here.
+means that it scales linearly. Here, _tcmalloc_, and
+_Hoard_ seem to scale less well and do more than 10% worse on the multi-core version. Even the best industrial
+allocators (_tcmalloc_, _jemalloc_, and _tbb_) are more than 10% slower as _mimalloc_ here.

 The _sh6bench_ and _sh8bench_ benchmarks are
 developed by [MicroQuill](http://www.microquill.com/) as part of SmartHeap.
 In _sh6bench_ _mimalloc_ does much
-better than the others (more than 2&times; faster than _jemalloc_).
+better than the others (more than 1.5&times; faster than _jemalloc_).
 We cannot explain this well but believe it is
 caused in part by the "reverse" free-ing pattern in _sh6bench_.
-Again in _sh8bench_ the _mimalloc_ allocator handles object migration
-between threads much better and is over 36% faster than the next best
-allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the
-addition of object migration caused it to be almost 3 times slower
-than before.
+The _sh8bench_ is a variation with object migration
+between threads; whereas _tcmalloc_ did well on _sh6bench_, the addition of object migration causes it to be 10&times; slower than before.

-The _xmalloc-testN_ benchmark by Lever and Boreham \[5] and Christian Eder,
-simulates an asymmetric workload where
-some threads only allocate, and others only free. The _snmalloc_
-allocator was especially developed to handle this case well as it
-often occurs in concurrent message passing systems (like the [Pony] language
-for which _snmalloc_ was initially developed). Here we see that
+The _xmalloc-testN_ benchmark by Lever and Boreham \[5] and Christian Eder, simulates an asymmetric workload where
+some threads only allocate, and others only free -- they observed this pattern in
+larger server applications. Here we see that
 the _mimalloc_ technique of having non-contended sharded thread free
-lists pays off as it even outperforms _snmalloc_ here.
-Only _jemalloc_ also handles this reasonably well, while the
-others underperform by a large margin.
+lists pays off as it outperforms others by a very large margin. Only _rpmalloc_ and _tbb_ also scale well on this benchmark.

-The _cache-scratch_ benchmark by Emery Berger \[1], and introduced with the Hoard
-allocator to test for _passive-false_ sharing of cache lines. With a single thread they all
+The _cache-scratch_ benchmark by Emery Berger \[1], and introduced with
+the Hoard allocator to test for _passive-false_ sharing of cache lines.
+With a single thread they all
 perform the same, but when running with multiple threads the potential allocator
-induced false sharing of the cache lines causes large run-time
-differences, where _mimalloc_ is more than 18&times; faster than _jemalloc_ and
-_tcmalloc_! Crundal \[6] describes in detail why the false cache line
-sharing occurs in the _tcmalloc_ design, and also discusses how this
+induced false sharing of the cache lines can cause large run-time differences.
+Crundal \[6] describes in detail why the false cache line sharing occurs in the _tcmalloc_ design, and also discusses how this
 can be avoided with some small implementation changes.
-Only _snmalloc_ and _tbb_ also avoid the
-cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail
+Only the _tbb_, _rpmalloc_ and _mesh_ allocators also avoid the
+cache line sharing completely, while _Hoard_ and _glibc_ seem to mitigate
+the effects. Kukanov and Voss \[7] describe in detail
 how the design of _tbb_ avoids the false cache line sharing.

+## On 24-core AMD Epyc
+
+For completeness, here are the results on a
+[r5a.12xlarge](https://aws.amazon.com/ec2/instance-types/#Memory_Optimized) instance
+having a 48 processor AMD Epyc 7000 at 2.5GHz with 384GiB of memory.
+The results are similar to the Intel results but it is interesting to
+see the differences in the _larsonN_, _mstressN_, and _xmalloc-testN_ benchmarks.
+
+![bench-r5a-12xlarge-a](doc/bench-r5a-12xlarge-2020-01-16-a.svq)
+![bench-r5a-12xlarge-b](doc/bench-r5a-12xlarge-2020-01-16-b.svq)
+
+
+## Peak Working Set
+
+The following figure shows the peak working set (rss) of the allocators
+on the benchmarks (on the c5.18xlarge instance).
+
+![bench-c5-18xlarge-rss-a](doc/bench-c5-18xlarge-2020-01-20-rss-a.svq)
+![bench-c5-18xlarge-rss-b](doc/bench-c5-18xlarge-2020-01-20-rss-b.svq)
+
+Note that the _xmalloc-testN_ memory usage should be disregarded as it
+allocates more the faster the program runs. Similarly, memory usage of
+_mstressN_, _rptestN_ and _sh8bench_ can vary depending on scheduling and
+speed. Nevertheless, even though _mimalloc_ is fast on these benchmarks we
+believe the memory usage is too high and hope to improve.


 # References
@ -453,14 +476,12 @@ how the design of _tbb_ avoids the false cache line sharing.
   the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.
   [pdf](http://www.cs.utexas.edu/users/mckinley/papers/asplos-2000.pdf)

-
- \[2] P. Larson and M. Krishnan. _Memory allocation for long-running server applications_. In ISMM, Vancouver, B.C., Canada, 1998.
-      [pdf](http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=5F0BFB4F57832AEB6C11BF8257271088?doi=10.1.1.45.1947&rep=rep1&type=pdf)
+- \[2] P. Larson and M. Krishnan. _Memory allocation for long-running server applications_.
+  In ISMM, Vancouver, B.C., Canada, 1998. [pdf](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.45.1947&rep=rep1&type=pdf)

 - \[3] D. Grunwald, B. Zorn, and R. Henderson.
  _Improving the cache locality of memory allocation_. In R. Cartwright, editor,
-  Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993.
-  [pdf](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf)
+  Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993. [pdf](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf)

 - \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986.

@ -468,17 +489,22 @@ how the design of _tbb_ avoids the false cache line sharing.
  In USENIX Annual Technical Conference, Freenix Session. San Diego, CA. Jun. 2000.
  Available at <https://github.com/kuszmaul/SuperMalloc/tree/master/tests>

- \[6] Timothy Crundal. _Reducing Active-False Sharing in TCMalloc._
-   2016. <http://courses.cecs.anu.edu.au/courses/CSPROJECTS/16S1/Reports/Timothy_Crundal_Report.pdf>. CS16S1 project at the Australian National University.
+- \[6] Timothy Crundal. _Reducing Active-False Sharing in TCMalloc_. 2016. CS16S1 project at the Australian National University. [pdf](http://courses.cecs.anu.edu.au/courses/CSPROJECTS/16S1/Reports/Timothy_Crundal_Report.pdf)

 - \[7] Alexey Kukanov, and Michael J Voss.
   _The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks._
   Intel Technology Journal 11 (4). 2007

- \[8] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson,
+- \[8] Bobby Powers, David Tench, Emery D. Berger, and Andrew McGregor.
+ _Mesh: Compacting Memory Management for C/C++_
+ In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'19), June 2019, pages 333-–346.
+
+<!--
+- \[9] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson,
  Alex Shamis, Christoph M Wintersteiger, and David Chisnall.
  _Snmalloc: A Message Passing Allocator._
  In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019.
+-->


 # Contributing