Update documentation generation
This commit is contained in:
parent
c91bed99e1
commit
644f59fad7
337
.gitignore
vendored
337
.gitignore
vendored
@ -1,330 +1,7 @@
|
||||
## Ignore Visual Studio temporary files, build results, and
|
||||
## files generated by popular Visual Studio add-ons.
|
||||
##
|
||||
## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore
|
||||
|
||||
# User-specific files
|
||||
*.suo
|
||||
*.user
|
||||
*.userosscache
|
||||
*.sln.docstates
|
||||
|
||||
# User-specific files (MonoDevelop/Xamarin Studio)
|
||||
*.userprefs
|
||||
|
||||
# Build results
|
||||
[Dd]ebug/
|
||||
[Dd]ebugPublic/
|
||||
[Rr]elease/
|
||||
[Rr]eleases/
|
||||
x64/
|
||||
x86/
|
||||
bld/
|
||||
[Bb]in/
|
||||
[Oo]bj/
|
||||
[Ll]og/
|
||||
|
||||
# Visual Studio 2015/2017 cache/options directory
|
||||
.vs/
|
||||
# Uncomment if you have tasks that create the project's static files in wwwroot
|
||||
#wwwroot/
|
||||
|
||||
# Visual Studio 2017 auto generated files
|
||||
Generated\ Files/
|
||||
|
||||
# MSTest test Results
|
||||
[Tt]est[Rr]esult*/
|
||||
[Bb]uild[Ll]og.*
|
||||
|
||||
# NUNIT
|
||||
*.VisualState.xml
|
||||
TestResult.xml
|
||||
|
||||
# Build Results of an ATL Project
|
||||
[Dd]ebugPS/
|
||||
[Rr]eleasePS/
|
||||
dlldata.c
|
||||
|
||||
# Benchmark Results
|
||||
BenchmarkDotNet.Artifacts/
|
||||
|
||||
# .NET Core
|
||||
project.lock.json
|
||||
project.fragment.lock.json
|
||||
artifacts/
|
||||
**/Properties/launchSettings.json
|
||||
|
||||
# StyleCop
|
||||
StyleCopReport.xml
|
||||
|
||||
# Files built by Visual Studio
|
||||
*_i.c
|
||||
*_p.c
|
||||
*_i.h
|
||||
*.ilk
|
||||
*.meta
|
||||
*.obj
|
||||
*.iobj
|
||||
*.pch
|
||||
*.pdb
|
||||
*.ipdb
|
||||
*.pgc
|
||||
*.pgd
|
||||
*.rsp
|
||||
*.sbr
|
||||
*.tlb
|
||||
*.tli
|
||||
*.tlh
|
||||
*.tmp
|
||||
*.tmp_proj
|
||||
*.log
|
||||
*.vspscc
|
||||
*.vssscc
|
||||
.builds
|
||||
*.pidb
|
||||
*.svclog
|
||||
*.scc
|
||||
|
||||
# Chutzpah Test files
|
||||
_Chutzpah*
|
||||
|
||||
# Visual C++ cache files
|
||||
ipch/
|
||||
*.aps
|
||||
*.ncb
|
||||
*.opendb
|
||||
*.opensdf
|
||||
*.sdf
|
||||
*.cachefile
|
||||
*.VC.db
|
||||
*.VC.VC.opendb
|
||||
|
||||
# Visual Studio profiler
|
||||
*.psess
|
||||
*.vsp
|
||||
*.vspx
|
||||
*.sap
|
||||
|
||||
# Visual Studio Trace Files
|
||||
*.e2e
|
||||
|
||||
# TFS 2012 Local Workspace
|
||||
$tf/
|
||||
|
||||
# Guidance Automation Toolkit
|
||||
*.gpState
|
||||
|
||||
# ReSharper is a .NET coding add-in
|
||||
_ReSharper*/
|
||||
*.[Rr]e[Ss]harper
|
||||
*.DotSettings.user
|
||||
|
||||
# JustCode is a .NET coding add-in
|
||||
.JustCode
|
||||
|
||||
# TeamCity is a build add-in
|
||||
_TeamCity*
|
||||
|
||||
# DotCover is a Code Coverage Tool
|
||||
*.dotCover
|
||||
|
||||
# AxoCover is a Code Coverage Tool
|
||||
.axoCover/*
|
||||
!.axoCover/settings.json
|
||||
|
||||
# Visual Studio code coverage results
|
||||
*.coverage
|
||||
*.coveragexml
|
||||
|
||||
# NCrunch
|
||||
_NCrunch_*
|
||||
.*crunch*.local.xml
|
||||
nCrunchTemp_*
|
||||
|
||||
# MightyMoose
|
||||
*.mm.*
|
||||
AutoTest.Net/
|
||||
|
||||
# Web workbench (sass)
|
||||
.sass-cache/
|
||||
|
||||
# Installshield output folder
|
||||
[Ee]xpress/
|
||||
|
||||
# DocProject is a documentation generator add-in
|
||||
DocProject/buildhelp/
|
||||
DocProject/Help/*.HxT
|
||||
DocProject/Help/*.HxC
|
||||
DocProject/Help/*.hhc
|
||||
DocProject/Help/*.hhk
|
||||
DocProject/Help/*.hhp
|
||||
DocProject/Help/Html2
|
||||
DocProject/Help/html
|
||||
|
||||
# Click-Once directory
|
||||
publish/
|
||||
|
||||
# Publish Web Output
|
||||
*.[Pp]ublish.xml
|
||||
*.azurePubxml
|
||||
# Note: Comment the next line if you want to checkin your web deploy settings,
|
||||
# but database connection strings (with potential passwords) will be unencrypted
|
||||
*.pubxml
|
||||
*.publishproj
|
||||
|
||||
# Microsoft Azure Web App publish settings. Comment the next line if you want to
|
||||
# checkin your Azure Web App publish settings, but sensitive information contained
|
||||
# in these scripts will be unencrypted
|
||||
PublishScripts/
|
||||
|
||||
# NuGet Packages
|
||||
*.nupkg
|
||||
# The packages folder can be ignored because of Package Restore
|
||||
**/[Pp]ackages/*
|
||||
# except build/, which is used as an MSBuild target.
|
||||
!**/[Pp]ackages/build/
|
||||
# Uncomment if necessary however generally it will be regenerated when needed
|
||||
#!**/[Pp]ackages/repositories.config
|
||||
# NuGet v3's project.json files produces more ignorable files
|
||||
*.nuget.props
|
||||
*.nuget.targets
|
||||
|
||||
# Microsoft Azure Build Output
|
||||
csx/
|
||||
*.build.csdef
|
||||
|
||||
# Microsoft Azure Emulator
|
||||
ecf/
|
||||
rcf/
|
||||
|
||||
# Windows Store app package directories and files
|
||||
AppPackages/
|
||||
BundleArtifacts/
|
||||
Package.StoreAssociation.xml
|
||||
_pkginfo.txt
|
||||
*.appx
|
||||
|
||||
# Visual Studio cache files
|
||||
# files ending in .cache can be ignored
|
||||
*.[Cc]ache
|
||||
# but keep track of directories ending in .cache
|
||||
!*.[Cc]ache/
|
||||
|
||||
# Others
|
||||
ClientBin/
|
||||
~$*
|
||||
*~
|
||||
*.dbmdl
|
||||
*.dbproj.schemaview
|
||||
*.jfm
|
||||
*.pfx
|
||||
*.publishsettings
|
||||
orleans.codegen.cs
|
||||
|
||||
# Including strong name files can present a security risk
|
||||
# (https://github.com/github/gitignore/pull/2483#issue-259490424)
|
||||
#*.snk
|
||||
|
||||
# Since there are multiple workflows, uncomment next line to ignore bower_components
|
||||
# (https://github.com/github/gitignore/pull/1529#issuecomment-104372622)
|
||||
#bower_components/
|
||||
|
||||
# RIA/Silverlight projects
|
||||
Generated_Code/
|
||||
|
||||
# Backup & report files from converting an old project file
|
||||
# to a newer Visual Studio version. Backup files are not needed,
|
||||
# because we have git ;-)
|
||||
_UpgradeReport_Files/
|
||||
Backup*/
|
||||
UpgradeLog*.XML
|
||||
UpgradeLog*.htm
|
||||
ServiceFabricBackup/
|
||||
*.rptproj.bak
|
||||
|
||||
# SQL Server files
|
||||
*.mdf
|
||||
*.ldf
|
||||
*.ndf
|
||||
|
||||
# Business Intelligence projects
|
||||
*.rdl.data
|
||||
*.bim.layout
|
||||
*.bim_*.settings
|
||||
*.rptproj.rsuser
|
||||
|
||||
# Microsoft Fakes
|
||||
FakesAssemblies/
|
||||
|
||||
# GhostDoc plugin setting file
|
||||
*.GhostDoc.xml
|
||||
|
||||
# Node.js Tools for Visual Studio
|
||||
.ntvs_analysis.dat
|
||||
node_modules/
|
||||
|
||||
# Visual Studio 6 build log
|
||||
*.plg
|
||||
|
||||
# Visual Studio 6 workspace options file
|
||||
*.opt
|
||||
|
||||
# Visual Studio 6 auto-generated workspace file (contains which files were open etc.)
|
||||
*.vbw
|
||||
|
||||
# Visual Studio LightSwitch build output
|
||||
**/*.HTMLClient/GeneratedArtifacts
|
||||
**/*.DesktopClient/GeneratedArtifacts
|
||||
**/*.DesktopClient/ModelManifest.xml
|
||||
**/*.Server/GeneratedArtifacts
|
||||
**/*.Server/ModelManifest.xml
|
||||
_Pvt_Extensions
|
||||
|
||||
# Paket dependency manager
|
||||
.paket/paket.exe
|
||||
paket-files/
|
||||
|
||||
# FAKE - F# Make
|
||||
.fake/
|
||||
|
||||
# JetBrains Rider
|
||||
.idea/
|
||||
*.sln.iml
|
||||
|
||||
# CodeRush
|
||||
.cr/
|
||||
|
||||
# Python Tools for Visual Studio (PTVS)
|
||||
__pycache__/
|
||||
*.pyc
|
||||
|
||||
# Cake - Uncomment if you are using it
|
||||
# tools/**
|
||||
# !tools/packages.config
|
||||
|
||||
# Tabs Studio
|
||||
*.tss
|
||||
|
||||
# Telerik's JustMock configuration file
|
||||
*.jmconfig
|
||||
|
||||
# BizTalk build output
|
||||
*.btp.cs
|
||||
*.btm.cs
|
||||
*.odx.cs
|
||||
*.xsd.cs
|
||||
|
||||
# OpenCover UI analysis results
|
||||
OpenCover/
|
||||
|
||||
# Azure Stream Analytics local run output
|
||||
ASALocalRun/
|
||||
|
||||
# MSBuild Binary and Structured Log
|
||||
*.binlog
|
||||
|
||||
# NVidia Nsight GPU debugger configuration file
|
||||
*.nvuser
|
||||
|
||||
# MFractors (Xamarin productivity tool) working folder
|
||||
.mfractor/
|
||||
ide/vs2017/*.db
|
||||
ide/vs2017/*.opendb
|
||||
ide/vs2017/*.user
|
||||
ide/vs2017/.vs
|
||||
out/
|
||||
docs/
|
||||
*.zip
|
||||
|
13
doc/doxyfile
13
doc/doxyfile
@ -1235,18 +1235,7 @@ HTML_EXTRA_STYLESHEET = mimalloc-doxygen.css
|
||||
# files will be copied as-is; there are no commands or markers available.
|
||||
# This tag requires that the tag GENERATE_HTML is set to YES.
|
||||
|
||||
HTML_EXTRA_FILES = bench-r5a-4xlarge-t1.png \
|
||||
bench-r5a-4xlarge-t2.png \
|
||||
bench-r5a-4xlarge-m1.png \
|
||||
bench-r5a-4xlarge-m2.png \
|
||||
bench-c5d-2xlarge-t1.png \
|
||||
bench-c5d-2xlarge-t2.png \
|
||||
bench-c5d-2xlarge-m1.png \
|
||||
bench-c5d-2xlarge-m2.png \
|
||||
bench-z4-win-t1.png \
|
||||
bench-z4-win-t2.png \
|
||||
bench-z4-win-m1.png \
|
||||
bench-z4-win-m2.png
|
||||
HTML_EXTRA_FILES =
|
||||
|
||||
# The HTML_COLORSTYLE_HUE tag controls the color of the HTML output. Doxygen
|
||||
# will adjust the colors in the style sheet and background images according to
|
||||
|
@ -11,13 +11,19 @@ terms of the MIT license. A copy of the license can be found in the file
|
||||
/*! \mainpage
|
||||
|
||||
This is the API documentation of the
|
||||
[mimalloc](https://github.com/koka-lang/mimalloc) allocator
|
||||
[mimalloc](https://github.com/microsoft/mimalloc) allocator
|
||||
(pronounced "me-malloc") -- a
|
||||
general purpose allocator with excellent [performance](bench.html)
|
||||
characteristics. Initially
|
||||
developed by Daan Leijen for the run-time systems of the
|
||||
[Koka](https://github.com/koka-lang/koka) and [Lean](https://github.com/leanprover/lean) languages.
|
||||
|
||||
It is a drop-in replacement for `malloc` and can be used in other programs
|
||||
without code changes, for example, on Unix you can use it as:
|
||||
```
|
||||
> LD_PRELOAD=/usr/bin/libmimalloc.so myprogram
|
||||
```
|
||||
|
||||
Notable aspects of the design include:
|
||||
|
||||
- __small and consistent__: the library is less than 3500 LOC using simple and
|
||||
@ -25,23 +31,32 @@ Notable aspects of the design include:
|
||||
to integrate and adapt in other projects. For runtime systems it
|
||||
provides hooks for a monotonic _heartbeat_ and deferred freeing (for
|
||||
bounded worst-case times with reference counting).
|
||||
- __free list sharding__: "the big idea": instead of one big free list (per size class) we have
|
||||
- __free list sharding__: the big idea: instead of one big free list (per size class) we have
|
||||
many smaller lists per memory "page" which both reduces fragmentation
|
||||
and increases locality --
|
||||
things that are allocated close in time get allocated close in memory.
|
||||
(A memory "page" in mimalloc contains blocks of one size class and is
|
||||
usually 64KB on a 64-bit system).
|
||||
(A memory "page" in _mimalloc_ contains blocks of one size class and is
|
||||
usually 64KiB on a 64-bit system).
|
||||
- __eager page reset__: when a "page" becomes empty (with increased chance
|
||||
due to free list sharding) the memory is marked to the OS as unused ("reset" or "purged")
|
||||
reducing (real) memory pressure and fragmentation, especially in long running
|
||||
programs.
|
||||
- __lazy initialization__: pages in a segment are lazily initialized so
|
||||
no memory is touched until it becomes allocated, reducing the resident
|
||||
memory and potential page faults.
|
||||
- __secure__: _mimalloc_ can be build in secure mode, adding guard pages,
|
||||
randomized allocation, encrypted free lists, etc. to protect against various
|
||||
heap vulnerabilities. The performance penalty is only around 3% on average
|
||||
over our benchmarks.
|
||||
- __first-class heaps__: efficiently create and use multiple heaps to allocate across different regions.
|
||||
A heap can be destroyed at once instead of deallocating each object separately.
|
||||
- __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation
|
||||
times (_wcat_), bounded space overhead (~0.2% meta-data, with at most 16.7% waste in allocation sizes),
|
||||
and has no internal points of contention using atomic operations almost
|
||||
everywhere.
|
||||
and has no internal points of contention using only atomic operations.
|
||||
- __fast__: In our benchmarks (see [below](#performance)),
|
||||
_mimalloc_ always outperforms all other leading allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc),
|
||||
and usually uses less memory (up to 25% more in the worst case). A nice property
|
||||
is that it does consistently well over a wide range of benchmarks.
|
||||
|
||||
You can read more on the design of _mimalloc_ in the upcoming technical report
|
||||
which also has detailed benchmark results.
|
||||
|
||||
Further information:
|
||||
|
||||
@ -623,13 +638,13 @@ void mi_option_set_default(mi_option_t option, long value);
|
||||
|
||||
Checkout the sources from Github:
|
||||
```
|
||||
git clone https://github.com/koka-lang/mimalloc.git
|
||||
git clone https://github.com/microsoft/mimalloc
|
||||
```
|
||||
|
||||
## Windows
|
||||
|
||||
Open `ide/vs2017/mimalloc.sln` in Visual Studio 2017 and build.
|
||||
The `mimalloc` project builds a static library, while the
|
||||
The `mimalloc` project builds a static library (in `out/msvc-x64`), while the
|
||||
`mimalloc-override` project builds a DLL for overriding malloc
|
||||
in the entire program.
|
||||
|
||||
@ -637,44 +652,50 @@ in the entire program.
|
||||
|
||||
We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system:
|
||||
|
||||
- `mkdir -p out/release` (create a build directory)
|
||||
- `cd out/release` (go to it)
|
||||
- `cmake ../..` (generate the make file)
|
||||
- `make` (and build)
|
||||
```
|
||||
> mkdir -p out/release
|
||||
> cd out/release
|
||||
> cmake ../..
|
||||
> make
|
||||
```
|
||||
This builds the library as a shared (dynamic)
|
||||
library (`.so` or `.dylib`), a static library (`.a`), and
|
||||
as a single object file (`.o`).
|
||||
|
||||
This will build the library as a shared (dynamic)
|
||||
library (`.so` or `.dylib`), a static library (`.a`), and
|
||||
as a single object file (`.o`).
|
||||
|
||||
- `sudo make install` (install the library and header files in `/usr/lib` and `/usr/include`)
|
||||
|
||||
Use the option `-DCMAKE_INSTALL_PREFIX=../local` (for example) to the `ccmake`
|
||||
command to install to a local directory to see what gets installed.
|
||||
`> sudo make install` (install the library and header files in `/usr/local/lib` and `/usr/local/include`)
|
||||
|
||||
You can build the debug version which does many internal checks and
|
||||
maintains detailed statistics as:
|
||||
|
||||
- `mkdir -p out/debug`
|
||||
- `cd out/debug`
|
||||
- `cmake -DCMAKE_BUILD_TYPE=Debug ../..`
|
||||
- `make`
|
||||
|
||||
This will name the shared library as `libmimalloc-debug.so`.
|
||||
|
||||
Or build with `clang`:
|
||||
|
||||
- `CC=clang cmake ../..`
|
||||
```
|
||||
> mkdir -p out/debug
|
||||
> cd out/debug
|
||||
> cmake -DCMAKE_BUILD_TYPE=Debug ../..
|
||||
> make
|
||||
```
|
||||
This will name the shared library as `libmimalloc-debug.so`.
|
||||
|
||||
Finally, you can build a _secure_ version that uses guard pages, encrypted
|
||||
free lists, etc, as:
|
||||
```
|
||||
> mkdir -p out/secure
|
||||
> cd out/secure
|
||||
> cmake -DSECURE=ON ../..
|
||||
> make
|
||||
```
|
||||
This will name the shared library as `libmimalloc-secure.so`.
|
||||
Use `ccmake`<sup>2</sup> instead of `cmake`
|
||||
to see and customize all the available build options.
|
||||
|
||||
Notes:
|
||||
1. Install CMake: `sudo apt-get install cmake`
|
||||
2. Install CCMake: `sudo apt-get install cmake-curses-gui`
|
||||
|
||||
*/
|
||||
|
||||
/*! \page using Using the library
|
||||
|
||||
|
||||
The preferred usage is including `<mimalloc.h>`, linking with
|
||||
the shared- or static library, and using the `mi_malloc` API exclusively for allocation. For example,
|
||||
```
|
||||
@ -745,7 +766,7 @@ See \ref overrides for more info.
|
||||
|
||||
/*! \page overrides Overriding Malloc
|
||||
|
||||
Overriding standard malloc can be done either _dynamically_ or _statically_.
|
||||
Overriding the standard `malloc` can be done either _dynamically_ or _statically_.
|
||||
|
||||
## Dynamic override
|
||||
|
||||
@ -753,7 +774,7 @@ This is the recommended way to override the standard malloc interface.
|
||||
|
||||
### Unix, BSD, MacOSX
|
||||
|
||||
On these system we preload the mimalloc shared
|
||||
On these systems we preload the mimalloc shared
|
||||
library so all calls to the standard `malloc` interface are
|
||||
resolved to the _mimalloc_ library.
|
||||
|
||||
@ -770,7 +791,7 @@ env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/libmimalloc.so myprogram
|
||||
```
|
||||
or run with the debug version to get detailed statistics:
|
||||
```
|
||||
env MIMALLOC_STATS=1 LD_PRELOAD=/usr/lib/libmimallocd.so myprogram
|
||||
env MIMALLOC_STATS=1 LD_PRELOAD=/usr/lib/libmimalloc-debug.so myprogram
|
||||
```
|
||||
|
||||
### Windows
|
||||
@ -780,7 +801,7 @@ DLL, and use the C-runtime library as a DLL (the `/MD` or `/MDd` switch).
|
||||
To ensure the mimalloc DLL gets loaded it is easiest to insert some
|
||||
call to the mimalloc API in the `main` function, like `mi_version()`.
|
||||
|
||||
Due to the way mimalloc overrides the standard malloc at runtime, it is best
|
||||
Due to the way mimalloc intercepts the standard malloc at runtime, it is best
|
||||
to link to the mimalloc import library first on the command line so it gets
|
||||
loaded right after the universal C runtime DLL (`ucrtbase`). See
|
||||
the `mimalloc-override-test` project for an example.
|
||||
@ -788,9 +809,9 @@ the `mimalloc-override-test` project for an example.
|
||||
|
||||
## Static override
|
||||
|
||||
You can also statically link with _mimalloc_ to override the standard
|
||||
On Unix systems, you can also statically link with _mimalloc_ to override the standard
|
||||
malloc interface. The recommended way is to link the final program with the
|
||||
_mimalloc_ single object file (`mimalloc-override.o` (or `.obj`)). We use
|
||||
_mimalloc_ single object file (`mimalloc-override.o`). We use
|
||||
an object file instead of a library file as linkers give preference to
|
||||
that over archives to resolve symbols. To ensure that the standard
|
||||
malloc interface resolves to the _mimalloc_ library, link it as the first
|
||||
@ -858,239 +879,19 @@ void _free_dbg(void* p, int block_type);
|
||||
|
||||
/*! \page bench Performance
|
||||
|
||||
We tested _mimalloc_ against many other top allocators over a wide
|
||||
range of benchmarks, ranging from various real world programs to
|
||||
synthetic benchmarks that see how the allocator behaves under more
|
||||
extreme circumstances.
|
||||
|
||||
tldr: In our benchmarks, mimalloc always outperforms
|
||||
all other leading allocators (jemalloc, tcmalloc, hoard, and glibc), and usually
|
||||
uses less memory (with less then 25% more in the worst case) (as of Jan 2019).
|
||||
A nice property is that it does consistently well over a wide range of benchmarks.
|
||||
In our benchmarks, _mimalloc_ always outperforms all other leading
|
||||
allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc) (Apr 2019),
|
||||
and usually uses less memory (up to 25% more in the worst case).
|
||||
A nice property is that it does *consistently* well over the wide
|
||||
range of benchmarks.
|
||||
|
||||
Disclaimer: allocators are interesting as there is no optimal algorithm -- for
|
||||
a given allocator one can always construct a workload where it does not do so well.
|
||||
The goal is thus to find an allocation strategy that performs well over a wide
|
||||
range of benchmarks without suffering from underperformance in less
|
||||
common situations (which is what our second benchmark set tests for).
|
||||
|
||||
|
||||
## Benchmarking
|
||||
|
||||
We tested _mimalloc_ with 5 other allocators over 11 benchmarks.
|
||||
The tested allocators are:
|
||||
|
||||
- **mi**: The mimalloc allocator (version tag `v1.0.0`).
|
||||
- **je**: [jemalloc](https://github.com/jemalloc/jemalloc), by [Jason Evans](https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) (Facebook);
|
||||
currently (2018) one of the leading allocators and is widely used, for example
|
||||
in BSD, Firefox, and at Facebook. Installed as package `libjemalloc-dev:amd64/bionic 3.6.0-11`.
|
||||
- **tc**: [tcmalloc](https://github.com/gperftools/gperftools), by Google as part of the performance tools.
|
||||
Highly performant and used in the Chrome browser. Installed as package `libgoogle-perftools-dev:amd64/bionic 2.5-2.2ubuntu3`.
|
||||
- **jx**: A compiled version of a more recent instance of [jemalloc](https://github.com/jemalloc/jemalloc).
|
||||
Using commit ` 7a815c1b` ([dev](https://github.com/jemalloc/jemalloc/tree/dev), 2019-01-15).
|
||||
- **hd**: [Hoard](https://github.com/emeryberger/Hoard), by Emery Berger \[1].
|
||||
One of the first multi-thread scalable allocators.
|
||||
([master](https://github.com/emeryberger/Hoard), 2019-01-01, version tag `3.13`)
|
||||
- **mc**: The system allocator. Here we use the LibC allocator (which is originally based on
|
||||
PtMalloc). Using version 2.27. (Note that version 2.26 significantly improved scalability over
|
||||
earlier versions).
|
||||
|
||||
All allocators run exactly the same benchmark programs and use `LD_PRELOAD` to override the system allocator.
|
||||
The wall-clock elapsed time and peak resident memory (_rss_) are
|
||||
measured with the `time` program. The best scores over 5 runs are used.
|
||||
Performance is reported relative to mimalloc, e.g. a time of 66% means that
|
||||
mimalloc ran 1.5× faster (i.e. that mimalloc finished in 66% of the time
|
||||
that the other allocator needed).
|
||||
|
||||
## On a 16-core AMD EPYC running Linux
|
||||
|
||||
Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
|
||||
consisting of a 16-core AMD EPYC 7000 at 2.5GHz
|
||||
with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||||
|
||||
|
||||
The first benchmark set consists of programs that allocate a lot. Relative
|
||||
elapsed time:
|
||||
|
||||
![bench-r5a-4xlarge-t1](bench-r5a-4xlarge-t1.png)
|
||||
|
||||
and memory usage:
|
||||
|
||||
![bench-r5a-4xlarge-m1](bench-r5a-4xlarge-m1.png)
|
||||
|
||||
The benchmarks above are (with N=16 in our case):
|
||||
|
||||
- __cfrac__: by Dave Barrett, implementation of continued fraction factorization:
|
||||
uses many small short-lived allocations. Factorizes as `./cfrac 175451865205073170563711388363274837927895`.
|
||||
- __espresso__: a programmable logic array analyzer \[3].
|
||||
- __barnes__: a hierarchical n-body particle solver \[4]. Simulates 163840 particles.
|
||||
- __leanN__: by Leonardo de Moura _et al_, the [lean](https://github.com/leanprover/lean)
|
||||
compiler, version 3.4.1, compiling its own standard library concurrently using N cores (`./lean --make -j N`).
|
||||
Big real-world workload with intensive allocation, takes about 1:40s when running on a
|
||||
single high-end core.
|
||||
- __redis__: running the [redis](https://redis.io/) 5.0.3 server on
|
||||
1 million requests pushing 10 new list elements and then requesting the
|
||||
head 10 elements. Measures the requests handled per second.
|
||||
- __alloc-test__: a modern [allocator test](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/)
|
||||
developed by by OLogN Technologies AG at [ITHare.com](http://ithare.com). Simulates intensive allocation workloads with a Pareto
|
||||
size distribution. The `alloc-testN` benchmark runs on N cores doing 100×10<sup>6</sup>
|
||||
allocations per thread with objects up to 1KB in size.
|
||||
Using commit `94f6cb` ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04)
|
||||
|
||||
We can see mimalloc outperforms the other allocators moderately but all
|
||||
these modern allocators perform well.
|
||||
In `cfrac`, mimalloc is about 13%
|
||||
faster than jemalloc for many small and short-lived allocations.
|
||||
The `cfrac` and `espresso` programs do not use much
|
||||
memory (~1.5MB) so it does not matter too much, but still mimalloc uses about half the resident
|
||||
memory of tcmalloc (and almost 5× less than Hoard on `espresso`).
|
||||
|
||||
_The `leanN` program is most interesting as a large realistic and concurrent
|
||||
workload and there is a 6% speedup over both tcmalloc and jemalloc. This is
|
||||
quite significant: if Lean spends (optimistically) 20% of its time in the allocator
|
||||
that means that mimalloc is 1.5× faster than the others._
|
||||
|
||||
The `alloc-test` is very allocation intensive and we see the larger
|
||||
diffrerences here. Since all allocators perform almost identical on `alloc-test1`
|
||||
as `alloc-testN`, we can see that they are all excellent and scale (almost) linearly.
|
||||
|
||||
The second benchmark set test specific aspects of the allocators and
|
||||
shows more extreme differences between allocators:
|
||||
|
||||
![bench-r5a-4xlarge-t2](bench-r5a-4xlarge-t2.png)
|
||||
|
||||
|
||||
|
||||
![bench-r5a-4xlarge-m2](bench-r5a-4xlarge-m2.png)
|
||||
|
||||
The benchmarks in the second set are (again with N=16):
|
||||
|
||||
- __larson__: by Larson and Krishnan \[2]. Simulates a server workload using 100
|
||||
separate threads where
|
||||
they allocate and free many objects but leave some objects to
|
||||
be freed by other threads. Larson and Krishnan observe this behavior
|
||||
(which they call _bleeding_) in actual server applications, and the
|
||||
benchmark simulates this.
|
||||
- __sh6bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
|
||||
single-threaded allocation where some of the objects are freed
|
||||
in a usual last-allocated, first-freed (LIFO) order, but others
|
||||
are freed in reverse order. Using the public [source](http://www.microquill.com/smartheap/shbench/bench.zip) (retrieved 2019-01-02)
|
||||
- __sh8bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
|
||||
multithreaded allocation (with N threads) where, just as in `larson`, some objects are freed
|
||||
by other threads, and some objects freed in reverse (as in `sh6bench`).
|
||||
Using the public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) (retrieved 2019-01-02)
|
||||
- __cache-scratch__: by Emery Berger _et al_ \[1]. Introduced with the Hoard
|
||||
allocator to test for _passive-false_ sharing of cache lines: first some
|
||||
small objects are allocated and given to each thread; the threads free that
|
||||
object and allocate another one and access that repeatedly. If an allocator
|
||||
allocates objects from different threads close to each other this will
|
||||
lead to cache-line contention.
|
||||
|
||||
In the `larson` server workload mimalloc is 2.5× faster than
|
||||
tcmalloc and jemalloc which is quite surprising -- probably due to the object
|
||||
migration between different threads. Also in `sh6bench` mimalloc does much
|
||||
better than the others (more than 4× faster than jemalloc). a
|
||||
We cannot explain this well but believe it may be
|
||||
caused in part by the "reverse" free-ing in `sh6bench`. Again in `sh8bench`
|
||||
the mimalloc allocator handles object migration between threads much better .
|
||||
|
||||
The `cache-scratch` benchmark also demonstrates the different architectures
|
||||
of the allocators nicely. With a single thread they all perform the same, but when
|
||||
running with multiple threads the allocator induced false sharing of the
|
||||
cache lines causes large run-time differences, where mimalloc is up to
|
||||
20× faster than tcmalloc here. Only the original jemalloc does almost
|
||||
as well (but the most recent version, jxmalloc, regresses). The
|
||||
Hoard allocator is specifically designed to avoid this false sharing and we
|
||||
are not sure why it is not doing well here (although it runs still 5× as
|
||||
fast as tcmalloc and jxmalloc).
|
||||
|
||||
|
||||
## On a 8-core Intel Xeon running Linux
|
||||
|
||||
Testing on a compute optimized Amazon EC2 instance ([c5d.2xlarge](https://aws.amazon.com/ec2/instance-types/))
|
||||
consisting of a 8-core Intel Xeon Platinum at 3GHz (up to 3.5GHz turbo boost)
|
||||
with 16GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||||
|
||||
First the regular workload benchmarks (with N=8):
|
||||
|
||||
![bench-c5d-2xlarge-t1](bench-c5d-2xlarge-t1.png)
|
||||
|
||||
|
||||
|
||||
![bench-c5d-2xlarge-m1](bench-c5d-2xlarge-m1.png)
|
||||
|
||||
Most results are quite similar to the 16-core AMD machine except the
|
||||
the differences are less pronounced with all a bit closer to mimalloc performance.
|
||||
|
||||
This is shown too in the second set of benchmarks:
|
||||
|
||||
![bench-c5d-2xlarge-t2](bench-c5d-2xlarge-t2.png)
|
||||
|
||||
|
||||
|
||||
![bench-c5d-2xlarge-m2](bench-c5d-2xlarge-m2.png)
|
||||
|
||||
On the server workload of `larson` everyone does a bit better on the 8-cores
|
||||
than on 16. On the other benchmarks the performance does not improve though.
|
||||
|
||||
|
||||
## On Windows (4-core Intel Xeon)
|
||||
|
||||
Testing on a HP Z4 G4 Workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz
|
||||
with 16GB ECC memory, running Windows 10 Pro (version 10.0.17134 Build 17134)
|
||||
with Visual Studio 2017 (version 15.8.9).
|
||||
|
||||
Since we cannot use `LD_PRELOAD` on Windows we compiled a subset of our
|
||||
allocators and benchmarks and linked them statically. The **je** benchmark
|
||||
is therefore equivalent to the **jx** benchmark in the previous graphs.
|
||||
The **mc** allocator now refers to the standard Microsoft allocator.
|
||||
Unfortunately we could not get Hoard to work on Windows at this time.
|
||||
|
||||
We used the Windows call `QueryPerformanceCounter` to measure elapsed wall-clock
|
||||
times, and `GetProcessMemoryInfo` to measure the peak working set (rss).
|
||||
|
||||
First the regular workload benchmarks:
|
||||
|
||||
![bench-z4-win-t1](bench-z4-win-t1.png)
|
||||
|
||||
|
||||
|
||||
![bench-z4-win-m1](bench-z4-win-m1.png)
|
||||
|
||||
Here mimalloc and tcmalloc perform very similar, and outperform the system
|
||||
allocator by a significant margin. Somehow jemalloc does much worse than
|
||||
running on Linux. It it not clear why yet, but it might be a compilation issue:
|
||||
when running through the profiler the `__chkstk` routine takes
|
||||
quite some time. This is a compiler inserted runtime function to check for enough
|
||||
stack space if there are many local variables or when the compiler cannot make
|
||||
a static estimate. Perhaps this is the culprit but it needs more investigation.
|
||||
|
||||
The second set of benchmarks shows again more pronounced differences:
|
||||
|
||||
![bench-z4-win-t2](bench-z4-win-t2.png)
|
||||
|
||||
|
||||
|
||||
![bench-z4-win-m2](bench-z4-win-m2.png)
|
||||
|
||||
In the `larson` server workload mimalloc is 25% faster than
|
||||
tcmalloc, and both significantly outperform the system allocator.
|
||||
(again probably due to the object
|
||||
migration between different threads).
|
||||
Also in `sh6bench` and `sh8bench`, mimalloc scales much
|
||||
better than the others.
|
||||
|
||||
## References
|
||||
|
||||
- \[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson.
|
||||
_Hoard: A Scalable Memory Allocator for Multithreaded Applications_
|
||||
the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.
|
||||
[pdf](http://www.cs.utexas.edu/users/mckinley/papers/asplos-2000.pdf)
|
||||
|
||||
- \[2] P. Larson and M. Krishnan. _Memory allocation for long-running server applications_. In ISMM, Vancouver, B.C., Canada, 1998.
|
||||
[pdf](http://citeseemi.ist.psu.edu/viewdoc/download;jsessionid=5F0BFB4F57832AEB6C11BF8257271088?doi=10.1.1.45.1947&rep=rep1&type=pdf)
|
||||
|
||||
- \[3] D. Grunwald, B. Zorn, and R. Henderson.
|
||||
_Improving the cache locality of memory allocation_. In R. Cartwright, editor,
|
||||
Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993.
|
||||
[pdf](http://citeseemi.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf)
|
||||
|
||||
- \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986.
|
||||
See the [Performance](https://github.com/microsoft/mimalloc#Performance)
|
||||
section in the _mimalloc_ repository for benchmark results,
|
||||
or the the technical report for detailed benchmark results.
|
||||
|
||||
*/
|
||||
|
Loading…
Reference in New Issue
Block a user