2020-09-30 11:47:33 +03:00
|
|
|
# chibicc: A Small C Compiler
|
|
|
|
|
|
|
|
(The old master has moved to
|
|
|
|
[historical/old](https://github.com/rui314/chibicc/tree/historical/old)
|
|
|
|
branch. This is a new one uploaded in September 2020.)
|
|
|
|
|
|
|
|
chibicc is yet another small C compiler that implements most C11
|
|
|
|
features. Even though it still probably falls into the "toy compilers"
|
|
|
|
category just like other small compilers do, chibicc can compile
|
|
|
|
several real-world programs, including [Git](https://git-scm.com/),
|
|
|
|
[SQLite](https://sqlite.org),
|
|
|
|
[libpng](http://www.libpng.org/pub/png/libpng.html) and chibicc
|
|
|
|
itself, without making modifications to the compiled programs.
|
|
|
|
Generated executables of these programs pass their corresponding test
|
|
|
|
suites. So, chibicc actually supports a wide variety of C11 features
|
|
|
|
and is able to compile hundreds of thousands of lines of real-world C
|
|
|
|
code correctly.
|
|
|
|
|
|
|
|
chibicc is developed as the reference implementation for a book I'm
|
|
|
|
currently writing about the C compiler and the low-level programming.
|
|
|
|
The book covers the vast topic with an incremental approach; in the first
|
|
|
|
chapter, readers will implement a "compiler" that accepts just a single
|
|
|
|
number as a "language", which will then gain one feature at a time in each
|
|
|
|
section of the book until the language that the compiler accepts matches
|
|
|
|
what the C11 spec specifies. I took this incremental approach from [the
|
|
|
|
paper](http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf) by Abdulaziz
|
|
|
|
Ghuloum.
|
|
|
|
|
|
|
|
Each commit of this project corresponds to a section of the book. For this
|
|
|
|
purpose, not only the final state of the project but each commit was
|
|
|
|
carefully written with readability in mind. Readers should be able to learn
|
|
|
|
how a C language feature can be implemented just by reading one or a few
|
|
|
|
commits of this project. For example, this is how
|
|
|
|
[while](https://github.com/rui314/chibicc/commit/773115ab2a9c4b96f804311b95b20e9771f0190a),
|
|
|
|
[[]](https://github.com/rui314/chibicc/commit/75fbd3dd6efde12eac8225d8b5723093836170a5),
|
|
|
|
[?:](https://github.com/rui314/chibicc/commit/1d0e942fd567a35d296d0f10b7693e98b3dd037c),
|
|
|
|
and [thread-local
|
|
|
|
variable](https://github.com/rui314/chibicc/commit/79644e54cc1805e54428cde68b20d6d493b76d34)
|
|
|
|
are implemented. If you have plenty of spare time, it might be fun to read
|
|
|
|
it from the [first
|
|
|
|
commit](https://github.com/rui314/chibicc/commit/0522e2d77e3ab82d3b80a5be8dbbdc8d4180561c).
|
|
|
|
|
|
|
|
If you like this project, please consider purchasing a copy of the book
|
|
|
|
when it becomes available! 😀 I publish the source code here to give people
|
|
|
|
early access to it, because I was planing to do that anyway with a
|
|
|
|
permissive open-source license after publishing the book. If I don't charge
|
|
|
|
for the source code, it doesn't make much sense to me to keep it private. I
|
|
|
|
hope to publish the book in 2021.
|
|
|
|
You can sign up [here](https://forms.gle/sgrMWHGeGjeeEJcX7) to receive a
|
|
|
|
notification when a free chapter is available online or the book is published.
|
|
|
|
|
|
|
|
I pronounce chibicc as _chee bee cee cee_. "chibi" means "mini" or
|
|
|
|
"small" in Japanese. "cc" stands for C compiler.
|
|
|
|
|
|
|
|
## Status
|
|
|
|
|
|
|
|
chibicc supports almost all mandatory features and most optional
|
|
|
|
features of C11 as well as a few GCC language extensions.
|
|
|
|
|
|
|
|
Features that are often missing in a small compiler but supported by
|
|
|
|
chibicc include (but not limited to):
|
|
|
|
|
|
|
|
- Preprocessor
|
|
|
|
- float, double and long double (x87 80-bit floating point numbers)
|
|
|
|
- Bit-fields
|
|
|
|
- alloca()
|
|
|
|
- Variable-length arrays
|
|
|
|
- Compound literals
|
|
|
|
- Thread-local variables
|
|
|
|
- Atomic variables
|
|
|
|
- Common symbols
|
|
|
|
- Designated initializers
|
|
|
|
- L, u, U and u8 string literals
|
|
|
|
- Functions that take or return structs as values, as specified by the
|
|
|
|
x86-64 SystemV ABI
|
|
|
|
|
|
|
|
chibicc does not support complex numbers, K&R-style function prototypes
|
|
|
|
and GCC-style inline assembly. Digraphs and trigraphs are intentionally
|
|
|
|
left out.
|
|
|
|
|
|
|
|
chibicc outputs a simple but nice error message when it finds an error in
|
|
|
|
source code.
|
|
|
|
|
|
|
|
There's no optimization pass. chibicc emits terrible code which is probably
|
|
|
|
twice or more slower than GCC's output. I have a plan to add an
|
|
|
|
optimization pass once the frontend is done.
|
|
|
|
|
|
|
|
I'm using Ubuntu 20.04 for x86-64 as a development platform. I made a
|
|
|
|
few small changes so that chibicc works on Ubuntu 18.04, Fedora 32 and
|
|
|
|
Gentoo 2.6, but portability is not my goal at this moment. It may or
|
|
|
|
may not work on systems other than Ubuntu 20.04.
|
|
|
|
|
|
|
|
## Internals
|
|
|
|
|
|
|
|
chibicc consists of the following stages:
|
|
|
|
|
|
|
|
- Tokenize: A tokenizer takes a string as an input, breaks it into a list
|
|
|
|
of tokens and returns them.
|
|
|
|
|
|
|
|
- Preprocess: A preprocessor takes as an input a list of tokens and output
|
|
|
|
a new list of macro-expanded tokens. It interprets preprocessor
|
|
|
|
directives while expanding macros.
|
|
|
|
|
|
|
|
- Parse: A recursive descendent parser constructs abstract syntax trees
|
|
|
|
from the output of the preprocessor. It also adds a type to each AST
|
|
|
|
node.
|
|
|
|
|
|
|
|
- Codegen: A code generator emits an assembly text for given AST nodes.
|
|
|
|
|
|
|
|
## Contributing
|
|
|
|
|
|
|
|
When I find a bug in this compiler, I go back to the original commit that
|
|
|
|
introduced the bug and rewrite the commit history as if there were no such
|
|
|
|
bug from the beginning. This is an unusual way of fixing bugs, but as a
|
|
|
|
part of a book, it is important to keep every commit bug-free.
|
|
|
|
|
|
|
|
Thus, I do not take pull requests in this repo. You can send me a pull
|
|
|
|
request if you find a bug, but it is very likely that I will read your
|
|
|
|
patch and then apply that to my previous commits by rewriting history. I'll
|
|
|
|
credit your name somewhere, but your changes will be rewritten by me before
|
|
|
|
submitted to this repository.
|
|
|
|
|
|
|
|
Also, please assume that I will occasionally force-push my local repository
|
|
|
|
to this public one to rewrite history. If you clone this project and make
|
|
|
|
local commits on top of it, your changes will have to be rebased by hand
|
|
|
|
when I force-push new commits.
|
|
|
|
|
|
|
|
## Design principles
|
|
|
|
|
|
|
|
chibicc's core value is its simplicity and the reability of its source
|
|
|
|
code. To achieve this goal, I was careful not to be too clever when
|
|
|
|
writing code. Let me explain what that means.
|
|
|
|
|
|
|
|
Oftentimes, as you get used to the code base, you are tempted to
|
|
|
|
_improve_ the code using more abstractions and clever tricks.
|
|
|
|
But that kind of _improvements_ don't always improve readability for
|
|
|
|
first-time readers and can actually hurts it. I tried to avoid the
|
|
|
|
pitfall as much as possible. I wrote this code not for me but for
|
|
|
|
first-time readers.
|
|
|
|
|
|
|
|
If you take a look at the source code, you'll find a couple of
|
|
|
|
dumb-looking pieces of code. These are written intentionally that way
|
|
|
|
(but at some places I might be actually missing something,
|
|
|
|
though). Here is a few notable examples:
|
|
|
|
|
|
|
|
- The recursive descendent parser contains many similar-looking functions
|
|
|
|
for similar-looking generative grammar rules. You might be tempted
|
|
|
|
to _improve_ it to reduce the duplication using higher-order functions
|
|
|
|
or macros, but I thought that that's too complicated. It's better to
|
|
|
|
allow small duplications instead.
|
|
|
|
|
|
|
|
- chibicc doesn't try too hard to save memory. An entire input source
|
|
|
|
file is read to memory first before the tokenizer kicks in, for example.
|
|
|
|
|
|
|
|
- Slow algorithms are fine if we know that n isn't too big.
|
|
|
|
For example, we use a linked list as a set in the preprocessor, so
|
|
|
|
the membership check takes O(n) where n is the size of the set. But
|
|
|
|
that's fine because we know n is usually very small.
|
|
|
|
And even if n can be very big, I stick with a simple slow algorithm
|
|
|
|
until it is proved by benchmarks that that's a bottleneck.
|
|
|
|
|
|
|
|
- Each AST node type uses only a few members of the `Node` struct members.
|
|
|
|
Other unused `Node` members are just a waste of memory at runtime.
|
|
|
|
We could save memory using unions, but I decided to simply put everything
|
|
|
|
in the same struct instead. I believe the inefficiency is negligible.
|
|
|
|
Even if it matters, we can always change the code to use unions
|
|
|
|
at any time. I wanted to avoid premature optimization.
|
|
|
|
|
|
|
|
- chibicc always allocates heap memory using `calloc`, which is a
|
|
|
|
variant of `malloc` that clears memory with zero. `calloc` is
|
|
|
|
slightly slower than `malloc`, but that should be neligible.
|
|
|
|
|
|
|
|
- Last but not least, chibicc allocates memory using `calloc` but never
|
|
|
|
calls `free`. Allocated heap memory is not freed until the process exits.
|
|
|
|
I'm sure that this memory management policy (or lack thereof) looks
|
|
|
|
very odd, but it makes sense for short-lived programs such as compilers.
|
|
|
|
DMD, a compiler for the D programming language, uses the same memory
|
|
|
|
management scheme for the same reason, for example [1].
|
|
|
|
|
|
|
|
## About the Author
|
|
|
|
|
|
|
|
I'm Rui Ueyama. I'm the creator of [8cc](https://github.com/rui314/8cc),
|
|
|
|
which is a hobby C compiler, and also the original creator of the current
|
|
|
|
version of [LLVM lld](https://lld.llvm.org) linker, which is a
|
|
|
|
production-quality linker used by various operating systems and large-scale
|
|
|
|
build systems.
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
- [tcc](https://bellard.org/tcc/): A small C compiler written by Fabrice
|
|
|
|
Bellard. I learned a lot from this compiler, but the design of tcc and
|
|
|
|
chibicc are different. In particular, tcc is a one-pass compiler, while
|
|
|
|
chibicc is a multi-pass one.
|
|
|
|
|
|
|
|
- [lcc](https://github.com/drh/lcc): Another small C compiler. The creators
|
|
|
|
wrote a [book](https://sites.google.com/site/lccretargetablecompiler/)
|
|
|
|
about the internals of lcc, which I found a good resource to see how a
|
|
|
|
compiler is implemented.
|
|
|
|
|
|
|
|
- [An Incremental Approach to Compiler
|
|
|
|
Construction](http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf)
|
|
|
|
|
|
|
|
- [Rob Pike's 5 Rules of Programming](https://users.ece.utexas.edu/~adnan/pike.html)
|
|
|
|
|
|
|
|
[1] https://www.drdobbs.com/cpp/increasing-compiler-speed-by-over-75/240158941
|
|
|
|
|
|
|
|
> DMD does memory allocation in a bit of a sneaky way. Since compilers
|
|
|
|
> are short-lived programs, and speed is of the essence, DMD just
|
|
|
|
> mallocs away, and never frees.
|