The STATIC macro was introduced a very long time ago in commit
d5df6cd44a. The original reason for this was
to have the option to define it to nothing so that all static functions
become global functions and therefore visible to certain debug tools, so
one could do function size comparison and other things.
This STATIC feature is rarely (if ever) used. And with the use of LTO and
heavy inline optimisation, analysing the size of individual functions when
they are not static is not a good representation of the size of code when
fully optimised.
So the macro does not have much use and it's simpler to just remove it.
Then you know exactly what it's doing. For example, newcomers don't have
to learn what the STATIC macro is and why it exists. Reading the code is
also less "loud" with a lowercase static.
One other minor point in favour of removing it, is that it stops bugs with
`STATIC inline`, which should always be `static inline`.
Methodology for this commit was:
1) git ls-files | egrep '\.[ch]$' | \
xargs sed -Ei "s/(^| )STATIC($| )/\1static\2/"
2) Do some manual cleanup in the diff by searching for the word STATIC in
comments and changing those back.
3) "git-grep STATIC docs/", manually fixed those cases.
4) "rg -t python STATIC", manually fixed codegen lines that used STATIC.
This work was funded through GitHub Sponsors.
Signed-off-by: Angus Gratton <angus@redyak.com.au>
There are two main changes here to improve the calculation of the size of
the next heap area when automatically expanding the heap:
- Compute the existing total size by counting the total number of GC
blocks, and then using that to compute the corresponding number of bytes.
- Round the bytes value up to the nearest multiple of BYTES_PER_BLOCK.
This makes the calculation slightly simpler and more accurate, and makes
sure that, in the case of growing from one area to two areas, the number
of bytes allocated from the system for the second area is the same as the
first. For example on esp32 with an initial area size of 65536 bytes, the
subsequent allocation is also 65536 bytes. Previously it was a number that
was not even a multiple of 2.
Signed-off-by: Damien George <damien@micropython.org>
When set, the split heap is automatically extended with new areas on
demand, and shrunk if a heap area becomes empty during a GC pass or soft
reset.
To save code size the size allocation for a new heap block (including
metadata) is estimated at 103% of the failed allocation, rather than
working from the more complex algorithm in gc_try_add_heap(). This appears
to work well except in the extreme limit case when almost all RAM is
exhausted (~last few hundred bytes). However in this case some allocation
is likely to fail soon anyhow.
Currently there is no API to manually add a block of a given size to the
heap, although that could easily be added if necessary.
This work was funded through GitHub Sponsors.
Signed-off-by: Angus Gratton <angus@redyak.com.au>
This commit:
- Breaks up some long lines for readability.
- Fixes a potential macro argument expansion issue.
This work was funded through GitHub Sponsors.
Signed-off-by: Angus Gratton <angus@redyak.com.au>
In applications that use little memory and run GC regularly, the cost of
the sweep phase quickly becomes prohibitives as the amount of RAM
increases.
On an ESP32-S3 with 2 MB of external SPIRAM, for example, a trivial GC
cycle takes a minimum of 40ms, virtually all of it in the sweep phase.
Similarly, on the UNIX port with 1 GB of heap, a trivial GC takes 47 ms,
again virtually all of it in the sweep phase.
This commit speeds up the sweep phase in the case most of the heap is empty
by keeping track of the ID of the highest block we allocated in an area
since the last GC.
The performance benchmark run on PYBV10 shows between +0 and +2%
improvement across the existing performance tests. These tests don't
really stress the GC, so they were also run with gc.threshold(30000) and
gc.threshold(10000). For the 30000 case, performance improved by up to
+10% with this commit. For the 10000 case, performance improved by at
least +10% on 6 tests, and up to +25%.
Signed-off-by: Damien George <damien@micropython.org>
Changes in this commit:
- Add MICROPY_GC_HOOK_LOOP to gc_info() and gc_alloc(). Both of these can
be long running (many milliseconds) which is too long to be blocking in
some applications.
- Pass loop variable to MICROPY_GC_HOOK_LOOP(i) macro so that implementers
can use it, e.g. to improve performance by only calling a function every
X number of iterations.
- Drop outer call to MICROPY_GC_HOOK_LOOP in gc_mark_subtree().
This adds a mechanism to track a pending notify/indicate operation that
is deferred due to the send buffer being full. This uses a tracked alloc
that is passed as the content arg to the callback.
This replaces the previous mechanism that did this via the global pending
op queue, shared with client read/write ops.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Showing 8 digits instead of 5, supporting devices with more than 1 MByte of
RAM (which is common these days). The masking was never needed, and the
related commented-out line can go.
The assertion that is added here (to gc.c) fails when running this new test
if ALLOC_TABLE_GAP_BYTE is set to 0.
Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Damien George <damien@micropython.org>
Prior to this fix the follow crash occurred. With a GC layout of:
GC layout:
alloc table at 0x3fd80428, length 32001 bytes, 128004 blocks
finaliser table at 0x3fd88129, length 16001 bytes, 128008 blocks
pool at 0x3fd8bfc0, length 2048064 bytes, 128004 blocks
Block 128003 is an AT_HEAD and eventually is passed to gc_mark_subtree.
This causes gc_mark_subtree to call ATB_GET_KIND(128004). When block 1 is
created with a finaliser, the first byte of the finaliser table becomes
0x2, but ATB_GET_KIND(128004) reads these bits as AT_TAIL, and then
gc_mark_subtree references past the end of the heap, which happened to be
past the end of PSRAM on the esp32-s2.
The fix in this commit is to ensure there is a one-byte gap after the ATB
filled permanently with AT_FREE.
Fixes issue #7116.
See also https://github.com/adafruit/circuitpython/issues/5021
Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Damien George <damien@micropython.org>
When you want to use the valgrind memory analysis tool on MicroPython, you
can arrange to define MICROPY_DEBUG_VALGRIND to enable use of special
valgrind macros. For now, this only fixes `gc_get_ptr` so that it never
emits the diagnostic "Conditional jump or move depends on uninitialised
value(s)".
Signed-off-by: Jeff Epler <jepler@gmail.com>
Use C macros to reduce the size of firmware images when the GC split-heap
feature is disabled.
The code size difference of this commit versus HEAD~2 (ie the commit prior
to MICROPY_GC_SPLIT_HEAP being introduced) when split-heap is disabled is:
bare-arm: +0 +0.000%
minimal x86: +0 +0.000%
unix x64: -16 -0.003%
unix nanbox: -20 -0.004%
stm32: -8 -0.002% PYBV10
cc3200: +0 +0.000%
esp8266: +8 +0.001% GENERIC
esp32: +0 +0.000% GENERIC
nrf: -20 -0.011% pca10040
rp2: +0 +0.000% PICO
samd: -4 -0.003% ADAFRUIT_ITSYBITSY_M4_EXPRESS
The code size difference of this commit versus HEAD~2 split-heap is enabled
with MICROPY_GC_MULTIHEAP=1 (but no extra code to add more heaps):
unix x64: +1032 +0.197% [incl +544(bss)]
esp32: +592 +0.039% GENERIC[incl +16(data) +264(bss)]
This commit adds a new option MICROPY_GC_SPLIT_HEAP (disabled by default)
which, when enabled, allows the GC heap to be split over multiple memory
areas/regions. The first area is added with gc_init() and subsequent areas
can be added with gc_add(). New areas can be added at runtime. Areas are
stored internally as a linked list, and calls to gc_alloc() can be
satisfied from any area.
This feature has the following use-cases (among others):
- The ESP32 has a fragmented OS heap, so to use all (or more) of it the
GC heap must be split.
- Other MCUs may have disjoint RAM regions and are now able to use them
all for the GC heap.
- The user could explicitly increase the size of the GC heap.
- Support a dynamic heap while running on an OS, adding more heap when
necessary.
This makes it possible for cooperative multitasking systems to keep running
event loops during garbage collector operations.
For example, this can be used to ensure that a motor control loop runs
approximately each 5 ms. Without this hook, the loop time can jump to
about 15 ms.
Addresses #3475.
Signed-off-by: Laurens Valk <laurens@pybricks.com>
This commit makes gc_lock_depth have one counter per thread, instead of one
global counter. This makes threads properly independent with respect to
the GC, in particular threads can now independently lock the GC for
themselves without locking it for other threads. It also means a given
thread can run a hard IRQ without temporarily locking the GC for all other
threads and potentially making them have MemoryError exceptions at random
locations (this really only occurs on MCUs with multiple cores and no GIL,
eg on the rp2 port).
The commit also removes protection of the GC lock/unlock functions, which
is no longer needed when the counter is per thread (and this also fixes the
cas where a hard IRQ calling gc_lock() may stall waiting for the mutex).
It also puts the check for `gc_lock_depth > 0` outside the GC mutex in
gc_alloc, gc_realloc and gc_free, to potentially prevent a hard IRQ from
waiting on a mutex if it does attempt to allocate heap memory (and putting
the check outside the GC mutex is now safe now that there is a
gc_lock_depth per thread).
Signed-off-by: Damien George <damien@micropython.org>
The "word" referred to by BYTES_PER_WORD is actually the size of mp_obj_t
which is not always the same as the size of a pointer on the target
architecture. So rename this config value to better reflect what it
measures, and also prefix it with MP_.
For uses of BYTES_PER_WORD in setting the stack limit this has been
changed to sizeof(void *), because the stack usually grows with
machine-word sized values (eg an nlr_buf_t has many machine words in it).
Signed-off-by: Damien George <damien@micropython.org>
Newer GCC versions are able to warn about switch cases that fall
through. This is usually a sign of a forgotten break statement, but in
the few cases where a fall through is intended we annotate it with this
macro to avoid the warning.
Note: the uncrustify configuration is explicitly set to 'add' instead of
'force' in order not to alter the comments which use extra spaces after //
as a means of indenting text for clarity.
This string is recognised by uncrustify, to disable formatting in the
region marked by these comments. This is necessary in the qstrdef*.h files
to prevent modification of the strings within the Q(...). In other places
it is used to prevent excessive reformatting that would make the code less
readable.
When threads and the GIL are enabled, then the GC mutex is not needed. The
gc_mutex field is never used in this case because of:
#if MICROPY_PY_THREAD && !MICROPY_PY_THREAD_GIL
#define GC_ENTER() mp_thread_mutex_lock(&MP_STATE_MEM(gc_mutex), 1)
#define GC_EXIT() mp_thread_mutex_unlock(&MP_STATE_MEM(gc_mutex))
#else
#define GC_ENTER()
#define GC_EXIT()
#endif
So, we can completely remove gc_mutex everywhere when MICROPY_PY_THREAD
&& !MICROPY_PY_THREAD_GIL.
The older "bool has_finaliser" gets recast as GC_ALLOC_FLAG_HAS_FINALISER=1
so this is a backwards compatible change to the signature. Since bool gets
implicitly converted to 1 this patch doesn't include conversion of all
calls.
Otherwise there is the possibility that n_free starts out non-zero from the
previous iteration, which may have found a few (but not enough) free blocks
at the end of the heap. If this is the case, and if the very first blocks
that are scanned the second time around (starting at
gc_last_free_atb_index) are found to give enough memory (including the
blocks at the end of the heap from the previous iteration that left n_free
non-zero) then memory will be allocated starting before the location that
gc_last_free_atb_index points to, most likely leading to corruption.
This serious bug did not manifest itself in the past because a gc_collect
always resets gc_last_free_atb_index to point to the start of the GC heap,
and the first block there is almost always allocated to a long-lived
object (eg entries from sys.path, or mounted filesystem objects), which
means that n_free would be reset at the start of the search loop.
But with threading enabled with the GIL disabled it is possible to trigger
the bug via the following sequence of events:
1. Thread A runs gc_alloc, fails to find enough memory, and has a non-zero
n_free at the end of the search.
2. Thread A calls gc_collect and frees a bunch of blocks on the GC heap.
3. Just after gc_collect finishes in thread A, thread B takes gc_mutex and
does an allocation, moving gc_last_free_atb_index to point to the
interior of the heap, to a place where there is most likely a run of
available blocks.
4. Thread A regains gc_mutex and does its second search for free memory,
starting with a non-zero n_free. Since it's likely that the first block
it searches is available it will allocate memory which overlaps with the
memory before gc_last_free_atb_index.
DEBUG_printf and MICROPY_DEBUG_PRINTER is now used instead of normal
printf, and a fault is fixed in mp_obj_class_lookup with debugging enabled;
see issue #3999. Debugging can now be enabled on all ports including when
nan-boxing is used.
This patch adds the gc_sweep_all() function which does a garbage collection
without tracing any root pointers, so frees all the memory, and most
importantly runs any remaining finalisers.
This helps primarily for soft reset: it will close any open files, any open
sockets, and help to get the system back to a clean state upon soft reset.
Without this, if GC threshold is hit and there is not enough memory left to
satisfy the request, gc_collect() will run a second time and the search for
memory will happen again and will fail again.
Thanks to @adritium for pointing out this issue, see #3786.
This patch moves the start of the root pointer section in mp_state_ctx_t
so that it skips entries that are not pointers and don't need scanning.
Previously, the start of the root pointer section was at the very beginning
of the mp_state_ctx_t struct (which is the beginning of mp_state_thread_t).
This was the original assembler version of the NLR code was hard-coded to
have the nlr_top pointer at the start of this state structure. But now
that the NLR code is partially written in C there is no longer this
restriction on the location of nlr_top (and a comment to this effect has
been removed in this patch).
So now the root pointer section starts part way through the
mp_state_thread_t structure, after the entries which are not root pointers.
This patch also moves the non-pointer entries for MICROPY_ENABLE_SCHEDULER
outside the root pointer section.
Moving non-pointer entries out of the root pointer section helps to make
the GC more precise and should help to prevent some cases of collectable
garbage being kept.
This patch also has a measurable improvement in performance of the
pystone.py benchmark: on unix x86-64 and stm32 there was an improvement of
roughly 0.6% (tested with both gcc 7.3 and gcc 8.1).
This macro is written out explicitly in the two locations that it is used
and then the code is optimised, opening possibilities for further
optimisations and reducing code size:
unix: -48
minimal CROSS=1: -32
stm32: -32
This patch introduces the MICROPY_ENABLE_PYSTACK option (disabled by
default) which enables a "Python stack" that allows to allocate and free
memory in a scoped, or Last-In-First-Out (LIFO) way, similar to alloca().
A new memory allocation API is introduced along with this Py-stack. It
includes both "local" and "nonlocal" LIFO allocation. Local allocation is
intended to be equivalent to using alloca(), whereby the same function must
free the memory. Nonlocal allocation is where another function may free
the memory, so long as it's still LIFO.
Follow-up patches will convert all uses of alloca() and VLA to the new
scoped allocation API. The old behaviour (using alloca()) will still be
available, but when MICROPY_ENABLE_PYSTACK is enabled then alloca() is no
longer required or used.
The benefits of enabling this option are (or will be once subsequent
patches are made to convert alloca()/VLA):
- Toolchains without alloca() can use this feature to obtain correct and
efficient scoped memory allocation (compared to using the heap instead
of alloca(), which is slower).
- Even if alloca() is available, enabling the Py-stack gives slightly more
efficient use of stack space when calling nested Python functions, due to
the way that compilers implement alloca().
- Enabling the Py-stack with the stackless mode allows for even more
efficient stack usage, as well as retaining high performance (because the
heap is no longer used to build and destroy stackless code states).
- With Py-stack and stackless enabled, Python-calling-Python is no longer
recursive in the C mp_execute_bytecode function.
The micropython.pystack_use() function is included to measure usage of the
Python stack.
Accessing them will crash immediately instead still working for some time,
until overwritten by some other data, leading to much less deterministic
crashes.
These checks are assumed to be true in all cases where gc_realloc is
called with a valid pointer, so no need to waste code space and time
checking them in a non-debug build.
Header files that are considered internal to the py core and should not
normally be included directly are:
py/nlr.h - internal nlr configuration and declarations
py/bc0.h - contains bytecode macro definitions
py/runtime0.h - contains basic runtime enums
Instead, the top-level header files to include are one of:
py/obj.h - includes runtime0.h and defines everything to use the
mp_obj_t type
py/runtime.h - includes mpstate.h and hence nlr.h, obj.h, runtime0.h,
and defines everything to use the general runtime support functions
Additional, specific headers (eg py/objlist.h) can be included if needed.