507f1ca139
-falign-functions=32, since these two really get hammered on. To make them faster needs a threadreg or TLS, unless there is a way to tell gcc that a library-local (pthread__threadmask) variable does not need to be PIC.