Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.
Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.
Musl sources were used as a reference [1].
Fixes: https://github.com/rust-lang/rust/issues/140452 once synced
[1]: c47ad25ea3/src/math/x32/fma.c
Update `run.sh` to start testing `libm`. Currently this is somewhat
inefficient because `builtins-test` gets run more than once on some
targets; this can be cleaned up later.
Distribute everything from `libm/` to better locations in the repo.
`libm/libm/*` has not moved yet to avoid Git seeing the move as an edit
to `Cargo.toml`.
Files that remain to be merged somehow are in `etc/libm`.