Expose algebraic floating point intrinsics
# Problem
A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.
See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.
### C++: 10us ✅
With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
<th>C++</th>
<th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
#pragma clang fp reassociate(on)
float sum = 0.0;
for (size_t i = 0; i < len; ++i) {
sum += a[i] * b[i];
}
return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>
### Nightly Rust: 10us ✅
With rustc 1.86.0-nightly (8239a37f9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
<th>Rust</th>
<th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
let mut sum = 0.0;
for i in 0..a.len() {
sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
}
sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>
### Stable Rust: 84us ❌
With rustc 1.84.1 (e71f9a9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
<th>Rust</th>
<th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
let mut sum = 0.0;
for i in 0..a.len() {
sum += a[i] * b[i];
}
sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>
# Proposed Change
Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.
# Alternatives Considered
https://github.com/rust-lang/rust/issues/21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.
In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.
# References
* https://github.com/rust-lang/rust/issues/21690
* https://github.com/rust-lang/libs-team/issues/532
* https://github.com/rust-lang/rust/issues/136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
By marking `__errno_location` as `#[ffi_const]` and `std::sys::os::errno` as `#[inline]`, this PR allows merging multiple calls to `io::Error::last_os_error()` into one.
Add `MAX_LEN_UTF8` and `MAX_LEN_UTF16` Constants
This pull request adds the `MAX_LEN_UTF8` and `MAX_LEN_UTF16` constants as per #45795, gated behind the `char_max_len` feature.
The constants are currently applied in the `alloc`, `core` and `std` libraries.
* Renames the methods:
* `get_many_mut` -> `get_disjoint_mut`
* `get_many_unchecked_mut` -> `get_disjoint_unchecked_mut`
* Does not rename the feature flag: `get_many_mut`
* Marks the feature as stable
* Renames some helper stuff:
* `GetManyMutError` -> `GetDisjointMutError`
* `GetManyMutIndex` -> `GetDisjointMutIndex`
* `get_many_mut_helpers` -> `get_disjoint_mut_helpers`
* `get_many_check_valid` -> `get_disjoint_check_valid`
This only touches slice methods.
HashMap's methods and feature gates are not renamed here
(nor are they stabilized).
Implement `ByteStr` and `ByteString` types
Approved ACP: https://github.com/rust-lang/libs-team/issues/502
Tracking issue: https://github.com/rust-lang/rust/issues/134915
These types represent human-readable strings that are conventionally,
but not always, UTF-8. The `Debug` impl prints non-UTF-8 bytes using
escape sequences, and the `Display` impl uses the Unicode replacement
character.
This is a minimal implementation of these types and associated trait
impls. It does not add any helper methods to other types such as `[u8]`
or `Vec<u8>`.
I've omitted a few implementations of `AsRef`, `AsMut`, and `Borrow`,
when those would be the second implementation for a type (counting the
`T` impl), to avoid potential inference failures. We can attempt to add
more impls later in standalone commits, and run them through crater.
In addition to the `bstr` feature, I've added a `bstr_internals` feature
for APIs provided by `core` for use by `alloc` but not currently
intended for stabilization.
This API and its implementation are based *heavily* on the `bstr` crate
by Andrew Gallant (`@BurntSushi).`
r? `@BurntSushi`
Use `NonNull::without_provenance` within the standard library
This API removes the need for several `unsafe` blocks, and leads to clearer code. It uses feature `nonnull_provenance` (#135243).
Close#135343
Approved ACP: https://github.com/rust-lang/libs-team/issues/502
Tracking issue: https://github.com/rust-lang/rust/issues/134915
These types represent human-readable strings that are conventionally,
but not always, UTF-8. The `Debug` impl prints non-UTF-8 bytes using
escape sequences, and the `Display` impl uses the Unicode replacement
character.
This is a minimal implementation of these types and associated trait
impls. It does not add any helper methods to other types such as `[u8]`
or `Vec<u8>`.
I've omitted a few implementations of `AsRef`, `AsMut`, `Borrow`,
`From`, and `PartialOrd`, when those would be the second implementation
for a type (counting the `T` impl) or otherwise may cause inference
failures. These impls are important, but we can attempt to add them
later in standalone commits, and run them through crater.
In addition to the `bstr` feature, I've added a `bstr_internals` feature
for APIs provided by `core` for use by `alloc` but not currently
intended for stabilization.
This API and its implementation are based *heavily* on the `bstr` crate
by Andrew Gallant (@BurntSushi).
This greatly reduces the number of places that actually use the `rustc_layout_scalar_valid_range_*` attributes down to just 3:
```
library/core\src\ptr\non_null.rs
68:#[rustc_layout_scalar_valid_range_start(1)]
library/core\src\num\niche_types.rs
19: #[rustc_layout_scalar_valid_range_start($low)]
20: #[rustc_layout_scalar_valid_range_end($high)]
```
Everything else -- PAL Nanoseconds, alloc's `Cap`, niched FDs, etc -- all just wrap those `niche_types` types.
`CheckAttrVisitor::check_doc_keyword` checks `#[doc(keyword = "..")]`
attributes to ensure they are on an empty module, and that the value is
a non-empty identifier.
The `rustc::existing_doc_keyword` lint checks these attributes to ensure
that the value is the name of a keyword.
It's silly to have two different checking mechanisms for these
attributes. This commit does the following.
- Changes `check_doc_keyword` to check that the value is the name of a
keyword (avoiding the need for the identifier check, which removes a
dependency on `rustc_lexer`).
- Removes the lint.
- Updates tests accordingly.
There is one hack: the `SelfTy` FIXME case used to used to be handled by
disabling the lint, but now is handled with a special case in
`is_doc_keyword`. That hack will go away if/when the FIXME is fixed.
Co-Authored-By: Guillaume Gomez <guillaume1.gomez@gmail.com>
`UniqueRc` trait impls
UniqueRc tracking Issue: #112566
Stable traits: (i.e. impls behind only the `unique_rc_arc` feature gate)
* Support the same formatting as `Rc`:
* `fmt::Debug` and `fmt::Display` delegate to the pointee.
* `fmt::Pointer` prints the address of the pointee.
* Add explicit `!Send` and `!Sync` impls, to mirror `Rc`.
* Borrowing traits: `Borrow`, `BorrowMut`, `AsRef`, `AsMut`
* `Rc` does not implement `BorrowMut` and `AsMut`, but `UniqueRc` can.
* Unconditional `Unpin`, like other heap-allocated types.
* Comparison traits `(Partial)Ord` and `(Partial)Eq` delegate to the pointees.
* `PartialEq for UniqueRc` does not do `Rc`'s specialization shortcut for pointer equality when `T: Eq`, since by definition two `UniqueRc`s cannot share an allocation.
* `Hash` delegates to the pointee.
* `AsRawFd`, `AsFd`, `AsHandle`, `AsSocket` delegate to the pointee like `Rc`.
* Sidenote: The bounds on `T` for the existing `Pointer<T>` impls for specifically `AsRawFd` and `AsSocket` do not allow `T: ?Sized`. For the added `UniqueRc` impls I allowed `T: ?Sized` for all four traits, but I did not change the existing (stable) impls.
Unstable traits:
* `DispatchFromDyn`, allows using `UniqueRc<Self>` as a method receiver under `feature(arbitrary_self_types)`.
* Existing `PinCoerceUnsized for UniqueRc` is generalized to allow non-`Global` allocators, like `Rc`.
* `DerefPure`, allows using `UniqueRc` in deref-patterns under `feature(deref_patterns)`, like `Rc`.
For documentation, `Rc` only has documentation on the comparison traits' methods, so I copied/adapted the documentation for those, and left the rest without impl-specific docs.
~~Edit: Marked as draft while I figure out `UnwindSafe`.~~
Edit: Ignoring `UnwindSafe` for this PR
Stabilize `extended_varargs_abi_support`
I think that is everything? If there is any documentation regarding `extern` and/or varargs to correct, let me know, some quick greps suggest that there might be none.
Tracking issue: https://github.com/rust-lang/rust/issues/100189
Bump boostrap compiler to new beta
Currently failing due to something about the const stability checks and `panic!`. I'm not sure why though since I wasn't able to see any PRs merged in the past few days that would result in a `cfg(bootstrap)` that shouldn't be removed. cc `@RalfJung` #131349
This reduces code sizes and better respects programmer intent when
marking inline(never). Previously such a marking was essentially ignored
for generic functions, as we'd still inline them in remote crates.
std: allow after-main use of synchronization primitives
By creating an unnamed thread handle when the actual one has already been destroyed, synchronization primitives using thread parking can be used even outside the Rust runtime.
This also fixes an inefficiency in the queue-based `RwLock`: if `thread::current` was not initialized yet, it will create a new handle on every parking attempt without initializing `thread::current`. The private `current_or_unnamed` function introduced here fixes this.
By creating an unnamed thread handle when the actual one has already been destroyed, synchronization primitives using thread parking can be used even outside the Rust runtime.
This also fixes an inefficiency in the queue-based `RwLock`: if `thread::current` was not initialized yet, it will create a new handle on every parking attempt without initializing `thread::current`. The private `current_or_unnamed` function introduced here fixes this.
better test for const HashMap; remove const_hash leftovers
The existing `const_with_hasher` test is kind of silly since the HashMap it constructs can never contain any elements. So this adjusts the test to construct a usable HashMap, which is a bit non-trivial since the default hash builder cannot be built in `const`. `BuildHasherDefault::new()` helps but is unstable (https://github.com/rust-lang/rust/issues/123197), so we also have a test that does not involve that type.
The second commit removes the last remnants of https://github.com/rust-lang/rust/issues/104061, since they aren't actually useful -- without const traits, you can't do any hashing in `const`.
Cc ``@rust-lang/libs-api`` ``@rust-lang/wg-const-eval``
Closes#104061
Related to https://github.com/rust-lang/rust/issues/102575