Initial `UnsafePinned` implementation [Part 1: Libs]
Initial libs changes necessary to unblock further work on [RFC 3467](https://rust-lang.github.io/rfcs/3467-unsafe-pinned.html).
Tracking issue: #125735
This PR is split off from #136964, and includes just the libs changes:
- `UnsafePinned` struct
- private `UnsafeUnpin` structural auto trait
- Lang items for both
- Compiler changes necessary to block niches on `UnsafePinned`
This PR does not change codegen, miri, the existing `!Unpin` hack, or anything else. That work is to be split into later PRs.
---
cc ``@RalfJung`` ``@Noratrieb``
``@rustbot`` label F-unsafe_pinned T-libs-api
Move `select_unpredictable` to the `hint` module
There has been considerable discussion in both the ACP (rust-lang/libs-team#468) and tracking issue (#133962) about whether the `bool::select_unpredictable` method should be in `core::hint` instead.
I believe this is the right move for the following reasons:
- The documentation explicitly says that it is a hint, not a codegen guarantee.
- `bool` doesn't have a corresponding `select` method, and I don't think we should be adding one.
- This shouldn't be something that people reach for with auto-completion unless they specifically understand the interactions with branch prediction. Using conditional moves can easily make code *slower* by preventing the CPU from speculating past the condition due to the data dependency.
- Although currently `core::hint` only contains no-ops, this isn't a hard rule (for example `unreachable_unchecked` is a bit of a gray area). The documentation only status that the module contains "hints to compiler that affects how code should be emitted or optimized". This is consistent with what `select_unpredictable` does.
Use the chaining methods on PartialOrd for slices too
#138135 added these doc-hidden trait methods to improve the tuple codegen. This PR adds more implementations and callers so that the codegen for slice (and array) comparisons also improves.
docs: clarify uint exponent for `is_power_of_two`
This makes the documentation more explicit for that method. I know this might seem "nit-picky", but `k` could be interpreted as "any Real or Complex number". A trivial example would be $`3 = 2^{log_2(3)}`$ which "proves that three is a power of two" (according to that vague definition).
BTW, when I read the implementation, I was surprised to see that `1` is considered a power of 2 despite being odd (it does make sense in some contexts, but still not intuitive). So I wrote "positive int" before correcting it to "unsigned int"
Polymorphize `array::IntoIter`'s iterator impl
Today we emit all the iterator methods for every different array width. That's wasteful since the actual array length never even comes into it -- the indices used are from the separate `alive: IndexRange` field, not even the `N` const param.
This PR switches things so that an `array::IntoIter<T, N>` stores a `PolymorphicIter<[MaybeUninit<T>; N]>`, which we *unsize* to `PolymorphicIter<[MaybeUninit<T>]>` and call methods on that non-`Sized` type for all the iterator methods.
That also necessarily makes the layout consistent between the different lengths of arrays, because of the unsizing. Compare that to today <https://rust.godbolt.org/z/Prb4xMPrb>, where different widths can't even be deduped because the offset to the indices is different for different array widths.
add `core::intrinsics::simd::{simd_extract_dyn, simd_insert_dyn}`
fixes https://github.com/rust-lang/rust/issues/137372
adds `core::intrinsics::simd::{simd_extract_dyn, simd_insert_dyn}`, which contrary to their non-dyn counterparts allow a non-const index. Many platforms (but notably not x86_64 or aarch64) have dedicated instructions for this operation, which stdarch can emit with this change.
Future work is to also make the `Index` operation on the `Simd` type emit this operation, but the intrinsic can't be used directly. We'll need some MIR shenanigans for that.
r? `@ghost`
Ensure `swap_nonoverlapping` is really always untyped
This replaces #134954, which was arguably overcomplicated.
## Fixes#134713
Actually using the type passed to `ptr::swap_nonoverlapping` for anything other than its size + align turns out to not work, so this goes back to always erasing the types down to just bytes.
(Except in `const`, which keeps doing the same thing as before to preserve `@RalfJung's` fix from #134689)
## Fixes#134946
I'd previously moved the swapping to use auto-vectorization *on bytes*, but someone pointed out on Discord that the tail loop handling from that left a whole bunch of byte-by-byte swapping around. This goes back to manual tail handling to avoid that, then still triggers auto-vectorization on pointer-width values. (So you'll see `<4 x i64>` on `x86-64-v3` for example.)
Update `u8`-to-and-from-`i8` suggestions.
`u8::cast_signed` and `i8::cast_unsigned` have been stabilised, but `i8::from_ne_bytes` et al. still suggest using `as i8` or `as u8`.
speed up `String::push` and `String::insert`
Addresses the concerns described in #116235.
The performance gain comes mainly from avoiding temporary buffers.
Complex pattern matching in `encode_utf8` (introduced in #67569) has been simplified to a comparison and an exhaustive `match` in the `encode_utf8_raw_unchecked` helper function. It takes a slice of `MaybeUninit<u8>` because otherwise we'd have to construct a normal slice to uninitialized data, which is not desirable, I guess.
Several functions still have that [unneeded zeroing](https://rust.godbolt.org/z/5oKfMPo7j), but a single instruction is not that important, I guess.
`@rustbot` label T-libs C-optimization A-str
Promise `array::from_fn` is generated in order of increasing indices
Fixes#139061
I agree this needs to be documented because of the `FnMut`, either with a guarantee or to explicitly disclaim one.
I'm pretty sure this will be non-controversial (like the other "well sure you *could* do it in a different order, but why?" things were), but I couldn't find any previous libs-api decision on it so it's seemingly a new promise that will need FCP.
Basically, yes, it would be plausible to fill in the reverse order, but there's no obvious way we could ever know that that might even be a good idea, so forward seems like an easy thing to promise. We could always add a `from_fn_rev` or something later if there's ever a strong enough need, but it seems unlikely.
Let's just do the obvious thing so it matches what `[gen(0), gen(1), …, gen(N-1)]` does.
Make `cfg_match!` a semitransparent macro
IIUC this is preferred when (potentially) stabilizing `macro` items, to avoid potentially utilizing def-site hygiene instead of mixed-site.
Tracking issue: #115585
Remove support for `extern "rust-intrinsic"` blocks
Part of rust-lang/rust#132735
Looked manageable and there didn't appear to have been progress in the last two weeks,
so decided to give it a try.
Replace last `usize` -> `ptr` transmute in `alloc` with strict provenance API
This replaces the `usize -> ptr` transmute in `RawVecInner::new_in` with a strict provenance API (`NonNull::without_provenance`).
The API is changed to take an `Alignment` which encodes the non-null constraint needed for `Unique` and allows us to do the construction safely.
Two internal-only APIs were added to let us avoid UB-checking in this hot code: `Layout::alignment` to get the `Alignment` type directly rather than as a `usize`, and `Unique::from_non_null` to create `Unique` in const context without a transmute.
Optimize slice {Chunks,Windows}::nth
I've noticed that the `nth` functions on slice iters had non-optimized-out bounds checks.
The new implementation even generates branchless code.
tidy: Fix paths to `coretests` and `alloctests`
Following `#135937` and `#136642`, tests for core and alloc are in coretests and alloctests. Fix tidy to lint for the new paths. Also, update comments referring to the old locations.
Some context for changes which don't match that pattern:
- `library/std/src/thread/local/dynamic_tests.rs` and `library/std/src/sync/mpsc/sync_tests.rs` were moved under `library/std/tests/` in 332fb7e6f1d (Move std::thread_local unit tests to integration tests, 2025-01-17) and b8ae372e483 (Move std::sync unit tests to integration tests, 2025-01-17), respectively, so are no longer special cases.
- There never was a `library/core/tests/fmt.rs` file. That comment previously referred to `src/test/ui/ifmt.rs`, which was folded into `library/alloc/tests/fmt.rs` in 949c96660c3 (move format! interface tests, 2020-09-08).
Now, the only matches for `(alloc|core)/tests` are in `compiler/rustc_codegen_{cranelift,gcc}/patches`. I don't know why CI hasn't broken because those patches can't apply. Or maybe they somehow still can apply?
r? `@bjorn3`
Following `#135937` and `#136642`, tests for core and alloc are in
coretests and alloctests. Fix tidy to lint for the new paths. Also,
update comments referring to the old locations.
Some context for changes which don't match that pattern:
* library/std/src/thread/local/dynamic_tests.rs and
library/std/src/sync/mpsc/sync_tests.rs were moved under
library/std/tests/ in 332fb7e6f1d (Move std::thread_local unit tests
to integration tests, 2025-01-17) and b8ae372e483 (Move std::sync unit
tests to integration tests, 2025-01-17), respectively, so are no
longer special cases.
* There never was a library/core/tests/fmt.rs file. That comment
previously referred to src/test/ui/ifmt.rs, which was folded into
library/alloc/tests/fmt.rs in 949c96660c3 (move format! interface
tests, 2020-09-08).
Fix missing const for inherent pointer `replace` methods
`ptr::replace` (the free fn) is already const stable. However, there are inherent convenience methods on `*mut T` and `NonNull<T>`, allowing you to write eg. `unsafe { foo.replace(bar) }` where `foo` is `*mut T` or `NonNull<T>`.
It seems const was never added to the inherent method (likely oversight), so this PR adds it.
I don't believe this needs another[^1] FCP as the inherent methods are already stable and `ptr::replace` is already const stable, so this adds no new API.
Original tracking issue: #83164
`ptr::replace` constified in #83091
`ptr::replace` const stabilized in #130954
[^1]: `const_replace` FCP completed: https://github.com/rust-lang/rust/issues/83164#issuecomment-2385670050
Implement `SliceIndex` for `ByteStr`
Implement `Index` and `IndexMut` for `ByteStr` in terms of `SliceIndex`. Implement it for the same types that `&[u8]` supports (a superset of those supported for `&str`, which does not have `usize` and `ops::IndexRange`).
At the same time, move compare and index traits to a separate file in the `bstr` module, to give it more space to grow as more functionality is added (e.g., iterators and string-like ops). Order the items in `bstr/traits.rs` similarly to `str/traits.rs`.
cc `@joshtriplett`
`ByteStr`/`ByteString` tracking issue: https://github.com/rust-lang/rust/issues/134915
Expose algebraic floating point intrinsics
# Problem
A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.
See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.
### C++: 10us ✅
With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
<th>C++</th>
<th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
#pragma clang fp reassociate(on)
float sum = 0.0;
for (size_t i = 0; i < len; ++i) {
sum += a[i] * b[i];
}
return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>
### Nightly Rust: 10us ✅
With rustc 1.86.0-nightly (8239a37f9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
<th>Rust</th>
<th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
let mut sum = 0.0;
for i in 0..a.len() {
sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
}
sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>
### Stable Rust: 84us ❌
With rustc 1.84.1 (e71f9a9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
<th>Rust</th>
<th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
let mut sum = 0.0;
for i in 0..a.len() {
sum += a[i] * b[i];
}
sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>
# Proposed Change
Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.
# Alternatives Considered
https://github.com/rust-lang/rust/issues/21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.
In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.
# References
* https://github.com/rust-lang/rust/issues/21690
* https://github.com/rust-lang/libs-team/issues/532
* https://github.com/rust-lang/rust/issues/136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux