cg_llvm: Replace some DIBuilder wrappers with LLVM-C API bindings (part 3)
- Part of rust-lang/rust#134001
- Follow-up to rust-lang/rust#136375
- Follow-up to rust-lang/rust#136632
---
This is another batch of LLVMDIBuilder binding migrations, replacing some our own LLVMRust bindings with bindings to upstream LLVM-C APIs.
This PR migrates all of the bindings that were touched by rust-lang/rust#136632, plus `LLVMDIBuilderCreateStructType`.
Use captures(address) instead of captures(none) for indirect args
While provenance cannot be captured through these arguments, the address / object identity can.
Fixes https://github.com/rust-lang/rust/issues/137668.
r? `@ghost`
The underlying implementation of `LLVMCreateConstantRangeAttribute` assumes
that each of `LowerWords` and `UpperWords` points to enough u64 values to
define an integer of the specified bit-length, and will encounter UB if that is
not the case.
Our safe wrapper function always passes pointers to `[u64; 2]` arrays,
regardless of the bit-length specified. That's fine in practice, because scalar
primitives never exceed 128 bits, but it is technically a soundness hole in a
safe function.
We can close the soundness hole by explicitly asserting `size_bits <= 128`.
This is effectively just a stricter version of the existing check that the
value must be small enough to fit in `c_uint`.
`&Freeze` parameters are not only `readonly` within the function,
but any captures of the pointer can also only be used for reads.
This can now be encoded using the `captures(address, read_provenance)`
attribute.
Remove `LlvmArchiveBuilder` and supporting code/bindings
Switching over to the newer Rust-based `ArArchiveBuilder` happened in rust-lang/rust#128936, a year ago.
Per the comment in `new_archive_builder`, that seems like enough time to justify removing the older, unused `LlvmArchiveBuilder` implementation and its associated bindings.
Fixesrust-lang/rust#128955.
Set the dead_on_return attribute (added in LLVM 21) for arguments
that are passed indirectly, but not byval.
This indicates that the value of the argument on return does not
matter, enabling additional dead store elimination.
Implement support for `become` and explicit tail call codegen for the LLVM backend
This PR implements codegen of explicit tail calls via `become` in `rustc_codegen_ssa` and support within the LLVM backend. Completes a task on (https://github.com/rust-lang/rust/issues/112788). This PR implements all the necessary bits to make explicit tail calls usable, other backends have received stubs for now and will ICE if you use `become` on them. I suspect there is some bikeshedding to be done on how we should go about implementing this for other backends, but it should be relatively straightforward for GCC after this is merged.
During development I also put together a POC bytecode VM based on tail call dispatch to test these changes out and analyze the codegen to make sure it generates expected assembly. That is available [here](https://github.com/xacrimon/tcvm).
Various refactors to the codegen coordinator code (part 3)
Continuing from https://github.com/rust-lang/rust/pull/144062 this removes an option without any known users, uses the object crate in favor of LLVM for getting the LTO bitcode and improves the coordinator channel handling.
gpu offload host code generation
r? ghost
This will generate most of the host side code to use llvm's offload feature.
The first PR will only handle automatic mem-transfers to and from the device.
So if a user calls a kernel, we will copy inputs back and forth, but we won't do the actual kernel launch.
Before merging, we will use LLVM's Info infrastructure to verify that the memcopies match what openmp offloa generates in C++. `LIBOMPTARGET_INFO=-1 ./my_rust_binary` should print that a memcpy to and later from the device is happening.
A follow-up PR will generate the actual device-side kernel which will then do computations on the GPU.
A third PR will implement manual host2device and device2host functionality, but the goal is to minimize cases where a user has to overwrite our default handling due to performance issues.
I'm trying to get a full MVP out first, so this just recognizes GPU functions based on magic names. The final frontend will obviously move this over to use proper macros, like I'm already doing it for the autodiff work.
This work will also be compatible with std::autodiff, so one can differentiate GPU kernels.
Tracking:
- https://github.com/rust-lang/rust/issues/131513
Fixes for LLVM 21
This fixes compatibility issues with LLVM 21 without performing the actual upgrade. Split out from https://github.com/rust-lang/rust/pull/143684.
This fixes three issues:
* Updates the AMDGPU data layout for address space 8.
* Makes emit-arity-indicator.rs a no_core test, so it doesn't fail on non-x86 hosts.
* Explicitly sets the exception model for wasm, as this is no longer implied by `-wasm-enable-eh`.
In LLVM 21 PR https://github.com/llvm/llvm-project/pull/130940
`TargetRegistry::createTargetMachine` was changed to take a `const
Triple&` and has deprecated the old `StringRef` method.
@rustbot label llvm-main
PassWrapper: adapt for llvm/llvm-project@d3d856ad84
LLVM 21 moves to making it more explicit what this function call is doing, but nothing has changed behaviorally, so for now we just adjust to using the new name of the function.
`@rustbot` label llvm-main
LLVM 21 moves to making it more explicit what this function call is
doing, but nothing has changed behaviorally, so for now we just adjust
to using the new name of the function.
@rustbot label llvm-main
Autodiff batching
Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.
Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of
```rs
for i in 0..100 {
df(x[i], y[i], 1234);
}
```
You can now do
```rs
for i in 0..100.step_by(4) {
df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}
```
which will give the same results, but allows better compiler optimizations. See the testcase for details.
There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.
I will also add more tests for both modes.
For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.
r? ghost
Tracking:
- https://github.com/rust-lang/rust/issues/124509
- https://github.com/rust-lang/rust/issues/135283
We also have to remove the LLVM argument in cast-target-abi.rs for LLVM
21. I'm not really sure what the best approach here is since that test
already uses revisions. We could also fork the test into a copy for LLVM
19-20 and another for LLVM 21, but what I did for now was drop the
lint-abort-on-error flag to LLVM figuring that some coverage was better
than none, but I'm happy to change this if that was a bad direction.
The above also applies for ffi-out-of-bounds-loads.rs.
r? dianqk
@rustbot label llvm-main