otg: Improve IN write performance

chunks_exact() can be handled by the compiler more efficiently. Previous code was making a memcpy call for each 4 byte chunk slice. Hoisting the fifo out of the loop avoids recalculating the pointer each time. In my benchmark I see a jump from ~13 megabyte/sec to ~25MB/sec after this change (opt-level=3). opt-level = "z" goes 9MB/s to 18MB/s. The benchmark was on a stm32h7s3l8, 600mhz clock, 512 byte bulk writes, data in DTCM. The benchmark isn't just USB writes, also has some unrelated memcpys for packet construction.
2025-10-01 06:10:41 +00:00 · 2025-07-11 17:38:42 +08:00 · 2025-07-11 17:38:42 +08:00 · e2ceb2b1f7
commit e2ceb2b1f7
parent f53b6649dd
1 changed files with 16 additions and 3 deletions
--- a/embassy-usb-synopsys-otg/src/lib.rs
+++ b/embassy-usb-synopsys-otg/src/lib.rs
@ -1210,10 +1210,23 @@ impl<'d> embassy_usb_driver::EndpointIn for Endpoint<'d, In> {
            });

            // Write data to FIFO
-            for chunk in buf.chunks(4) {
+            let chunks = buf.chunks_exact(4);
+            // Stash the last partial chunk
+            let rem = chunks.remainder();
+            let last_chunk = (!rem.is_empty()).then(|| {
                let mut tmp = [0u8; 4];
-                tmp[0..chunk.len()].copy_from_slice(chunk);
-                self.regs.fifo(index).write_value(regs::Fifo(u32::from_ne_bytes(tmp)));
+                tmp[0..rem.len()].copy_from_slice(rem);
+                u32::from_ne_bytes(tmp)
+            });
+
+            let fifo = self.regs.fifo(index);
+            for chunk in chunks {
+                let val = u32::from_ne_bytes(chunk.try_into().unwrap());
+                fifo.write_value(regs::Fifo(val));
+            }
+            // Write any last chunk
+            if let Some(val) = last_chunk {
+                fifo.write_value(regs::Fifo(val));
            }
        });