CUDA: larger SRAM reads for tile FA, AMD FP16 dot #15927

JohannesGaessler · 2025-09-10T15:07:12Z

See https://github.com/iacopPBK/llama.cpp-gfx906 . AMD GPUs support reads of up to 16 bytes from SRAM. This PR extends the tile FlashAttention CUDA kernel with support for reads of 8 or 16 bytes. The FP32 -> FP16 type conversion is also done prior to writing the data to SRAM to reduce I/O further.

I also checked the AMD ISA documentation for v_dot2_f32_f16 support and adjusted the code paths accordingly; it seems to be available everywhere except for RDNA 1.

Performance changes

GPU	Model	Microbatch size	Test	t/s master	t/s `5bae9f9`	Speedup
MI50	gemma 2B Q4_0	16	pp16384	629.54	716.54	1.14
MI50	gemma 2B Q4_0	32	pp16384	728.28	911.94	1.25
MI50	gemma 2B Q4_0	512	pp16384	1412.40	2141.40	1.52
MI50	llama 1B Q4_0	16	pp16384	927.48	998.83	1.08
MI50	llama 1B Q4_0	32	pp16384	1189.48	1338.73	1.13
MI50	llama 1B Q4_0	512	pp16384	2278.02	2898.30	1.27
MI50	llama 8B Q4_0	16	pp16384	277.56	294.46	1.06
MI50	llama 8B Q4_0	32	pp16384	334.59	366.55	1.10
MI50	llama 8B Q4_0	512	pp16384	508.27	606.21	1.19
RX 6800	gemma 2B Q4_0	16	pp16384	399.91	636.85	1.59
RX 6800	gemma 2B Q4_0	32	pp16384	313.91	992.28	3.16
RX 6800	gemma 2B Q4_0	512	pp16384	568.27	1897.49	3.34
RX 6800	llama 1B Q4_0	16	pp16384	644.87	903.19	1.40
RX 6800	llama 1B Q4_0	32	pp16384	658.72	1336.48	2.03
RX 6800	llama 1B Q4_0	512	pp16384	897.77	2301.67	2.56
RX 6800	llama 8B Q4_0	16	pp16384	172.06	234.54	1.36
RX 6800	llama 8B Q4_0	32	pp16384	174.27	328.20	1.88
RX 6800	llama 8B Q4_0	512	pp16384	231.13	530.74	2.30

slaren · 2025-09-11T18:31:40Z

ggml/src/ggml-cuda/common.cuh

+    } else if constexpr (nbytes == 16) {
+        *(int4 *) dst = *(const int4 *) src;
+    } else {
+        static_assert(nbytes == 0 && nbytes == -1, "bad nbytes");


Wouldn't this work?

Suggested change

static_assert(nbytes == 0 && nbytes == -1, "bad nbytes");

static_assert(false, "bad nbytes");

I tried this first, it failed during the host pass.

…5927)" This reverts commit 0e6ff00.

Includes fix for v_dot2_f32_f16 being used on ISAs without that instruction. ggml-org/llama.cpp#15927

CUDA: larger SRAM reads for tile FA, AMD FP16 dot

8821183

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 10, 2025

JohannesGaessler mentioned this pull request Sep 11, 2025

Compile bug: Failing to compile with hipblas and gfx803 #15936

Closed

mudler mentioned this pull request Sep 11, 2025

chore: [test] upstream validation - DO NOT MERGE! mudler/LocalAI#6248

Closed

1 task

fix logic for availability of v_dot2_f32_f16

fe4eb4f

JohannesGaessler force-pushed the cuda-fa-tile-mem-pattern-4 branch from 4ff6731 to fe4eb4f Compare September 11, 2025 12:00

JohannesGaessler linked an issue Sep 11, 2025 that may be closed by this pull request

Compile bug: Failing to compile with hipblas and gfx803 #15936

Closed

slaren approved these changes Sep 11, 2025

View reviewed changes

JohannesGaessler merged commit 0e6ff00 into ggml-org:master Sep 11, 2025
47 of 48 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 13, 2025

Revert "CUDA: larger SRAM reads for tile FA, AMD FP16 dot (ggml-org#1…

dcd2d56

…5927)" This reverts commit 0e6ff00.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 14, 2025

Revert "CUDA: larger SRAM reads for tile FA, AMD FP16 dot (ggml-org#1…

66e5032

…5927)" This reverts commit 0e6ff00.

LunNova added a commit to LunNova/nixpkgs that referenced this pull request Sep 16, 2025

llama-cpp: 6442 -> 6479

65b0da6

Includes fix for v_dot2_f32_f16 being used on ISAs without that instruction. ggml-org/llama.cpp#15927

LunNova mentioned this pull request Sep 16, 2025

llama-cpp: 6442 -> 6479 NixOS/nixpkgs#443311

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: larger SRAM reads for tile FA, AMD FP16 dot #15927

CUDA: larger SRAM reads for tile FA, AMD FP16 dot #15927

Uh oh!

JohannesGaessler commented Sep 10, 2025

Uh oh!

slaren Sep 11, 2025

Uh oh!

JohannesGaessler Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

	static_assert(nbytes == 0 && nbytes == -1, "bad nbytes");
	static_assert(false, "bad nbytes");

CUDA: larger SRAM reads for tile FA, AMD FP16 dot #15927

CUDA: larger SRAM reads for tile FA, AMD FP16 dot #15927

Uh oh!

Conversation

JohannesGaessler commented Sep 10, 2025

Uh oh!

slaren Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!