Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

See https://github.com/iacopPBK/llama.cpp-gfx906 . AMD GPUs support reads of up to 16 bytes from SRAM. This PR extends the tile FlashAttention CUDA kernel with support for reads of 8 or 16 bytes. The FP32 -> FP16 type conversion is also done prior to writing the data to SRAM to reduce I/O further.

I also checked the AMD ISA documentation for v_dot2_f32_f16 support and adjusted the code paths accordingly; it seems to be available everywhere except for RDNA 1.

Performance changes
GPU Model Microbatch size Test t/s master t/s 5bae9f9 Speedup
MI50 gemma 2B Q4_0 16 pp16384 629.54 716.54 1.14
MI50 gemma 2B Q4_0 32 pp16384 728.28 911.94 1.25
MI50 gemma 2B Q4_0 512 pp16384 1412.40 2141.40 1.52
MI50 llama 1B Q4_0 16 pp16384 927.48 998.83 1.08
MI50 llama 1B Q4_0 32 pp16384 1189.48 1338.73 1.13
MI50 llama 1B Q4_0 512 pp16384 2278.02 2898.30 1.27
MI50 llama 8B Q4_0 16 pp16384 277.56 294.46 1.06
MI50 llama 8B Q4_0 32 pp16384 334.59 366.55 1.10
MI50 llama 8B Q4_0 512 pp16384 508.27 606.21 1.19
RX 6800 gemma 2B Q4_0 16 pp16384 399.91 636.85 1.59
RX 6800 gemma 2B Q4_0 32 pp16384 313.91 992.28 3.16
RX 6800 gemma 2B Q4_0 512 pp16384 568.27 1897.49 3.34
RX 6800 llama 1B Q4_0 16 pp16384 644.87 903.19 1.40
RX 6800 llama 1B Q4_0 32 pp16384 658.72 1336.48 2.03
RX 6800 llama 1B Q4_0 512 pp16384 897.77 2301.67 2.56
RX 6800 llama 8B Q4_0 16 pp16384 172.06 234.54 1.36
RX 6800 llama 8B Q4_0 32 pp16384 174.27 328.20 1.88
RX 6800 llama 8B Q4_0 512 pp16384 231.13 530.74 2.30

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 10, 2025
@JohannesGaessler JohannesGaessler force-pushed the cuda-fa-tile-mem-pattern-4 branch from 4ff6731 to fe4eb4f Compare September 11, 2025 12:00
@JohannesGaessler JohannesGaessler linked an issue Sep 11, 2025 that may be closed by this pull request
} else if constexpr (nbytes == 16) {
*(int4 *) dst = *(const int4 *) src;
} else {
static_assert(nbytes == 0 && nbytes == -1, "bad nbytes");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this work?

Suggested change
static_assert(nbytes == 0 && nbytes == -1, "bad nbytes");
static_assert(false, "bad nbytes");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this first, it failed during the host pass.

@JohannesGaessler JohannesGaessler merged commit 0e6ff00 into ggml-org:master Sep 11, 2025
47 of 48 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 13, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 14, 2025
LunNova added a commit to LunNova/nixpkgs that referenced this pull request Sep 16, 2025
Includes fix for v_dot2_f32_f16 being used on ISAs without that instruction.

ggml-org/llama.cpp#15927
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compile bug: Failing to compile with hipblas and gfx803
2 participants