Skip to content

Conversation

ikawrakow
Copy link
Owner

It ends up being somewhere in the middle between IQ2_XXS and IQ2_XS in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct
il31a

What is the point, then? Two points:

  • Another proof that one can extend quantization to very low bpw without using a codebook. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
  • Much better CPU performance compared to IQ2_XXS or IQ2_XS (or any of the i-quants that uses a codebook), see tables.

M2-Max CPU

model size params backend threads test t/s
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B ARM_NEON 8 pp512 46.86 ± 0.05
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B ARM_NEON 8 pp512 72.27 ± 0.19
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B ARM_NEON 8 tg128 18.83 ± 0.06
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B ARM_NEON 8 tg128 34.50 ± 0.30

Ryzen-7950X CPU

model size params backend threads test t/s
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B Zen4 16 pp512 128.88 ± 0.21
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B Zen4 16 pp512 187.56 ± 1.01
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B Zen4 4 tg128 11.91 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B Zen4 4 tg128 21.05 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B Zen4 8 tg128 20.55 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B Zen4 8 tg128 23.61 ± 0.20

The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.

@ikawrakow ikawrakow merged commit 910a134 into main Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant