IQ2_KS: 2.1875 bpw non-linear quantization #85

ikawrakow · 2024-10-13T10:20:44Z

It ends up being somewhere in the middle between IQ2_XXS and IQ2_XS in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct

What is the point, then? Two points:

Another proof that one can extend quantization to very low bpw without using a codebook. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
Much better CPU performance compared to IQ2_XXS or IQ2_XS (or any of the i-quants that uses a codebook), see tables.

M2-Max CPU

model	size	params	backend	threads	test	t/s
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	ARM_NEON	8	pp512	46.86 ± 0.05
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	ARM_NEON	8	pp512	72.27 ± 0.19
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	ARM_NEON	8	tg128	18.83 ± 0.06
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	ARM_NEON	8	tg128	34.50 ± 0.30

Ryzen-7950X CPU

model	size	params	backend	threads	test	t/s
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	Zen4	16	pp512	128.88 ± 0.21
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	Zen4	16	pp512	187.56 ± 1.01
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	Zen4	4	tg128	11.91 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	Zen4	4	tg128	21.05 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	Zen4	8	tg128	20.55 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	Zen4	8	tg128	23.61 ± 0.20

The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5

LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s

Iwan Kawrakow added 13 commits October 12, 2024 17:54

Experimenting

9a6376a

iq2k: Try make_qx_quants for the scale

e640a9e

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5

iq2k with make_qx_quants: adjust scale

2b74703

iq2ks: basics

103c8c0

iq2_ks: CUDA works

aa36d90

iq2_ks: WIP

70e7b75

iq2_ks: WIP

15a8115

iq2_ks: Zen4

c98243b

iq2_ks: AVX2

1f6e498

iq2_ks: scalar dot product

18cdf62

iq2_ks: ARM_NEON

550c40e

iq2_ks: Metal

5cafaf5

iq2_ks: faster Metal

f9f15c2

LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s

ikawrakow merged commit 910a134 into main Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IQ2_KS: 2.1875 bpw non-linear quantization #85

IQ2_KS: 2.1875 bpw non-linear quantization #85

Uh oh!

ikawrakow commented Oct 13, 2024

Uh oh!

Uh oh!

IQ2_KS: 2.1875 bpw non-linear quantization #85

IQ2_KS: 2.1875 bpw non-linear quantization #85

Uh oh!

Conversation

ikawrakow commented Oct 13, 2024

Uh oh!

Uh oh!