[Transform] Attention/Cache transforms #436

kylesayrs · 2025-08-26T23:46:45Z

Purpose

Enable transforms applied to attention for R3 rotations of spinquant

Prerequisites

[Transform] Better dispatch support for offloaded and multi-gpu #423

Changes

Add hookable attention and kvcache implementations which are registered to the attention module as submodules
- QuantizedAttentionImpl injects itself into the model by registering a new attention implementation called ct_hooked_attention overriding model.config._attn_implementation to be the new implementation name
- QuantizedKVCache injects itself into the model by overriding the past_key_values input kwarg to attention, and wrapping the functionality of the original cache
- Calibration and transform hooks can be added to these modules via the hook functions
  - register_query_hook,
  - register_key_hook
  - register_value_hook
- These modules are responsible for initializing quantization parameters on the parent attention module (used in the next PR)
Implement transform hooks for Q_ATTN and K_CACHE locations

Testing

Added test_correctness_attention_heads test, which simulates R3
Tested with R3 using this LC branch

Signed-off-by: Kyle Sayers <[email protected]>

…fy-key

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

This looks good, though i have a number of questions and minor suggestions

brian-dellabetta · 2025-08-27T15:18:52Z

src/compressed_tensors/modeling/attention.py

+
+    def __init__(self, attn_module: Module):
+        super().__init__()
+        self.attn_module_container = [attn_module]  # avoid circular reference


avoid circular reference by placing in a list? Can a weakref be used here?

Sure, but to be clear, the circular reference is a module circular reference (ie a module cannot be the child of its child), not one of garbage collection

brian-dellabetta · 2025-08-27T15:20:25Z

src/compressed_tensors/modeling/attention.py

+        quant_args = getattr_chain(module, quant_args_attr, None)
+        quant_enabled = getattr(module, "quantization_enabled", True)
+        if quant_args is not None and quant_enabled and self._qparams_initialized:
+            query = forward_quantize(module, query, "q", quant_args)


why is only query quantized and not key & value?

nm, i see below the kv quantization implementation. So QuantizedAttention only refers to quantized query, and QuantizedKVCache always refers to the key/value states?

Yep! I can make a note, but this guards against misuse. I don't think there's any reason to add a quantization hook to attention but not to kv cache

brian-dellabetta · 2025-08-27T15:20:44Z

src/compressed_tensors/modeling/attention.py

+    def initialize_qparams_once(self, model: PreTrainedModel, module: Module):
+        """
+        Initialize attention quantization parameters if they have not already been
+        intialized. KV cache quantization parameters are initialized by the


Suggested change

intialized. KV cache quantization parameters are initialized by the

initialized. KV cache quantization parameters are initialized by the

brian-dellabetta · 2025-08-27T15:26:16Z

src/compressed_tensors/modeling/attention.py

+            # assumes only one model at a time
+            global _original_impl


😬 i don't want to delay things, but we should briefly consider if there are alternative solutions

I spent 20 minutes exploring this, it requires creating specialized _ct_hooked_attention functions and specialized QuantizedAttentionImpl, which is more complexity than value added imho

brian-dellabetta · 2025-08-27T15:27:15Z

src/compressed_tensors/modeling/kvcache.py

+
+    def __init__(self, attn_module: Module):
+        super().__init__()
+        self.attn_module_container = [attn_module]  # avoid circular reference


same here, weakref?

brian-dellabetta · 2025-08-27T17:28:31Z

src/compressed_tensors/modeling/kvcache.py

+    return kv_cache.register_forward_pre_hook(_hook, with_kwargs=True)
+
+
+def register_value_hook(


there's a lot of equivalent code in register_key_hook and register_value_hook, can they call into the same logic with a id string that is either value_states or key_states?

The logic of "create a kwarg hook which uses the signature of the module child, but calls the hook with the parent module" is pretty specific to these use cases, and is a hard function to name 🙃.

I think it's better to be explicit in this case.

dsikka

If the goal is to use this generally for kv_cache and attn quantize, can we move the initialize_hooked_attention and initialize_hooked_kv_cache to initialize.py?

I understand we haven't hooked them in yet for those workflows but I think these belong there.

src/compressed_tensors/modeling/attention.py

Signed-off-by: Kyle Sayers <[email protected]>

dsikka

do a pass through on any missing docstring, otherwise lgtm.
nice work

src/compressed_tensors/modeling/kvcache.py

The base branch was changed.

kylesayrs added 8 commits August 14, 2025 14:25

key by weight only

f8f7156

Signed-off-by: Kyle Sayers <[email protected]>

always return on CPU, onload at runtime

6929f16

Signed-off-by: Kyle Sayers <[email protected]>

fix get_offloaded_device

13cb9e3

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

253df57

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

bc11789

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

d37251a

Signed-off-by: Kyle Sayers <[email protected]>

move to device to support pipeline parallel

5590e28

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/transform-simpli…

f485af6

…fy-key

kylesayrs mentioned this pull request Aug 27, 2025

[Transform] Spinquant R3 vllm-project/llm-compressor#1778

Open

eagerly generate with precision

0914f6f

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta previously approved these changes Aug 27, 2025

View reviewed changes

dsikka reviewed Aug 28, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Show resolved Hide resolved

kylesayrs added 3 commits August 28, 2025 17:09

attention and kvcache transforms

4b79cef

Signed-off-by: Kyle Sayers <[email protected]>

fix typo

5b8155e

Signed-off-by: Kyle Sayers <[email protected]>

docstrings

75056bf

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/r3-only branch from 7bf4b57 to 75056bf Compare August 28, 2025 21:09

use weakref

773de39

Signed-off-by: Kyle Sayers <[email protected]>

dsikka previously approved these changes Sep 2, 2025

View reviewed changes

src/compressed_tensors/modeling/kvcache.py Show resolved Hide resolved

Base automatically changed from kylesayrs/transform-simplify-key to main September 8, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Transform] Attention/Cache transforms #436

[Transform] Attention/Cache transforms #436

Uh oh!

kylesayrs commented Aug 26, 2025 •

edited

Loading

Uh oh!

brian-dellabetta left a comment

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

kylesayrs Aug 28, 2025

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

kylesayrs Aug 28, 2025

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

kylesayrs Aug 28, 2025

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

brian-dellabetta Aug 27, 2025

Uh oh!

kylesayrs Aug 28, 2025

Uh oh!

dsikka left a comment

Uh oh!

Uh oh!

dsikka left a comment

Uh oh!

Uh oh!

Uh oh!

	intialized. KV cache quantization parameters are initialized by the
	initialized. KV cache quantization parameters are initialized by the

		return kv_cache.register_forward_pre_hook(_hook, with_kwargs=True)


		def register_value_hook(

[Transform] Attention/Cache transforms #436

Are you sure you want to change the base?

[Transform] Attention/Cache transforms #436

Uh oh!

Conversation

kylesayrs commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Changes

Testing

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Aug 26, 2025 •

edited

Loading