Skip to content

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Aug 1, 2025

Purpose

  • Enable saving models with applied transforms
    • Transform config encodes both online and offline (fused) rotations
config.json
{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "group",
          "symmetric": true,
          "type": "int"
        }
      }
    },
    "global_compression_ratio": null,
    "ignore": [
      "lm_head"
    ],
    "kv_cache_scheme": null,
    "quant_method": "compressed-tensors",
    "quantization_status": "compressed",
    "sparsity_config": {},
    "transform_config": {
      "config_groups": {
        "u": {
          "apply": [
            {
              "ignore": [
                "lm_head"
              ],
              "inverse": false,
              "location": "weight_output",
              "targets": [
                "Linear"
              ]
            },
            {
              "ignore": [
                "lm_head"
              ],
              "inverse": true,
              "location": "output",
              "targets": [
                "Linear"
              ]
            }
          ],
          "head_dim": null,
          "randomize": false,
          "requires_grad": false,
          "type": "random-hadamard"
        },
        "v": {
          "apply": [
            {
              "ignore": [
                "lm_head"
              ],
              "inverse": false,
              "location": "input",
              "targets": [
                "Linear"
              ]
            },
            {
              "ignore": [
                "lm_head"
              ],
              "inverse": true,
              "location": "weight_input",
              "targets": [
                "Linear"
              ]
            }
          ],
          "head_dim": null,
          "randomize": false,
          "requires_grad": false,
          "type": "random-hadamard"
        }
      }
    },
    "version": "0.10.3.dev146+ga3cd59d"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.55.0.dev0",
  "use_cache": true,
  "vocab_size": 128256
}

Prerequisites

Changes

  • Implement transform_config similar to sparsity config, as a subconfig of the quantization config
    • This aligns with HF's pattern of treating a "quantization config" as a compression or optimization config
  • Transform config is passed to serialization by attaching the transform config to the model when it is applied to the model
  • Refactor ModelCompressor.update_config to support writing q/s/t configs

Follow ups

  • Some work will need to be done if we want to support users passing with CompressedTensorsConfig
  • Right now there are 3 ways we pass configs. Some work could be done to consolidate these methods [WIP] Refactor serialization of qconfig #410
    • qconfig is reconstructed from attached schemes
    • sconfig is inferred from the model in LC and passed as an argument
    • qconfig is attached to the model directly

Testing

kylesayrs added 30 commits May 30, 2025 13:40
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs and others added 4 commits July 31, 2025 10:59
Signed-off-by: Kyle Sayers <[email protected]>
… Compression Params (#407)

* add compression param; update qdq for batch greater than 1

* make generic

* fix tests

* remove incorrect line change; make generic

* update
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the base branch from main to kylesayrs/transform_save August 1, 2025 23:39
@kylesayrs kylesayrs marked this pull request as ready for review August 1, 2025 23:53
Copy link
Contributor

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes to transform config look good to me, so approving, but definitely need to confirm with @dsikka and @rahul-tuli the changes to quantization life cycle

Base automatically changed from kylesayrs/transform_save to main August 7, 2025 01:12
@dsikka dsikka dismissed brian-dellabetta’s stale review August 7, 2025 01:12

The base branch was changed.

dsikka
dsikka previously approved these changes Aug 7, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but needs rebase

Copy link
Contributor

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit question, otherwise LGTM

@dsikka dsikka merged commit 0731aa5 into main Aug 11, 2025
1 check passed
@dsikka dsikka deleted the kylesayrs/serialize-tconfig branch August 11, 2025 18:13
dsikka added a commit that referenced this pull request Aug 12, 2025
@dsikka dsikka restored the kylesayrs/serialize-tconfig branch August 12, 2025 01:34
dsikka added a commit that referenced this pull request Aug 12, 2025
brian-dellabetta added a commit to vllm-project/llm-compressor that referenced this pull request Aug 13, 2025
## Purpose ##
* Enable offline spinquant-style transforms

## Prerequisites ##
* neuralmagic/compressed-tensors#370
* neuralmagic/compressed-tensors#412
* neuralmagic/compressed-tensors#414

## Changes ##
* Added `spinquant_example.py` to examples folder
* Added `SpinQuantModifier` which handles the construction of a
spinquant-style transform config

## Testing ##
* Added modifier serialization and correctness tests

## Evaluation ##
Using this branch, and [the original SpinQuant
code](https://github.com/facebookresearch/SpinQuant), we see very
similar results for `meta-llama/Llama-3.2-1B-Instruct` with W4A16
quantization. Results are equivalent in hf (in-memory vs serialized and
re-loaded), and very similar in vllm. The symmetric scales calculation
in `llm-compressor` is slightly different than original SpinQuant paper,
which uses the original GPTQ implementation. When this is swapped in,
results are consistent, with hadamard improving results on `gsm8k_llama`
and `arc_challenge_llama`:

Scheme | Impl | gsm8k | gsm8k_llama | arc_challenge_llama
-- | -- | -- | -- | --
Hadamard+W4A16 | LC | 0.2403 | 0.2835 | 0.5262
W4A16 | LC | 0.1964 | 0.1933 | 0.4781
Hadamard+W4A16 | LC+SQscales | 0.1721 | 0.2183 | 0.485
W4A16 | LC+SQscales | 0.207 | 0.1706 | 0.4498
Hadamard+W4A16 | SQ | 0.1736 | 0.2282 | 0.4807
W4A16 | SQ | 0.1986 | 0.1774 | 0.4489

To run LC+SQScales, change [this line in
CT](https://github.com/neuralmagic/compressed-tensors/blob/b2df366797b00330ec765f5891dde14e4cc74c9d/src/compressed_tensors/quantization/utils/helpers.py#L111)
from

```python
scales = max_val_pos / (float(bit_range) / 2)
```
to
```python
scales = max_val_pos / (float(bit_max))
```

<details>
<summary>The following python script was used to generate these
results</summary>

Clone SpinQuant repo and paste this in the top-level directory:
```python
# coding=utf-8
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import torch
from typing import Literal
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from torch import nn
import lm_eval

from transformers import LlamaForCausalLM, AutoTokenizer
import transformers
from train_utils.main import prepare_model
from train_utils.modeling_llama_quant import LlamaForCausalLM as LlamaForCausalLMQuant
from utils.hadamard_utils import random_hadamard_matrix, hadamard_matrix
from utils.process_args import process_args_ptq

# model_id = "meta-llama/Llama-3.1-8B-Instruct"
# model_id = "meta-llama/Llama-3.2-3B-Instruct"
model_id = "meta-llama/Llama-3.2-1B-Instruct"
dtype = torch.bfloat16


class RotateModule(nn.Module):
    def __init__(self, R_init):
        super(RotateModule, self).__init__()
        self.weight = nn.Parameter(R_init.to(torch.float32).to(torch.device("cuda")))

    def forward(self, x, transpose=False):
        if transpose:
            return x @ self.weight
        else:
            return self.weight @ x


def get_sq_model(
    r1r2=Literal["eye", "random-hadamard", "hadamard"],
    w_bits=Literal[4, 16],
    w_clip: bool = False,
) -> LlamaForCausalLMQuant:
    model_args, training_args, ptq_args = process_args_ptq()
    model_args.input_model = model_id
    if w_bits == 4:
        ptq_args.w_bits = 4
        ptq_args.w_groupsize = 128
        ptq_args.w_rtn = True  # if False, GPTQ is used
        ptq_args.w_clip = w_clip
    ptq_args.a_bits = 16
    ptq_args.k_bits = 16
    ptq_args.v_bits = 16

    print("=======ARGS=======", ptq_args)

    config = transformers.AutoConfig.from_pretrained(model_args.input_model)

    # Llama v3.2 specific: Spinquant is not compatiable with tie_word_embeddings, clone lm_head from embed_tokens
    process_word_embeddings = False
    if config.tie_word_embeddings:
        config.tie_word_embeddings = False
        process_word_embeddings = True

    model = LlamaForCausalLMQuant.from_pretrained(
        pretrained_model_name_or_path=model_args.input_model,
        config=config,
        torch_dtype=dtype,
        device_map="cuda",
    )

    if process_word_embeddings:
        model.lm_head.weight.data = model.model.embed_tokens.weight.data.clone()

    model = prepare_model(ptq_args, model)
    for param in model.parameters():
        param.requires_grad = False
    match r1r2:
        case "eye":
            R1 = torch.eye(model.config.hidden_size, device="cuda")
        case "random-hadamard":
            R1 = random_hadamard_matrix(model.config.hidden_size, "cuda")
        case _:
            R1 = hadamard_matrix(model.config.hidden_size, "cuda")
    model.R1 = RotateModule(R1)
    for i in range(model.config.num_hidden_layers):
        # Each head dim = 128 for Llama model
        match r1r2:
            case "eye":
                R2 = torch.eye(
                    model.config.hidden_size // model.config.num_attention_heads,
                    device="cuda",
                )
            case "random-hadamard":
                R2 = random_hadamard_matrix(
                    model.config.hidden_size // model.config.num_attention_heads, "cuda"
                )
            case _:
                R2 = hadamard_matrix(
                    model.config.hidden_size // model.config.num_attention_heads, "cuda"
                )
        model.model.layers[i].self_attn.R2 = RotateModule(R2)

    model.config.use_cache = False

    return model


def get_lc_model(
    r1r2=Literal["eye", "random-hadamard", "hadamard"], w_bits=Literal[4, 16]
) -> LlamaForCausalLM:
    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.modifiers.transform import SpinQuantModifier

    model = LlamaForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_id,
        torch_dtype=dtype,
        device_map="cuda",
    )

    recipe = [
        SpinQuantModifier(
            rotations=[] if r1r2 == "eye" else ["R1", "R2"],
            transform_type="hadamard",
        )
    ]
    if w_bits == 4:
        recipe.append(
            QuantizationModifier(
                targets="Linear",
                scheme="W4A16",
                ignore=["lm_head"],
            )
        )

    oneshot(
        model=model,
        recipe=recipe,
        pipeline="datafree",
        log_dir=None,
    )

    return model


if __name__ == "__main__":
    for scales_impl in ["sq_min_hack", "lc_min_hack"]:
        for r1r2 in ["eye", "hadamard"]:
            for sq_lc in ["sq", "lc"]:
                w_bits = 4

                os.environ["SCALES_IMPL"] = scales_impl

                model = (
                    get_sq_model(r1r2=r1r2, w_bits=w_bits)
                    if sq_lc == "sq"
                    else get_lc_model(r1r2=r1r2, w_bits=w_bits)
                ).to("cuda")

                SAVE_DIR = model_id.split("/")[1] + f"-{scales_impl}-{r1r2}-w4a16"
                model.save_pretrained(SAVE_DIR, save_compressed=True)
                tokenizer = AutoTokenizer.from_pretrained(
                    model_id, trust_remote_code=True
                )
                tokenizer.save_pretrained(SAVE_DIR)

                del model
                del tokenizer
                torch.cuda.empty_cache()

                results = lm_eval.simple_evaluate(
                    # 1) hf in-memory
                    # model=lm_eval.models.huggingface.HFLM(
                    #     pretrained=model,
                    #     batch_size=32,
                    #     add_bos_token=False,
                    # ),
                    # 1/)
                    # 2) vllm serialized
                    model="vllm",
                    model_args={
                        "pretrained": SAVE_DIR,
                        "add_bos_token": False,
                        "dtype": "auto",
                        "max_model_len": 4096,
                        "gpu_memory_utilization": 0.5,
                        "enable_chunked_prefill": True,
                    },
                    # 2/)
                    # 3) hf serialized
                    # model="hf",
                    # model_args={
                    #     "pretrained": SAVE_DIR,
                    #     "add_bos_token": False,
                    #     "dtype": "auto",
                    # },
                    # device="cuda",
                    # 3/)
                    tasks=["gsm8k_llama", "gsm8k", "arc_challenge_llama"],
                    num_fewshot=8,
                    batch_size=32,
                    apply_chat_template=True,
                    fewshot_as_multiturn=True,
                )
                print(
                    f"RESULTS, {model_id} {sq_lc} R1R2 {r1r2} W_BITS {w_bits} SCALEIMPL {scales_impl}"
                )
                print(lm_eval.utils.make_table(results))
```
</details>


## Follow Ups ##
* Infer data free pipeline, even if a transform modifier is included
* Rotations R3 and R4
* Modify example to use GPTQ once basic evaluation has been performed

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Aug 14, 2025
## Purpose ##
* Enable quip-style transforms

## Prerequisites ##
* neuralmagic/compressed-tensors#370
* neuralmagic/compressed-tensors#412
* neuralmagic/compressed-tensors#414

## Changes ##
* Added `quip_example.py` to examples folder
* As made clear in the disclaimer, this example requires minimum
versions of compressed-tensors and transformers to run
* Added `QuIPModifier` which handles the construction of a quip-style
transform config

## Testing ##
* Added modifier serialization and correctness tests

## Evaluation ##
Evaluation performed by @brian-dellabetta 

Evals on Llama 3.2 1B with Quip (num_fewshot 8, limit 1000 to be
compatible with results
[here](https://github.com/vllm-project/llm-compressor/pull/1243/files#diff-bdc27f23c0dc2da352d5c83abdc0f267873edf4d36f88474038b975df75bd8c3R38-R64))
:

| Strat | gsm8k,strict | gsm8k_llama,strict |
|-|-|-|
| FP16 | .352 | .323 |
| Quip | .348 | .322 |
| W4A16 | .180 | .017 |
| Quip+W4A16 | .213 | .141 |

## Follow Ups ##
* Infer data free pipeline, even if a transform modifier is included
* Modify example to use GPTQ once basic evaluation has been performed

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Etelis added a commit to Etelis/compressed-tensors that referenced this pull request Sep 11, 2025
* add utilities

Signed-off-by: Kyle Sayers <[email protected]>

* add tests

Signed-off-by: Kyle Sayers <[email protected]>

* add additional tests

Signed-off-by: Kyle Sayers <[email protected]>

* add utils and tests

Signed-off-by: Kyle Sayers <[email protected]>

* Implement transform factories

Signed-off-by: Kyle Sayers <[email protected]>

* add permutations

Signed-off-by: Kyle Sayers <[email protected]>

* add delete_offload_module

Signed-off-by: Kyle Sayers <[email protected]>

* key inverses by weight

Signed-off-by: Kyle Sayers <[email protected]>

* fix tests

Signed-off-by: Kyle Sayers <[email protected]>

* standardize random hadamard

Signed-off-by: Kyle Sayers <[email protected]>

* prepend input hooks

Signed-off-by: Kyle Sayers <[email protected]>

* apply sqrt division first

Signed-off-by: Kyle Sayers <[email protected]>

* use divided hadamards

Signed-off-by: Kyle Sayers <[email protected]>

* fix typo

Signed-off-by: Kyle Sayers <[email protected]>

* add random option

Signed-off-by: Kyle Sayers <[email protected]>

* use random seeds, rename matrix multiply

Signed-off-by: Kyle Sayers <[email protected]>

* add deterministic generation to random matrix

Signed-off-by: Kyle Sayers <[email protected]>

* fix perm math

Signed-off-by: Kyle Sayers <[email protected]>

* update docstrings

Signed-off-by: Kyle Sayers <[email protected]>

* update docstrings

Signed-off-by: Kyle Sayers <[email protected]>

* cleanup

Signed-off-by: Kyle Sayers <[email protected]>

* cleanup 2

Signed-off-by: Kyle Sayers <[email protected]>

* make seed optional

Signed-off-by: Kyle Sayers <[email protected]>

* remove iterable check and missing return value

Signed-off-by: Kyle Sayers <[email protected]>

* Remove unrelated changes

* simplify code

Signed-off-by: Kyle Sayers <[email protected]>

* implement apply, use in tests

Signed-off-by: Kyle Sayers <[email protected]>

* use hadamards database file

Signed-off-by: Kyle Sayers <[email protected]>

* try manifest

Signed-off-by: Kyle Sayers <[email protected]>

* try setup, update hadamards list

Signed-off-by: Kyle Sayers <[email protected]>

* fix setup

Signed-off-by: Kyle Sayers <[email protected]>

* add docstrings, cleanup

Signed-off-by: Kyle Sayers <[email protected]>

* fix setup, thank you @dbarbuzzi

Signed-off-by: Kyle Sayers <[email protected]>

* remove numpy, add tests

Signed-off-by: Kyle Sayers <[email protected]>

* solidify dtype, add gpu tests

Signed-off-by: Kyle Sayers <[email protected]>

* fix docstring

Signed-off-by: Kyle Sayers <[email protected]>

* add device option

Signed-off-by: Kyle Sayers <[email protected]>

* construct on execution device, cache on offload device

Signed-off-by: Kyle Sayers <[email protected]>

* save construction device changes for later

Signed-off-by: Kyle Sayers <[email protected]>

* construct on execution device, cache on offload device

* cite nja sloane

Signed-off-by: Kyle Sayers <[email protected]>

* remove dreg

Signed-off-by: Kyle Sayers <[email protected]>

* put on device via safe_open

Signed-off-by: Kyle Sayers <[email protected]>

* nits and docstrings

Signed-off-by: Kyle Sayers <[email protected]>

* update docstring

Signed-off-by: Kyle Sayers <[email protected]>

* Merge

* merge with construct: construct in float32

Signed-off-by: Kyle Sayers <[email protected]>

* construct with same dtype, constructing on fp32 found no difference

Signed-off-by: Kyle Sayers <[email protected]>

* remove unnecessary imports

Signed-off-by: Kyle Sayers <[email protected]>

* bugfixes (neuralmagic#375)

Signed-off-by: Brian Dellabetta <[email protected]>

* use factory_kwargs

Signed-off-by: Kyle Sayers <[email protected]>

* add frozen dict to deps

Signed-off-by: Kyle Sayers <[email protected]>

* fix style

Signed-off-by: Kyle Sayers <[email protected]>

* merge

Signed-off-by: Kyle Sayers <[email protected]>

* use delete_offload_module

Signed-off-by: Kyle Sayers <[email protected]>

* add docstrign

Signed-off-by: Kyle Sayers <[email protected]>

* use parametrize

Signed-off-by: Kyle Sayers <[email protected]>

* populate _dynamic_tied_weights_keys

Signed-off-by: Kyle Sayers <[email protected]>

* ensure serializable

Signed-off-by: Kyle Sayers <[email protected]>

* remove extra space

Signed-off-by: Kyle Sayers <[email protected]>

* apply style

Signed-off-by: Kyle Sayers <[email protected]>

* merge dregs

* skip offloading tests until transformers changes land

Signed-off-by: Kyle Sayers <[email protected]>

* use set

Signed-off-by: Kyle Sayers <[email protected]>

* [Quantization][Decompression] Fix QDQ for dynamic quant; Update NVFP4 Compression Params (neuralmagic#407)

* add compression param; update qdq for batch greater than 1

* make generic

* fix tests

* remove incorrect line change; make generic

* update

* serialize

Signed-off-by: Kyle Sayers <[email protected]>

* fix typo, comment

Signed-off-by: Kyle Sayers <[email protected]>

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Etelis added a commit to Etelis/compressed-tensors that referenced this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants