Skip to content

[quantization] Increase precision#704

Merged
mhs4670go merged 3 commits into
Samsung:mainfrom
stamalakhov:tune_gptq
May 13, 2026
Merged

[quantization] Increase precision#704
mhs4670go merged 3 commits into
Samsung:mainfrom
stamalakhov:tune_gptq

Conversation

@stamalakhov
Copy link
Copy Markdown
Contributor

This PR increases precision for GPTQ quantization:

  1. replaces moving average in Hessian with just a sum
  2. increases precision to double in Cholesky and its inversion to improve reproducibility.

Draft: #670
Related: #656

GPTQ quantization for HuggingFaceTB/SmolLM2-135M-Instruct provides the same


┌── Wikitext-2 test perplexity ─────────────
│ int16 :    22.95
└───────────────────────────────────────────

ppl as in #702 (comment)

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this May 13, 2026
This PR increases precision for GPTQ quantization:
1. replaces moving average in Hessian with just a sum
2. increases precision to double in Cholesky and its inversion
to improve reproducibility.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
diag = torch.arange(self.columns, device=self.dev)
H[diag, diag] += damp
H = torch.linalg.cholesky(H)
H = torch.linalg.cholesky(H.double()).float()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, currently the code casts each intermediate result back to float.

Since the goal of this PR is to increase precision in Cholesky and its inversion, it would be better to keep H as fp64 across the full linear algebra path and cast back to fp32 only once before entering the quantization loop.

H = H.double()
damp = percdamp * torch.mean(torch.diag(H))
diag = torch.arange(self.columns, device=self.dev)
H[diag, diag] += damp

H = torch.linalg.cholesky(H)
H = torch.cholesky_inverse(H)
Hinv = torch.linalg.cholesky(H, upper=True).float()

This preserves the intended double precision during Cholesky -> inverse -> Cholesky, while still keeping the existing fp32 quantization loop mostly unchanged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhs4670go
Fixed.
Please take another look.

Comment thread tico/quantization/algorithm/gptq/gptq.py Outdated
Co-authored-by: seongwoo chae <mhs4670go@naver.com>
@stamalakhov stamalakhov requested a review from mhs4670go May 13, 2026 17:05
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhs4670go mhs4670go merged commit 5bdfe4c into Samsung:main May 13, 2026
7 checks passed
@stamalakhov stamalakhov deleted the tune_gptq branch May 14, 2026 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants