[quantization] Increase precision by stamalakhov · Pull Request #704 · Samsung/TICO

stamalakhov · 2026-05-13T13:59:34Z

This PR increases precision for GPTQ quantization:

replaces moving average in Hessian with just a sum
increases precision to double in Cholesky and its inversion to improve reproducibility.

Draft: #670
Related: #656

GPTQ quantization for HuggingFaceTB/SmolLM2-135M-Instruct provides the same


┌── Wikitext-2 test perplexity ─────────────
│ int16 :    22.95
└───────────────────────────────────────────

ppl as in #702 (comment)

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR increases precision for GPTQ quantization: 1. replaces moving average in Hessian with just a sum 2. increases precision to double in Cholesky and its inversion to improve reproducibility. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go · 2026-05-13T15:33:03Z

        diag = torch.arange(self.columns, device=self.dev)
        H[diag, diag] += damp
-        H = torch.linalg.cholesky(H)
+        H = torch.linalg.cholesky(H.double()).float()


IIUC, currently the code casts each intermediate result back to float.

Since the goal of this PR is to increase precision in Cholesky and its inversion, it would be better to keep H as fp64 across the full linear algebra path and cast back to fp32 only once before entering the quantization loop.

H = H.double() damp = percdamp * torch.mean(torch.diag(H)) diag = torch.arange(self.columns, device=self.dev) H[diag, diag] += damp H = torch.linalg.cholesky(H) H = torch.cholesky_inverse(H) Hinv = torch.linalg.cholesky(H, upper=True).float()

This preserves the intended double precision during Cholesky -> inverse -> Cholesky, while still keeping the existing fp32 quantization loop mostly unchanged.

@mhs4670go
Fixed.
Please take another look.

Co-authored-by: seongwoo chae <mhs4670go@naver.com>

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go

LGTM

stamalakhov self-assigned this May 13, 2026

stamalakhov force-pushed the tune_gptq branch from f2226df to b2e9aa8 Compare May 13, 2026 14:07

stamalakhov requested a review from mhs4670go May 13, 2026 14:10

mhs4670go reviewed May 13, 2026

View reviewed changes

Comment thread tico/quantization/algorithm/gptq/gptq.py Outdated

Apply suggestions from code review

8450f08

Co-authored-by: seongwoo chae <mhs4670go@naver.com>

stamalakhov requested a review from mhs4670go May 13, 2026 17:05

Apply suggestions from code review.

c2c7178

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the tune_gptq branch from bbbcb46 to c2c7178 Compare May 13, 2026 17:23

mhs4670go approved these changes May 13, 2026

View reviewed changes

mhs4670go merged commit 5bdfe4c into Samsung:main May 13, 2026
7 checks passed

stamalakhov deleted the tune_gptq branch May 14, 2026 04:46

stamalakhov mentioned this pull request May 14, 2026

[quantization] Instability on llama quantization #656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Increase precision#704

[quantization] Increase precision#704
mhs4670go merged 3 commits into
Samsung:mainfrom
stamalakhov:tune_gptq

stamalakhov commented May 13, 2026

Uh oh!

mhs4670go May 13, 2026

Uh oh!

stamalakhov May 13, 2026

Uh oh!

Uh oh!

mhs4670go left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stamalakhov commented May 13, 2026

Uh oh!

mhs4670go May 13, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhs4670go left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants