test: add tokenizer regression test for granite-4.0-micro#948
test: add tokenizer regression test for granite-4.0-micro#948kndtran wants to merge 2 commits intogenerative-computing:mainfrom
Conversation
…-computing#947) Assert golden token IDs and tokenizer class to catch transformers upgrades that change AutoTokenizer resolution for granite-4.0-micro. Motivated by transformers v5 returning GPT2Tokenizer (constructs BPE from vocab.json/merges.txt) instead of GPT2TokenizerFast (loads tokenizer.json), which produced different token IDs and broke 3/6 RAG intrinsics. Closes generative-computing#947 Assisted-by: Claude Code Signed-off-by: Khoi-Nguyen Tran <kndtran@ibm.com>
|
original body: Summary
Closes #947 Test plan
|
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
…enerative-computing#947) Replace full-query golden entries and intrinsic output labels with 20 diverse corrupted snippets extracted from context_relevance and query_rewrite eval data. Coverage now spans pure numbers, DUNS, company suffixes, fiscal year compounds, industry codes, standards, and mixed text+number phrases — all confirmed to diverge under tf5. Assisted-by: Claude Code
|
@psschwei I updated the PR template with the details. |
|
@kndtran I recommend that we come up with an explanation for this problem that makes sense before checking a workaround into Mellea. The current analysis from Claude produces more questions than answers. Specific important questions:
|
|
@frreiss I got Claude to refine the details and evidence in the referenced issue #947. This PR is just a simple regression test for the expected input IDs, specifically for our intrinsics use case, so no workarounds yet (only suggested) and it's not a general test in Let me try to address those questions in the referenced issue. The release notes is light on the details of the tokenizer consolidation, so I (with Claude's help) had to discover the diff changes in the HF code. |
|
|
||
| @pytest.fixture(scope="module") | ||
| def tokenizer(): | ||
| return transformers.AutoTokenizer.from_pretrained(MODEL_ID) |
There was a problem hiding this comment.
This line makes no sense to me. Why are we testing the correctness of loading the tokenizer the wrong way? Per the comment above, this line should be calling PreTrainedTokenizerFast.from_pretrained(). Why do these tests not fail?
There was a problem hiding this comment.
mellea is still currently on transformers < 5:
Line 47 in 7912a1d
This test will fail on migration to TF5 and will need to be updated with the workaround, if that's the choice,or there is an official fix in transformers.
|
Hmm ok, a workaround will have to be implemented if there was a migration to TF5 in |
planetf1
left a comment
There was a problem hiding this comment.
Just added a note on test gating -- rest will wait for a v5 upgrade
Misc PR
Type of PR
Description
Testing
Additional
uv run pytest test/formatters/granite/test_tokenizer_regression.py -v— 26 passeduv run ruff checkanduv run ruff format --check — cleanAttribution