Skip to content

AttributeError: Qwen2Tokenizer has no attribute batch_encode_plus. Did you mean: '_encode_plus'?#870

Open
jiyzhang wants to merge 1 commit intoNVIDIA:mainfrom
jiyzhang:patch-1
Open

AttributeError: Qwen2Tokenizer has no attribute batch_encode_plus. Did you mean: '_encode_plus'?#870
jiyzhang wants to merge 1 commit intoNVIDIA:mainfrom
jiyzhang:patch-1

Conversation

@jiyzhang
Copy link

@jiyzhang jiyzhang commented Feb 9, 2026

What does this PR do?

Type of change: ?
Bug fix

Overview: ?

The error below occurred when trying to quantize Qwen3 models (Qwen/Qwen3-Code-Next)

  File "/app/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 146, in make_calib_dataloader
    calib_dataloader = get_dataset_dataloader(
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/dataset_utils.py", line 217, in get_dataset_dataloader
    batch_encoded = tokenizer.batch_encode_plus(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1291, in __getattr__
    raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
AttributeError: Qwen2Tokenizer has no attribute batch_encode_plus. Did you mean: '_encode_plus'?

batch_encode_plus was deprecated, it's recommended to to tokenizer(...)

File changed:
modelopt/torch/utils/dataset_utils.py
from

    batch_encoded = tokenizer.batch_encode_plus(
        all_samples,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_sample_length,
    )

to

    batch_encoded = tokenizer(
        all_samples,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_sample_length,
    )

Usage

There is no change to the usage.

Testing

After the code modification, quantizing Qwen3 models works well.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

  • Refactor
    • Updated the tokenization interface to use a more modern and streamlined approach while preserving all existing functionality and output compatibility.

1. issues encountered
   ```
  File "/app/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 146, in make_calib_dataloader
    calib_dataloader = get_dataset_dataloader(
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/dataset_utils.py", line 217, in get_dataset_dataloader
    batch_encoded = tokenizer.batch_encode_plus(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1291, in __getattr__
    raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
AttributeError: Qwen2Tokenizer has no attribute batch_encode_plus. Did you mean: '_encode_plus'?
   ```

2. `batch_encode_plus` was deprecated, it's recommended to to `tokenizer(...)`

Signed-off-by: jiyzhang <jiyongzhang@gmail.com>
@jiyzhang jiyzhang requested a review from a team as a code owner February 9, 2026 07:08
@jiyzhang jiyzhang requested a review from realAsma February 9, 2026 07:08
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

The change replaces tokenizer.batch_encode_plus() with a direct tokenizer() call in the dataset utilities module, passing equivalent parameters including return_tensors="pt", padding=True, truncation=True, and max_length to maintain the same encoding behavior.

Changes

Cohort / File(s) Summary
Tokenizer API Migration
modelopt/torch/utils/dataset_utils.py
Replaced deprecated batch_encode_plus() method with direct tokenizer call interface while preserving all encoding parameters and output structure.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title describes the specific error message encountered and serves as a bug report title rather than summarizing the solution. While it accurately reflects the problem being fixed, it highlights the error symptom rather than the main change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
modelopt/torch/utils/dataset_utils.py (1)

227-227: Stale comment: still references batch_encode_plus.

Now that the explicit batch_encode_plus call is gone, this comment is misleading. Consider updating it to reflect the actual reason for the deep copy.

Suggested fix
-    # batch_encode_plus will modify the tokenizer in place, so we need to clone it.
+    # Tokenizer encoding may modify internal state in place, so we need to clone it.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Feb 9, 2026

Do you know why Qwen3-Code-Next uses Qwen2 tokenizer?

@jiyzhang
Copy link
Author

Do you know why Qwen3-Code-Next uses Qwen2 tokenizer?

  1. There is no Qwen3 tokenizer released.
  2. The vocabulary didn't change between Qwen2 and Qwen3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants