Skip to content

Fix num_logits_to_keep default in decoder_forward and add get_available_devices()#1669

Closed
Suh0161 wants to merge 2 commits into
huggingface:mainfrom
Suh0161:feat/num-logits-and-available-devices
Closed

Fix num_logits_to_keep default in decoder_forward and add get_available_devices()#1669
Suh0161 wants to merge 2 commits into
huggingface:mainfrom
Suh0161:feat/num-logits-and-available-devices

Conversation

@Suh0161
Copy link
Copy Markdown

@Suh0161 Suh0161 commented May 1, 2026

Closes #1666, closes #1643.

Fix: num_logits_to_keep defaults to 0n instead of 1n (#1666)

decoder_forward() had a comment correctly explaining that num_logits_to_keep=1 reduces memory during generation, but the code set 0n — causing the ONNX model to compute logits for the entire prompt instead of just the last token. For Gemma 4 with a 20k token prompt and 262k vocabulary this wastes ~20 GB of memory and can cause OOM crashes.

Fix: Change [0n][1n] to match both the comment and the behavior of decoder_prepare_inputs_for_generation().

Feature: get_available_devices() (#1643)

The supportedDevices list already existed inside the ONNX backend but was never exposed to users. This adds a public get_available_devices() function:

import { get_available_devices } from '@huggingface/transformers';

const devices = get_available_devices();
// Node.js (Windows):   ['dml', 'webgpu', 'cpu']
// Node.js (Linux x64): ['cuda', 'webgpu', 'cpu']
// Browser (WebGPU):    ['webgpu', 'wasm']
// Browser (no WebGPU): ['wasm']


---

Also note: **#1182 already has open PR #1190**, so I skipped that one to avoid duplicating work.

Suh0161 added 2 commits May 1, 2026 18:16
…ing to tokenizer

- TokenClassificationPipeline now populates start/end character offsets on
  every raw token result by scanning forward through the original text.
  Grouped results (aggregation_strategy='simple') carry the span of the
  first-to-last token in the group.

- PreTrainedTokenizer._call now accepts return_offsets_mapping: true, which
  adds an offset_mapping field ([start, end) per token) to the encoding.
  Works for single strings and batched input; handles padding with [0,0] and
  strips the field before tensor conversion so it is never tensorized.

- Adds computeOffsets() helper with case-insensitive fallback for uncased
  tokenizers (e.g. bert-base-uncased).

Closes huggingface#425, closes huggingface#633.
- Fix decoder_forward() defaulting num_logits_to_keep to 0n instead of 1n.
  The comment correctly stated the value should be 1 to avoid computing logits
  for the entire prompt sequence, but the code contradicted it. For models like
  Gemma 4 with long contexts and large vocabularies this caused ~20 GB of
  unnecessary memory allocation during generation.
  Closes huggingface#1666.

- Add get_available_devices() to the public API. The underlying supportedDevices
  list already existed in the ONNX backend but was not accessible to users.
  Returns a copy of the device list sorted by priority/performance for the
  current environment (Node.js, browser, Electron).
  Closes huggingface#1643.
@Suh0161
Copy link
Copy Markdown
Author

Suh0161 commented May 1, 2026

Closing in favor of #1670, which contains this work in the same commit stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_available_devices() Gemma 4 generation passes num_logits_to_keep=0 in decoder_forward, causing full-prompt logits memory blowup

1 participant