Skip to content

Key Features decisions:Special Tokens (BOS), Whitespace, Reversibility, and Byte Fallback etc #148

@joenaess

Description

@joenaess

Beginning-of-Sentence (BOS) and End-of-Sentence (EOS) Tokens: A clear strategy is needed for how and when to insert BOS (e.g., [BOS] or ) and EOS (e.g., [EOS] or ) tokens. These markers are crucial for sequence demarcation, especially during training and inference, helping the model understand the start and end of a complete input unit.
Padding Token ([PAD]): Define the token used to ensure all sequences in a batch have the same length. This is essential for efficient batch processing on accelerated hardware.
Unknown/Out-of-Vocabulary Token ([UNK]): Establish the designated token for handling characters or subwords that are not present in the tokenizer's learned vocabulary. The behavior of the UNK token significantly impacts the model's ability to generalize to novel inputs.
Mask Token ([MASK]): If the tokenizer is intended for models utilizing masked language modeling (e.g., BERT), a dedicated mask token is required.
Whitespace Management:
Preservation vs. Stripping: Determine whether whitespace should be preserved as separate tokens/subwords or entirely stripped/collapsed. For many LLMs, whitespace is implicitly handled or represented by leading spaces in subword tokens (e.g., SentencePiece, BPE).
Normalization: Decide on a standard for normalizing different forms of whitespace (e.g., tabs, multiple spaces, non-breaking spaces) to a single canonical representation (e.g., a single ASCII space) before tokenization.
Reversibility (Decoding Accuracy):
Lossless Conversion: A critical feature for any production-grade tokenizer is the ability to losslessly convert a sequence of tokens back into the exact original text string, including all original spacing and punctuation. This ensures consistency and prevents unintended data corruption or manipulation between the tokenized and detokenized states.
Handling of Joiners: Ensure that the decoding process correctly handles the joining of subwords, particularly in models that rely on leading-space characters to denote the start of a new word.
Byte Fallback Mechanism:
Robustness for OOV: Implement a strategy to handle characters that cannot be represented even by the [UNK] token or high-frequency subwords. A byte-level fallback (e.g., using UTF-8 byte representations or individual character tokens) ensures that any input string, regardless of language or character set, can be tokenized. This is essential for achieving full language and character-set coverage and preventing errors on exotic or rare input characters.
Efficiency Trade-offs: While byte fallback is robust, it can lead to very long token sequences for highly out-of-vocabulary inputs, which has performance implications during training and inference. The design must balance robustness with efficiency..

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions