Token Budget

Automatically find the least compression needed to fit a target token count.

How it works

When tokenBudget is set, the engine binary-searches recencyWindow to find the largest recency window that fits within the budget. This maximizes preserved recent context while still hitting the target.

Binary search algorithm

1. Fast path: if total tokens <= budget, return immediately (no compression needed)
2. Set lo = minRecencyWindow (default 0), hi = messages.length - 1
3. Binary search:
   a. mid = ceil((lo + hi) / 2)
   b. Compress with recencyWindow = mid
   c. If result fits budget: lo = mid (try larger window)
   d. If over budget: hi = mid - 1 (try smaller window)
4. Final compress at recencyWindow = lo
5. If still over budget and forceConverge enabled: hard-truncate pass

The binary search runs compression at each iteration. When a summarizer is provided, each iteration calls the LLM — so budget + LLM is slower than budget alone.

Basic usage

import { compress } from 'context-compression-engine';

const result = compress(messages, {
  tokenBudget: 4000,
  minRecencyWindow: 2,
});

result.fits; // true if result fits within budget
result.tokenCount; // token count (via tokenCounter)
result.recencyWindow; // the recencyWindow the binary search settled on

`defaultTokenCounter`

The built-in estimator:

function defaultTokenCounter(msg: Message): number {
  return Math.ceil(msg.content.length / 3.5);
}

~3.5 characters per token is derived from empirical measurements of GPT-family BPE tokenizers (cl100k_base, o200k_base) on mixed English text. We pick the lower end of the observed range so estimates are conservative — slightly over-counting tokens is safer than under-counting and blowing the budget. It's fast and works for ballpark estimates, but real tokenizers vary:

Tokenizer	Typical chars/token
GPT-4/4o	~3.5-4.0
Claude	~3.5-4.0
Llama 3	~3.0-3.5

For accurate budgeting, replace it.

Custom tokenCounter

The tokenCounter function is called for all budget decisions: binary search iterations, force-converge deltas, token_ratio stats, and the final tokenCount/fits fields.

With gpt-tokenizer

import { compress } from 'context-compression-engine';
import { encode } from 'gpt-tokenizer';

const result = compress(messages, {
  tokenBudget: 4000,
  tokenCounter: (msg) => {
    const text = typeof msg.content === 'string' ? msg.content : '';
    return encode(text).length;
  },
});

With tiktoken

import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');

const result = compress(messages, {
  tokenBudget: 4000,
  tokenCounter: (msg) => {
    const text = typeof msg.content === 'string' ? msg.content : '';
    return enc.encode(text).length;
  },
});

enc.free(); // tiktoken uses WASM — free when done

`minRecencyWindow`

Floor for recencyWindow during binary search. Guarantees that at least N recent messages are always preserved, even under tight budgets.

const result = compress(messages, {
  tokenBudget: 2000,
  minRecencyWindow: 4, // always keep at least 4 recent messages
});

Default: 0 (no floor).

`forceConverge`

When the binary search bottoms out (reaches minRecencyWindow) and the result still exceeds the budget, forceConverge runs a hard-truncation pass.

How it works

Collect eligible messages: before the recency cutoff, not in preserve roles, content > 512 chars
Sort by content length descending (biggest savings first)
Truncate each to 512 chars: [truncated — {original_length} chars: {first 512 chars}]
Stop once the budget is satisfied

const result = compress(messages, {
  tokenBudget: 4000,
  forceConverge: true,
});
// result.fits is guaranteed true (unless only system/recency messages remain)

Truncated messages get _cce_original provenance metadata, so uncompress() restores the full content. Messages that were already compressed (have _cce_original) get their content replaced in-place without double-wrapping.

When to use it

CI/CD pipelines where you need a hard guarantee that context fits
Streaming applications where exceeding the context window is a crash
Agentic loops where the budget must be respected each iteration

Without forceConverge, the result may exceed the budget when conversations are heavily system-message or short-message dominated (since those are preserved).

Tiered budget strategy

An alternative to binary search that keeps the recency window fixed. Instead of shrinking recencyWindow to fit, it progressively compresses older messages through tightening passes.

const result = compress(messages, {
  tokenBudget: 4000,
  budgetStrategy: 'tiered',
  forceConverge: true,
});

See V2 features — Tiered budget for the full algorithm and tradeoff comparison.

Compression depth with budget

When compressionDepth: 'auto' is combined with tokenBudget, the engine progressively tries gentle → moderate → aggressive until the budget fits:

const result = compress(messages, {
  tokenBudget: 2000,
  compressionDepth: 'auto',
  forceConverge: true,
});

This is the most adaptive budget mode — it finds the minimum aggressiveness needed. See V2 features — Compression depth.

Budget with LLM summarizer

const result = await compress(messages, {
  tokenBudget: 4000,
  summarizer: mySummarizer,
});

The binary search calls the LLM at each iteration, so cost and latency scale with log2(messages.length) iterations. The LLM path still has the three-level fallback (LLM -> deterministic -> size guard) at each step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token Budget

How it works

Binary search algorithm

Basic usage

`defaultTokenCounter`

Custom tokenCounter

With gpt-tokenizer

With tiktoken

`minRecencyWindow`

`forceConverge`

How it works

When to use it

Tiered budget strategy

Compression depth with budget

Budget with LLM summarizer

See also

FilesExpand file tree

token-budget.md

Latest commit

History

token-budget.md

File metadata and controls

Token Budget

How it works

Binary search algorithm

Basic usage

defaultTokenCounter

Custom tokenCounter

With gpt-tokenizer

With tiktoken

minRecencyWindow

forceConverge

How it works

When to use it

Tiered budget strategy

Compression depth with budget

Budget with LLM summarizer

See also

`defaultTokenCounter`

`minRecencyWindow`

`forceConverge`