Skip to content

Latest commit

 

History

History
187 lines (133 loc) · 6.14 KB

File metadata and controls

187 lines (133 loc) · 6.14 KB

Token Budget

Back to README | All docs

Automatically find the least compression needed to fit a target token count.

How it works

When tokenBudget is set, the engine binary-searches recencyWindow to find the largest recency window that fits within the budget. This maximizes preserved recent context while still hitting the target.

Binary search algorithm

1. Fast path: if total tokens <= budget, return immediately (no compression needed)
2. Set lo = minRecencyWindow (default 0), hi = messages.length - 1
3. Binary search:
   a. mid = ceil((lo + hi) / 2)
   b. Compress with recencyWindow = mid
   c. If result fits budget: lo = mid (try larger window)
   d. If over budget: hi = mid - 1 (try smaller window)
4. Final compress at recencyWindow = lo
5. If still over budget and forceConverge enabled: hard-truncate pass

The binary search runs compression at each iteration. When a summarizer is provided, each iteration calls the LLM — so budget + LLM is slower than budget alone.

Basic usage

import { compress } from 'context-compression-engine';

const result = compress(messages, {
  tokenBudget: 4000,
  minRecencyWindow: 2,
});

result.fits; // true if result fits within budget
result.tokenCount; // token count (via tokenCounter)
result.recencyWindow; // the recencyWindow the binary search settled on

defaultTokenCounter

The built-in estimator:

function defaultTokenCounter(msg: Message): number {
  return Math.ceil(msg.content.length / 3.5);
}

~3.5 characters per token is derived from empirical measurements of GPT-family BPE tokenizers (cl100k_base, o200k_base) on mixed English text. We pick the lower end of the observed range so estimates are conservative — slightly over-counting tokens is safer than under-counting and blowing the budget. It's fast and works for ballpark estimates, but real tokenizers vary:

Tokenizer Typical chars/token
GPT-4/4o ~3.5-4.0
Claude ~3.5-4.0
Llama 3 ~3.0-3.5

For accurate budgeting, replace it.

Custom tokenCounter

The tokenCounter function is called for all budget decisions: binary search iterations, force-converge deltas, token_ratio stats, and the final tokenCount/fits fields.

With gpt-tokenizer

import { compress } from 'context-compression-engine';
import { encode } from 'gpt-tokenizer';

const result = compress(messages, {
  tokenBudget: 4000,
  tokenCounter: (msg) => {
    const text = typeof msg.content === 'string' ? msg.content : '';
    return encode(text).length;
  },
});

With tiktoken

import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');

const result = compress(messages, {
  tokenBudget: 4000,
  tokenCounter: (msg) => {
    const text = typeof msg.content === 'string' ? msg.content : '';
    return enc.encode(text).length;
  },
});

enc.free(); // tiktoken uses WASM — free when done

minRecencyWindow

Floor for recencyWindow during binary search. Guarantees that at least N recent messages are always preserved, even under tight budgets.

const result = compress(messages, {
  tokenBudget: 2000,
  minRecencyWindow: 4, // always keep at least 4 recent messages
});

Default: 0 (no floor).

forceConverge

When the binary search bottoms out (reaches minRecencyWindow) and the result still exceeds the budget, forceConverge runs a hard-truncation pass.

How it works

  1. Collect eligible messages: before the recency cutoff, not in preserve roles, content > 512 chars
  2. Sort by content length descending (biggest savings first)
  3. Truncate each to 512 chars: [truncated — {original_length} chars: {first 512 chars}]
  4. Stop once the budget is satisfied
const result = compress(messages, {
  tokenBudget: 4000,
  forceConverge: true,
});
// result.fits is guaranteed true (unless only system/recency messages remain)

Truncated messages get _cce_original provenance metadata, so uncompress() restores the full content. Messages that were already compressed (have _cce_original) get their content replaced in-place without double-wrapping.

When to use it

  • CI/CD pipelines where you need a hard guarantee that context fits
  • Streaming applications where exceeding the context window is a crash
  • Agentic loops where the budget must be respected each iteration

Without forceConverge, the result may exceed the budget when conversations are heavily system-message or short-message dominated (since those are preserved).

Tiered budget strategy

An alternative to binary search that keeps the recency window fixed. Instead of shrinking recencyWindow to fit, it progressively compresses older messages through tightening passes.

const result = compress(messages, {
  tokenBudget: 4000,
  budgetStrategy: 'tiered',
  forceConverge: true,
});

See V2 features — Tiered budget for the full algorithm and tradeoff comparison.

Compression depth with budget

When compressionDepth: 'auto' is combined with tokenBudget, the engine progressively tries gentle → moderate → aggressive until the budget fits:

const result = compress(messages, {
  tokenBudget: 2000,
  compressionDepth: 'auto',
  forceConverge: true,
});

This is the most adaptive budget mode — it finds the minimum aggressiveness needed. See V2 features — Compression depth.

Budget with LLM summarizer

const result = await compress(messages, {
  tokenBudget: 4000,
  summarizer: mySummarizer,
});

The binary search calls the LLM at each iteration, so cost and latency scale with log2(messages.length) iterations. The LLM path still has the three-level fallback (LLM -> deterministic -> size guard) at each step.


See also