Skip to content

fix(recall): preserve non-BMP characters when truncating recalled context#109

Open
akhilesharora wants to merge 1 commit into
Tencent:mainfrom
akhilesharora:fix/recall-truncate-non-bmp
Open

fix(recall): preserve non-BMP characters when truncating recalled context#109
akhilesharora wants to merge 1 commit into
Tencent:mainfrom
akhilesharora:fix/recall-truncate-non-bmp

Conversation

@akhilesharora
Copy link
Copy Markdown
Contributor

Description | 描述

truncateRecallLine (src/core/hooks/auto-recall.ts) truncated recalled-memory lines by UTF-16 code unit (line.length / line.slice). A cap landing between the halves of a surrogate pair left a lone surrogate, which becomes U+FFFD when the line is UTF-8 encoded for the request, corrupting any non-BMP character (emoji, CJK Ext-B) in the injected context. Switched the count and slice to code points via Array.from. Both recall call sites (per-memory and total-budget) route through this one function. Same defect class as #30/#31 ("preserve non-BMP characters in sanitizeText"), reintroduced by the #71 budget path.

Related Issue | 关联 Issue

Fix #108

Change Type | 修改类型

  • Bug fix | Bug 修复
  • New feature | 新功能
  • Documentation update | 文档更新
  • Code optimization | 代码优化

Self-test Checklist | 自测清单

  • Verified locally | 本地验证通过
  • No existing features affected | 无影响现有功能

Additional Notes | 其他说明

Verified with Node on both branches of the function: a cap landing mid-pair produces U+FFFD before the change and an intact character after; ASCII/BMP input is byte-for-byte unchanged. CI has no test step and the repo currently ships no test files, so no test was added.

truncateRecallLine capped lines with line.length and line.slice, which
count and index by UTF-16 code unit. When recall.maxCharsPerMemory or
recall.maxTotalRecallChars is set and the cut lands between the halves of
a surrogate pair, the line keeps a lone surrogate that becomes U+FFFD once
UTF-8 encoded for the request, corrupting any non-BMP character (emoji,
CJK Ext-B) in the injected context. Slice on Array.from(line) code points
instead. Both recall call sites route through this function. Same class as
the sanitizeText fix in Tencent#31, reintroduced by the Tencent#71 budget path.

Signed-off-by: Akhilesh Arora <akhildawra@gmail.com>
@Maxwell-Code07
Copy link
Copy Markdown
Collaborator

Thank you for your contribution. We will review it and provide timely feedback. Thank you for your attention to our project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] truncateRecallLine splits surrogate pairs, corrupting non-BMP chars in recalled context

2 participants