fix(recall): preserve non-BMP characters when truncating recalled context#109
Open
akhilesharora wants to merge 1 commit into
Open
fix(recall): preserve non-BMP characters when truncating recalled context#109akhilesharora wants to merge 1 commit into
akhilesharora wants to merge 1 commit into
Conversation
truncateRecallLine capped lines with line.length and line.slice, which count and index by UTF-16 code unit. When recall.maxCharsPerMemory or recall.maxTotalRecallChars is set and the cut lands between the halves of a surrogate pair, the line keeps a lone surrogate that becomes U+FFFD once UTF-8 encoded for the request, corrupting any non-BMP character (emoji, CJK Ext-B) in the injected context. Slice on Array.from(line) code points instead. Both recall call sites route through this function. Same class as the sanitizeText fix in Tencent#31, reintroduced by the Tencent#71 budget path. Signed-off-by: Akhilesh Arora <akhildawra@gmail.com>
Collaborator
|
Thank you for your contribution. We will review it and provide timely feedback. Thank you for your attention to our project! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description | 描述
truncateRecallLine(src/core/hooks/auto-recall.ts) truncated recalled-memory lines by UTF-16 code unit (line.length/line.slice). A cap landing between the halves of a surrogate pair left a lone surrogate, which becomes U+FFFD when the line is UTF-8 encoded for the request, corrupting any non-BMP character (emoji, CJK Ext-B) in the injected context. Switched the count and slice to code points viaArray.from. Both recall call sites (per-memory and total-budget) route through this one function. Same defect class as #30/#31 ("preserve non-BMP characters in sanitizeText"), reintroduced by the #71 budget path.Related Issue | 关联 Issue
Fix #108
Change Type | 修改类型
Self-test Checklist | 自测清单
Additional Notes | 其他说明
Verified with Node on both branches of the function: a cap landing mid-pair produces U+FFFD before the change and an intact character after; ASCII/BMP input is byte-for-byte unchanged. CI has no test step and the repo currently ships no test files, so no test was added.