Fixes in unicode helper with convert to/from UTF8 #3247

josesimoes · 2025-12-09T18:43:01Z

Description

Issue with null terminator buffer size.
Invalid surrogate pair handling where the second character is put back for reprocessing.
Proper advancement of input pointers for invalid sequences.
Edge cases with partial multi-byte sequences.
Overlong encoding detection.
Boundary conditions and valid/invalid transitions.

Motivation and Context

Found unit test failing at Update 1 NuGet dependencies System.IO.Streams#85.
Debugging those ended up on the fixes here.
Further fixes for Encoding.UTF8.GetString cause an exception Home#1648.

How Has This Been Tested?

New unit tests added at Improvements in unit tests System.Text#244.

Screenshots

Types of changes

Improvement (non-breaking change that improves a feature, code or algorithm)
Bug fix (non-breaking change which fixes an issue with code or algorithm)
New feature (non-breaking change which adds functionality to code)
Breaking change (fix or feature that would cause existing functionality to change)
Config and build (change in the configuration and build system, has no impact on code or features)
Dev Containers (changes related with Dev Containers, has no impact on code or features)
Dependencies/declarations (update dependencies or assembly declarations and changes associated, has no impact on code or features)
Documentation (changes or updates in the documentation, has no impact on code or features)

Checklist

My code follows the code style of this project (only if there are changes in source code).
My changes require an update to the documentation (there are changes that require the docs website to be updated).
I have updated the documentation accordingly (the changes require an update on the docs in this repo).
I have read the CONTRIBUTING document.
I have tested everything locally and all new and existing tests passed (only if there are changes in source code).

Summary by CodeRabbit

Bug Fixes
- Improved handling of invalid UTF-8 and UTF-16 sequences with proper replacement character substitution for better robustness
- Enhanced validation of continuation bytes and surrogate pairs to prevent data corruption
- Fixed edge cases with overlong encodings and malformed character sequences
- Strengthened boundary condition handling and output capacity checks for reliable Unicode conversion

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Issue with null terminator buffer size. - Invalid surrogate pair handling where the second character is put back for reprocessing. - Proper advancement of input pointers for invalid sequences. - Edge cases with partial multi-byte sequences. - Overlong encoding detection. - Boundary conditions and valid/invalid transitions.

coderabbitai · 2025-12-09T18:43:27Z

Walkthrough

Modified Unicode conversion functions in the CLR runtime helper to improve handling of invalid UTF-8 and UTF-16 sequences. Changes include more aggressive input pointer advancement on errors, explicit validation of continuation bytes and surrogates, substitution of invalid sequences with U+FFFD replacement characters, and boundary condition refinements in finalization logic.

Changes

Cohort / File(s)	Summary
Unicode conversion error handling improvements `src/CLR/Core/CLR_RT_UnicodeHelper.cpp`	Enhanced ConvertFromUTF8, ConvertToUTF8, and CountNumberOfCharacters functions to advance input pointers more aggressively on invalid sequences; added explicit lookahead validation for continuation bytes; implemented overlong encoding detection; added surrogate pair validation with proper error recovery; ensured replacement character (U+FFFD) output in invalid conditions; refined output capacity checks and pointer state management during error conditions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

Pointer arithmetic correctness across all error paths, especially advancement of iMaxBytes and iMaxChars in tandem with input pointer updates
Verification that overlong 2-byte and surrogate pair detection logic conforms to UTF-8/UTF-16 standards
Confirmation that U+FFFD replacement characters are consistently output in all invalid sequence branches
Boundary condition handling in UTF-16 output finalization logic to prevent premature false returns or incomplete termination
State consistency when reverting or adjusting pointers for reprocessing invalid units

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: UTF-8/UTF-16 conversion fixes in the unicode helper, which aligns with the substantial modifications to CLR_RT_UnicodeHelper.cpp detailed in the changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2139b and 2b1d616.

📒 Files selected for processing (1)

src/CLR/Core/CLR_RT_UnicodeHelper.cpp (18 hunks)

🔇 Additional comments (5)

src/CLR/Core/CLR_RT_UnicodeHelper.cpp (5)

279-302: Improved overlong encoding detection and pointer management.

The 2-byte UTF-8 sequence handling now correctly:

Reverts the pointer when there aren't enough bytes remaining

Detects overlong encodings (values < 0x80 encoded as 2 bytes)

Advances past the continuation byte for overlong sequences before emitting the replacement character

The pointer and iMaxBytes adjustments appear consistent.

329-368: Correct handling of 3-byte UTF-8 sequences with partial validation.

The implementation properly:

Validates continuation bytes individually and only advances over valid ones

Rejects overlong encodings (fullCh < 0x0800) and surrogate code points (0xD800-0xDFFF)

Maintains consistent pointer and iMaxBytes state before jumping to invalid_sequence

392-476: Correct handling of 4-byte UTF-8 sequences (supplementary characters).

The implementation correctly:

Validates all three continuation bytes individually

Checks the strict Unicode range (0x10000-0x10FFFF)

Handles the case where iMaxChars == 1 (insufficient space for the surrogate pair output)

Properly reverts the pointer by 4 bytes when output buffer is full

506-512: Correct null terminator handling.

The finalization logic properly checks for available space (outputUTF16_size >= 1) before writing the null terminator. This prevents buffer overflow when the output buffer is exactly filled by converted characters.

689-709: Correct handling of invalid surrogate pairs.

When a high surrogate is followed by a non-low-surrogate character:

U+FFFD replacement is output for the invalid high surrogate

The second character is correctly put back (inputUTF16--, iMaxChars++) for reprocessing

This ensures the second character is not lost and is processed as a standalone character in the next iteration.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nfbot added the Type: bug label Dec 9, 2025

josesimoes added the Area: Common libs Everything related with common libraries label Dec 9, 2025

josesimoes merged commit 573c265 into nanoframework:main Dec 9, 2025
27 checks passed

josesimoes deleted the fix-unicode-helper branch December 9, 2025 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fixes in unicode helper with convert to/from UTF8 #3247

Fixes in unicode helper with convert to/from UTF8 #3247

Uh oh!

josesimoes commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fixes in unicode helper with convert to/from UTF8 #3247

Fixes in unicode helper with convert to/from UTF8 #3247

Uh oh!

Conversation

josesimoes commented Dec 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots

Types of changes

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

josesimoes commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2025 •

edited

Loading