Skip to content

Conversation

@josesimoes
Copy link
Member

@josesimoes josesimoes commented Dec 9, 2025

Description

  • Issue with null terminator buffer size.
  • Invalid surrogate pair handling where the second character is put back for reprocessing.
  • Proper advancement of input pointers for invalid sequences.
  • Edge cases with partial multi-byte sequences.
  • Overlong encoding detection.
  • Boundary conditions and valid/invalid transitions.

Motivation and Context

How Has This Been Tested?

Screenshots

Types of changes

  • Improvement (non-breaking change that improves a feature, code or algorithm)
  • Bug fix (non-breaking change which fixes an issue with code or algorithm)
  • New feature (non-breaking change which adds functionality to code)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Config and build (change in the configuration and build system, has no impact on code or features)
  • Dev Containers (changes related with Dev Containers, has no impact on code or features)
  • Dependencies/declarations (update dependencies or assembly declarations and changes associated, has no impact on code or features)
  • Documentation (changes or updates in the documentation, has no impact on code or features)

Checklist

  • My code follows the code style of this project (only if there are changes in source code).
  • My changes require an update to the documentation (there are changes that require the docs website to be updated).
  • I have updated the documentation accordingly (the changes require an update on the docs in this repo).
  • I have read the CONTRIBUTING document.
  • I have tested everything locally and all new and existing tests passed (only if there are changes in source code).

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of invalid UTF-8 and UTF-16 sequences with proper replacement character substitution for better robustness
    • Enhanced validation of continuation bytes and surrogate pairs to prevent data corruption
    • Fixed edge cases with overlong encodings and malformed character sequences
    • Strengthened boundary condition handling and output capacity checks for reliable Unicode conversion

✏️ Tip: You can customize this high-level summary in your review settings.

- Issue with null terminator buffer size.
- Invalid surrogate pair handling where the second character is put back for reprocessing.
- Proper advancement of input pointers for invalid sequences.
- Edge cases with partial multi-byte sequences.
- Overlong encoding detection.
- Boundary conditions and valid/invalid transitions.
@nfbot nfbot added the Type: bug label Dec 9, 2025
@josesimoes josesimoes added the Area: Common libs Everything related with common libraries label Dec 9, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

Modified Unicode conversion functions in the CLR runtime helper to improve handling of invalid UTF-8 and UTF-16 sequences. Changes include more aggressive input pointer advancement on errors, explicit validation of continuation bytes and surrogates, substitution of invalid sequences with U+FFFD replacement characters, and boundary condition refinements in finalization logic.

Changes

Cohort / File(s) Summary
Unicode conversion error handling improvements
src/CLR/Core/CLR_RT_UnicodeHelper.cpp
Enhanced ConvertFromUTF8, ConvertToUTF8, and CountNumberOfCharacters functions to advance input pointers more aggressively on invalid sequences; added explicit lookahead validation for continuation bytes; implemented overlong encoding detection; added surrogate pair validation with proper error recovery; ensured replacement character (U+FFFD) output in invalid conditions; refined output capacity checks and pointer state management during error conditions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • Pointer arithmetic correctness across all error paths, especially advancement of iMaxBytes and iMaxChars in tandem with input pointer updates
  • Verification that overlong 2-byte and surrogate pair detection logic conforms to UTF-8/UTF-16 standards
  • Confirmation that U+FFFD replacement characters are consistently output in all invalid sequence branches
  • Boundary condition handling in UTF-16 output finalization logic to prevent premature false returns or incomplete termination
  • State consistency when reverting or adjusting pointers for reprocessing invalid units

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: UTF-8/UTF-16 conversion fixes in the unicode helper, which aligns with the substantial modifications to CLR_RT_UnicodeHelper.cpp detailed in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2139b and 2b1d616.

📒 Files selected for processing (1)
  • src/CLR/Core/CLR_RT_UnicodeHelper.cpp (18 hunks)
🔇 Additional comments (5)
src/CLR/Core/CLR_RT_UnicodeHelper.cpp (5)

279-302: Improved overlong encoding detection and pointer management.

The 2-byte UTF-8 sequence handling now correctly:

  • Reverts the pointer when there aren't enough bytes remaining
  • Detects overlong encodings (values < 0x80 encoded as 2 bytes)
  • Advances past the continuation byte for overlong sequences before emitting the replacement character

The pointer and iMaxBytes adjustments appear consistent.


329-368: Correct handling of 3-byte UTF-8 sequences with partial validation.

The implementation properly:

  • Validates continuation bytes individually and only advances over valid ones
  • Rejects overlong encodings (fullCh < 0x0800) and surrogate code points (0xD800-0xDFFF)
  • Maintains consistent pointer and iMaxBytes state before jumping to invalid_sequence

392-476: Correct handling of 4-byte UTF-8 sequences (supplementary characters).

The implementation correctly:

  • Validates all three continuation bytes individually
  • Checks the strict Unicode range (0x10000-0x10FFFF)
  • Handles the case where iMaxChars == 1 (insufficient space for the surrogate pair output)
  • Properly reverts the pointer by 4 bytes when output buffer is full

506-512: Correct null terminator handling.

The finalization logic properly checks for available space (outputUTF16_size >= 1) before writing the null terminator. This prevents buffer overflow when the output buffer is exactly filled by converted characters.


689-709: Correct handling of invalid surrogate pairs.

When a high surrogate is followed by a non-low-surrogate character:

  1. U+FFFD replacement is output for the invalid high surrogate
  2. The second character is correctly put back (inputUTF16--, iMaxChars++) for reprocessing

This ensures the second character is not lost and is processed as a standalone character in the next iteration.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@josesimoes josesimoes merged commit 573c265 into nanoframework:main Dec 9, 2025
27 checks passed
@josesimoes josesimoes deleted the fix-unicode-helper branch December 9, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Common libs Everything related with common libraries Type: bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants