[BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs#7185
Open
3em0 wants to merge 2 commits intoPaddlePaddle:developfrom
Open
[BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs#71853em0 wants to merge 2 commits intoPaddlePaddle:developfrom
3em0 wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
… or dtype differs
numpy tobytes() only serializes raw element bytes without encoding shape
or dtype metadata. This means arrays with identical raw bytes but
different shapes (e.g. (6,4) vs (4,6)) or different dtypes (e.g.
float32 vs uint8 reinterpretation of same memory) produce the same
SHA-256 digest, leading to silent cache collisions in
ProcessorCacheManager / EncoderCacheManager / PrefixCacheManager.
Prepend a "{shape}|{dtype}|" header to the byte payload before hashing
so that shape and dtype participate in the digest.
Added test cases for shape and dtype sensitivity.
0da4249 to
7cf4c3c
Compare
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-03 17:28 CST
📋 Review 摘要
PR 概述:修复 MultimodalHasher 在 ndarray shape 或 dtype 不同时的 hash 碰撞风险
变更范围:fastdeploy/multimodal/hasher.py、tests/multimodal/test_hasher.py
影响面 Tag:BugFix
问题
未发现阻塞性问题。
总体评价
这是一个高质量的 BugFix PR。修复方案通过在 hash 前添加 {shape}|{dtype}| header 来区分不同 layout 的数组,实现简洁有效。测试覆盖充分,新增了 shape 敏感性和 dtype 敏感性测试用例。PR 描述详细说明了问题根因和修复思路,符合规范。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #7196
MultimodalHasher.hash_features()usesnp.ndarray.tobytes()to compute SHA-256 digests for multimodal cache keys. However,tobytes()only serializes raw element bytes — it does not encode shape or dtype metadata.This means:
(6,4)vs(4,6)) produce the same hashfloat32vsuint8reinterpretation) also collideWhile the current inference pipeline uses a fixed dtype (
float32) and deterministic reshape paths — making real-world collision extremely unlikely — the hash function itself is fundamentally unsafe and could silently return wrong cached results if any upstream change alters the preprocessing path.Fix
Prepend a
"{shape}|{dtype}|"header to the byte payload before hashing:Test plan
test_hash_features_ndarrayto match new hash formattest_hash_features_ndarray_shape_sensitivity— verifies(6,4)vs(4,6)produce different hashestest_hash_features_ndarray_dtype_sensitivity— verifiesfloat32vsfloat64produce different hashestest_hash_features_objectunchanged and passing