Skip to content

[BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs#7185

Open
3em0 wants to merge 2 commits intoPaddlePaddle:developfrom
3em0:fix/multimodal-hasher-shape-dtype-collision
Open

[BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs#7185
3em0 wants to merge 2 commits intoPaddlePaddle:developfrom
3em0:fix/multimodal-hasher-shape-dtype-collision

Conversation

@3em0
Copy link
Copy Markdown

@3em0 3em0 commented Apr 3, 2026

Summary

Fixes #7196

MultimodalHasher.hash_features() uses np.ndarray.tobytes() to compute SHA-256 digests for multimodal cache keys. However, tobytes() only serializes raw element bytes — it does not encode shape or dtype metadata.

This means:

  • Arrays with different shapes but identical flattened bytes (e.g. (6,4) vs (4,6)) produce the same hash
  • Arrays with different dtypes but coincidentally same byte patterns (e.g. float32 vs uint8 reinterpretation) also collide

While the current inference pipeline uses a fixed dtype (float32) and deterministic reshape paths — making real-world collision extremely unlikely — the hash function itself is fundamentally unsafe and could silently return wrong cached results if any upstream change alters the preprocessing path.

Fix

Prepend a "{shape}|{dtype}|" header to the byte payload before hashing:

header = f"{obj.shape}|{obj.dtype}|".encode()
return hashlib.sha256(header + obj.tobytes()).hexdigest()

Test plan

  • Updated test_hash_features_ndarray to match new hash format
  • Added test_hash_features_ndarray_shape_sensitivity — verifies (6,4) vs (4,6) produce different hashes
  • Added test_hash_features_ndarray_dtype_sensitivity — verifies float32 vs float64 produce different hashes
  • Existing test_hash_features_object unchanged and passing

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 3, 2026

Thanks for your contribution!

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 3, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Apr 3, 2026
… or dtype differs

numpy tobytes() only serializes raw element bytes without encoding shape
or dtype metadata. This means arrays with identical raw bytes but
different shapes (e.g. (6,4) vs (4,6)) or different dtypes (e.g.
float32 vs uint8 reinterpretation of same memory) produce the same
SHA-256 digest, leading to silent cache collisions in
ProcessorCacheManager / EncoderCacheManager / PrefixCacheManager.

Prepend a "{shape}|{dtype}|" header to the byte payload before hashing
so that shape and dtype participate in the digest.

Added test cases for shape and dtype sensitivity.
@3em0 3em0 force-pushed the fix/multimodal-hasher-shape-dtype-collision branch from 0da4249 to 7cf4c3c Compare April 3, 2026 08:35
Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-03 17:28 CST

📋 Review 摘要

PR 概述:修复 MultimodalHasher 在 ndarray shape 或 dtype 不同时的 hash 碰撞风险
变更范围fastdeploy/multimodal/hasher.pytests/multimodal/test_hasher.py
影响面 TagBugFix

问题

未发现阻塞性问题。

总体评价

这是一个高质量的 BugFix PR。修复方案通过在 hash 前添加 {shape}|{dtype}| header 来区分不同 layout 的数组,实现简洁有效。测试覆盖充分,新增了 shape 敏感性和 dtype 敏感性测试用例。PR 描述详细说明了问题根因和修复思路,符合规范。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BugFix] MultimodalHasher.hash_features 存在 ndarray shape/dtype 哈希碰撞风险

3 participants