Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds bounded, fail-closed literal and nested-pickle scanning to the standalone pickle engine, refactors nested/policy/stream helpers into new modules, extends ScanOptions with three resource limits, makes report mappings immutable, strengthens compressed-wrapper magic-byte routing, and updates adapter, scanner merge behavior, and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Entry as Scanner Entry
participant Stream as _BoundedPickleStream
participant Policy as policy module
participant Nested as nested module
participant Engine as Standalone/Package Engine
Entry->>Stream: open with byte_limit / deadline
Entry->>Stream: read()/readline() for opcodes
Stream-->>Entry: bytes or short_read / _StreamReadError
alt STRING literal opcode
Entry->>Entry: apply max_string_literal_scan_chars window
Entry->>Policy: suspicious_string_matches(window)
Policy-->>Entry: match / no match
Entry->>Entry: emit SUSPICIOUS_STRING finding if match
alt truncated window
Entry->>Entry: emit literal_scan_truncated notice (analysis_incomplete)
end
end
alt encoded/hex nested candidate
Entry->>Nested: _decode_possible_encoded_pickle(value, max_nested_pickle_bytes)
Nested-->>Entry: decoded bytes OR oversized-prefix signal
alt decoded && looks_like_pickle
Entry->>Engine: if nested_depth < max_nested_depth then recursive scan (nested_depth+1, deadline)
Engine-->>Entry: nested PickleReport
Entry->>Entry: surface nested findings/notices (remap)
else oversized prefix
Entry->>Entry: emit encoded_nested_payload_truncated notice (analysis_incomplete)
end
end
Entry->>Entry: finalize PickleReport with immutable metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…scan-audit-hardening # Conflicts: # CHANGELOG.md
Performance BenchmarksCompared
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a209603a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Show resolved
Hide resolved
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 734fd9c53b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.py`:
- Around line 22-23: The hex truncate budget is wrong:
max_hex_nested_pickle_chars = max_nested_pickle_bytes * 2 only covers plain hex,
but escaped hex literals use "\xNN" (4 chars per byte), so update the logic
around max_hex_nested_pickle_chars (and the truncation that runs later) to
account for escaped form—either set max_hex_nested_pickle_chars =
max_nested_pickle_bytes * 4 so truncation on the escaped string inspects the
intended byte budget, or change the flow to unescape the string (remove "\x")
before applying a bytes-based truncation; adjust any code that slices/decodes
between lines 35-45 accordingly to ensure the byte budget applies to decoded
bytes, referencing max_hex_nested_pickle_chars and the truncation/unescape steps
in nested.py.
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/report.py`:
- Around line 39-44: The current _immutable_mapping and _mapping_to_dict
implementations only wrap the top-level dict (MappingProxyType(dict(value)))
leaving nested dicts/lists mutable; change _immutable_mapping to deep-freeze the
input before wrapping by recursively converting nested dicts to
MappingProxyType, lists/tuples to tuples, and sets to frozenset (start from a
deepcopy of value to avoid aliasing), and adjust _mapping_to_dict to return a
deep copy that does not share mutable nested objects; operate on the Mapping
passed to _immutable_mapping and _mapping_to_dict so Findng.details and
PickleReport.metadata become fully immutable even for nested contents.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d54c1056-287a-4ebc-92cd-b1ac8c85707c
📒 Files selected for processing (14)
CHANGELOG.mddocs/agents/picklescan-package-split.mdmodelaudit/scanners/picklescan_adapter.pypackages/modelaudit-picklescan/README.mdpackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/stream.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/options.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/report.pypackages/modelaudit-picklescan/tests/test_api.pypackages/modelaudit-picklescan/tests/test_options.pypackages/modelaudit-picklescan/tests/test_report.pytests/scanners/test_picklescan_adapter.py
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.py
Show resolved
Hide resolved
packages/modelaudit-picklescan/src/modelaudit_picklescan/report.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd7c6f6d99
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
CHANGELOG.md (1)
34-41:⚠️ Potential issue | 🟡 MinorMove this entry into the canonical
[Unreleased]block.The file now has two
[Unreleased]sections. Adding new bullets here fragments the changelog and makes it ambiguous which unreleased block release tooling/readers should trust.As per coding guidelines:
CHANGELOG.md: “Add CHANGELOG entry under [Unreleased] section (Keep a Changelog format)”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CHANGELOG.md` around lines 34 - 41, The changelog contains duplicate "[Unreleased]" sections: consolidate the new bullet lines ("avoid CoreML nested parse...", "harden standalone pickle scanner...", "fail closed when standalone pickle...") into the canonical [Unreleased] block by removing the extra [Unreleased] header and appending these bullets under the existing [Unreleased] section, ensuring there is only one "[Unreleased]" header and that all new entries appear under that single block.
♻️ Duplicate comments (1)
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.py (1)
22-23:⚠️ Potential issue | 🟠 MajorApply the hex byte budget after accounting for escaped
\xNNform.This still under-scans escaped hex literals.
max_nested_pickle_bytes * 2is correct for plain hex, but here the slice happens beforereplace("\\x", ""), so\xNNpayloads only get about half the configured decoded-byte budget. That creates a real false-negative path for escaped-hex nested pickles that are below the configured limit.Suggested fix
- max_hex_nested_pickle_chars = max_nested_pickle_bytes * 2 + max_hex_nested_pickle_chars = max_nested_pickle_bytes * 4 @@ - hex_candidate = stripped[:max_hex_nested_pickle_chars].replace("\\x", "") + hex_candidate = stripped[:max_hex_nested_pickle_chars].replace("\\x", "")[ + : max_nested_pickle_bytes * 2 + ] @@ - bounded_hex_candidate = hex_candidate[:max_hex_nested_pickle_chars] + bounded_hex_candidate = hex_candidateAlso applies to: 35-45
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.py` around lines 22 - 23, The hex char budget is computed as max_nested_pickle_bytes * 2 which only fits plain "NN" pairs, but the code slices the input before replace("\\x", "") so escaped '\xNN' sequences consume 4 chars each and will be under-scanned; change the pre-replace budget to account for escaped form (e.g. set max_hex_nested_pickle_chars = max_nested_pickle_bytes * 4 or compute length = max_nested_pickle_bytes*2 + count_of_escaped_sequences*2) and apply the same fix to the other occurrence mentioned (lines 35-45) so the slice length is sized for '\xNN' payloads before calling replace("\\x",""); reference variables: max_base64_nested_pickle_chars, max_hex_nested_pickle_chars, max_nested_pickle_bytes and the replace("\\x","") call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py`:
- Around line 866-909: The nested-analysis incomplete branch in
_surface_nested_pickle_findings currently only adds a Notice but leaves the
parent _ScanState.status as COMPLETE; update that branch so when
nested_report.status != ScanStatus.COMPLETE you also set the parent scan status
(self.status) to ScanStatus.INCONCLUSIVE (or the appropriate "inconclusive" enum
variant used by the project) to mark the parent scan as not fully completed,
preserving the existing Notice creation and include reference to
nested_report.status and nested_report.errors/notices in the details.
In `@tests/scanners/test_picklescan_adapter.py`:
- Around line 280-309: The test
test_pickle_report_to_scan_result_fails_closed_for_truncated_literal_scan_notice
should assert the operator-facing fail-closed notice text in addition to
existing success/metadata checks: after calling
pickle_report_to_scan_result(report), locate the emitted check with name
"Standalone Pickle Notice" (or inspect result.checks for check.rule_code ==
"S902" and check.details["pickle_notice_code"] == "literal_scan_truncated") and
add an assertion that its message/text equals the expected fail-closed message
string (the human-facing notice used when analysis_incomplete is True), ensuring
the test locks the exact notice content so the fail-closed message cannot
regress; apply the same pattern to the similar tests around lines 312-350.
---
Outside diff comments:
In `@CHANGELOG.md`:
- Around line 34-41: The changelog contains duplicate "[Unreleased]" sections:
consolidate the new bullet lines ("avoid CoreML nested parse...", "harden
standalone pickle scanner...", "fail closed when standalone pickle...") into the
canonical [Unreleased] block by removing the extra [Unreleased] header and
appending these bullets under the existing [Unreleased] section, ensuring there
is only one "[Unreleased]" header and that all new entries appear under that
single block.
---
Duplicate comments:
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.py`:
- Around line 22-23: The hex char budget is computed as max_nested_pickle_bytes
* 2 which only fits plain "NN" pairs, but the code slices the input before
replace("\\x", "") so escaped '\xNN' sequences consume 4 chars each and will be
under-scanned; change the pre-replace budget to account for escaped form (e.g.
set max_hex_nested_pickle_chars = max_nested_pickle_bytes * 4 or compute length
= max_nested_pickle_bytes*2 + count_of_escaped_sequences*2) and apply the same
fix to the other occurrence mentioned (lines 35-45) so the slice length is sized
for '\xNN' payloads before calling replace("\\x",""); reference variables:
max_base64_nested_pickle_chars, max_hex_nested_pickle_chars,
max_nested_pickle_bytes and the replace("\\x","") call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fdd0de2a-12d4-4f9f-8dd3-c63e4056c330
📒 Files selected for processing (6)
CHANGELOG.mdmodelaudit/scanners/picklescan_adapter.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.pypackages/modelaudit-picklescan/tests/test_api.pytests/scanners/test_picklescan_adapter.py
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 792b3962b8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.py
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08d8855f76
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py`:
- Around line 853-864: The current _bounded_encoded_nested_windows computes
max_chars from the worst-case escaped-hex size and eagerly slices two large
windows for every long literal, which can allocate attacker-controlled memory;
instead, perform a cheap per-encoding candidate check and only create windows
sized for that encoding. In _bounded_encoded_nested_windows, keep
max_base64_chars and max_hex_chars calculations but first cheaply test the value
for likely encoding (e.g., quick membership checks: if value contains only hex
digits and escape patterns treat as hex, else if characters are within base64
alphabet treat as base64, otherwise handle as plain), then compute max_chars
from the chosen encoding (or 16 minimum) and only slice prefix/suffix using that
smaller max_chars; preserve the existing prefix==suffix dedup logic and return
types unchanged.
- Around line 866-876: The recursive call in _surface_nested_pickle_findings
currently launches scan_pickle_payload which creates a fresh _ScanState and
gives nested scans a full timeout; instead capture the remaining wall-clock
budget from the current scan state and pass it into the nested scan so recursion
consumes the same timeout budget. Modify the call in
_surface_nested_pickle_findings (and/or scan_pickle_payload invocation) to
compute remaining_timeout from the current _ScanState (e.g., via a method or
attribute on self._scan_state) and pass that remaining timeout into the nested
scan (either by forwarding a timeout_s/remaining_timeout parameter or by
constructing the nested scan state with the remaining value) so nested payloads
do not receive a full new timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f2c2c7be-593b-4f5a-93e8-367e89aa9302
📒 Files selected for processing (8)
modelaudit/scanners/picklescan_adapter.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/nested.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/report.pypackages/modelaudit-picklescan/tests/test_api.pypackages/modelaudit-picklescan/tests/test_report.pytests/scanners/test_picklescan_adapter.py
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Show resolved
Hide resolved
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Show resolved
Hide resolved
* fix: add standalone-primary pickle migration mode * fix: isolate standalone pickle primary fallback state * fix: narrow legacy pickle stdlib policy (#903) * fix: narrow legacy pickle stdlib policy Remove broad CRITICAL treatment for noisy stdlib modules in the root pickle scanner while preserving exact dangerous helper coverage and warning-level suspicious refs. * fix: preserve exact stdlib helper detections Keep logging configuration loaders and uuid subprocess getnode helpers as exact dangerous callables in both root and standalone pickle policies. * fix: flag uuid getnode pickle calls Keep uuid.getnode in the exact dangerous-call policy after narrowing broad uuid module severity, since it can dispatch to platform helper probes that invoke subprocess-backed paths. Cover both root PickleScanner and standalone picklescan policy regressions. * fix: route misnamed compressed wrappers by header (#904) * fix: route misnamed compressed wrappers by header Detect gzip, bzip2, xz, lz4, and zlib wrappers from magic bytes even when the filename uses a misleading extension. Preserve existing joblib and R serialized scanner precedence, and add positive/negative routing regressions. * fix: reject compressed wrapper raw trailers (#905) * fix: reject compressed wrapper raw trailers Use bounded low-level bzip2 and xz decompression so valid first members cannot hide raw unscanned trailing bytes. Preserve concatenated valid-member support and add fail-closed regressions. * fix: bound compressed decompression probes Cap bzip2/xz per-call decompression max_length to a chunk-sized probe so a single compressed chunk cannot allocate up to the full remaining scan budget before limit checks run. * fix: bound zlib decompression probes Use the same chunk-sized decompression probe for zlib streams so per-call output is bounded before size and ratio limit checks run. * fix: harden lz4 wrapper decoding (#906) * fix: harden lz4 wrapper decoding Use the lz4 frame decompressor API through the bounded concatenated-stream loop so raw trailers fail closed and per-call output probes stay chunk-sized. Add fake-frame regressions for optional lz4 coverage. * fix: support legacy lz4 chunk decoding Fall back to lz4.frame decompression contexts when LZ4FrameDecompressor is unavailable while keeping bounded output probes and raw-trailer rejection. Add fallback-specific lz4 regressions for bounded reads, raw trailers, and concatenated frames.
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/compressed_scanner.py (1)
493-555:⚠️ Potential issue | 🟠 MajorDelete the temp file when decompression fails.
If any
_read_*_stream_with_limits()call raises here,_decompress_to_tempfile()exits before returning and the caller never learns thetemp_path. Because the file was created withdelete=False, the partially written temp file is leaked on every corrupt/oversized input.Suggested fix
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as temp_file: temp_path = temp_file.name - - with open(path, "rb") as source: - if codec == "gzip": - try: - with gzip.GzipFile(fileobj=source, mode="rb") as reader: - total_out = self._copy_stream_with_limits( - source=reader, - destination=temp_file, - max_decompressed_bytes=self.max_decompressed_bytes, - max_ratio=self.max_decompression_ratio, - compressed_size=compressed_size, - chunk_size=self.chunk_size, - ) - except (OSError, EOFError, gzip.BadGzipFile) as exc: - raise _CorruptStreamError(f"Invalid gzip stream: {exc}") from exc - elif codec == "bzip2": - total_out = self._read_bzip2_stream_with_limits( - source=source, - destination=temp_file, - max_decompressed_bytes=self.max_decompressed_bytes, - max_ratio=self.max_decompression_ratio, - compressed_size=compressed_size, - chunk_size=self.chunk_size, - ) - elif codec == "xz": - total_out = self._read_xz_stream_with_limits( - source=source, - destination=temp_file, - max_decompressed_bytes=self.max_decompressed_bytes, - max_ratio=self.max_decompression_ratio, - compressed_size=compressed_size, - chunk_size=self.chunk_size, - ) - elif codec == "lz4": - lz4_frame = self._get_lz4_frame_module() - total_out = self._read_lz4_stream_with_limits( - source=source, - destination=temp_file, - lz4_frame=lz4_frame, - max_decompressed_bytes=self.max_decompressed_bytes, - max_ratio=self.max_decompression_ratio, - compressed_size=compressed_size, - chunk_size=self.chunk_size, - ) - elif codec == "zlib": - total_out = self._read_zlib_stream_with_limits( - source=source, - destination=temp_file, - max_decompressed_bytes=self.max_decompressed_bytes, - max_ratio=self.max_decompression_ratio, - compressed_size=compressed_size, - chunk_size=self.chunk_size, - ) - else: - raise _CorruptStreamError(f"Unsupported compression codec: {codec}") + try: + with open(path, "rb") as source: + if codec == "gzip": + try: + with gzip.GzipFile(fileobj=source, mode="rb") as reader: + total_out = self._copy_stream_with_limits( + source=reader, + destination=temp_file, + max_decompressed_bytes=self.max_decompressed_bytes, + max_ratio=self.max_decompression_ratio, + compressed_size=compressed_size, + chunk_size=self.chunk_size, + ) + except (OSError, EOFError, gzip.BadGzipFile) as exc: + raise _CorruptStreamError(f"Invalid gzip stream: {exc}") from exc + elif codec == "bzip2": + total_out = self._read_bzip2_stream_with_limits(...) + elif codec == "xz": + total_out = self._read_xz_stream_with_limits(...) + elif codec == "lz4": + total_out = self._read_lz4_stream_with_limits(...) + elif codec == "zlib": + total_out = self._read_zlib_stream_with_limits(...) + else: + raise _CorruptStreamError(f"Unsupported compression codec: {codec}") + except Exception: + if os.path.exists(temp_path): + os.unlink(temp_path) + raise return temp_path, total_out🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/compressed_scanner.py` around lines 493 - 555, The temporary file created in _decompress_to_tempfile using tempfile.NamedTemporaryFile(suffix=suffix, delete=False) can be leaked if any of the codec handlers (_copy_stream_with_limits, _read_bzip2_stream_with_limits, _read_xz_stream_with_limits, _read_lz4_stream_with_limits, _read_zlib_stream_with_limits) raises an exception; wrap the decompression logic in a try/except/finally (or try/finally) that removes/unlinks temp_path on any exception before re-raising so the partially written file is always deleted, ensuring temp_file.name (temp_path) is cleaned up when errors occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 349-350: The early-return that checks propagate_fallback_state and
_is_operational_pickle_issue(check) is too narrow: when propagate_fallback_state
is False you must also suppress legacy scanner-limitation findings and avoid
copying fallback metadata; update the conditional(s) that currently use
_is_operational_pickle_issue(check) (and the similar checks at the other
early-return sites) to also detect/skip checks with legacy states like
"scanner_limitation", "recursion_limit_exceeded",
"memory_limit_on_legitimate_model" (or add a helper like
_is_legacy_scanner_limitation(check) and use it alongside
_is_operational_pickle_issue), and remove or sanitize fallback keys
("parsing_failed", "failure_reason", "scanner_limitation") from the merged
metadata before returning so standalone-primary mode does not leak legacy
incomplete-state noise.
- Around line 4232-4238: The legacy-pass exceptions (RecursionError and those
matching _is_pickle_parse_failure) are being re-raised before the
use_standalone_pickle_primary branch, so when standalone produced a usable
package_result we still fail; change the control flow in the scanner method so
exceptions from the legacy/fallback parse are caught and not re-raised if
self.use_standalone_pickle_primary is True and the standalone package_result is
valid—specifically, wrap the legacy parse/merge block to catch RecursionError
and exceptions where _is_pickle_parse_failure(error) is True, defer or suppress
re-raising, and let the existing standalone branch that calls
_merge_missing_pickle_checks(package_result, legacy_result, ...) run and return
package_result (setting
package_result.metadata["pickle_primary_engine"]="standalone") when appropriate.
Ensure you reference the same symbols: use_standalone_pickle_primary,
package_result, legacy_result, _merge_missing_pickle_checks, RecursionError, and
_is_pickle_parse_failure.
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.py`:
- Around line 24-167: The new top-level policy constants lack type annotations;
add explicit types (and necessary typing imports) so mypy can validate this
security surface: annotate _BUILTIN_MODULES and _BUILTIN_DANGEROUS_NAMES and
_DANGEROUS_WILDCARD_MODULES as FrozenSet[str] (or frozenset[str] if using Python
3.9+), and annotate _DANGEROUS_GLOBALS as FrozenSet[Tuple[str, str]] (or
frozenset[tuple[str, str]]), and import the corresponding typing names
(FrozenSet, Tuple) if needed; keep the existing values but add the type hints
next to each constant name.
- Around line 48-167: The policy lists in _DANGEROUS_GLOBALS and
_DANGEROUS_WILDCARD_MODULES are missing entries present in the legacy canonical
table (e.g., base64.b64decode and codecs.decode/encode), causing
global_severity() to return None for those symbols; update policy.py so its
_DANGEROUS_GLOBALS and/or _DANGEROUS_WILDCARD_MODULES are derived from or
synchronized with the canonical table used by
modelaudit/detectors/suspicious_symbols.py (or explicitly add the missing tuples
such as ("base64","b64decode"), ("codecs","decode"), ("codecs","encode") to
_DANGEROUS_GLOBALS) and ensure global_severity() behavior remains parity with
the legacy detector (run/change tests to validate standalone-primary parity).
In `@packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py`:
- Around line 588-603: The oversized-raw branch in _scan_raw_nested_pickle_bytes
currently records a nested finding but does not mark the overall scan/report as
incomplete or emit the user-facing notice; update that branch to mirror the
encoded-path behavior by (1) marking the parent scan/report as incomplete using
the same helper the encoded path uses (e.g., call the scan-incomplete helper the
codebase uses) and (2) emit the same notice/message about analysis being
incomplete so callers see the skipped deep-scan; keep the existing
_add_nested_payload_finding call (with analysis_incomplete=True) and ensure any
surface logic in _surface_nested_pickle_findings is not invoked when you
early-return.
In `@tests/scanners/test_compressed_scanner.py`:
- Around line 512-530: The test
test_compressed_scanner_rejects_raw_trailer_after_bzip2_or_xz_member currently
asserts result.success is False and no CRITICAL issues but misses the repo
contract around warning-only failures; update this test (and the other similar
tests around the 580-623 range) to also assert result.has_warnings is True and
result.has_errors is False so that CompressedScanner scan results preserve
WARNING vs CRITICAL semantics (refer to the result variable,
CompressedScanner.scan, and ScanResult.has_warnings/has_errors to locate and
implement the assertions).
In `@tests/scanners/test_pickle_scanner.py`:
- Around line 1021-1029: The test should additionally assert that the merged
result remains warning-only and that any post-merge scan outcome reasons are
removed: after calling _merge_missing_pickle_checks(target, fallback,
propagate_fallback_state=False) add assertions that target.has_warnings is True,
target.has_errors is False, and "scan_outcome_reasons" not in target.metadata;
keep the existing checks for the S777 issue and absence of the "Legacy Pickle
Parse Failure" check to ensure the fallback/merge preserved only the
warning-level evidence.
- Around line 5033-5109: The tests
test_benign_stdlib_neighbors_do_not_inherit_broad_module_criticality and
test_uuid_object_pickle_does_not_inherit_uuid_module_criticality currently only
assert absence of CRITICAL findings; update them to assert there are no failing
checks at all for the benign references. Concretely, in
test_benign_stdlib_neighbors_do_not_inherit_broad_module_criticality replace the
any(...) predicate that checks for check.status == CheckStatus.FAILED and
check.severity == IssueSeverity.CRITICAL with a predicate that checks only for
check.status == CheckStatus.FAILED (and full_ref in check.message) so any
WARNING or CRITICAL failure will fail the test; similarly in
test_uuid_object_pickle_does_not_inherit_uuid_module_criticality change the
any(...) that filters by severity to check only for check.status ==
CheckStatus.FAILED and "uuid.UUID" in check.message.
In `@tests/scripts/test_compare_pickle_scanners.py`:
- Around line 155-158: The test assumes a stable ordering by using
report["comparisons"][0], which can be flaky; update the assertion to select the
desired comparison deterministically (e.g., locate the comparison whose "root"
has the expected path or sort report["comparisons"] by a stable key such as
"path" before indexing). In the test_compare_pickle_scanners.py test, replace
the use of first_comparison = report["comparisons"][0] with logic that finds the
comparison where comparison["root"]["path"] (or another invariant) matches the
expected root, then assert on that comparison's "root_delta", "root"]["engine"],
and "metadata"]["pickle_primary_engine"] to avoid order coupling.
---
Outside diff comments:
In `@modelaudit/scanners/compressed_scanner.py`:
- Around line 493-555: The temporary file created in _decompress_to_tempfile
using tempfile.NamedTemporaryFile(suffix=suffix, delete=False) can be leaked if
any of the codec handlers (_copy_stream_with_limits,
_read_bzip2_stream_with_limits, _read_xz_stream_with_limits,
_read_lz4_stream_with_limits, _read_zlib_stream_with_limits) raises an
exception; wrap the decompression logic in a try/except/finally (or try/finally)
that removes/unlinks temp_path on any exception before re-raising so the
partially written file is always deleted, ensuring temp_file.name (temp_path) is
cleaned up when errors occur.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e0e96e2b-1628-44b0-b787-d45ab30a62bf
📒 Files selected for processing (17)
CHANGELOG.mddocs/agents/picklescan-package-split.mdmodelaudit/core.pymodelaudit/detectors/suspicious_symbols.pymodelaudit/scanner_registry_metadata.pymodelaudit/scanners/compressed_scanner.pymodelaudit/scanners/pickle_scanner.pymodelaudit/utils/file/detection.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.pypackages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.pypackages/modelaudit-picklescan/tests/test_api.pyscripts/README.mdscripts/compare_pickle_scanners.pytests/scanners/test_compressed_scanner.pytests/scanners/test_pickle_scanner.pytests/scripts/test_compare_pickle_scanners.pytests/test_core.py
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.py
Outdated
Show resolved
Hide resolved
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/policy.py
Outdated
Show resolved
Hide resolved
packages/modelaudit-picklescan/src/modelaudit_picklescan/engine/scanner.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 217-368: The Unreleased compare target in the CHANGELOG footer is
still pointing to v0.2.28...HEAD after adding releases through 0.2.33; update
the footer link (the `[Unreleased]` compare target) to use v0.2.33...HEAD (or
equivalent latest tag) so the Unreleased diff excludes already-released commits
and restores Keep-a-Changelog navigation semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: db85fec9b6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Summary
This hardens the standalone
modelaudit-picklescanpackage after the audit. The scanner previously kept policy tables, stream handling, nested pickle helpers, and result construction close together in one engine module, while some resource limits were fixed constants rather than caller-controlled options. That made the package harder to reason about and left gaps in high-risk callable coverage, suspicious string detection, nested payload surfacing, and result immutability.Changes
pip.main,numpy.load,shutil.rmtree,webbrowser.open, andbuiltins.globals.detailsandmetadatatop-level mappings read-only after construction while preserving mutableto_dict()output for downstream consumers.Validation
uv run --with pytest pytest tests/test_api.py -qfrompackages/modelaudit-picklescan: 48 passed.uv run --with pytest pytest tests/test_api.py tests/test_options.py -qfrompackages/modelaudit-picklescan: 69 passed.uv run --with pytest pytest tests/test_report.py tests/test_api.py -qfrompackages/modelaudit-picklescan: 58 passed.uv run --with ruff ruff check src testsfrompackages/modelaudit-picklescan: passed.uv run --with ruff ruff format --check src testsfrompackages/modelaudit-picklescan: passed.uv run --with mypy mypy src testsfrompackages/modelaudit-picklescan: passed.uv run --with pytest --with pytest-xdist pytest -n auto tests --tb=shortfrompackages/modelaudit-picklescan: 78 passed.env PYTHONPATH=packages/modelaudit-picklescan/src uv run python scripts/compare_pickle_scanners.py --json: passed with no safe fixture drift and no status/verdict drift.uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/: passed.uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/: passed.uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/: passed.uv run pytest -n auto -m "not slow and not integration" --maxfail=1: 3333 passed, 75 skipped, 16 warnings after syncing withorigin/main.uv build --out-dir /tmp/modelaudit-picklescan-dist packages/modelaudit-picklescan: passed.uv run --with twine twine check /tmp/modelaudit-picklescan-dist/*: passed.Notes
The parity harness still reports expected rule-code drift while the root
PickleScannercompatibility fallback remains in place, but the drift did not change safe fixture verdicts/statuses.Summary by CodeRabbit