Add pre-tool verifier defense middleware for input instruction violation detection#1605
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Pre-Tool Verifier middleware and supporting pieces to perform LLM-driven input-side verification and sanitization before tool invocation, plus config, data model, registration, and README updates to integrate pre-tool verification alongside existing output-side defenses. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PreToolVerifier as Pre-Tool Verifier Middleware
participant GuardLLM as Guard LLM
participant Next as Next Middleware / Tool
Client->>PreToolVerifier: Submit request / tool input
PreToolVerifier->>PreToolVerifier: Extract input content
PreToolVerifier->>GuardLLM: Send content + system_instructions for verification
GuardLLM-->>PreToolVerifier: Return JSON verification result
PreToolVerifier->>PreToolVerifier: Parse result, evaluate confidence vs threshold
alt Violation detected (confidence >= threshold)
PreToolVerifier->>PreToolVerifier: Apply policy (refuse / sanitize / redirect)
PreToolVerifier-->>Client: Return refusal or sanitized input
else No violation
PreToolVerifier->>Next: Forward original input
Next-->>PreToolVerifier: Return response
PreToolVerifier-->>Client: Return response
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ion detection Adds a new PreToolVerifierMiddleware that uses an LLM to analyze function inputs before a tool is called, detecting prompt injection, jailbreak attempts, and instruction override attacks. Updates the retail agent example config and documentation with the new defense. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: lidan-capsule <lidan@capsule.security>
Switch the pre-tool verifier to a smaller, dedicated guard model (Qwen/Qwen3Guard-Gen-8B) run locally via HuggingFace instead of the large meta/llama-3.3-70b-instruct NIM model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: lidan-capsule <lidan@capsule.security>
2610550 to
ed40d35
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (5)
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_data_models.py (1)
15-15: Module docstring is now slightly inaccurate.The docstring reads "Data models for defense middleware output" but the file now also contains
PreToolVerificationResult, which is an input-level data model.Suggested fix
-"""Data models for defense middleware output.""" +"""Data models for defense middleware."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_data_models.py` at line 15, Update the module docstring to accurately reflect the contents: this file contains both input and output data models (e.g., PreToolVerificationResult is an input-level model and other classes are defense middleware output models). Replace the current docstring "Data models for defense middleware output." with a brief description that mentions both input and output data models and references PreToolVerificationResult and the defense middleware output models to make intent clear.examples/safety_and_security/retail_agent/src/nat_retail_agent/configs/config-with-defenses.yml (1)
42-48: Consider documenting the resource requirements for the 8B guard model.The
pre_tool_guard_llmusesQwen/Qwen3Guard-Gen-8B(8B params), which is significantly larger than the existingguard_llmat 0.6B. Withdevice: auto, this will attempt to load onto GPU. Users running on resource-constrained environments may be surprised by the VRAM requirement. A comment noting the approximate memory footprint would be helpful.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/safety_and_security/retail_agent/src/nat_retail_agent/configs/config-with-defenses.yml` around lines 42 - 48, Add a short comment next to the pre_tool_guard_llm entry documenting that Qwen/Qwen3Guard-Gen-8B is an ~8B-parameter model with high VRAM requirements (expect ~20–40GB on GPU depending on precision and sharding), note that device: auto will load it on GPU if available, and suggest alternatives or mitigations (use guard_llm 0.6B, set device: cpu, or use quantized/8-bit variants) so users on constrained hardware are warned and have options.examples/safety_and_security/retail_agent/README.md (1)
96-97: Defense flow description could be updated to reflect input-side verification.Line 97 currently reads: "The defense middleware inspects tool outputs, sanitizes or blocks unsafe content, and returns safe data to the agent." With the addition of the Pre-Tool Verifier, the defense middleware now also inspects inputs. Consider updating this sentence to cover both directions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/safety_and_security/retail_agent/README.md` around lines 96 - 97, Update the sentence that currently reads "The defense middleware inspects tool outputs, sanitizes or blocks unsafe content, and returns safe data to the agent." to reflect two-way verification by mentioning both inputs and outputs (e.g., "inspects tool inputs and outputs" or "inspects inputs and outputs, including Pre-Tool Verifier checks"), so the README's defense middleware description and reference to the Pre-Tool Verifier accurately convey that the middleware sanitizes/blocks unsafe content in both directions.packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py (2)
107-111: Missing return type hint on_get_llm.Per coding guidelines, all methods should have type hints on parameters and return values.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 107 - 111, Add a return type hint to the async method _get_llm by annotating it with an appropriate type (e.g., -> Any or the concrete LLM type used by _get_llm_for_defense) and ensure the corresponding import (from typing import Any or the specific LLM class) is present; update the signature of _get_llm and keep the body unchanged so it remains async def _get_llm(self) -> Any: and still calls await self._get_llm_for_defense(self.config.llm_name).
145-174: User content is interpolated directly into the LLM prompt without any escaping or delimiters.The analyzed content is inserted into the user prompt at Line 174 as
f"Input to verify: {content_str}". An adversary who crafts input mimicking the expected JSON response format could trick the verifier into returning"violation_detected": false. Consider wrapping the user content in clear delimiters (e.g., XML tags or triple-backtick fencing) to make it harder for injection content to blend with prompt instructions.Suggested improvement
- user_prompt_parts.append(f"Input to verify: {content_str}") + user_prompt_parts.append(f"Input to verify:\n<user_input>\n{content_str}\n</user_input>")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 145 - 174, The user input (content_str) is interpolated directly into the LLM user prompt via user_prompt_parts.append(f"Input to verify: {content_str}") which allows prompt-injection by mirroring the expected JSON form; fix by surrounding the user content with an unambiguous delimiter and/or explicit escaping before appending (e.g., wrap content_str in triple-backticks or XML-like tags and/or apply a sanitizer/escape function), update places that build system_prompt/user_prompt_parts (references: content_str, user_prompt_parts, system_prompt, function_name) to use the delimited/escaped value so the verifier always treats user data as opaque input rather than executable prompt text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/safety_and_security/retail_agent/README.md`:
- Around line 286-302: Update the README example so the pre-tool verifier uses
the correct guard LLM name: change the llm_name value in the
pre_tool_verifier_workflow block from "nim_llm" to "pre_tool_guard_llm" to match
the actual configuration; ensure any textual references in the example mention
pre_tool_guard_llm (not nim_llm) so users don't accidentally route sensitive
inputs to the remote NIM model.
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 338-344: The except blocks in apply_pre_tool_verification and
function_middleware_stream are logging with exc_info=True and then re-raising,
which causes duplicate tracebacks; remove the exc_info=True argument from the
logger.error(...) calls in those except handlers so the error is logged without
the traceback before the bare raise (do not replace with logger.exception since
we still re-raise). Locate the logger.error calls inside
apply_pre_tool_verification and the corresponding except block in
function_middleware_stream and delete the exc_info=True parameter, keeping the
same message and context.name usage.
- Around line 325-329: The current PreToolVerifierMiddleware sets value =
args[0] if args else None and then always calls call_next(value, *args[1:],
**kwargs), which passes a spurious positional None when args is empty; change
the call to conditionally pass the first positional only when args is non-empty
(e.g., if args: return await call_next(value, *args[1:], **kwargs) else: return
await call_next(*args, **kwargs)); apply the same fix to the
function_middleware_stream handler so neither path ever injects a None
positional argument; keep the checks using _should_apply_defense and the
existing logger.debug call intact.
- Around line 211-224: The middleware currently fails open on exceptions in
_analyze_content by returning
PreToolVerificationResult(violation_detected=False, should_refuse=False); add a
boolean config field fail_closed (e.g., on PreToolVerifierMiddlewareConfig,
default False) and update the exception handler in
defense_middleware_pre_tool_verifier.py to consult this flag: if
config.fail_closed is True return a result that refuses input
(violation_detected=True, should_refuse=True, error=True and an explanatory
reason) otherwise preserve the current fail-open behavior; ensure the logger
still records the exception and response length for diagnostics and that the new
behavior is covered by any existing callers of _analyze_content or middleware
initialization.
---
Nitpick comments:
In `@examples/safety_and_security/retail_agent/README.md`:
- Around line 96-97: Update the sentence that currently reads "The defense
middleware inspects tool outputs, sanitizes or blocks unsafe content, and
returns safe data to the agent." to reflect two-way verification by mentioning
both inputs and outputs (e.g., "inspects tool inputs and outputs" or "inspects
inputs and outputs, including Pre-Tool Verifier checks"), so the README's
defense middleware description and reference to the Pre-Tool Verifier accurately
convey that the middleware sanitizes/blocks unsafe content in both directions.
In
`@examples/safety_and_security/retail_agent/src/nat_retail_agent/configs/config-with-defenses.yml`:
- Around line 42-48: Add a short comment next to the pre_tool_guard_llm entry
documenting that Qwen/Qwen3Guard-Gen-8B is an ~8B-parameter model with high VRAM
requirements (expect ~20–40GB on GPU depending on precision and sharding), note
that device: auto will load it on GPU if available, and suggest alternatives or
mitigations (use guard_llm 0.6B, set device: cpu, or use quantized/8-bit
variants) so users on constrained hardware are warned and have options.
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_data_models.py`:
- Line 15: Update the module docstring to accurately reflect the contents: this
file contains both input and output data models (e.g., PreToolVerificationResult
is an input-level model and other classes are defense middleware output models).
Replace the current docstring "Data models for defense middleware output." with
a brief description that mentions both input and output data models and
references PreToolVerificationResult and the defense middleware output models to
make intent clear.
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 107-111: Add a return type hint to the async method _get_llm by
annotating it with an appropriate type (e.g., -> Any or the concrete LLM type
used by _get_llm_for_defense) and ensure the corresponding import (from typing
import Any or the specific LLM class) is present; update the signature of
_get_llm and keep the body unchanged so it remains async def _get_llm(self) ->
Any: and still calls await self._get_llm_for_defense(self.config.llm_name).
- Around line 145-174: The user input (content_str) is interpolated directly
into the LLM user prompt via user_prompt_parts.append(f"Input to verify:
{content_str}") which allows prompt-injection by mirroring the expected JSON
form; fix by surrounding the user content with an unambiguous delimiter and/or
explicit escaping before appending (e.g., wrap content_str in triple-backticks
or XML-like tags and/or apply a sanitizer/escape function), update places that
build system_prompt/user_prompt_parts (references: content_str,
user_prompt_parts, system_prompt, function_name) to use the delimited/escaped
value so the verifier always treats user data as opaque input rather than
executable prompt text.
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py
Show resolved
Hide resolved
- Fix README example to use pre_tool_guard_llm instead of nim_llm - Add fail_closed config field for configurable fail-open/fail-closed behavior - Fix spurious None positional arg when args is empty in call_next - Remove exc_info=True from logger.error before re-raise to avoid duplicate tracebacks - Add return type hint on _get_llm, wrap user input in XML delimiters - Update data_models docstring, README defense flow description, VRAM comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: lidan-capsule <lidan@capsule.security>
f33b017 to
96f5ea5
Compare
|
@ericevans-nv, happy to get your review on this one. I love your work on the defense module! |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py (2)
63-63: Add validation bounds forthreshold.The description states 0.0–1.0, but no Pydantic validation enforces it. A misconfigured threshold (e.g., negative or >1.0) would silently cause unexpected refuse/allow behavior.
Suggested fix
- threshold: float = Field(default=0.7, description="Confidence threshold for violation detection (0.0-1.0)") + threshold: float = Field(default=0.7, ge=0.0, le=1.0, description="Confidence threshold for violation detection (0.0-1.0)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` at line 63, The threshold Field currently documents 0.0–1.0 but lacks enforcement; update the Pydantic field definition for threshold to enforce bounds (e.g., use Field(default=0.7, ge=0.0, le=1.0, description="...")) or replace the annotation with pydantic.confloat(ge=0.0, le=1.0) to ensure values are validated; reference the threshold symbol in the model (the Field call that sets default=0.7) and add the ge/le constraints so out-of-range values raise validation errors.
216-217: Redundant exception object inlogger.exceptioncall.
logger.exception()already captures and logs the exception info fromsys.exc_info(). Passingeas a format argument duplicates the exception representation in the log output.Suggested fix
- logger.exception("Pre-Tool Verifier analysis failed: %s", e) + logger.exception("Pre-Tool Verifier analysis failed")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 216 - 217, The except block is passing the caught exception object `e` into `logger.exception`, which is redundant because `logger.exception` already logs exception info; update the except handler in defense_middleware_pre_tool_verifier.py (the `except Exception as e` block) to call `logger.exception("Pre-Tool Verifier analysis failed")` without the `e` format argument so the exception is logged once via the logger's built-in exception handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 222-238: The current exception message `e` is being interpolated
into the PreToolVerificationResult.reason when `fail_closed` is true, which can
leak internal details; change the `PreToolVerificationResult` construction in
the `if self.config.fail_closed` branch to use a generic reason string like
"Security verification unavailable" (or "Input blocked: security verification
unavailable") without including `({e})`, and rely on the existing
`logger.exception` call (referenced near the catch where `e` is defined) to
record the actual error; ensure only the `fail_closed` path is sanitized so
`_handle_threat` and its raise (which uses `analysis_result.reason`) won't
expose internal exception details.
---
Nitpick comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Line 63: The threshold Field currently documents 0.0–1.0 but lacks
enforcement; update the Pydantic field definition for threshold to enforce
bounds (e.g., use Field(default=0.7, ge=0.0, le=1.0, description="...")) or
replace the annotation with pydantic.confloat(ge=0.0, le=1.0) to ensure values
are validated; reference the threshold symbol in the model (the Field call that
sets default=0.7) and add the ge/le constraints so out-of-range values raise
validation errors.
- Around line 216-217: The except block is passing the caught exception object
`e` into `logger.exception`, which is redundant because `logger.exception`
already logs exception info; update the except handler in
defense_middleware_pre_tool_verifier.py (the `except Exception as e` block) to
call `logger.exception("Pre-Tool Verifier analysis failed")` without the `e`
format argument so the exception is logged once via the logger's built-in
exception handling.
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py
Show resolved
Hide resolved
- Add ge/le validation bounds on threshold field (0.0-1.0) - Remove redundant exception object from logger.exception call - Use generic reason strings in error results to avoid leaking internal details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: lidan-capsule <lidan@capsule.security>
There was a problem hiding this comment.
🧹 Nitpick comments (5)
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py (5)
216-217: Remove unused exception variablee.Ruff F841 flags this correctly —
eis bound but never referenced. Sincelogger.exception()already captures the exception info, the binding is unnecessary.Proposed fix
- except Exception as e: + except Exception: logger.exception("Pre-Tool Verifier analysis failed")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 216 - 217, Remove the unused exception binding in the except block that wraps the logger.exception call: change "except Exception as e:" to "except Exception:" so the exception is not bound to the unused variable; keep the existing logger.exception("Pre-Tool Verifier analysis failed") call intact (the logger will still capture the exception info).
260-264: Consider thatanalysis_result.reasonmay contain attacker-influenced content.The LLM's
reasonfield could echo fragments of the malicious input. While it's used here only in aValueErrormessage (low direct risk), this string propagates to callers and may end up in logs, HTTP responses, or UI. If the caller surfaces error messages to end users, this becomes a secondary injection vector.A static refusal message (e.g.,
"Input blocked by security policy") would be safer; the detailed reason is already logged on line 254.Proposed fix
if action == "refusal": logger.error("Pre-Tool Verifier refusing input to %s: %s", context.name, analysis_result.reason) - raise ValueError(f"Input blocked by security policy: {analysis_result.reason}") + raise ValueError("Input blocked by security policy")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 260 - 264, The ValueError raised in the pre-tool verifier includes attacker-influenced analysis_result.reason; change the code in the pre-tool verifier (where action == "refusal" in DefenseMiddlewarePreToolVerifier / the refusal branch) to raise a static, non-user-controlled message like "Input blocked by security policy" instead of interpolating analysis_result.reason, while keeping the existing logger.error call that records the detailed reason for internal diagnostics; do not remove the internal logging of analysis_result.reason, only stop propagating it in the raised error or any value returned to callers.
387-399: Same unnecessary LLM call whenargsis empty — apply consistent fix.Same issue as in
function_middleware_invoke: whenargsis empty, the verification runs againstNonebut the result is discarded. Apply the same guard here for consistency.Proposed fix
value = args[0] if args else None try: - # Verify input BEFORE calling the tool - verified_value = await self._process_input_verification(value, context) - - # Stream the actual function with the (potentially sanitized) input - if args: + if args: + # Verify input BEFORE calling the tool + verified_value = await self._process_input_verification(value, context) + # Stream the actual function with the (potentially sanitized) input async for chunk in call_next(verified_value, *args[1:], **kwargs): yield chunk else: async for chunk in call_next(**kwargs): yield chunk🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 387 - 399, The code currently always calls self._process_input_verification(value, context) even when args is empty (verifying None and discarding results); change the logic to mirror function_middleware_invoke by only invoking self._process_input_verification when args is non-empty and using its returned verified_value as the first positional argument to call_next; if args is empty, skip the verification call entirely and call call_next(**kwargs) to stream results unchanged. Refer to the symbols _process_input_verification, call_next, and function_middleware_invoke to locate and apply the guard consistently.
127-136: JSON extraction is adequate for the expected schema but fragile for edge cases.The
split("```")approach on line 130 will misbehave if the LLM wraps its response in a non-JSON fenced block (e.g.,```text ... ```), since[1]grabs the language tag along with the content. Consider stripping the first line after the opening fence. The regex on line 132 handles only one level of brace nesting, which is fine for the current response schema but worth noting if the schema evolves.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 127 - 136, The current extraction logic that uses response_text.split("```json") / split("```") can capture the language tag line as content; update the fenced-block handling in the function that processes response_text to, after splitting on the opening fence, split into lines and drop the first line if it looks like a language tag (e.g., matches r'^\s*[a-zA-Z0-9_+-]+\s*$' or if it starts with a non-{ character), then rejoin the remaining lines before running the existing json_match regex; keep the json_match usage (and response_text fallback) but ensure you trim whitespace and handle empty content after stripping the tag so the function still falls back to the original response_text when no JSON is found.
343-352: Unnecessary LLM call whenargsis empty.When
argsis empty,valueisNone, so_process_input_verificationsends the string"None"to the LLM for analysis — but the result is discarded on line 352 sincecall_next(**kwargs)is called without the verified value. Consider skipping verification when there's no positional input to analyze.Proposed fix
value = args[0] if args else None try: - # Verify input BEFORE calling the tool - verified_value = await self._process_input_verification(value, context) - - # Call the actual function with the (potentially sanitized) input - if args: + if args: + # Verify input BEFORE calling the tool + verified_value = await self._process_input_verification(value, context) + # Call the actual function with the (potentially sanitized) input return await call_next(verified_value, *args[1:], **kwargs) - return await call_next(**kwargs) + else: + return await call_next(**kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 343 - 352, The code currently calls self._process_input_verification(value, context) even when args is empty (value is None) which triggers an unnecessary LLM call; change the logic in the wrapper handling function so that you only call await self._process_input_verification(...) when args is non-empty, and otherwise skip verification and directly return await call_next(**kwargs). Locate the block using variables value, args, _process_input_verification, call_next and implement the conditional: if args: set value=args[0], run verification to get verified_value and call_next(verified_value, *args[1:], **kwargs); else simply return await call_next(**kwargs) without invoking _process_input_verification.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 216-217: Remove the unused exception binding in the except block
that wraps the logger.exception call: change "except Exception as e:" to "except
Exception:" so the exception is not bound to the unused variable; keep the
existing logger.exception("Pre-Tool Verifier analysis failed") call intact (the
logger will still capture the exception info).
- Around line 260-264: The ValueError raised in the pre-tool verifier includes
attacker-influenced analysis_result.reason; change the code in the pre-tool
verifier (where action == "refusal" in DefenseMiddlewarePreToolVerifier / the
refusal branch) to raise a static, non-user-controlled message like "Input
blocked by security policy" instead of interpolating analysis_result.reason,
while keeping the existing logger.error call that records the detailed reason
for internal diagnostics; do not remove the internal logging of
analysis_result.reason, only stop propagating it in the raised error or any
value returned to callers.
- Around line 387-399: The code currently always calls
self._process_input_verification(value, context) even when args is empty
(verifying None and discarding results); change the logic to mirror
function_middleware_invoke by only invoking self._process_input_verification
when args is non-empty and using its returned verified_value as the first
positional argument to call_next; if args is empty, skip the verification call
entirely and call call_next(**kwargs) to stream results unchanged. Refer to the
symbols _process_input_verification, call_next, and function_middleware_invoke
to locate and apply the guard consistently.
- Around line 127-136: The current extraction logic that uses
response_text.split("```json") / split("```") can capture the language tag line
as content; update the fenced-block handling in the function that processes
response_text to, after splitting on the opening fence, split into lines and
drop the first line if it looks like a language tag (e.g., matches
r'^\s*[a-zA-Z0-9_+-]+\s*$' or if it starts with a non-{ character), then rejoin
the remaining lines before running the existing json_match regex; keep the
json_match usage (and response_text fallback) but ensure you trim whitespace and
handle empty content after stripping the tag so the function still falls back to
the original response_text when no JSON is found.
- Around line 343-352: The code currently calls
self._process_input_verification(value, context) even when args is empty (value
is None) which triggers an unnecessary LLM call; change the logic in the wrapper
handling function so that you only call await
self._process_input_verification(...) when args is non-empty, and otherwise skip
verification and directly return await call_next(**kwargs). Locate the block
using variables value, args, _process_input_verification, call_next and
implement the conditional: if args: set value=args[0], run verification to get
verified_value and call_next(verified_value, *args[1:], **kwargs); else simply
return await call_next(**kwargs) without invoking _process_input_verification.
- Remove unused exception binding (except Exception as e -> except Exception) - Use static message in ValueError to avoid leaking attacker-influenced content - Skip verification when args is empty to avoid unnecessary LLM call on None - Improve fenced code block extraction to handle language tag lines Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: lidan-capsule <lidan@capsule.security>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py (3)
139-143: JSON extraction regex only handles one level of brace nesting.The regex
r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}'will fail if the LLM returns JSON with more than one level of nested braces. While the current expected schema is flat, LLM responses are unpredictable — a model might wrap the result in an extra object or add nested metadata.A more robust fallback approach:
Suggested alternative
- json_match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', response_text, re.DOTALL) - if json_match: - return json_match.group(0) - - return response_text + # Try to find balanced JSON by locating the first '{' and parsing from there + start = response_text.find('{') + if start != -1: + for end in range(len(response_text), start, -1): + try: + json.loads(response_text[start:end]) + return response_text[start:end] + except json.JSONDecodeError: + continue + + return response_text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 139 - 143, The current JSON extraction uses a regex that only handles one level of brace nesting and can fail on deeper nested objects; in the function defense_middleware_pre_tool_verifier (where json_match is computed from response_text) replace the regex fallback with a robust brace-matching extractor: locate the first '{' in response_text and iterate characters maintaining a nesting counter to find the matching closing '}', extract that substring, and then attempt json.loads on it (fall back to returning response_text if parsing fails); ensure you reference the existing json_match logic and replace it with this iterative balanced-brace approach to reliably handle arbitrary nesting.
157-157: No size limit on content sent to the verifier LLM.
str(content)could produce an arbitrarily large string (e.g., a large document or base64 blob), leading to LLM token-limit errors or unexpectedly high costs. Consider truncating to a reasonable maximum before sending to the verifier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` at line 157, The code converts unbounded input to a string with content_str = str(content), risking huge payloads to the verifier LLM; replace this with a safe-truncation step (e.g., define MAX_CONTENT_LENGTH constant) and produce a truncated_content_str from content that enforces the limit (preferably keeping head+tail or truncating in the middle), include a clear truncation marker so the verifier knows content was shortened, and use that truncated string wherever content_str is currently used (reference variable names content, content_str and the verifier call in defense_middleware_pre_tool_verifier).
95-110: Missing type hint onbuilderparameter.The coding guidelines require type hints on all public API parameters. The
builderparameter lacks a type annotation.Also, the check at line 106 is redundant —
target_locationis typed asLiteral["input"], so Pydantic validation will reject any other value before__init__runs.Suggested fix
- def __init__(self, config: PreToolVerifierMiddlewareConfig, builder): + def __init__(self, config: PreToolVerifierMiddlewareConfig, builder: Any):If a more specific type is available for
builder, prefer that overAny.As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 95 - 110, The __init__ method of PreToolVerifierMiddleware is missing a type annotation for the builder parameter and contains a redundant runtime check for config.target_location; update the signature of __init__ to add a precise type hint for builder (prefer the concrete Builder class if available, otherwise use typing.Any or a Protocol) and remove the redundant if block that raises ValueError since PreToolVerifierMiddlewareConfig declares target_location as Literal["input"] and Pydantic will enforce it; adjust imports if you add Any/Protocol and ensure the attribute assignment self.config: PreToolVerifierMiddlewareConfig remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 269-287: In the action == "redirection" branch, preserve the
original content type instead of always returning a string: if
isinstance(content, str) return analysis_result.sanitized_input as today;
otherwise attempt to deserialize analysis_result.sanitized_input back into the
original type (e.g., if original content is dict use
json.loads(sanitized_input); if it is a Pydantic model class use Model.parse_raw
or parse_obj on the parsed JSON), and return the reconstructed object; if
deserialization fails, log a warning including context.name and
analysis_result.reason and fall back to a safe behavior (e.g., return the
original content or raise a clear ValueError) so downstream validation doesn’t
break. Use the existing symbols analysis_result.sanitized_input, context.name
and content to locate and implement the logic.
---
Nitpick comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 139-143: The current JSON extraction uses a regex that only
handles one level of brace nesting and can fail on deeper nested objects; in the
function defense_middleware_pre_tool_verifier (where json_match is computed from
response_text) replace the regex fallback with a robust brace-matching
extractor: locate the first '{' in response_text and iterate characters
maintaining a nesting counter to find the matching closing '}', extract that
substring, and then attempt json.loads on it (fall back to returning
response_text if parsing fails); ensure you reference the existing json_match
logic and replace it with this iterative balanced-brace approach to reliably
handle arbitrary nesting.
- Line 157: The code converts unbounded input to a string with content_str =
str(content), risking huge payloads to the verifier LLM; replace this with a
safe-truncation step (e.g., define MAX_CONTENT_LENGTH constant) and produce a
truncated_content_str from content that enforces the limit (preferably keeping
head+tail or truncating in the middle), include a clear truncation marker so the
verifier knows content was shortened, and use that truncated string wherever
content_str is currently used (reference variable names content, content_str and
the verifier call in defense_middleware_pre_tool_verifier).
- Around line 95-110: The __init__ method of PreToolVerifierMiddleware is
missing a type annotation for the builder parameter and contains a redundant
runtime check for config.target_location; update the signature of __init__ to
add a precise type hint for builder (prefer the concrete Builder class if
available, otherwise use typing.Any or a Protocol) and remove the redundant if
block that raises ValueError since PreToolVerifierMiddlewareConfig declares
target_location as Literal["input"] and Pydantic will enforce it; adjust imports
if you add Any/Protocol and ensure the attribute assignment self.config:
PreToolVerifierMiddlewareConfig remains.
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py
Show resolved
Hide resolved
- Replace single-level nesting regex with robust balanced-brace JSON extraction - Add content size limit (32K) with truncation before sending to verifier LLM - Remove redundant target_location runtime check (enforced by Pydantic Literal) - Attempt to deserialize sanitized input back to original type in redirection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: lidan-capsule <lidan@capsule.security>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py (1)
134-144: Quadratic worst-case in JSON extraction.The reverse scan (
for end in range(len(response_text), start, -1)) callsjson.loadson every suffix, making this O(n²) in the worst case. Since inputs are capped at 32K and LLM responses are typically short, this is unlikely to be a problem in practice—but worth noting for future reference.A brace-counting approach would be O(n), though the added complexity may not be warranted today.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py` around lines 134 - 144, The current extraction loop that tries every end index (the for end in range(...): json.loads(response_text[start:end])) is quadratic; replace it with a linear brace-counting scan: locate the first '{' in response_text (start), iterate forward maintaining a brace counter (increment on '{', decrement on '}'), stop when counter returns to zero and extract response_text[start:pos+1]; validate the slice with json.loads once and fall back to returning full response_text on failure — update the code around the json.loads call and the loop in defense_middleware_pre_tool_verifier.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 362-375: The broad except in defense_middleware_pre_tool_verifier
around the call to _process_input_verification/call_next is catching intentional
refusal ValueError from _handle_threat and logging it as an internal failure;
change the except to inspect the exception and re-raise ValueError unlogged
(e.g. catch Exception as e: if isinstance(e, ValueError): raise;
logger.error(..., e); raise) so only real errors are logged, and apply the same
change to function_middleware_stream to keep behavior consistent.
---
Duplicate comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 397-418: The broad "except Exception" in the pre-tool verification
block is catching intentional refusal ValueErrors and mis-logging them; update
the exception handling so ValueError raised by _process_input_verification (or
other refusal flows) is re-raised without logging, and only other exceptions are
logged and re-raised. Concretely, in the method that calls
_should_apply_defense, _process_input_verification and call_next (the shown
middleware invocation), replace the single "except Exception" with two handlers:
"except ValueError: raise" and "except Exception as e: logger.error(...,
context.name); raise" so refusal errors bypass the error log while unexpected
errors are still recorded.
---
Nitpick comments:
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`:
- Around line 134-144: The current extraction loop that tries every end index
(the for end in range(...): json.loads(response_text[start:end])) is quadratic;
replace it with a linear brace-counting scan: locate the first '{' in
response_text (start), iterate forward maintaining a brace counter (increment on
'{', decrement on '}'), stop when counter returns to zero and extract
response_text[start:pos+1]; validate the slice with json.loads once and fall
back to returning full response_text on failure — update the code around the
json.loads call and the loop in defense_middleware_pre_tool_verifier.py
accordingly.
| try: | ||
| if args: | ||
| # Verify input BEFORE calling the tool | ||
| verified_value = await self._process_input_verification(args[0], context) | ||
| return await call_next(verified_value, *args[1:], **kwargs) | ||
| else: | ||
| return await call_next(**kwargs) | ||
|
|
||
| except Exception: | ||
| logger.error( | ||
| "Failed to apply pre-tool verification to function %s", | ||
| context.name, | ||
| ) | ||
| raise |
There was a problem hiding this comment.
Broad except catches intentional refusal ValueError, logging a misleading error.
When action="refusal" and a violation is detected, _handle_threat (line 276) intentionally raises ValueError. This is caught by the except Exception at line 370, which logs "Failed to apply pre-tool verification" — making a successful refusal look like an internal error.
Either narrow the catch or exclude the intentional refusal:
Suggested fix
try:
if args:
# Verify input BEFORE calling the tool
verified_value = await self._process_input_verification(args[0], context)
return await call_next(verified_value, *args[1:], **kwargs)
else:
return await call_next(**kwargs)
+ except ValueError:
+ raise
except Exception:
logger.error(
"Failed to apply pre-tool verification to function %s",
context.name,
)
raiseApply the same pattern to function_middleware_stream.
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 371-374: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_core/src/nat/middleware/defense/defense_middleware_pre_tool_verifier.py`
around lines 362 - 375, The broad except in defense_middleware_pre_tool_verifier
around the call to _process_input_verification/call_next is catching intentional
refusal ValueError from _handle_threat and logging it as an internal failure;
change the except to inspect the exception and re-raise ValueError unlogged
(e.g. catch Exception as e: if isinstance(e, ValueError): raise;
logger.error(..., e); raise) so only real errors are logged, and apply the same
change to function_middleware_stream to keep behavior consistent.
Description
Adds a new
PreToolVerifierMiddlewaredefense that uses an LLM to analyze function inputs before a tool is called. Unlike the existing defense middlewares (pii_defense,content_safety_guard,output_verifier) which analyze outputs after tool execution, this middleware intercepts inputs to detect:Changes
New files:
defense_middleware_pre_tool_verifier.py—PreToolVerifierMiddlewareandPreToolVerifierMiddlewareConfig(config name:pre_tool_verifier)Modified files:
defense_middleware.py— Expandedtarget_locationfromLiteral["output"]toLiteral["input", "output"]to enable input-analyzing defensesdefense_middleware_data_models.py— AddedPreToolVerificationResultdata modeldefense/register.py— Registered the new middlewareconfig-with-defenses.yml— Addedpre_tool_verifier_workflowconfig targeting workflow input; added to workflow middleware chainREADME.md— Added config example, updated defense table and scenario descriptionsConfiguration
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Documentation