-
Notifications
You must be signed in to change notification settings - Fork 227
Description
Description:
In agentless/util/preprocess_data.py, the function filter_none_python relies on a heuristic to distinguish between a directory and a processed file node. Specifically, it assumes that if a dictionary contains exactly three keys (functions, classes, and text), it must be a file node.
However, if a directory structure happens to mirror this (e.g., a directory containing exactly three sub-directories named functions, classes, and text), the logic fails.
Root Cause:
The current logic uses an "exclusion" approach:
if ( not "functions" in value.keys() and not "classes" in value.keys() and not "text" in value.keys() ) or not len(value.keys()) == 3: # Treat as directory -> Recurse
If a directory node does contain these keys and has exactly 3 children, it falls into the else block:
else: if not key.endswith(".py"): del structure[key] # Potential data loss!
Steps to Reproduce:
Suppose we have a project structure like this:
structure = { "utils": { # This is a directory "functions": {}, # Sub-directory "classes": {}, # Sub-directory "text": {} # Sub-directory } }
Run filter_none_python(structure).
When processing key="utils", the code sees len(value.keys()) == 3 and all three keys exist.
It skips the if block (no recursion into utils).
It enters the else block, sees "utils" does not end with .py, and executes del structure["utils"].
Result: The entire utils/ directory and its contents are deleted, even if they contained relevant .py files deeper down.
Impact:
Data Loss: Valid directories can be deleted if their structure matches the heuristic.
Incomplete Analysis: The recursion stops prematurely, meaning Python files inside such directories are never processed.
Suggested Fix:
Instead of relying on key names and counts to guess if a node is a file, we should explicitly check if the node represents a file or use a more robust marker. A safer approach is to always attempt recursion if the value is a dictionary, or strictly verify the leaf node structure.