Skip to content

[Bug] filter_none_python logic can misidentify directories as file nodes and skip recursion #85

@annoymous3

Description

@annoymous3

Description:
In agentless/util/preprocess_data.py, the function filter_none_python relies on a heuristic to distinguish between a directory and a processed file node. Specifically, it assumes that if a dictionary contains exactly three keys (functions, classes, and text), it must be a file node.

However, if a directory structure happens to mirror this (e.g., a directory containing exactly three sub-directories named functions, classes, and text), the logic fails.

Root Cause:
The current logic uses an "exclusion" approach:
if ( not "functions" in value.keys() and not "classes" in value.keys() and not "text" in value.keys() ) or not len(value.keys()) == 3: # Treat as directory -> Recurse
If a directory node does contain these keys and has exactly 3 children, it falls into the else block:
else: if not key.endswith(".py"): del structure[key] # Potential data loss!
Steps to Reproduce:
Suppose we have a project structure like this:
structure = { "utils": { # This is a directory "functions": {}, # Sub-directory "classes": {}, # Sub-directory "text": {} # Sub-directory } }

Run filter_none_python(structure).
When processing key="utils", the code sees len(value.keys()) == 3 and all three keys exist.
It skips the if block (no recursion into utils).
It enters the else block, sees "utils" does not end with .py, and executes del structure["utils"].
Result: The entire utils/ directory and its contents are deleted, even if they contained relevant .py files deeper down.

Impact:

Data Loss: Valid directories can be deleted if their structure matches the heuristic.
Incomplete Analysis: The recursion stops prematurely, meaning Python files inside such directories are never processed.

Suggested Fix:

Instead of relying on key names and counts to guess if a node is a file, we should explicitly check if the node represents a file or use a more robust marker. A safer approach is to always attempt recursion if the value is a dictionary, or strictly verify the leaf node structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions