Skip to content

fix(realtime): handle non-code files and filter spurious events#405

Open
bhargavchippada wants to merge 5 commits intovitali87:mainfrom
bhargavchippada:fix/realtime-updater-event-handling
Open

fix(realtime): handle non-code files and filter spurious events#405
bhargavchippada wants to merge 5 commits intovitali87:mainfrom
bhargavchippada:fix/realtime-updater-event-handling

Conversation

@bhargavchippada
Copy link

Summary

This PR fixes three issues in the real-time file watcher (realtime_updater.py):

  1. Filter spurious file system events: Only process MODIFIED, CREATED, and deleted events

    • Previously, read-only events like opened and closed_no_write (triggered by IDEs accessing files) would cause files to be deleted from the graph but not recreated
    • This happened because Step 1 (deletion) ran for all events, but Step 3 (recreation) only ran for MODIFIED/CREATED events
  2. Delete File nodes for non-code files: Added query to delete File nodes

    • The existing CYPHER_DELETE_MODULE query only deletes Module nodes (for code files)
    • Non-code files like .md, .json, etc. were never removed from the graph when deleted from the filesystem
  3. Create File nodes for ALL file types: Added process_generic_file() call for all files

    • Previously, only code files with recognized language configs were indexed in real-time
    • Non-code files were only indexed during the initial full scan, not during real-time updates

Test plan

  • Created a .md file → verified it appears in the graph
  • Deleted the .md file → verified it's removed from the graph
  • Opened a file in IDE without modifying → verified no false deletions occur
  • Modified a Python file → verified code structure is updated correctly

🤖 Generated with Claude Code

This commit fixes three issues in the real-time file watcher:

1. Filter spurious file system events: Only process MODIFIED, CREATED,
   and deleted events. Previously, read-only events like "opened" and
   "closed_no_write" (triggered by IDEs accessing files) would cause
   files to be deleted from the graph but not recreated, since Step 3
   only runs for modification events.

2. Delete File nodes for non-code files: The existing CYPHER_DELETE_MODULE
   query only deletes Module nodes (for code files). Added a separate
   query to delete File nodes, ensuring non-code files like .md, .json,
   etc. are properly removed when deleted from the filesystem.

3. Create File nodes for ALL file types: Added process_generic_file()
   call for all files during MODIFIED/CREATED events, not just code
   files with recognized language configs. This ensures non-code files
   are indexed in real-time.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the real-time file watcher's accuracy and completeness. It resolves issues where the graph could become inconsistent due to unhandled file types or irrelevant file system events, ensuring that the knowledge graph accurately reflects the repository's state for all files, not just code.

Highlights

  • Event Filtering: Implemented filtering for file system events, ensuring only MODIFIED, CREATED, and deleted events are processed, preventing spurious deletions caused by read-only IDE actions.
  • Non-Code File Deletion: Added a Cypher query to explicitly delete File nodes for non-code files (e.g., .md, .json) when they are removed from the filesystem, addressing a previous oversight where only Module nodes were deleted.
  • Universal File Node Creation: Ensured that File nodes are created for all file types, including non-code files, during real-time updates, aligning real-time indexing with the initial full scan.
Changelog
  • realtime_updater.py
    • Added a check to filter out file system events that do not modify file content, such as 'opened' or 'closed_no_write' events.
    • Introduced a new Cypher query to delete generic 'File' nodes for any file type, complementing the existing 'Module' node deletion for code files.
    • Modified the event dispatch logic to call process_generic_file() for all file types (code and non-code) when a file is modified or created, ensuring comprehensive real-time indexing.
Activity
  • The author created a test plan and verified the fixes by creating, deleting, and modifying various file types, and by simulating IDE read-only access without modifications.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable fixes to the real-time file watcher. It correctly filters spurious file system events, preventing incorrect deletions from the graph. Additionally, it adds logic to properly handle non-code files during creation and deletion, ensuring they are accurately represented and removed. The changes are logical and effectively address the described issues. I have a couple of suggestions to improve maintainability by moving hardcoded values to constants and using enums for consistency.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 1, 2026

Greptile Summary

This PR fixes three related bugs in realtime_updater.py's CodeChangeEventHandler.dispatch(): it filters spurious read-only watchdog events (e.g. opened, closed_no_write) that were incorrectly triggering node deletions, adds deletion of File nodes for non-code files, and calls process_generic_file() during real-time updates so non-code files are re-indexed on MODIFIED/CREATED events — mirroring the behaviour of the initial full scan in _process_single_file().

  • Test breakage: test_realtime_updater.py is not updated — all four tests asserting execute_write.call_count == 2 will now fail because Step 1 makes three execute_write calls (DELETE_MODULE + new DELETE_FILE + DELETE_CALLS). The expected count must be updated to 3.
  • Hardcoded string: "deleted" is used as a raw string literal in relevant_events; EventType should gain a DELETED = "deleted" member and be referenced as EventType.DELETED.
  • Inline Cypher query: "MATCH (f:File {path: $path}) DETACH DELETE f" is a raw string literal; per project convention it should be a named constant in constants.py (e.g. CYPHER_DELETE_FILE) and imported alongside the other Cypher constants.
  • Comment policy: Four newly-added comments (realtime_updater.py:77, :81, :91, :93) are missing the required (H) prefix.

Confidence Score: 2/5

  • Not safe to merge — the PR breaks the existing test suite and contains multiple standards violations.
  • The functional logic of the fix is sound and well-reasoned, but the test file was not updated to account for the new execute_write call, which means CI will fail. Additional issues include a hardcoded string literal that should be a StrEnum member, a raw Cypher query that should be a named constant, and several comments missing the required (H) prefix.
  • realtime_updater.py — and the unmodified codebase_rag/tests/test_realtime_updater.py which will break

Important Files Changed

Filename Overview
realtime_updater.py Fixes three real-time watcher bugs (spurious event filtering, File node deletion, and File node creation for non-code files), but introduces test breakage (execute_write call count now 3 vs expected 2), a hardcoded "deleted" string instead of EventType.DELETED, a raw Cypher query string instead of a named constant, and several comments missing the (H) prefix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[FileSystemEvent received] --> B{is_directory\nor not relevant?}
    B -- Yes --> Z[return early]
    B -- No --> C{event_type in\nMODIFIED / CREATED / DELETED?}
    C -- No --> Z
    C -- Yes --> D[Step 1: Delete Module node\nCYPHER_DELETE_MODULE]
    D --> E[Step 1: Delete File node\nnew MATCH f:File DETACH DELETE]
    E --> F[Step 2: remove_file_from_state]
    F --> G{event_type ==\nMODIFIED or CREATED?}
    G -- No / deleted --> I
    G -- Yes --> H{lang_config exists\nand language supported?}
    H -- Yes --> H2[process_file — build AST\nand Module/code nodes]
    H -- No --> H3[skip code parsing]
    H2 --> H4[process_generic_file — create File node]
    H3 --> H4
    H4 --> I[Step 4: CYPHER_DELETE_CALLS\n+ _process_function_calls]
    I --> J[Step 5: flush_all]
Loading

Last reviewed commit: 53faff4

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 84 to 96
return

logger.warning(
logs.CHANGE_DETECTED.format(event_type=event.event_type, path=path)
)

# (H) Step 1
# (H) Step 1: Delete existing nodes for this file path
# Delete Module node and its children (for code files)
ingestor.execute_write(CYPHER_DELETE_MODULE, {KEY_PATH: relative_path_str})
# Delete File node (for all files including non-code like .md, .json)
ingestor.execute_write(
CYPHER_DELETE_FILE, {KEY_PATH: relative_path_str}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests not updated for new execute_write call

The PR adds a second execute_write call in Step 1 (deleting the File node), but test_realtime_updater.py was not updated. All four test assertions still check execute_write.call_count == 2 (see tests at lines 45, 65, 84, 130), but the flow now has three calls:

  1. execute_write(CYPHER_DELETE_MODULE, ...) — delete Module node
  2. execute_write("MATCH (f:File …) DETACH DELETE f", ...) — new File node deletion
  3. execute_write(CYPHER_DELETE_CALLS) — Step 4

These tests will fail as-is. The expected count in each of those assertions should be updated to 3, and a new assertion verifying process_generic_file is called for MODIFIED/CREATED events would be valuable too.

Prompt To Fix With AI
This is a comment left during a code review.
Path: realtime_updater.py
Line: 84-96

Comment:
**Tests not updated for new `execute_write` call**

The PR adds a second `execute_write` call in Step 1 (deleting the `File` node), but `test_realtime_updater.py` was not updated. All four test assertions still check `execute_write.call_count == 2` (see tests at lines 45, 65, 84, 130), but the flow now has **three** calls:

1. `execute_write(CYPHER_DELETE_MODULE, ...)` — delete Module node
2. `execute_write("MATCH (f:File …) DETACH DELETE f", ...)` — new File node deletion
3. `execute_write(CYPHER_DELETE_CALLS)` — Step 4

These tests will fail as-is. The expected count in each of those assertions should be updated to `3`, and a new assertion verifying `process_generic_file` is called for `MODIFIED`/`CREATED` events would be valuable too.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +94 to +96
ingestor.execute_write(
CYPHER_DELETE_FILE, {KEY_PATH: relative_path_str}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline Cypher query should be a named constant

The project keeps all Cypher queries as named constants in constants.py (e.g. CYPHER_DELETE_MODULE, CYPHER_DELETE_CALLS). The new query is a raw string literal directly in the handler, which violates the "No Hardcoded Strings" rule.

Move it to constants.py:

CYPHER_DELETE_FILE = "MATCH (f:File {path: $path}) DETACH DELETE f"

Then reference it here:

Suggested change
ingestor.execute_write(
CYPHER_DELETE_FILE, {KEY_PATH: relative_path_str}
)
ingestor.execute_write(
CYPHER_DELETE_FILE, {KEY_PATH: relative_path_str}
)

Context Used: Rule from dashboard - ## Technical Requirements

Agentic Framework

  • PydanticAI Only: This project uses PydanticAI... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: realtime_updater.py
Line: 94-96

Comment:
**Inline Cypher query should be a named constant**

The project keeps all Cypher queries as named constants in `constants.py` (e.g. `CYPHER_DELETE_MODULE`, `CYPHER_DELETE_CALLS`). The new query is a raw string literal directly in the handler, which violates the "No Hardcoded Strings" rule.

Move it to `constants.py`:
```python
CYPHER_DELETE_FILE = "MATCH (f:File {path: $path}) DETACH DELETE f"
```

Then reference it here:
```suggestion
        ingestor.execute_write(
            CYPHER_DELETE_FILE, {KEY_PATH: relative_path_str}
        )
```

**Context Used:** Rule from `dashboard` - ## Technical Requirements

### Agentic Framework
- **PydanticAI Only**: This project uses PydanticAI... ([source](https://app.greptile.com/review/custom-context?memory=d4240b05-b763-467a-a6bf-94f73e8b6859))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines 78 to 83
relevant_events = {
EventType.MODIFIED,
EventType.CREATED,
EventType.DELETED, # watchdog deletion event
}
if event.event_type not in relevant_events:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"deleted" should be a StrEnum member, not a raw string

EventType already has MODIFIED and CREATED as StrEnum members, but "deleted" is added as a bare string literal. Per the "No Hardcoded Strings" rule, it should be added to the enum in constants.py:

class EventType(StrEnum):
    MODIFIED = "modified"
    CREATED = "created"
    DELETED = "deleted"

Then reference it here:

Suggested change
relevant_events = {
EventType.MODIFIED,
EventType.CREATED,
EventType.DELETED, # watchdog deletion event
}
if event.event_type not in relevant_events:
relevant_events = {
EventType.MODIFIED,
EventType.CREATED,
EventType.DELETED,
}

Context Used: Rule from dashboard - ## Technical Requirements

Agentic Framework

  • PydanticAI Only: This project uses PydanticAI... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: realtime_updater.py
Line: 78-83

Comment:
**`"deleted"` should be a `StrEnum` member, not a raw string**

`EventType` already has `MODIFIED` and `CREATED` as `StrEnum` members, but `"deleted"` is added as a bare string literal. Per the "No Hardcoded Strings" rule, it should be added to the enum in `constants.py`:

```python
class EventType(StrEnum):
    MODIFIED = "modified"
    CREATED = "created"
    DELETED = "deleted"
```

Then reference it here:
```suggestion
        relevant_events = {
            EventType.MODIFIED,
            EventType.CREATED,
            EventType.DELETED,
        }
```

**Context Used:** Rule from `dashboard` - ## Technical Requirements

### Agentic Framework
- **PydanticAI Only**: This project uses PydanticAI... ([source](https://app.greptile.com/review/custom-context?memory=d4240b05-b763-467a-a6bf-94f73e8b6859))

How can I resolve this? If you propose a fix, please make it concise.

relative_path_str = str(path.relative_to(self.updater.repo_path))

# (H) Only process events that actually change file content
# Skip read-only events like "opened", "closed_no_write" that don't modify the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments missing (H) prefix

Per the project's comment policy, all inline comments must be prefixed with (H). The following new comments added by this PR are missing the prefix:

  • realtime_updater.py:77# Skip read-only events like "opened", "closed_no_write"...
  • realtime_updater.py:81# watchdog deletion event
  • realtime_updater.py:91# Delete Module node and its children (for code files)
  • realtime_updater.py:93# Delete File node (for all files including non-code like .md, .json)

Each should be prefixed with (H), e.g.:

Suggested change
# Skip read-only events like "opened", "closed_no_write" that don't modify the file
# (H) Skip read-only events like "opened", "closed_no_write" that don't modify the file

Context Used: Rule from dashboard - ## Technical Requirements

Agentic Framework

  • PydanticAI Only: This project uses PydanticAI... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: realtime_updater.py
Line: 77

Comment:
**Comments missing `(H)` prefix**

Per the project's comment policy, all inline comments must be prefixed with `(H)`. The following new comments added by this PR are missing the prefix:

- `realtime_updater.py:77``# Skip read-only events like "opened", "closed_no_write"...`
- `realtime_updater.py:81``# watchdog deletion event`
- `realtime_updater.py:91``# Delete Module node and its children (for code files)`
- `realtime_updater.py:93``# Delete File node (for all files including non-code like .md, .json)`

Each should be prefixed with `(H)`, e.g.:
```suggestion
        # (H) Skip read-only events like "opened", "closed_no_write" that don't modify the file
```

**Context Used:** Rule from `dashboard` - ## Technical Requirements

### Agentic Framework
- **PydanticAI Only**: This project uses PydanticAI... ([source](https://app.greptile.com/review/custom-context?memory=d4240b05-b763-467a-a6bf-94f73e8b6859))

How can I resolve this? If you propose a fix, please make it concise.

bhargavchippada and others added 3 commits March 1, 2026 01:14
Add missing constants required by the code review suggestions:
- EventType.DELETED = "deleted" for watchdog deletion events
- CYPHER_DELETE_FILE query for deleting File nodes
- Update import in realtime_updater.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…prefixes

- Update test assertions to expect 3 execute_write calls (was 2):
  DELETE_MODULE + DELETE_FILE + DELETE_CALLS
- Rename test_unsupported_file_types_are_ignored to
  test_non_code_files_create_file_nodes to reflect new behavior
- Add assertion for process_generic_file being called for non-code files
- Add (H) prefix to all new comments per project convention
- Add pytest as dev dependency

All 6 tests pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GraphUpdater._process_files only cleared in-memory state for deleted
files but never issued Cypher DELETE to Memgraph. Files/folders deleted
before the hash cache existed were invisible to the diff logic entirely.

- Add _prune_orphan_nodes() to GraphUpdater that queries all File,
  Module, and Folder paths from the graph, checks filesystem existence,
  and deletes stale nodes via CYPHER_DELETE_* queries
- Fix _process_files to issue CYPHER_DELETE_MODULE + CYPHER_DELETE_FILE
  for hash-cache-detected deletions (not just in-memory cleanup)
- Add CYPHER_DELETE_FOLDER and CYPHER_ALL_*_PATHS query constants
- Add PRUNE_* log message constants
- Add 10 unit tests covering pruning logic, edge cases, and integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant