Skip to content

feat: add filter-based index cleanup for L2 data items#658

Open
vilenarios wants to merge 4 commits intodevelopfrom
PE-9016-add-filter-based-index-cleanup
Open

feat: add filter-based index cleanup for L2 data items#658
vilenarios wants to merge 4 commits intodevelopfrom
PE-9016-add-filter-based-index-cleanup

Conversation

@vilenarios
Copy link
Copy Markdown
Contributor

Gateway operators indexing data via ArDrive Turbo for optimistic 30-day retention currently have no way to selectively clean up expired or unwanted data items. The only existing mechanism (prune-stable-data-items) supports height range + indexed_at threshold but no tag or wallet filtering.

This adds filter-based index cleanup that deletes L2 data items from SQLite and ClickHouse based on owner wallet address, tags, block height range, and indexing age.

Three interfaces are provided:

  • Admin endpoint: POST /ar-io/admin/index-cleanup
  • Background worker: configurable via ENABLE_INDEX_CLEANUP_WORKER
  • Both share the same core cleanup engine

Key design decisions:

  • Dry-run mode defaults ON (must explicitly set dryRun: false)
  • Leaves bundles/bundle_data_items tables intact to prevent unbundler re-import loops (follows existing prune pattern)
  • Leaves contiguous_data table intact (content-addressed, may be shared)
  • Empty filter guard prevents accidental mass deletion
  • Batch processing with cursor-based pagination (configurable batch size)
  • ClickHouse cleanup via ALTER TABLE DELETE with is_data_item guard

New environment variables:

  • ENABLE_INDEX_CLEANUP_WORKER (default: false)
  • INDEX_CLEANUP_INTERVAL_SECONDS (default: 86400)
  • INDEX_CLEANUP_BATCH_SIZE (default: 1000)
  • INDEX_CLEANUP_FILTER (JSON filter spec)
  • INDEX_CLEANUP_MIN_AGE_SECONDS (default: 2592000 / 30 days)
  • INDEX_CLEANUP_DRY_RUN (default: true)

PE-9016

vilenarios and others added 2 commits March 24, 2026 03:23
Gateway operators indexing data via ArDrive Turbo for optimistic 30-day
retention currently have no way to selectively clean up expired or
unwanted data items. The only existing mechanism (prune-stable-data-items)
supports height range + indexed_at threshold but no tag or wallet filtering.

This adds filter-based index cleanup that deletes L2 data items from
SQLite and ClickHouse based on owner wallet address, tags, block height
range, and indexing age.

Three interfaces are provided:
- Admin endpoint: POST /ar-io/admin/index-cleanup
- Background worker: configurable via ENABLE_INDEX_CLEANUP_WORKER
- Both share the same core cleanup engine

Key design decisions:
- Dry-run mode defaults ON (must explicitly set dryRun: false)
- Leaves bundles/bundle_data_items tables intact to prevent unbundler
  re-import loops (follows existing prune pattern)
- Leaves contiguous_data table intact (content-addressed, may be shared)
- Empty filter guard prevents accidental mass deletion
- Batch processing with cursor-based pagination (configurable batch size)
- ClickHouse cleanup via ALTER TABLE DELETE with is_data_item guard

New environment variables:
- ENABLE_INDEX_CLEANUP_WORKER (default: false)
- INDEX_CLEANUP_INTERVAL_SECONDS (default: 86400)
- INDEX_CLEANUP_BATCH_SIZE (default: 1000)
- INDEX_CLEANUP_FILTER (JSON filter spec)
- INDEX_CLEANUP_MIN_AGE_SECONDS (default: 2592000 / 30 days)
- INDEX_CLEANUP_DRY_RUN (default: true)

PE-9016

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prod tsconfig does not narrow the `hasOwners` boolean check to
prove `filter.owners` is defined. Add explicit guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

Warning

Rate limit exceeded

@vilenarios has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 3 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c201a2f4-4098-4d66-97b5-d4869492c770

📥 Commits

Reviewing files that changed from the base of the PR and between b06a792 and 6499c41.

📒 Files selected for processing (9)
  • docker-compose.yaml
  • src/config.ts
  • src/database/composite-clickhouse.ts
  • src/database/index-cleanup.test.ts
  • src/database/standalone-sqlite.ts
  • src/routes/ar-io.ts
  • src/system.ts
  • src/workers/index-cleanup-worker.test.ts
  • src/workers/index-cleanup-worker.ts
📝 Walkthrough

Walkthrough

Adds an index-cleanup feature: config flags, SQLite candidate discovery/counting and batch deletions, ClickHouse deletion support, a periodic IndexCleanupWorker, an admin API endpoint, new types, and tests.

Changes

Cohort / File(s) Summary
Configuration
src/config.ts
Added environment-driven exports for index-cleanup: ENABLE_INDEX_CLEANUP_WORKER, INDEX_CLEANUP_INTERVAL_SECONDS, INDEX_CLEANUP_BATCH_SIZE, INDEX_CLEANUP_FILTER, INDEX_CLEANUP_MIN_AGE_SECONDS, INDEX_CLEANUP_DRY_RUN.
Types
src/types.d.ts
Added IndexCleanupFilter and ClickHouseIndexCleanup interfaces.
DB — ClickHouse
src/database/composite-clickhouse.ts
Implemented ClickHouseIndexCleanup on CompositeClickHouseDatabase; added deleteDataItemsByIds(ids: string[]) batching IDs, converting base64url→hex, and issuing ALTER TABLE ... DELETE mutations per table with per-table error logging.
DB — SQLite
src/database/standalone-sqlite.ts
Added candidate query builder, getIndexCleanupCandidateIds, countIndexCleanupCandidates, and batched deletion handlers (deleteIndexCleanupBundlesBatch, deleteIndexCleanupDataBatch) with chunking and worker-thread message handling.
API Route
src/routes/ar-io.ts
Added POST /ar-io/admin/index-cleanup route with request validation, dry-run support, paginated processing, batch deletions, optional ClickHouse mirroring, and progress logging.
Worker
src/workers/index-cleanup-worker.ts
Added IndexCleanupWorker class: periodic scheduling, concurrency guard, effective maxIndexedAt from minAgeSeconds, dry-run counting, paginated deletions, ClickHouse integration, and error logging.
System wiring
src/system.ts
Wired optional IndexCleanupWorker startup when enabled; exported clickHouseCleanup reference; registered shutdown handler.
Tests
src/database/index-cleanup.test.ts, src/workers/index-cleanup-worker.test.ts
Added comprehensive tests for candidate discovery/counting, pagination, batch deletions, dry-run behavior, ClickHouse call paths, concurrency guard, and lifecycle.

Sequence Diagram

sequenceDiagram
    participant Admin
    participant Route as API Route
    participant SQLiteDB as SQLite Database
    participant ClickHouseDB as ClickHouse Database
    participant Worker as IndexCleanupWorker

    Admin->>Route: POST /ar-io/admin/index-cleanup (filter, dryRun, batchSize)
    alt dryRun = true
        Route->>SQLiteDB: countIndexCleanupCandidates(filter)
        SQLiteDB-->>Route: count
        Route-->>Admin: {count, totalDeleted:0}
    else dryRun = false
        loop per batch
            Route->>SQLiteDB: getIndexCleanupCandidateIds(filter, limit, afterId)
            SQLiteDB-->>Route: {ids[], hasMore}
            Route->>SQLiteDB: deleteIndexCleanupBundlesBatch(ids)
            SQLiteDB-->>Route: deletedCounts
            Route->>SQLiteDB: deleteIndexCleanupDataBatch(ids)
            SQLiteDB-->>Route: deletedCounts
            alt ClickHouse enabled
                Route->>ClickHouseDB: deleteDataItemsByIds(ids)
                ClickHouseDB-->>Route: ack
            end
        end
        Route-->>Admin: {totalDeleted}
    end

    Note over Worker: Periodic background run (same flow as non-dry-run)
    Worker->>SQLiteDB: countIndexCleanupCandidates(filter)
    SQLiteDB-->>Worker: count
    loop per batch
        Worker->>SQLiteDB: getIndexCleanupCandidateIds(filter, limit, afterId)
        SQLiteDB-->>Worker: {ids[], hasMore}
        Worker->>SQLiteDB: deleteIndexCleanupBundlesBatch(ids)
        Worker->>SQLiteDB: deleteIndexCleanupDataBatch(ids)
        alt ClickHouse enabled
            Worker->>ClickHouseDB: deleteDataItemsByIds(ids)
            ClickHouseDB-->>Worker: ack
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • djwhitt
  • karlprieb
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add filter-based index cleanup for L2 data items' clearly summarizes the main feature addition, matching the core functionality added across multiple files.
Description check ✅ Passed The description comprehensively explains the purpose, design decisions, and implementation details of the index cleanup feature, directly addressing the changeset and including relevant environment variables.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PE-9016-add-filter-based-index-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (6)
src/workers/index-cleanup-worker.ts (2)

15-32: Consider extracting the db type to a named interface.

The database dependency type is defined identically in both the class property (lines 15-32) and constructor parameter (lines 53-70). Extracting this to a named interface would reduce duplication and improve maintainability.

♻️ Suggested refactor
+interface IndexCleanupDb {
+  getIndexCleanupCandidateIds(params: {
+    filter: IndexCleanupFilter;
+    limit: number;
+    afterId?: Buffer;
+  }): Promise<{ ids: Buffer[]; hasMore: boolean }>;
+  countIndexCleanupCandidates(filter: IndexCleanupFilter): Promise<number>;
+  deleteIndexCleanupBundlesBatch(ids: Buffer[]): Promise<{
+    stableDataItemTagsDeleted: number;
+    stableDataItemsDeleted: number;
+    newDataItemTagsDeleted: number;
+    newDataItemsDeleted: number;
+  }>;
+  deleteIndexCleanupDataBatch(ids: Buffer[]): Promise<{
+    contiguousDataIdParentsDeleted: number;
+    contiguousDataIdsDeleted: number;
+  }>;
+}

 export class IndexCleanupWorker {
   private log: winston.Logger;
-  private db: {
-    getIndexCleanupCandidateIds(params: {
-      ...
-    }): Promise<...>;
-    ...
-  };
+  private db: IndexCleanupDb;

Also applies to: 53-70

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workers/index-cleanup-worker.ts` around lines 15 - 32, Extract the inline
db type into a named interface (e.g., IndexCleanupDb) containing the methods
getIndexCleanupCandidateIds, countIndexCleanupCandidates,
deleteIndexCleanupBundlesBatch, and deleteIndexCleanupDataBatch with their
existing signatures, then replace the anonymous type on the class property db
and the constructor parameter with this new interface; update any
imports/exports if needed and ensure the constructor parameter and the private
db property both use the same IndexCleanupDb type to remove duplication.

88-96: Consider running an initial cleanup on start.

The worker starts an interval but doesn't run an immediate cleanup. Depending on the use case, operators may want cleanup to begin immediately rather than waiting for intervalMs to elapse. This could be made configurable or documented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workers/index-cleanup-worker.ts` around lines 88 - 96, The start() method
in index-cleanup-worker.ts currently sets up a periodic timer but never performs
an immediate cleanup; update start() to optionally run an initial cleanup by
calling this.cleanup() right after setting the interval (or before, if you
prefer) controlled by a new boolean option like runOnStart or immediateCleanup
on the worker instance (add the option to the constructor/defaults), and emit a
log entry from start() indicating that an initial cleanup was executed (or
skipped) so operators can see behavior; keep existing fields (intervalMs,
dryRun, minAgeSeconds, filter) intact and ensure cleanup() handles concurrent
invocations safely if needed.
src/workers/index-cleanup-worker.test.ts (3)

156-193: Consider verifying the actual IDs passed to ClickHouse cleanup.

The test verifies deleteDataItemsByIds is called once but doesn't assert that the IDs are correctly converted to base64url format. Adding an assertion on the call arguments would strengthen the test.

💡 Suggested enhancement
     assert.equal(mockClickHouse.deleteDataItemsByIds.mock.callCount(), 1);
+    const passedIds = mockClickHouse.deleteDataItemsByIds.mock.calls[0].arguments[0];
+    assert.equal(passedIds.length, 1);
+    // Verify base64url conversion occurred
+    assert.equal(typeof passedIds[0], 'string');
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workers/index-cleanup-worker.test.ts` around lines 156 - 193, The test
currently only asserts deleteDataItemsByIds was called once; update the test
around IndexCleanupWorker.cleanup to also assert the actual IDs passed to
mockClickHouse.deleteDataItemsByIds are the expected base64url-encoded values
(convert the Buffer id1 to base64url and compare to the argument passed). Locate
the mockClickHouse.deleteDataItemsByIds mock used in the test and add an
assertion on its call arguments (the first call's payload) to verify correct
conversion/formatting of IDs.

215-230: Consider adding an assertion that start() actually schedules the interval.

The test calls start() and stop() but doesn't verify that the interval was actually created. While this is a basic smoke test, you could enhance it by checking that cleanup() would eventually be called or that stop() clears the interval.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workers/index-cleanup-worker.test.ts` around lines 215 - 230, Add an
assertion that start() actually schedules the periodic work: in the test for
IndexCleanupWorker, use fake timers (or spy on the instance method) to call
worker.start(), advance timers and assert that cleanup() was invoked (or inspect
the worker's timer/interval property exists), then call worker.stop() and assert
the timer/interval was cleared (or that cleanup is no longer called after
advancing timers). Reference the IndexCleanupWorker class and its start(),
stop(), and cleanup() methods to locate where to add the fake timers/spy and the
assertions.

61-96: Test assertion may be fragile if batch iteration logic changes.

The test expects deleteIndexCleanupBundlesBatch and deleteIndexCleanupDataBatch to each be called exactly once. This works because the mock returns hasMore: true on the first call (with IDs) and hasMore: false on the second call (with empty IDs), so only one batch has actual work.

Consider making the test more explicit by adding a comment explaining this expected behavior, or adjusting the mock to return two non-empty batches to verify the loop iterates correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workers/index-cleanup-worker.test.ts` around lines 61 - 96, The test's
assumptions about batch iteration are fragile because
getIndexCleanupCandidateIds returns one non-empty batch then an empty batch;
update the test for clarity by either adding a short comment explaining why
deleteIndexCleanupBundlesBatch and deleteIndexCleanupDataBatch are expected to
be called once, or better: change the mock for getIndexCleanupCandidateIds (used
by IndexCleanupWorker.cleanup) to return two non-empty batches (e.g., first {
ids: [id1, id2], hasMore: true } and second { ids: [id3], hasMore: false }) and
then update the assertions to expect the corresponding increased call counts for
deleteIndexCleanupBundlesBatch and deleteIndexCleanupDataBatch to explicitly
verify the loop iterates over multiple batches.
src/routes/ar-io.ts (1)

756-759: Consider returning more detailed error information.

The error handler logs the full error but only returns error?.message to the client. For admin endpoints, providing more context (like a stack trace in non-production environments) could aid debugging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routes/ar-io.ts` around lines 756 - 759, The catch block in
src/routes/ar-io.ts currently logs the full error via log.error but only returns
error?.message to clients; update the error response to include additional
context (e.g., error.name, error.code, and the stack) for non-production
environments and keep production responses minimal. Specifically, in the catch
handling where log.error('Index cleanup error', { error: error?.message }) and
res.status(500).send(error?.message) are used, change the response to a JSON
payload that includes message and conditionally includes stack
(process.env.NODE_ENV !== 'production') and any other safe diagnostic fields
like error.name or error.code, while ensuring stack traces or sensitive details
are not exposed in production.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/config.ts`:
- Around line 1371-1381: INDEX_CLEANUP_INTERVAL_SECONDS and
INDEX_CLEANUP_BATCH_SIZE (and the other similar vars around 1390-1395) currently
coerce env values with unary + which produces NaN/0 for malformed or empty
input; change them to parse and validate integers explicitly: read the env
string via env.varOrDefault, parse with Number.parseInt (or a small helper like
parseEnvInt), ensure the result is a finite integer and within acceptable bounds
(>0), and throw or exit with a clear error message during startup if validation
fails so the process fails fast instead of running with NaN/0 values.

In `@src/database/composite-clickhouse.ts`:
- Around line 534-550: The current loop over `tables` swallows individual `ALTER
TABLE ... DELETE` errors (caught in the `catch` that calls `this.log.error`) but
still emits `this.log.info('ClickHouse index cleanup mutations submitted', {
totalIds: ids.length, ... })`, so partial failures are reported as success;
modify the method around the `for (const table of tables)` loop to collect
failures (e.g., an array of failure metadata from the `catch` that includes
`table`, `batchSize`, and `error.message`), do not suppress them, and after the
loop if any failures were recorded either throw a new Error (or rethrow the
first error) containing the failure summary or mark the operation as failed so
the subsequent `this.log.info('ClickHouse index cleanup mutations
submitted'...)` is not executed; reference `this.clickhouseClient.command`, the
`catch` block currently logging via `this.log.error`, the
`tables`/`batch`/`idList` loop, and the final `this.log.info` when implementing
this change.

In `@src/database/index-cleanup.test.ts`:
- Line 21: The test imports the production logger via "import log from
'../log.js';"—replace that import with createTestLogger from
"test/test-logger.ts" and instantiate a test logger (e.g., const log =
createTestLogger()) so the test uses the test-only silent logger; update any
references to the existing "log" identifier to use the created test logger and
remove the production logger import.
- Around line 134-135: The "Index cleanup" test suite (describe 'Index cleanup'
/ tests for getIndexCleanupCandidateIds) lacks per-test teardown so inserted
rows (IDs like ID1-ID6) can leak between tests; add an afterEach hook that
truncates or deletes from the tables used by these tests (e.g.,
documents/embeddings or whichever test helpers insert rows) or reset the test DB
state via the same DB client used in tests, ensuring each test uses a clean
table state (alternatively generate unique IDs per test if truncation is not
possible).

In `@src/database/standalone-sqlite.ts`:
- Line 2850: buildCleanupCandidateQuery() currently only selects from
stable_data_items (the line creating `query` using
`sql.select(selectClause).from('stable_data_items sdi')`), so rows that exist
only in new_data_items are never found; update the query to include both tables
(either by unioning a second select from `new_data_items ndi` with the same
selectClause or by selecting from a derived table that UNIONS stable_data_items
and new_data_items) so the resulting candidate set matches the delete path that
removes both stable_* and new_* rows.
- Around line 4142-4161: The switch cases that declare consts are not
block-scoped and violate noSwitchDeclarations: wrap the bodies of the
'getIndexCleanupCandidateIds', 'countIndexCleanupCandidates',
'deleteIndexCleanupBundlesBatch', and 'deleteIndexCleanupDataBatch' cases in
braces so each const (cleanupCandidates, cleanupCount, bundlesCleanupResult,
dataCleanupResult) is scoped to its own block; e.g., enclose the statements that
call worker.getIndexCleanupCandidateIds(args[0]) /
worker.countIndexCleanupCandidates(args[0]) /
worker.deleteIndexCleanupBundlesBatch(args[0]) /
worker.deleteIndexCleanupDataBatch(args[0]) and the corresponding
parentPort?.postMessage(...) inside { ... } for each case.

In `@src/routes/ar-io.ts`:
- Around line 626-640: The route handler registered with
arIoRouter.post('/ar-io/admin/index-cleanup') reads batchSize from req.body and
defaults to 1000 but does not validate it; add validation in that handler to
coerce and validate batchSize (e.g., parseInt) and reject or normalize invalid
values: enforce it is an integer > 0 (and optionally cap to a sane maximum),
return a 400 error when the client provides a non-numeric, non-integer, zero, or
negative batchSize, and otherwise use the validated value in the rest of the
function; update any variable references to batchSize inside the handler to use
the validated value.

In `@src/system.ts`:
- Around line 1402-1413: The code currently instantiates IndexCleanupWorker with
JSON.parse(config.INDEX_CLEANUP_FILTER) even when it's the default '{}' which
causes the cleanup query builder to fail; update the startup logic that
constructs IndexCleanupWorker to validate the parsed filter (from
config.INDEX_CLEANUP_FILTER) and ensure it contains at least one criterion
(e.g., non-empty object or required keys) before creating the worker. If
validation fails, do not start the worker—log a clear warning/error including
the invalid filter value—and keep indexCleanupWorker undefined (or bail out) so
the worker doesn't repeatedly wake and fail; adjust the code around the
IndexCleanupWorker construction and related variables to perform this check.

---

Nitpick comments:
In `@src/routes/ar-io.ts`:
- Around line 756-759: The catch block in src/routes/ar-io.ts currently logs the
full error via log.error but only returns error?.message to clients; update the
error response to include additional context (e.g., error.name, error.code, and
the stack) for non-production environments and keep production responses
minimal. Specifically, in the catch handling where log.error('Index cleanup
error', { error: error?.message }) and res.status(500).send(error?.message) are
used, change the response to a JSON payload that includes message and
conditionally includes stack (process.env.NODE_ENV !== 'production') and any
other safe diagnostic fields like error.name or error.code, while ensuring stack
traces or sensitive details are not exposed in production.

In `@src/workers/index-cleanup-worker.test.ts`:
- Around line 156-193: The test currently only asserts deleteDataItemsByIds was
called once; update the test around IndexCleanupWorker.cleanup to also assert
the actual IDs passed to mockClickHouse.deleteDataItemsByIds are the expected
base64url-encoded values (convert the Buffer id1 to base64url and compare to the
argument passed). Locate the mockClickHouse.deleteDataItemsByIds mock used in
the test and add an assertion on its call arguments (the first call's payload)
to verify correct conversion/formatting of IDs.
- Around line 215-230: Add an assertion that start() actually schedules the
periodic work: in the test for IndexCleanupWorker, use fake timers (or spy on
the instance method) to call worker.start(), advance timers and assert that
cleanup() was invoked (or inspect the worker's timer/interval property exists),
then call worker.stop() and assert the timer/interval was cleared (or that
cleanup is no longer called after advancing timers). Reference the
IndexCleanupWorker class and its start(), stop(), and cleanup() methods to
locate where to add the fake timers/spy and the assertions.
- Around line 61-96: The test's assumptions about batch iteration are fragile
because getIndexCleanupCandidateIds returns one non-empty batch then an empty
batch; update the test for clarity by either adding a short comment explaining
why deleteIndexCleanupBundlesBatch and deleteIndexCleanupDataBatch are expected
to be called once, or better: change the mock for getIndexCleanupCandidateIds
(used by IndexCleanupWorker.cleanup) to return two non-empty batches (e.g.,
first { ids: [id1, id2], hasMore: true } and second { ids: [id3], hasMore: false
}) and then update the assertions to expect the corresponding increased call
counts for deleteIndexCleanupBundlesBatch and deleteIndexCleanupDataBatch to
explicitly verify the loop iterates over multiple batches.

In `@src/workers/index-cleanup-worker.ts`:
- Around line 15-32: Extract the inline db type into a named interface (e.g.,
IndexCleanupDb) containing the methods getIndexCleanupCandidateIds,
countIndexCleanupCandidates, deleteIndexCleanupBundlesBatch, and
deleteIndexCleanupDataBatch with their existing signatures, then replace the
anonymous type on the class property db and the constructor parameter with this
new interface; update any imports/exports if needed and ensure the constructor
parameter and the private db property both use the same IndexCleanupDb type to
remove duplication.
- Around line 88-96: The start() method in index-cleanup-worker.ts currently
sets up a periodic timer but never performs an immediate cleanup; update start()
to optionally run an initial cleanup by calling this.cleanup() right after
setting the interval (or before, if you prefer) controlled by a new boolean
option like runOnStart or immediateCleanup on the worker instance (add the
option to the constructor/defaults), and emit a log entry from start()
indicating that an initial cleanup was executed (or skipped) so operators can
see behavior; keep existing fields (intervalMs, dryRun, minAgeSeconds, filter)
intact and ensure cleanup() handles concurrent invocations safely if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: adfae3c3-5ac7-4d3f-916a-5c3806bc24e6

📥 Commits

Reviewing files that changed from the base of the PR and between fc32edf and 1d81bbf.

📒 Files selected for processing (9)
  • src/config.ts
  • src/database/composite-clickhouse.ts
  • src/database/index-cleanup.test.ts
  • src/database/standalone-sqlite.ts
  • src/routes/ar-io.ts
  • src/system.ts
  • src/types.d.ts
  • src/workers/index-cleanup-worker.test.ts
  • src/workers/index-cleanup-worker.ts

);
}

const query = sql.select(selectClause).from('stable_data_items sdi');
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This selector never sees new_data_items.

buildCleanupCandidateQuery() is pinned to stable_data_items, but the delete path later removes both stable_* and new_* rows. Any matching item that still exists only in new_data_items cannot be counted or deleted by this feature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/standalone-sqlite.ts` at line 2850, buildCleanupCandidateQuery()
currently only selects from stable_data_items (the line creating `query` using
`sql.select(selectClause).from('stable_data_items sdi')`), so rows that exist
only in new_data_items are never found; update the query to include both tables
(either by unioning a second select from `new_data_items ndi` with the same
selectClause or by selecting from a derived table that UNIONS stable_data_items
and new_data_items) so the resulting candidate set matches the delete path that
removes both stable_* and new_* rows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current implementation of buildCleanupCandidateQuery in the PR branch
rg -n "buildCleanupCandidateQuery|new_data_items|stable_data_items" src/database/standalone-sqlite.ts -A 3 -B 1 | head -100

Length of output: 1472


@vilenarios, thanks for the update! However, after checking the current branch, Line 2850 in src/database/standalone-sqlite.ts still reads:

const query = sql.select(selectClause).from('stable_data_items sdi');

There is no UNION with new_data_items, so the fix doesn't appear to be reflected yet. Could you double-check that the change was pushed? The delete path in deleteIndexCleanupBundlesBatch still removes rows from both stable_data_items and new_data_items, so the candidate query needs to cover both tables to ensure consistency.

Here's a sketch of what the fix would look like:

private buildCleanupCandidateQuery(
  filter: IndexCleanupFilter,
  selectClause: string,
): sql.SelectStatement {
  // ... validation unchanged ...

  // Build filter conditions for reuse
  const buildFiltered = (tableAlias: string, tableName: string) => {
    const q = sql.select(selectClause.replace(/\bsdi\b/g, tableAlias))
      .from(`${tableName} ${tableAlias}`);

    if (hasOwners && filter.owners) {
      q.where(sql.in(`${tableAlias}.owner_address`, filter.owners.map((o) => fromB64Url(o))));
    }
    // ... tags, height, indexedAt filters ...
    return q;
  };

  // UNION stable + new items so both are included in candidate set
  const stableQuery = buildFiltered('sdi', 'stable_data_items');
  const newQuery = buildFiltered('ndi', 'new_data_items');

  // Return as a derived UNION query
  // (sql-bricks may require raw SQL here for the UNION)
  ...
}

The exact approach depends on how sql-bricks handles UNIONs in your codebase; you may need a raw SQL string for the union. Let me know if you'd like a more complete implementation.


🧠 Learnings used
Learnt from: CR
Repo: ar-io/ar-io-node PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-20T15:05:19.465Z
Learning: Applies to src/database/**/*.{ts,tsx} : When adding database methods, implement in worker class, add queue wrapper in main database class, add case handler in worker message handler, and add signature to appropriate interface in types.d.ts

Learnt from: djwhitt
Repo: ar-io/ar-io-node PR: 635
File: src/routes/data/handlers.ts:677-686
Timestamp: 2026-03-10T20:58:16.097Z
Learning: In Node.js streams, attaching an error listener immediately after stream.pipe(res) is safe because pipe is synchronous and error events are emitted asynchronously, so there is no meaningful race window. Do not flag this ordering as a bug in code reviews. This guidance applies when reviewing Node.js stream usage across TypeScript files; if you see a pattern involving attaching an 'error' listener after pipe(), assume it is safe and focus on real race conditions or error handling concerns elsewhere.

Learnt from: djwhitt
Repo: ar-io/ar-io-node PR: 637
File: src/system.ts:547-555
Timestamp: 2026-03-11T13:40:11.246Z
Learning: In src/system.ts, ensure loop protection is activated without depending on ownOrigin/ARNS_ROOT_HOST config. The detection should use detectLoopInViaChain to inspect the via-chain header and skip gateways already present in the chain. There should be no requirement for local configuration, and missing environment variables should not disable loop protection.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 89.89691% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.35%. Comparing base (fc32edf) to head (6499c41).

Files with missing lines Patch % Lines
src/database/standalone-sqlite.ts 82.97% 46 Missing and 1 partial ⚠️
src/workers/index-cleanup-worker.ts 98.83% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #658      +/-   ##
===========================================
+ Coverage    77.19%   77.35%   +0.16%     
===========================================
  Files          111      112       +1     
  Lines        37227    37712     +485     
  Branches      2787     2837      +50     
===========================================
+ Hits         28736    29172     +436     
- Misses        8455     8503      +48     
- Partials        36       37       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/database/standalone-sqlite.ts (1)

2834-2906: Consider adding TSDoc comments to document the cleanup query building.

Per coding guidelines, adding TSDoc comments when modifying code helps with documentation. The buildCleanupCandidateQuery method and related public methods (getIndexCleanupCandidateIds, countIndexCleanupCandidates, deleteIndexCleanupBundlesBatch, deleteIndexCleanupDataBatch) would benefit from documentation explaining their purpose, parameters, and the filter criteria supported.

Example TSDoc for buildCleanupCandidateQuery
/**
 * Builds a SQL query to find data item candidates for index cleanup.
 * Requires at least one filter criterion to prevent accidental mass deletion.
 *
 * `@param` filter - Cleanup filter criteria (owners, tags, height range, maxIndexedAt)
 * `@param` selectClause - SQL select clause (e.g., 'sdi.id' or 'COUNT(DISTINCT sdi.id)')
 * `@returns` A sql-bricks SelectStatement for querying stable_data_items
 * `@throws` Error if filter specifies no criteria
 */

As per coding guidelines: "**/*.{ts,tsx,js}: Add or improve TSDoc comments when modifying code to enhance documentation"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/standalone-sqlite.ts` around lines 2834 - 2906, Add TSDoc
comments for buildCleanupCandidateQuery and the related public methods
getIndexCleanupCandidateIds, countIndexCleanupCandidates,
deleteIndexCleanupBundlesBatch, and deleteIndexCleanupDataBatch: describe each
method's purpose, parameters (explain IndexCleanupFilter fields: owners, tags,
minHeight, maxHeight, maxIndexedAt), the selectClause and return type
(sql.SelectStatement or relevant return), and thrown errors (e.g., when no
filter criteria are provided); keep the summary, `@param`, `@returns` and `@throws`
tags concise and include the note that at least one filter criterion is required
to avoid accidental mass deletions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/database/standalone-sqlite.ts`:
- Around line 2834-2906: Add TSDoc comments for buildCleanupCandidateQuery and
the related public methods getIndexCleanupCandidateIds,
countIndexCleanupCandidates, deleteIndexCleanupBundlesBatch, and
deleteIndexCleanupDataBatch: describe each method's purpose, parameters (explain
IndexCleanupFilter fields: owners, tags, minHeight, maxHeight, maxIndexedAt),
the selectClause and return type (sql.SelectStatement or relevant return), and
thrown errors (e.g., when no filter criteria are provided); keep the summary,
`@param`, `@returns` and `@throws` tags concise and include the note that at least one
filter criterion is required to avoid accidental mass deletions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 65e983be-6978-4849-b857-dfb7df644e7d

📥 Commits

Reviewing files that changed from the base of the PR and between 1d81bbf and b06a792.

📒 Files selected for processing (1)
  • src/database/standalone-sqlite.ts

vilenarios and others added 2 commits March 24, 2026 03:48
Without these entries in the environment section, the INDEX_CLEANUP_*
variables from .env are never forwarded to the core container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use positiveIntOrDefault for config env vars instead of unary +
  to fail fast on malformed input
- Collect ClickHouse deletion failures and throw after loop instead
  of silently logging partial failures as success
- Block-scope const declarations in switch/case to fix
  noSwitchDeclarations
- Add batchSize validation in admin endpoint (must be positive int)
- Validate INDEX_CLEANUP_FILTER JSON in system.ts with try/catch,
  log error and skip worker startup on malformed config
- Replace production logger with createTestLogger in test file
- Extract IndexCleanupDb interface from inline type duplication
- Run initial cleanup on worker start instead of waiting for first
  interval
- Improve batch iteration test to verify multi-batch loop with two
  non-empty batches
- Assert actual IDs passed to ClickHouse mock in worker test
- Add toB64Url static import in admin endpoint (was already fixed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant