Skip to content

Fix 2#194

Open
abishekve wants to merge 11 commits into
masterfrom
fix-2
Open

Fix 2#194
abishekve wants to merge 11 commits into
masterfrom
fix-2

Conversation

@abishekve
Copy link
Copy Markdown
Collaborator

@abishekve abishekve commented May 12, 2026

PR Title

Fix SQLite lock contention, thread pool starvation, and unbounded raw data queries


PR Description

This PR resolves major SQLite performance and stability issues affecting live monitoring, status page responsiveness, rollup processing, and long-term database growth.

The primary root causes were:

  • Concurrent SQLite writes from probe tasks

  • Full-table scans against check_result_raw

  • Sync-over-async thread pool blocking

  • Excessive per-probe DB access

  • Unbounded historical queries loading millions of rows into memory

  • Missing automated pruning

The changes introduce WAL mode, a centralized async write queue, query optimizations, endpoint caching, scheduled pruning, and multiple background processing improvements.


Changes Included

1. Added SqliteWalInterceptor

Introduced a connection interceptor that applies SQLite PRAGMA optimizations on every new DB connection:

  • journal_mode=WAL

  • busy_timeout=5000

  • synchronous=NORMAL

  • cache_size=-8000

This enables concurrent reads/writes, reduces lock contention, and improves overall DB responsiveness.


2. Added CheckResultWriteQueue

Implemented a centralized background write queue for probe results.

Previous behavior

Each probe executed its own SaveChangesAsync() call, causing massive SQLite write lock contention when many probes completed simultaneously.

New behavior

  • Probe results are queued via Channel<T>

  • Single background writer batches up to 100 items per save

  • Queue bounded at 10,000 entries with oldest-item drop strategy

  • RTT updates use ExecuteUpdateAsync

  • Remaining items are flushed during shutdown

Result:

  • Eliminates concurrent SQLite writers

  • Reduces write amplification

  • Improves monitoring stability under load


3. Updated OutageDetectionService

Replaced inline async DB writes with fire-and-forget queue enqueueing:

  • SaveCheckResultAsync()SaveCheckResult()

  • Actual persistence handled asynchronously by CheckResultWriteQueue

This removes DB latency from probe execution paths.


4. Updated MonitoringBackgroundService

Fixed timer restart thundering herd

Previously all probe timers restarted every 15 seconds, causing probes to execute simultaneously.

Now:

  • Timers restart only when intervals actually change

  • Uses ConcurrentDictionary<Guid, int> to track intervals

Removed per-probe DB reads

Previously each probe executed FindAsync(endpointId).

Now:

  • Endpoints cached in-memory via ConcurrentDictionary<Guid, Endpoint>

  • Refreshed periodically

  • Probe execution performs zero DB reads


5. Optimized StatusService

Resolved multiple severe performance issues.

Removed expensive full-table scans

Old query pattern:

  • GroupBy + OrderByDescending + FirstOrDefault

  • Triggered scans across millions of check_result_raw rows

Status now uses:

  • endpoint.LastStatus

  • endpoint.LastRttMs

Fixed sync-over-async thread pool starvation

Previous implementation called:

IsFlapping(endpoint.Id).Result

inside loops for every endpoint, blocking ASP.NET Core thread pool threads.

Replaced with:

  • GetFlappingEndpointIdsAsync(endpointIds)

  • Single batched query

  • HashSet<Guid>.Contains() lookups

Removed unused CountAsync()

Deleted unnecessary DB query that was never used.


6. Optimized RollupService

Fixed unbounded memory usage in rollup processing.

Previous behavior

Loaded entire tables into memory:

await _context.CheckResultsRaw.ToListAsync()

then filtered in C#.

New behavior

  • Date filtering moved into SQL WHERE

  • Added AsNoTracking()

  • Removed blocking .ToList()

  • Uses fully async queries

This dramatically reduces memory usage and DB pressure.


7. Optimized HistoryService

Fixed all historical query methods loading entire endpoint histories before filtering.

Updated methods:

  • GetRawDataAsync

  • GetRollup15mDataAsync

  • GetRollupDailyDataAsync

  • GetOutagesAsync

Changes:

  • Date filtering moved into SQL

  • Added AsNoTracking()

  • Replaced synchronous list materialization with async equivalents


8. Added PruneBackgroundService

Previously pruning logic existed but required manual invocation.

Added automated scheduled pruning service:

  • Runs daily at 02:00

  • Calls IPruneService.PruneRawDataAsync()

  • Reads config from Data:Pruning

  • Uses scoped service pattern for DbContext access

This prevents uncontrolled growth of check_result_raw.


9. Updated appsettings.json

Setting | Before | After | Reason -- | -- | -- | -- Data.Retention.RawDays | 60 | 30 | Rollups preserve long-term history; old raw data is redundant Data.Pruning.IntervalHours | 168 | 24 | Continuous pruning prevents runaway table growth EF Core SQL logging | Enabled | Removed | Reduced memory/log overhead

These should ideally be separated into independent PRs.

@abishekve abishekve changed the base branch from fix-1 to master May 12, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant