Skip to content

[P1] Make notification scheduling and Firebase delivery reliable and observable #281

@jjoonleo

Description

@jjoonleo

Problem

Notification delivery and scheduled reminders are fragile. Reminder tasks live in memory after boot recovery, Firebase failures are swallowed, and there is no retry or alerting strategy.

Why this is not production ready

A real reminder service must tolerate restarts, transient Firebase/provider failures, clock issues, and duplicate scheduling. Silent failures mean users can miss alarms without operators knowing.

Evidence

  • NotificationService.scheduledTasks is a ConcurrentHashMap<Long, ScheduledFuture<?>> in process memory.
  • NotificationRecoveryService rebuilds pending reminders only at application startup.
  • NotificationService.sendNotificationToUser catches all exceptions and calls e.printStackTrace() without marking failure or retrying.
  • sendReminder marks a notification as sent after calling sendNotificationToUser, but the called method does not report success/failure to the caller.

Required work

  • Persist delivery attempts, success/failure status, and provider response metadata.
  • Make sending return an explicit result and only mark sent on confirmed success.
  • Add retry/backoff for transient Firebase failures and cleanup for invalid tokens.
  • Add metrics/logging/alerts for failed or delayed notifications.
  • Consider a durable scheduler/queue for reminders instead of only in-memory ScheduledFuture tasks.
  • Define idempotency to avoid duplicate notifications after restart or retry.

Acceptance criteria

  • Firebase send failures are visible, persisted, and retried when appropriate.
  • Invalid Firebase tokens are handled explicitly.
  • A process restart does not permanently lose future reminders.
  • Metrics or logs can answer: scheduled count, sent count, failed count, retry count, and stale pending count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:notificationsPush notifications and schedulingarea:stabilityReliability and runtime stabilitypriority:P1High: should be resolved before production launchproduction-readinessProduction readiness audit itemtype:hardeningSecurity/stability hardening task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions