Skip to content

Add resilience metrics and improved retry/fallback logging#150

Open
dougborg wants to merge 6 commits intoFoggedLens:mainfrom
dougborg:feat/resilience-metrics
Open

Add resilience metrics and improved retry/fallback logging#150
dougborg wants to merge 6 commits intoFoggedLens:mainfrom
dougborg:feat/resilience-metrics

Conversation

@dougborg
Copy link
Collaborator

Summary

  • Add ResilienceMetrics singleton for tracking retry/fallback behavior (request counts, success rates, error types, latency)
  • Instrument executeWithEndpointList and _executeWithRetries with structured debug logging (service name, attempt count, error type, disposition)
  • Introduce _ClassifiedError wrapper so callers receive disposition without re-classifying errors
  • Remove deprecated executeWithFallback (no remaining callers)

Stacked on #146 — will rebase after that merges.

Test plan

  • flutter analyze passes
  • All service_policy_test.dart tests pass (67 tests)
  • All overpass_service_test.dart tests pass (22 tests)
  • New ResilienceMetrics test group covers: primary success, fallback success, retries, abort, total failure, error types, summary format

🤖 Generated with Claude Code

dougborg and others added 6 commits March 8, 2026 22:10
Extract duplicated retry logic into shared executeWithFallback() with
ErrorDisposition-based classification (abort/fallback/retry). Services
use hard-coded primary+fallback endpoint pairs:
- OverpassService: overpass.deflock.org + overpass-api.de
- RoutingService: dontgetflocked.com + alprwatch.org

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split the combined null/abort guard so Dart's flow analysis promotes
fallbackUrl to non-null, eliminating the unnecessary_non_null_assertion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evolve from hard-coded primary/fallback endpoint pairs to a user-configurable
service registry. Each endpoint carries its own name, URL, enabled state, and
optional retry/timeout overrides.

- ServiceEndpoint model with JSON serialization and sensible defaults
- ServiceRegistry<T> generic ordered list with SharedPreferences persistence
- executeWithEndpointList() iterates enabled endpoints with per-endpoint policy
- OverpassService and RoutingService accept List<ServiceEndpoint>
- SettingsState uses ServiceRegistry instances with migration from old format
- API Endpoints UI section for managing endpoint registries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ResilienceMetrics singleton to track request counts, retry/fallback
invocations, error types, and latency. Instrument executeWithEndpointList
and _executeWithRetries with structured debug logging that includes
service name, attempt count, error type, and disposition.

Remove deprecated executeWithFallback (no remaining callers).
Introduce _ClassifiedError wrapper so callers receive the disposition
from _executeWithRetries without re-classifying errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 12, 2026 06:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a generalized multi-endpoint resilience layer (retry + fallback) with in-memory metrics, and wires endpoint registries through settings/AppState plus a new Advanced Settings UI for managing API endpoints.

Changes:

  • Introduce executeWithEndpointList + ResiliencePolicy/ErrorDisposition + ResilienceMetrics instrumentation for retry/fallback flows.
  • Add ServiceRegistry<T> + ServiceEndpoint models, settings migration from legacy single-string endpoint prefs, and AppState getters.
  • Add an “API Endpoints” settings section UI and expand tests to cover registries, migrations, and fallback behavior for routing/overpass.

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/state/settings_state_test.dart Adds coverage for endpoint registries + migration behavior in SettingsState.
test/state/service_registry_test.dart New test suite covering persistence/CRUD/reorder/reset semantics for ServiceRegistry.
test/services/service_policy_test.dart Adds tests for ResiliencePolicy, executeWithEndpointList, and ResilienceMetrics.
test/services/routing_service_test.dart Updates routing tests for explicit endpoints and adds fallback behavior coverage.
test/services/overpass_service_test.dart Updates overpass tests for explicit endpoints and adds fallback behavior coverage.
test/models/service_endpoint_test.dart Adds JSON/copyWith/equality/default-endpoints tests for ServiceEndpoint.
pubspec.lock Updates locked transitive dependency versions/hashes.
lib/state/settings_state.dart Adds endpoint registries, migration from legacy keys, and new getters for endpoints/registries.
lib/state/service_registry.dart Introduces generic persisted ordered-list registry abstraction with mutation helpers.
lib/state/navigation_state.dart Updates routing log messaging to be service-agnostic.
lib/services/service_policy.dart Adds resilience primitives + execution engine + metrics tracking + structured debug logging.
lib/services/routing_service.dart Switches routing to use multi-endpoint retry/fallback via executeWithEndpointList.
lib/services/overpass_service.dart Switches overpass to use multi-endpoint retry/fallback via executeWithEndpointList.
lib/screens/settings/sections/api_endpoints_section.dart New UI for viewing/reordering/toggling/adding/resetting endpoint registries.
lib/screens/advanced_settings_screen.dart Adds the new API Endpoints section to advanced settings UI.
lib/models/service_registry_entry.dart Defines shared interface for registry-managed entries.
lib/models/service_endpoint.dart Adds endpoint model + built-in defaults for routing/overpass.
lib/app_state.dart Exposes endpoint registries/endpoints from SettingsState through AppState.
lib/dev_config.dart Marks old routing timeout as legacy and aligns timeout value with policy defaults.
lib/localizations/en.json Adds strings for API endpoint settings UI.
lib/localizations/de.json Adds strings for API endpoint settings UI.
lib/localizations/es.json Adds strings for API endpoint settings UI.
lib/localizations/fr.json Adds strings for API endpoint settings UI.
lib/localizations/it.json Adds strings for API endpoint settings UI.
lib/localizations/nl.json Adds strings for API endpoint settings UI.
lib/localizations/pl.json Adds strings for API endpoint settings UI.
lib/localizations/pt.json Adds strings for API endpoint settings UI.
lib/localizations/tr.json Adds strings for API endpoint settings UI.
lib/localizations/uk.json Adds strings for API endpoint settings UI.
lib/localizations/zh.json Adds strings for API endpoint settings UI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +441 to +446
final effectivePolicy = ResiliencePolicy(
maxRetries: endpoint.maxRetries ?? defaultPolicy.maxRetries,
httpTimeout: endpoint.timeoutSeconds != null
? Duration(seconds: endpoint.timeoutSeconds!)
: defaultPolicy.httpTimeout,
retryBackoffBase: defaultPolicy.retryBackoffBase,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResiliencePolicy.httpTimeout is computed per-endpoint (including timeoutSeconds overrides) but is never enforced by executeWithEndpointList/_executeWithRetries. As a result, endpoint timeout overrides (and even defaultPolicy.httpTimeout) have no effect unless each execute callback manually applies the timeout, which contradicts the function docs. Consider wrapping execute(url) with .timeout(policy.httpTimeout) inside _executeWithRetries, or change the API/docs so the execute callback receives the effective policy and is responsible for applying timeouts consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
onReorder: (oldIndex, newIndex) {
registry.reorder(oldIndex, newIndex);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registry.reorder(oldIndex, newIndex) returns a Future, but it’s called without await/error handling. If persistence fails or an exception is thrown, it will become an unhandled async error and the UI will not be able to report it. Make this callback async and await the reorder (or use unawaited(...).catchError(...) and surface a SnackBar on failure).

Suggested change
onReorder: (oldIndex, newIndex) {
registry.reorder(oldIndex, newIndex);
onReorder: (oldIndex, newIndex) async {
try {
await registry.reorder(oldIndex, newIndex);
} catch (error) {
ScaffoldMessenger.of(context).showSnackBar(
const SnackBar(
content: Text('Failed to reorder endpoints'),
),
);
}

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +273
onPressed: () {
try {
registry.delete(endpoint.id);
} on StateError catch (e) {
ScaffoldMessenger.of(context).showSnackBar(
SnackBar(content: Text(e.message)),
);
}
},
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These registry mutations are async but not awaited. In particular, the try { registry.delete(...) } on StateError block won’t catch the StateError because it will be raised on the returned Future, leading to unhandled async exceptions and no SnackBar. Make the handler async, await registry.delete(...), and catch errors from the awaited call (same applies to addOrUpdate/resetToDefaults).

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +74
ServiceEndpoint copyWith({
String? id,
String? name,
String? url,
bool? enabled,
bool? isBuiltIn,
int? maxRetries,
int? timeoutSeconds,
}) => ServiceEndpoint(
id: id ?? this.id,
name: name ?? this.name,
url: url ?? this.url,
enabled: enabled ?? this.enabled,
isBuiltIn: isBuiltIn ?? this.isBuiltIn,
maxRetries: maxRetries ?? this.maxRetries,
timeoutSeconds: timeoutSeconds ?? this.timeoutSeconds,
);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServiceEndpoint.copyWith can’t clear nullable override fields (maxRetries, timeoutSeconds) back to null because it uses maxRetries ?? this.maxRetries (same for timeoutSeconds). The codebase uses a sentinel _notProvided pattern to distinguish “not provided” vs “explicitly set to null” (e.g., NodeProfile.copyWith). Consider adopting the same pattern here so UI can remove an override once set.

Copilot uses AI. Check for mistakes.
@dougborg dougborg added this to the 3.0 milestone Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants