An educational example demonstrating how to design evaluation, observability, and feedback loops for production AI systems in a clean, minimal C# console app.
This project intentionally keeps infrastructure simple while modeling the key production concepts:
- Offline evaluation and quality gates
- Online observability (latency, tokens, error rate)
- Feedback loops that update prompt policy based on failures
Production AI systems are not just "prompt in, answer out". They need:
- Evaluation to measure quality before and after changes
- Observability to see latency, safety, and drift in real time
- Feedback loops to convert failures into improvements
This example shows a tiny, local pipeline that includes all three, including a two-pass run (baseline -> apply feedback -> re-run).
- A simple evaluation suite with relevance (keyword) and safety (forbidden term) scoring
- A minimal telemetry sink that records spans, wall latency, model latency, and token counts
- Quality gates that block releases when pass rate or safety drops
- A feedback processor that updates a prompt policy when failures occur
- Optional integration with a local Ollama model via OllamaSharp (falls back to a mock model)
- .NET 10 SDK or later https://dotnet.microsoft.com/
Optional:
- Ollama installed and running locally https://ollama.ai/
Run the app from the repo root (uses a deterministic mock model by default):
dotnet run --project EvalObservabilityFeedbackLoopsSet these environment variables and run the app:
set USE_OLLAMA=true
set OLLAMA_URL=http://localhost:11434
set OLLAMA_MODEL=llama3.2:3b
dotnet run --project EvalObservabilityFeedbackLoopsIf Ollama is unavailable, the app falls back to the mock model. The Ollama client streams responses using OllamaSharp, then aggregates them into a single response.
- Per-case evaluation scores and notes
- Telemetry snapshot (avg latency, p95 latency, tokens)
- Quality gate decision (deploy or block)
- Prompt policy updates from feedback events
- Before vs after summary (PASS 1 vs PASS 2)
The evaluation uses simple demo thresholds:
- Pass rate >= 90%
- Average safety score >= 1.0
- P95 latency < 1500 ms
These are intentionally conservative for demonstration and can be tuned per environment.
- PASS 1 runs the evaluation suite with the initial policy.
- Failures produce feedback events that add constraints to the policy.
- PASS 2 re-runs the suite with the updated policy.
This is a simple stand-in for continuous improvement loops in production.
.
+-- EvalObservabilityFeedbackLoops.slnx
+-- EvalObservabilityFeedbackLoops/
| +-- EvalObservabilityFeedbackLoops.csproj
| +-- Program.cs
| +-- EvaluationCase.cs
| +-- EvaluationResult.cs
| +-- EvaluationSuite.cs
| +-- FeedbackProcessor.cs
| +-- ModelClients.cs
| +-- PromptPolicy.cs
| +-- TelemetrySink.cs
+-- LICENSE
+-- README.md
The example is intentionally small but maps cleanly to real systems:
- Swap the mock model for a real one
- Stream telemetry to your observability stack
- Store feedback events and label them for retraining
See the LICENSE file for details.
Contributions are welcome. If you'd like to extend the demo, consider:
- Adding new evaluation cases or scoring rules
- Wiring telemetry into your preferred observability stack
- Expanding feedback signals (human review, regression labels)
Open a PR or issue with a clear description of the change.