Observability
Observability & SLI/SLO Strategy
Principle: Observe everything, alert on what matters. No alert fatigue.
Stack: OpenTelemetry (instrumentation) + Serilog (logging) + Azure Monitor/App Insights (backend)
Constraint: 5 engineers → no dedicated SRE. Observability must be self-service.
Source: HA.md, Deployment.md, Architect.md, Deliverable 4.3 - Failure Modeling.md
1. Observability Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
│ │
│ ┌─ INSTRUMENTATION ────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Travel │ │ Event │ │ Workforce│ │ Comms │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ OTel SDK │ │ OTel SDK │ │ OTel SDK │ │ OTel SDK │ │ │
│ │ │ Serilog │ │ Serilog │ │ Serilog │ │ Serilog │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ └──────────────┴──────────────┴──────────────┘ │ │
│ │ │ │ │
│ └──────────────────────────┼────────────────────────────────────┘ │
│ │ OTLP (gRPC) │
│ ▼ │
│ ┌─ COLLECTION & STORAGE ───────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ Azure Monitor │ │ Application │ │ │
│ │ │ (Metrics) │ │ Insights │ │ │
│ │ │ │ │ (Logs + Traces) │ │ │
│ │ └────────┬─────────┘ └────────┬─────────┘ │ │
│ │ └──────────┬──────────┘ │ │
│ └──────────────────────┼────────────────────────────────────────┘ │
│ ▼ │
│ ┌─ VISUALIZATION & ALERTING ───────────────────────────────────┐ │
│ │ │ │
│ │ Azure Dashboard (operational) │ │
│ │ Azure Workbooks (detailed analysis) │ │
│ │ Alerts → Slack #incidents + PagerDuty (on-call) │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
2. Three Pillars
2.1 Logs (Serilog → App Insights)
// Structured logging standard — ALL services
Log.Information(
"Booking {BookingId} created by {UserId} for {FlightId}, amount {Amount}",
booking.Id, userId, flightId, amount);
// RULES:
// ✓ Structured (not string interpolation)
// ✓ Context-rich (IDs, amounts, status)
// ✗ NEVER log PII (email, phone, passport, card numbers)
// ✗ NEVER log secrets (tokens, connection strings)
| Log Level | Usage | Example |
|---|---|---|
| Fatal | Service cannot start / unrecoverable | DB connection failed on startup |
| Error | Request failed, needs attention | Payment ACL returned 500 |
| Warning | Degraded but functional | Circuit breaker tripped, using cache |
| Information | Business events | Booking created, event published |
| Debug | Developer troubleshooting (off in prod) | Query parameters, cache hit/miss |
PII Masking Policy (Serilog Destructuring):
// Registered at startup — automatic masking
.Destructure.ByTransforming<UserDto>(u => new {
u.Id,
Email = MaskEmail(u.Email), // j***@example.com
Phone = "***MASKED***",
u.Department, u.Role // Safe to log
})
2.2 Metrics (OpenTelemetry → Azure Monitor)
Standard Metrics per Service:
| Metric | Type | Labels | Alert Threshold |
|---|---|---|---|
http_request_duration_seconds |
Histogram | method, path, status_code | P95 > 500ms |
http_requests_total |
Counter | method, path, status_code | Error rate > 5% |
db_query_duration_seconds |
Histogram | query_type, table | P95 > 100ms |
servicebus_messages_processed |
Counter | topic, status | Dead-letter > 10/hour |
servicebus_message_lag |
Gauge | subscription | Lag > 1000 messages |
circuit_breaker_state |
Gauge | target_service | State = Open > 5 min |
active_connections |
Gauge | service | > 80% of pool |
Custom Business Metrics:
| Metric | Service | Purpose |
|---|---|---|
bookings_created_total |
Travel | Business throughput |
bookings_cancelled_total |
Travel | Cancellation rate monitoring |
events_capacity_utilization |
Event | Capacity planning |
notifications_sent_total |
Comms | Communication volume |
acl_latency_seconds |
ACL | Legacy bridge health |
report_generation_duration |
Reporting | Performance monitoring |
2.3 Traces (OpenTelemetry → App Insights)
Distributed Trace Example: Create Booking
TraceId: abc123...
│
├─ [YARP Gateway] POST /api/bookings 12ms
│ ├─ JWT validation 2ms
│ └─ Route to Travel Service
│
├─ [Travel Service] CreateBookingHandler 145ms
│ ├─ Validate request (FluentValidation) 5ms
│ ├─ Check availability (DB query) 35ms
│ ├─ Create Booking aggregate 3ms
│ ├─ Save to DB (EF Core) 42ms
│ ├─ Publish BookingCreated event 15ms
│ └─ Call Payment ACL 45ms
│ ├─ [ACL] AuthorizePayment 40ms
│ │ └─ [Legacy] POST /payment/auth 35ms
│ └─ Return result 5ms
│
├─ [Comms Service] Handle BookingCreated 25ms (async)
│ ├─ Resolve notification template 5ms
│ ├─ Send email 18ms
│ └─ Log notification sent 2ms
│
Total: 182ms (sync path: 157ms)
Trace Configuration:
// Program.cs — OpenTelemetry setup (every service)
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddSource("ServiceBus.Consumer")
.AddOtlpExporter())
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter());
3. SLI/SLO Definitions
3.1 Service-Level Indicators (SLIs)
| SLI | Definition | Measurement |
|---|---|---|
| Availability | % of successful HTTP responses (non-5xx) | 1 - (5xx responses / total responses) |
| Latency | P95 response time for API requests | histogram_quantile(0.95, http_request_duration) |
| Throughput | Requests processed per second | rate(http_requests_total[5m]) |
| Error Rate | % of requests resulting in error | 5xx responses / total responses |
| Freshness | Data sync lag (CDC → new service DB) | now() - last_cdc_event_timestamp |
3.2 Service-Level Objectives (SLOs)
| Service | Availability SLO | Latency SLO (P95) | Error Budget (30-day) |
|---|---|---|---|
| Travel | 99.9% | < 200ms | 43.2 min downtime |
| Event | 99.9% | < 200ms | 43.2 min downtime |
| Workforce | 99.5% | < 300ms | 3.6 hours downtime |
| Comms | 99.5% | < 500ms (async OK) | 3.6 hours downtime |
| Reporting | 99.0% | < 1000ms (complex queries) | 7.2 hours downtime |
| Payment ACL | 99.95% | < 150ms | 21.6 min downtime |
| YARP Gateway | 99.99% | < 50ms (routing only) | 4.3 min downtime |
Why are SLOs different per service?
Travel + Event: Core business operations. User-facing, real-time.
→ 99.9% — every minute of downtime = lost bookings, lost trust.
Workforce + Comms: Important but not real-time critical.
→ 99.5% — notification delay 5 min OK. Workforce report delay OK.
Reporting: Batch-oriented, complex queries, non-blocking.
→ 99.0% — report delayed 30 min does not impact operations.
Payment ACL: HIGHEST — payment failure = revenue loss.
→ 99.95% — tighter than Travel because payment = money.
Gateway: MUST be highest — single point of entry.
→ 99.99% — gateway down = everything down.
3.3 Error Budget Policy
Error Budget = 100% - SLO
Example: Travel Service (SLO = 99.9%)
Error budget = 0.1% = 43.2 minutes per 30-day window
POLICY:
Budget > 50% remaining: Normal development, deploy freely
Budget 20–50% remaining: Caution — reduce deploy frequency, review changes
Budget < 20% remaining: FREEZE — no new features, reliability fixes only
Budget = 0 (exhausted): Full freeze. Incident review. Resume when budget renews.
WHO DECIDES: D1 (Tech Lead) reviews error budget weekly.
D5 monitors dashboard daily.
4. Alerting Strategy
4.1 Alert Severity & Response
| Severity | Criteria | Notification | Response Time | Example |
|---|---|---|---|---|
| P1 Critical | Service down, data loss, payment impact | PagerDuty (phone call) + Slack | < 15 min | Gateway 5xx > 50%, Payment ACL timeout |
| P2 High | SLO at risk, degraded performance | Slack #incidents + mention on-call | < 1 hour | Travel P95 > 500ms for 10 min |
| P3 Medium | Anomaly, approaching limits | Slack #monitoring | < 4 hours | DB connection pool > 70%, disk > 80% |
| P4 Low | Informational, optimization opportunity | Slack #monitoring (no mention) | Next business day | Cache hit ratio < 60%, slow query detected |
4.2 Alert Rules
# Alert definitions (Azure Monitor / Bicep)
alerts:
# P1: Service availability
- name: "Gateway Down"
severity: P1
condition: availability(yarp-gateway) < 99% over 5min
action: pagerduty + slack-critical
- name: "Payment ACL Failure"
severity: P1
condition: error_rate(payment-acl) > 10% over 2min
action: pagerduty + slack-critical
# P2: SLO at risk
- name: "Travel Latency Spike"
severity: P2
condition: p95_latency(travel-service) > 500ms over 10min
action: slack-incidents
- name: "Service Bus Dead Letters Rising"
severity: P2
condition: dead_letter_count(any-topic) > 50 over 15min
action: slack-incidents
# P3: Resource warnings
- name: "DB Connection Pool High"
severity: P3
condition: active_connections(any-sql) > 80% of max
action: slack-monitoring
- name: "Container CPU High"
severity: P3
condition: cpu_usage(any-container) > 80% over 15min
action: slack-monitoring
# P4: Informational
- name: "CDC Lag Detected"
severity: P4
condition: cdc_lag(any-service) > 30sec
action: slack-monitoring (no mention)
4.3 Anti-Alert-Fatigue Rules
RULE 1: Every alert MUST have a documented action.
"What do I DO when this fires?" → if no answer → delete alert.
RULE 2: Flapping alerts (fire+resolve > 3 times in 1 hour) → auto-suppress.
Review threshold. Likely too sensitive.
RULE 3: Maximum 5 P1/P2 alerts per service.
More = noise. Consolidate or downgrade.
RULE 4: P3/P4 alerts suppressed outside business hours (8AM–8PM ICT).
P1/P2: 24/7.
RULE 5: Weekly alert review: how many fired? How many were actionable?
Target: >80% of alerts require action. <80% = tuning needed.
5. Dashboards
5.1 Dashboard Hierarchy
┌───────────────────────────────────────────────────────┐
│ LEVEL 1: Executive Dashboard (Azure Portal) │
│ Audience: Management, D1 │
│ Refresh: Real-time │
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Overall │ │Error │ │Active │ │SLO │ │
│ │Avail. │ │Rate │ │Users │ │Status │ │
│ │ 99.97% │ │ 0.03% │ │ 1,247 │ │ 6/7 ✅ │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
└───────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────┐
│ LEVEL 2: Service Dashboard (per service) │
│ Audience: D1–D5 (service owners) │
│ Refresh: Real-time │
│ │
│ Request rate │ Latency (P50/P95/P99) │ Error rate │
│ DB queries │ Cache hit ratio │ Queue depth │
│ CPU / Memory │ Connection pool │ Active traces │
└───────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────┐
│ LEVEL 3: Debug Dashboard (Azure Workbooks) │
│ Audience: On-call engineer during incident │
│ Content: Distributed traces, slow queries, │
│ Service Bus dead letters, dependency map │
└───────────────────────────────────────────────────────┘
5.2 SLO Burn Rate Dashboard
Travel Service — 30-Day SLO Window
SLO: 99.9% availability
Error Budget: 43.2 minutes
Used: 12.5 minutes (29%)
Remaining: 30.7 minutes (71%)
┌──────────────────────────────────────────┐
│ ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 29% used
│ ◄──── used ────►◄──── remaining ────────►│
└──────────────────────────────────────────┘
Burn Rate: 0.97x (on track — budget lasts full window)
Status: ✅ HEALTHY
Action: None required
Events this window:
Mar 3 — 5min outage (deployment issue, rolled back)
Mar 8 — 7.5min degraded (DB connection pool spike)
6. Correlation & Troubleshooting
6.1 Correlation ID Flow
Browser → YARP → Service → Service Bus → Consumer Service → DB
│ │ │ │ │ │
└─────────┴────────┴──────────┴──────────────┴───────────┘
Same CorrelationId / TraceId
Implementation:
1. YARP generates TraceId if not present (X-Trace-Id header)
2. Each service propagates via OpenTelemetry context
3. Service Bus messages carry TraceId in properties
4. Logs include TraceId in every entry → searchable in App Insights
Query: traces | where customDimensions.TraceId == "abc123"
→ Shows full request journey across all services
6.2 Troubleshooting Runbook
SYMPTOM: Travel Service P95 latency > 500ms
STEP 1: Check Dashboard Level 2 (Travel)
→ Is it ALL endpoints or specific one?
STEP 2: Check Dependencies
→ DB query time normal? (db_query_duration metric)
→ Payment ACL responding? (acl_latency metric)
→ Service Bus publishing OK? (publish_duration metric)
STEP 3: Check Infrastructure
→ CPU > 80%? → Scale up (auto-scale should handle)
→ Memory > 80%? → Possible leak, check container restart count
→ DB connection pool exhausted? → Connection leak
STEP 4: Distributed Trace
→ Find slow trace in App Insights
→ Identify which span is slow
→ Fix or escalate
STEP 5: If ACL is slow
→ Legacy monolith issue → cannot fix directly
→ Enable circuit breaker (return cached/degraded response)
→ Notify legacy team
7. Observability per Phase
Phase 0 (Month 1)
- OpenTelemetry + Serilog NuGet packages in shared template
- App Insights workspace provisioned (Bicep)
- Basic dashboard: availability + error rate + latency per service
- Slack webhook for alerts
Phase 1 (Month 2–4)
- Travel + Event: full instrumentation (traces + metrics + logs)
- ACL: latency + error rate dashboard (critical)
- SLO definitions published for Travel + Event + ACL
- P1/P2 alert rules active
- Correlation ID flowing through YARP → services → Service Bus
Phase 2 (Month 5–7)
- All 5 services: full observability
- SLO burn rate dashboard operational
- Error budget policy enforced
- Custom business metrics (bookings/sec, events created/day)
- CDC lag monitoring
Phase 3 (Month 8–9)
- All alerts tuned (< 20% false positive rate)
- Troubleshooting runbooks per service
- On-call rotation established (D1 backup, D2–D5 primary rotation)
- Monthly SLO review process documented
- Load test baselines captured as performance benchmarks
8. Cost of Observability
| Component | Monthly Cost | Notes |
|---|---|---|
| App Insights ingestion (20 GB) | ~$45 | First 5 GB free. 15 GB × $2.99 |
| Azure Monitor metrics | Included | Free with Container Apps + SQL |
| Log retention (90 days) | ~$10 | Standard retention |
| Alert rules (20 rules) | ~$3 | $0.15/rule/month |
| TOTAL | ~$58/month | < 0.3% of total project cost |
Observability is the cheapest investment with the highest ROI. $58/month to prevent 1 hour undetected outage = no-brainer.