Documents/reference/Observability

Observability

Observability & SLI/SLO Strategy

Principle: Observe everything, alert on what matters. No alert fatigue.
Stack: OpenTelemetry (instrumentation) + Serilog (logging) + Azure Monitor/App Insights (backend)
Constraint: 5 engineers → no dedicated SRE. Observability must be self-service.
Source: HA.md, Deployment.md, Architect.md, Deliverable 4.3 - Failure Modeling.md


1. Observability Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  OBSERVABILITY STACK                                                 │
│                                                                      │
│  ┌─ INSTRUMENTATION ────────────────────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │   │
│  │  │ Travel   │  │ Event    │  │ Workforce│  │  Comms   │    │   │
│  │  │          │  │          │  │          │  │          │    │   │
│  │  │ OTel SDK │  │ OTel SDK │  │ OTel SDK │  │ OTel SDK │    │   │
│  │  │ Serilog  │  │ Serilog  │  │ Serilog  │  │ Serilog  │    │   │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │   │
│  │       └──────────────┴──────────────┴──────────────┘         │   │
│  │                          │                                    │   │
│  └──────────────────────────┼────────────────────────────────────┘   │
│                             │ OTLP (gRPC)                            │
│                             ▼                                        │
│  ┌─ COLLECTION & STORAGE ───────────────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌──────────────────┐  ┌──────────────────┐                  │   │
│  │  │ Azure Monitor    │  │ Application      │                  │   │
│  │  │ (Metrics)        │  │ Insights         │                  │   │
│  │  │                  │  │ (Logs + Traces)  │                  │   │
│  │  └────────┬─────────┘  └────────┬─────────┘                  │   │
│  │           └──────────┬──────────┘                             │   │
│  └──────────────────────┼────────────────────────────────────────┘   │
│                         ▼                                            │
│  ┌─ VISUALIZATION & ALERTING ───────────────────────────────────┐   │
│  │                                                               │   │
│  │  Azure Dashboard (operational)                                │   │
│  │  Azure Workbooks (detailed analysis)                          │   │
│  │  Alerts → Slack #incidents + PagerDuty (on-call)             │   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

2. Three Pillars

2.1 Logs (Serilog → App Insights)

// Structured logging standard — ALL services
Log.Information(
    "Booking {BookingId} created by {UserId} for {FlightId}, amount {Amount}",
    booking.Id, userId, flightId, amount);

// RULES:
// ✓ Structured (not string interpolation)
// ✓ Context-rich (IDs, amounts, status)
// ✗ NEVER log PII (email, phone, passport, card numbers)
// ✗ NEVER log secrets (tokens, connection strings)
Log Level Usage Example
Fatal Service cannot start / unrecoverable DB connection failed on startup
Error Request failed, needs attention Payment ACL returned 500
Warning Degraded but functional Circuit breaker tripped, using cache
Information Business events Booking created, event published
Debug Developer troubleshooting (off in prod) Query parameters, cache hit/miss

PII Masking Policy (Serilog Destructuring):

// Registered at startup — automatic masking
.Destructure.ByTransforming<UserDto>(u => new {
    u.Id,
    Email = MaskEmail(u.Email),      // j***@example.com
    Phone = "***MASKED***",
    u.Department, u.Role             // Safe to log
})

2.2 Metrics (OpenTelemetry → Azure Monitor)

Standard Metrics per Service:

Metric Type Labels Alert Threshold
http_request_duration_seconds Histogram method, path, status_code P95 > 500ms
http_requests_total Counter method, path, status_code Error rate > 5%
db_query_duration_seconds Histogram query_type, table P95 > 100ms
servicebus_messages_processed Counter topic, status Dead-letter > 10/hour
servicebus_message_lag Gauge subscription Lag > 1000 messages
circuit_breaker_state Gauge target_service State = Open > 5 min
active_connections Gauge service > 80% of pool

Custom Business Metrics:

Metric Service Purpose
bookings_created_total Travel Business throughput
bookings_cancelled_total Travel Cancellation rate monitoring
events_capacity_utilization Event Capacity planning
notifications_sent_total Comms Communication volume
acl_latency_seconds ACL Legacy bridge health
report_generation_duration Reporting Performance monitoring

2.3 Traces (OpenTelemetry → App Insights)

Distributed Trace Example: Create Booking

TraceId: abc123...
│
├─ [YARP Gateway] POST /api/bookings          12ms
│  ├─ JWT validation                            2ms
│  └─ Route to Travel Service
│
├─ [Travel Service] CreateBookingHandler       145ms
│  ├─ Validate request (FluentValidation)       5ms
│  ├─ Check availability (DB query)            35ms
│  ├─ Create Booking aggregate                  3ms
│  ├─ Save to DB (EF Core)                    42ms
│  ├─ Publish BookingCreated event             15ms
│  └─ Call Payment ACL                         45ms
│     ├─ [ACL] AuthorizePayment               40ms
│     │  └─ [Legacy] POST /payment/auth        35ms
│     └─ Return result                          5ms
│
├─ [Comms Service] Handle BookingCreated        25ms  (async)
│  ├─ Resolve notification template             5ms
│  ├─ Send email                               18ms
│  └─ Log notification sent                     2ms
│
Total: 182ms (sync path: 157ms)

Trace Configuration:

// Program.cs — OpenTelemetry setup (every service)
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("ServiceBus.Consumer")
        .AddOtlpExporter())
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter());

3. SLI/SLO Definitions

3.1 Service-Level Indicators (SLIs)

SLI Definition Measurement
Availability % of successful HTTP responses (non-5xx) 1 - (5xx responses / total responses)
Latency P95 response time for API requests histogram_quantile(0.95, http_request_duration)
Throughput Requests processed per second rate(http_requests_total[5m])
Error Rate % of requests resulting in error 5xx responses / total responses
Freshness Data sync lag (CDC → new service DB) now() - last_cdc_event_timestamp

3.2 Service-Level Objectives (SLOs)

Service Availability SLO Latency SLO (P95) Error Budget (30-day)
Travel 99.9% < 200ms 43.2 min downtime
Event 99.9% < 200ms 43.2 min downtime
Workforce 99.5% < 300ms 3.6 hours downtime
Comms 99.5% < 500ms (async OK) 3.6 hours downtime
Reporting 99.0% < 1000ms (complex queries) 7.2 hours downtime
Payment ACL 99.95% < 150ms 21.6 min downtime
YARP Gateway 99.99% < 50ms (routing only) 4.3 min downtime

Why are SLOs different per service?

Travel + Event: Core business operations. User-facing, real-time.
  → 99.9% — every minute of downtime = lost bookings, lost trust.

Workforce + Comms: Important but not real-time critical.
  → 99.5% — notification delay 5 min OK. Workforce report delay OK.

Reporting: Batch-oriented, complex queries, non-blocking.
  → 99.0% — report delayed 30 min does not impact operations.

Payment ACL: HIGHEST — payment failure = revenue loss.
  → 99.95% — tighter than Travel because payment = money.

Gateway: MUST be highest — single point of entry.
  → 99.99% — gateway down = everything down.

3.3 Error Budget Policy

Error Budget = 100% - SLO

Example: Travel Service (SLO = 99.9%)
  Error budget = 0.1% = 43.2 minutes per 30-day window

POLICY:
  Budget > 50% remaining:   Normal development, deploy freely
  Budget 20–50% remaining:  Caution — reduce deploy frequency, review changes
  Budget < 20% remaining:   FREEZE — no new features, reliability fixes only
  Budget = 0 (exhausted):   Full freeze. Incident review. Resume when budget renews.

WHO DECIDES: D1 (Tech Lead) reviews error budget weekly.
             D5 monitors dashboard daily.

4. Alerting Strategy

4.1 Alert Severity & Response

Severity Criteria Notification Response Time Example
P1 Critical Service down, data loss, payment impact PagerDuty (phone call) + Slack < 15 min Gateway 5xx > 50%, Payment ACL timeout
P2 High SLO at risk, degraded performance Slack #incidents + mention on-call < 1 hour Travel P95 > 500ms for 10 min
P3 Medium Anomaly, approaching limits Slack #monitoring < 4 hours DB connection pool > 70%, disk > 80%
P4 Low Informational, optimization opportunity Slack #monitoring (no mention) Next business day Cache hit ratio < 60%, slow query detected

4.2 Alert Rules

# Alert definitions (Azure Monitor / Bicep)

alerts:
  # P1: Service availability
  - name: "Gateway Down"
    severity: P1
    condition: availability(yarp-gateway) < 99% over 5min
    action: pagerduty + slack-critical

  - name: "Payment ACL Failure"
    severity: P1
    condition: error_rate(payment-acl) > 10% over 2min
    action: pagerduty + slack-critical

  # P2: SLO at risk
  - name: "Travel Latency Spike"
    severity: P2
    condition: p95_latency(travel-service) > 500ms over 10min
    action: slack-incidents

  - name: "Service Bus Dead Letters Rising"
    severity: P2
    condition: dead_letter_count(any-topic) > 50 over 15min
    action: slack-incidents

  # P3: Resource warnings
  - name: "DB Connection Pool High"
    severity: P3
    condition: active_connections(any-sql) > 80% of max
    action: slack-monitoring

  - name: "Container CPU High"
    severity: P3
    condition: cpu_usage(any-container) > 80% over 15min
    action: slack-monitoring

  # P4: Informational
  - name: "CDC Lag Detected"
    severity: P4
    condition: cdc_lag(any-service) > 30sec
    action: slack-monitoring (no mention)

4.3 Anti-Alert-Fatigue Rules

RULE 1: Every alert MUST have a documented action.
        "What do I DO when this fires?" → if no answer → delete alert.

RULE 2: Flapping alerts (fire+resolve > 3 times in 1 hour) → auto-suppress.
        Review threshold. Likely too sensitive.

RULE 3: Maximum 5 P1/P2 alerts per service.
        More = noise. Consolidate or downgrade.

RULE 4: P3/P4 alerts suppressed outside business hours (8AM–8PM ICT).
        P1/P2: 24/7.

RULE 5: Weekly alert review: how many fired? How many were actionable?
        Target: >80% of alerts require action. <80% = tuning needed.

5. Dashboards

5.1 Dashboard Hierarchy

┌───────────────────────────────────────────────────────┐
│  LEVEL 1: Executive Dashboard (Azure Portal)          │
│  Audience: Management, D1                             │
│  Refresh: Real-time                                    │
│                                                        │
│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐         │
│  │Overall │ │Error   │ │Active  │ │SLO     │         │
│  │Avail.  │ │Rate    │ │Users   │ │Status  │         │
│  │ 99.97% │ │ 0.03%  │ │ 1,247  │ │ 6/7 ✅ │         │
│  └────────┘ └────────┘ └────────┘ └────────┘         │
└───────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────┐
│  LEVEL 2: Service Dashboard (per service)             │
│  Audience: D1–D5 (service owners)                     │
│  Refresh: Real-time                                    │
│                                                        │
│  Request rate │ Latency (P50/P95/P99) │ Error rate    │
│  DB queries   │ Cache hit ratio       │ Queue depth   │
│  CPU / Memory │ Connection pool       │ Active traces │
└───────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────┐
│  LEVEL 3: Debug Dashboard (Azure Workbooks)           │
│  Audience: On-call engineer during incident           │
│  Content: Distributed traces, slow queries,           │
│           Service Bus dead letters, dependency map    │
└───────────────────────────────────────────────────────┘

5.2 SLO Burn Rate Dashboard

Travel Service — 30-Day SLO Window

  SLO: 99.9% availability
  Error Budget: 43.2 minutes
  Used: 12.5 minutes (29%)
  Remaining: 30.7 minutes (71%)

  ┌──────────────────────────────────────────┐
  │ ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  29% used
  │ ◄──── used ────►◄──── remaining ────────►│
  └──────────────────────────────────────────┘

  Burn Rate: 0.97x (on track — budget lasts full window)
  
  Status: ✅ HEALTHY
  Action: None required
  
  Events this window:
    Mar 3 — 5min outage (deployment issue, rolled back)
    Mar 8 — 7.5min degraded (DB connection pool spike)

6. Correlation & Troubleshooting

6.1 Correlation ID Flow

Browser → YARP → Service → Service Bus → Consumer Service → DB
   │         │        │          │              │           │
   └─────────┴────────┴──────────┴──────────────┴───────────┘
                    Same CorrelationId / TraceId

Implementation:
  1. YARP generates TraceId if not present (X-Trace-Id header)
  2. Each service propagates via OpenTelemetry context
  3. Service Bus messages carry TraceId in properties
  4. Logs include TraceId in every entry → searchable in App Insights
  
Query: traces | where customDimensions.TraceId == "abc123"
  → Shows full request journey across all services

6.2 Troubleshooting Runbook

SYMPTOM: Travel Service P95 latency > 500ms

STEP 1: Check Dashboard Level 2 (Travel)
  → Is it ALL endpoints or specific one?
  
STEP 2: Check Dependencies
  → DB query time normal? (db_query_duration metric)
  → Payment ACL responding? (acl_latency metric)
  → Service Bus publishing OK? (publish_duration metric)
  
STEP 3: Check Infrastructure
  → CPU > 80%? → Scale up (auto-scale should handle)
  → Memory > 80%? → Possible leak, check container restart count
  → DB connection pool exhausted? → Connection leak
  
STEP 4: Distributed Trace
  → Find slow trace in App Insights
  → Identify which span is slow
  → Fix or escalate

STEP 5: If ACL is slow
  → Legacy monolith issue → cannot fix directly
  → Enable circuit breaker (return cached/degraded response)
  → Notify legacy team

7. Observability per Phase

Phase 0 (Month 1)

  • OpenTelemetry + Serilog NuGet packages in shared template
  • App Insights workspace provisioned (Bicep)
  • Basic dashboard: availability + error rate + latency per service
  • Slack webhook for alerts

Phase 1 (Month 2–4)

  • Travel + Event: full instrumentation (traces + metrics + logs)
  • ACL: latency + error rate dashboard (critical)
  • SLO definitions published for Travel + Event + ACL
  • P1/P2 alert rules active
  • Correlation ID flowing through YARP → services → Service Bus

Phase 2 (Month 5–7)

  • All 5 services: full observability
  • SLO burn rate dashboard operational
  • Error budget policy enforced
  • Custom business metrics (bookings/sec, events created/day)
  • CDC lag monitoring

Phase 3 (Month 8–9)

  • All alerts tuned (< 20% false positive rate)
  • Troubleshooting runbooks per service
  • On-call rotation established (D1 backup, D2–D5 primary rotation)
  • Monthly SLO review process documented
  • Load test baselines captured as performance benchmarks

8. Cost of Observability

Component Monthly Cost Notes
App Insights ingestion (20 GB) ~$45 First 5 GB free. 15 GB × $2.99
Azure Monitor metrics Included Free with Container Apps + SQL
Log retention (90 days) ~$10 Standard retention
Alert rules (20 rules) ~$3 $0.15/rule/month
TOTAL ~$58/month < 0.3% of total project cost

Observability is the cheapest investment with the highest ROI. $58/month to prevent 1 hour undetected outage = no-brainer.