Observability & SLI/SLO Strategy

Principle: Observe everything, alert on what matters. No alert fatigue.
Stack: OpenTelemetry (instrumentation) + Serilog (logging) + Azure Monitor/App Insights (backend)
Constraint: 5 engineers → no dedicated SRE. Observability must be self-service.
Source: HA.md, Deployment.md, Architect.md, Deliverable 4.3 - Failure Modeling.md

1. Observability Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  OBSERVABILITY STACK                                                 │
│                                                                      │
│  ┌─ INSTRUMENTATION ────────────────────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │   │
│  │  │ Travel   │  │ Event    │  │ Workforce│  │  Comms   │    │   │
│  │  │          │  │          │  │          │  │          │    │   │
│  │  │ OTel SDK │  │ OTel SDK │  │ OTel SDK │  │ OTel SDK │    │   │
│  │  │ Serilog  │  │ Serilog  │  │ Serilog  │  │ Serilog  │    │   │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │   │
│  │       └──────────────┴──────────────┴──────────────┘         │   │
│  │                          │                                    │   │
│  └──────────────────────────┼────────────────────────────────────┘   │
│                             │ OTLP (gRPC)                            │
│                             ▼                                        │
│  ┌─ COLLECTION & STORAGE ───────────────────────────────────────┐   │
│  │                                                               │   │
│  │  ┌──────────────────┐  ┌──────────────────┐                  │   │
│  │  │ Azure Monitor    │  │ Application      │                  │   │
│  │  │ (Metrics)        │  │ Insights         │                  │   │
│  │  │                  │  │ (Logs + Traces)  │                  │   │
│  │  └────────┬─────────┘  └────────┬─────────┘                  │   │
│  │           └──────────┬──────────┘                             │   │
│  └──────────────────────┼────────────────────────────────────────┘   │
│                         ▼                                            │
│  ┌─ VISUALIZATION & ALERTING ───────────────────────────────────┐   │
│  │                                                               │   │
│  │  Azure Dashboard (operational)                                │   │
│  │  Azure Workbooks (detailed analysis)                          │   │
│  │  Alerts → Slack #incidents + PagerDuty (on-call)             │   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

2. Three Pillars

2.1 Logs (Serilog → App Insights)

// Structured logging standard — ALL services
Log.Information(
    "Booking {BookingId} created by {UserId} for {FlightId}, amount {Amount}",
    booking.Id, userId, flightId, amount);

// RULES:
// ✓ Structured (not string interpolation)
// ✓ Context-rich (IDs, amounts, status)
// ✗ NEVER log PII (email, phone, passport, card numbers)
// ✗ NEVER log secrets (tokens, connection strings)

Log Level	Usage	Example
Fatal	Service cannot start / unrecoverable	DB connection failed on startup
Error	Request failed, needs attention	Payment ACL returned 500
Warning	Degraded but functional	Circuit breaker tripped, using cache
Information	Business events	Booking created, event published
Debug	Developer troubleshooting (off in prod)	Query parameters, cache hit/miss

PII Masking Policy (Serilog Destructuring):

// Registered at startup — automatic masking
.Destructure.ByTransforming<UserDto>(u => new {
    u.Id,
    Email = MaskEmail(u.Email),      // j***@example.com
    Phone = "***MASKED***",
    u.Department, u.Role             // Safe to log
})

2.2 Metrics (OpenTelemetry → Azure Monitor)

Standard Metrics per Service:

Metric	Type	Labels	Alert Threshold
`http_request_duration_seconds`	Histogram	method, path, status_code	P95 > 500ms
`http_requests_total`	Counter	method, path, status_code	Error rate > 5%
`db_query_duration_seconds`	Histogram	query_type, table	P95 > 100ms
`servicebus_messages_processed`	Counter	topic, status	Dead-letter > 10/hour
`servicebus_message_lag`	Gauge	subscription	Lag > 1000 messages
`circuit_breaker_state`	Gauge	target_service	State = Open > 5 min
`active_connections`	Gauge	service	> 80% of pool

Custom Business Metrics:

Metric	Service	Purpose
`bookings_created_total`	Travel	Business throughput
`bookings_cancelled_total`	Travel	Cancellation rate monitoring
`events_capacity_utilization`	Event	Capacity planning
`notifications_sent_total`	Comms	Communication volume
`acl_latency_seconds`	ACL	Legacy bridge health
`report_generation_duration`	Reporting	Performance monitoring

2.3 Traces (OpenTelemetry → App Insights)

Distributed Trace Example: Create Booking

TraceId: abc123...
│
├─ [YARP Gateway] POST /api/bookings          12ms
│  ├─ JWT validation                            2ms
│  └─ Route to Travel Service
│
├─ [Travel Service] CreateBookingHandler       145ms
│  ├─ Validate request (FluentValidation)       5ms
│  ├─ Check availability (DB query)            35ms
│  ├─ Create Booking aggregate                  3ms
│  ├─ Save to DB (EF Core)                    42ms
│  ├─ Publish BookingCreated event             15ms
│  └─ Call Payment ACL                         45ms
│     ├─ [ACL] AuthorizePayment               40ms
│     │  └─ [Legacy] POST /payment/auth        35ms
│     └─ Return result                          5ms
│
├─ [Comms Service] Handle BookingCreated        25ms  (async)
│  ├─ Resolve notification template             5ms
│  ├─ Send email                               18ms
│  └─ Log notification sent                     2ms
│
Total: 182ms (sync path: 157ms)

Trace Configuration:

// Program.cs — OpenTelemetry setup (every service)
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("ServiceBus.Consumer")
        .AddOtlpExporter())
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter());

3. SLI/SLO Definitions

3.1 Service-Level Indicators (SLIs)

SLI	Definition	Measurement
Availability	% of successful HTTP responses (non-5xx)	`1 - (5xx responses / total responses)`
Latency	P95 response time for API requests	`histogram_quantile(0.95, http_request_duration)`
Throughput	Requests processed per second	`rate(http_requests_total[5m])`
Error Rate	% of requests resulting in error	`5xx responses / total responses`
Freshness	Data sync lag (CDC → new service DB)	`now() - last_cdc_event_timestamp`

3.2 Service-Level Objectives (SLOs)

Service	Availability SLO	Latency SLO (P95)	Error Budget (30-day)
Travel	99.9%	< 200ms	43.2 min downtime
Event	99.9%	< 200ms	43.2 min downtime
Workforce	99.5%	< 300ms	3.6 hours downtime
Comms	99.5%	< 500ms (async OK)	3.6 hours downtime
Reporting	99.0%	< 1000ms (complex queries)	7.2 hours downtime
Payment ACL	99.95%	< 150ms	21.6 min downtime
YARP Gateway	99.99%	< 50ms (routing only)	4.3 min downtime

Why are SLOs different per service?

Travel + Event: Core business operations. User-facing, real-time.
  → 99.9% — every minute of downtime = lost bookings, lost trust.

Workforce + Comms: Important but not real-time critical.
  → 99.5% — notification delay 5 min OK. Workforce report delay OK.

Reporting: Batch-oriented, complex queries, non-blocking.
  → 99.0% — report delayed 30 min does not impact operations.

Payment ACL: HIGHEST — payment failure = revenue loss.
  → 99.95% — tighter than Travel because payment = money.

Gateway: MUST be highest — single point of entry.
  → 99.99% — gateway down = everything down.

3.3 Error Budget Policy

Error Budget = 100% - SLO

Example: Travel Service (SLO = 99.9%)
  Error budget = 0.1% = 43.2 minutes per 30-day window

POLICY:
  Budget > 50% remaining:   Normal development, deploy freely
  Budget 20–50% remaining:  Caution — reduce deploy frequency, review changes
  Budget < 20% remaining:   FREEZE — no new features, reliability fixes only
  Budget = 0 (exhausted):   Full freeze. Incident review. Resume when budget renews.

WHO DECIDES: D1 (Tech Lead) reviews error budget weekly.
             D5 monitors dashboard daily.

4. Alerting Strategy

4.1 Alert Severity & Response

Severity	Criteria	Notification	Response Time	Example
P1 Critical	Service down, data loss, payment impact	PagerDuty (phone call) + Slack	< 15 min	Gateway 5xx > 50%, Payment ACL timeout
P2 High	SLO at risk, degraded performance	Slack #incidents + mention on-call	< 1 hour	Travel P95 > 500ms for 10 min
P3 Medium	Anomaly, approaching limits	Slack #monitoring	< 4 hours	DB connection pool > 70%, disk > 80%
P4 Low	Informational, optimization opportunity	Slack #monitoring (no mention)	Next business day	Cache hit ratio < 60%, slow query detected

4.2 Alert Rules

# Alert definitions (Azure Monitor / Bicep)

alerts:
  # P1: Service availability
  - name: "Gateway Down"
    severity: P1
    condition: availability(yarp-gateway) < 99% over 5min
    action: pagerduty + slack-critical

  - name: "Payment ACL Failure"
    severity: P1
    condition: error_rate(payment-acl) > 10% over 2min
    action: pagerduty + slack-critical

  # P2: SLO at risk
  - name: "Travel Latency Spike"
    severity: P2
    condition: p95_latency(travel-service) > 500ms over 10min
    action: slack-incidents

  - name: "Service Bus Dead Letters Rising"
    severity: P2
    condition: dead_letter_count(any-topic) > 50 over 15min
    action: slack-incidents

  # P3: Resource warnings
  - name: "DB Connection Pool High"
    severity: P3
    condition: active_connections(any-sql) > 80% of max
    action: slack-monitoring

  - name: "Container CPU High"
    severity: P3
    condition: cpu_usage(any-container) > 80% over 15min
    action: slack-monitoring

  # P4: Informational
  - name: "CDC Lag Detected"
    severity: P4
    condition: cdc_lag(any-service) > 30sec
    action: slack-monitoring (no mention)

4.3 Anti-Alert-Fatigue Rules

RULE 1: Every alert MUST have a documented action.
        "What do I DO when this fires?" → if no answer → delete alert.

RULE 2: Flapping alerts (fire+resolve > 3 times in 1 hour) → auto-suppress.
        Review threshold. Likely too sensitive.

RULE 3: Maximum 5 P1/P2 alerts per service.
        More = noise. Consolidate or downgrade.

RULE 4: P3/P4 alerts suppressed outside business hours (8AM–8PM ICT).
        P1/P2: 24/7.

RULE 5: Weekly alert review: how many fired? How many were actionable?
        Target: >80% of alerts require action. <80% = tuning needed.

5. Dashboards

5.1 Dashboard Hierarchy

┌───────────────────────────────────────────────────────┐
│  LEVEL 1: Executive Dashboard (Azure Portal)          │
│  Audience: Management, D1                             │
│  Refresh: Real-time                                    │
│                                                        │
│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐         │
│  │Overall │ │Error   │ │Active  │ │SLO     │         │
│  │Avail.  │ │Rate    │ │Users   │ │Status  │         │
│  │ 99.97% │ │ 0.03%  │ │ 1,247  │ │ 6/7 ✅ │         │
│  └────────┘ └────────┘ └────────┘ └────────┘         │
└───────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────┐
│  LEVEL 2: Service Dashboard (per service)             │
│  Audience: D1–D5 (service owners)                     │
│  Refresh: Real-time                                    │
│                                                        │
│  Request rate │ Latency (P50/P95/P99) │ Error rate    │
│  DB queries   │ Cache hit ratio       │ Queue depth   │
│  CPU / Memory │ Connection pool       │ Active traces │
└───────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────┐
│  LEVEL 3: Debug Dashboard (Azure Workbooks)           │
│  Audience: On-call engineer during incident           │
│  Content: Distributed traces, slow queries,           │
│           Service Bus dead letters, dependency map    │
└───────────────────────────────────────────────────────┘

5.2 SLO Burn Rate Dashboard

Travel Service — 30-Day SLO Window

  SLO: 99.9% availability
  Error Budget: 43.2 minutes
  Used: 12.5 minutes (29%)
  Remaining: 30.7 minutes (71%)

  ┌──────────────────────────────────────────┐
  │ ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  29% used
  │ ◄──── used ────►◄──── remaining ────────►│
  └──────────────────────────────────────────┘

  Burn Rate: 0.97x (on track — budget lasts full window)
  
  Status: ✅ HEALTHY
  Action: None required
  
  Events this window:
    Mar 3 — 5min outage (deployment issue, rolled back)
    Mar 8 — 7.5min degraded (DB connection pool spike)

6. Correlation & Troubleshooting

6.1 Correlation ID Flow

Browser → YARP → Service → Service Bus → Consumer Service → DB
   │         │        │          │              │           │
   └─────────┴────────┴──────────┴──────────────┴───────────┘
                    Same CorrelationId / TraceId

Implementation:
  1. YARP generates TraceId if not present (X-Trace-Id header)
  2. Each service propagates via OpenTelemetry context
  3. Service Bus messages carry TraceId in properties
  4. Logs include TraceId in every entry → searchable in App Insights
  
Query: traces | where customDimensions.TraceId == "abc123"
  → Shows full request journey across all services

6.2 Troubleshooting Runbook

SYMPTOM: Travel Service P95 latency > 500ms

STEP 1: Check Dashboard Level 2 (Travel)
  → Is it ALL endpoints or specific one?
  
STEP 2: Check Dependencies
  → DB query time normal? (db_query_duration metric)
  → Payment ACL responding? (acl_latency metric)
  → Service Bus publishing OK? (publish_duration metric)
  
STEP 3: Check Infrastructure
  → CPU > 80%? → Scale up (auto-scale should handle)
  → Memory > 80%? → Possible leak, check container restart count
  → DB connection pool exhausted? → Connection leak
  
STEP 4: Distributed Trace
  → Find slow trace in App Insights
  → Identify which span is slow
  → Fix or escalate

STEP 5: If ACL is slow
  → Legacy monolith issue → cannot fix directly
  → Enable circuit breaker (return cached/degraded response)
  → Notify legacy team

7. Observability per Phase

Phase 0 (Month 1)

OpenTelemetry + Serilog NuGet packages in shared template
App Insights workspace provisioned (Bicep)
Basic dashboard: availability + error rate + latency per service
Slack webhook for alerts

Phase 1 (Month 2–4)

Travel + Event: full instrumentation (traces + metrics + logs)
ACL: latency + error rate dashboard (critical)
SLO definitions published for Travel + Event + ACL
P1/P2 alert rules active
Correlation ID flowing through YARP → services → Service Bus

Phase 2 (Month 5–7)

All 5 services: full observability
SLO burn rate dashboard operational
Error budget policy enforced
Custom business metrics (bookings/sec, events created/day)
CDC lag monitoring

Phase 3 (Month 8–9)

All alerts tuned (< 20% false positive rate)
Troubleshooting runbooks per service
On-call rotation established (D1 backup, D2–D5 primary rotation)
Monthly SLO review process documented
Load test baselines captured as performance benchmarks

8. Cost of Observability

Component	Monthly Cost	Notes
App Insights ingestion (20 GB)	~$45	First 5 GB free. 15 GB × $2.99
Azure Monitor metrics	Included	Free with Container Apps + SQL
Log retention (90 days)	~$10	Standard retention
Alert rules (20 rules)	~$3	$0.15/rule/month
TOTAL	~$58/month	< 0.3% of total project cost

Observability is the cheapest investment with the highest ROI. $58/month to prevent 1 hour undetected outage = no-brainer.

Observability

Observability & SLI/SLO Strategy

1. Observability Architecture

2. Three Pillars

2.1 Logs (Serilog → App Insights)

2.2 Metrics (OpenTelemetry → Azure Monitor)

2.3 Traces (OpenTelemetry → App Insights)

3. SLI/SLO Definitions

3.1 Service-Level Indicators (SLIs)

3.2 Service-Level Objectives (SLOs)

3.3 Error Budget Policy

4. Alerting Strategy

4.1 Alert Severity & Response

4.2 Alert Rules

4.3 Anti-Alert-Fatigue Rules

5. Dashboards

5.1 Dashboard Hierarchy

5.2 SLO Burn Rate Dashboard

6. Correlation & Troubleshooting

6.1 Correlation ID Flow

6.2 Troubleshooting Runbook

7. Observability per Phase

Phase 0 (Month 1)

Phase 1 (Month 2–4)

Phase 2 (Month 5–7)

Phase 3 (Month 8–9)

8. Cost of Observability

Related Documents

Links to →

← Referenced by