Documents/domain/High Availability

High Availability

High Availability & Zero Downtime Strategy

Solving 2 core problems: Zero downtime migration + Serving 40K users globally
Platform: Azure Container Apps | YARP Gateway | Azure Service Bus


1. Two Problems to Solve

Problem 1: ZERO DOWNTIME MIGRATION
  Legacy monolith is running live, 40K users actively using it
  → Migrate to microservices WITHOUT shutting down the system
  → No data loss, no lost transactions

Problem 2: GLOBALLY DISTRIBUTED USERS
  40K users spread across multiple timezones
  → No safe "maintenance window" available
  → Latency must be acceptable for all regions
  → System must survive regional failure

2. Zero Downtime — Approach

2.1 Strangler Fig + YARP = Zero Downtime Migration

                    BEFORE MIGRATION
                    ────────────────
    Users ──────► Legacy Monolith (all modules)


                    DURING MIGRATION (Phase 1–2, Month 2–7)
                    ──────────────────────────────────────
                         ┌──────────────────────┐
                         │     YARP Gateway      │
                         │   (traffic router)    │
                         └──────┬───────┬────────┘
                                │       │
                    ┌───────────┘       └───────────┐
                    ▼                               ▼
         ┌──────────────────┐            ┌──────────────────┐
         │  New Services    │            │  Legacy Monolith  │
         │  (migrated)      │            │  (remaining)      │
         │                  │            │                   │
         │  • Travel ────── │ ◄──ACL──── │  • Payment       │
         │  • Event         │            │  (frozen Phase 1) │
         │  • Workforce     │            │                   │
         │  • Comms         │            │                   │
         │  • Reporting     │            │                   │
         └──────────────────┘            └──────────────────┘


                    AFTER MIGRATION (Phase 3, Month 8–9+)
                    ──────────────────────────────────────
                         ┌──────────────────────┐
                         │     YARP Gateway      │
                         └──────┬───────┬────────┘
                                │       │
                    ┌───────────┘       └──── ACL ────┐
                    ▼                                  ▼
         ┌──────────────────┐              ┌──────────────┐
         │  All Services    │              │  Legacy       │
         │  (100% traffic)  │              │  Payment only │
         └──────────────────┘              └──────────────┘

2.2 Traffic Cutover Procedure (per Module)

Each module follows a 7-step migration procedure — NO BIG BANG:

Step 1: SHADOW MODE (Day 1-3)
  ┌────────────┐     ┌──────────┐
  │   YARP     │────►│ Legacy   │ ← serves response
  │   Gateway  │     └──────────┘
  │            │────►│ New Svc  │ ← receives copy, response discarded
  │            │     └──────────┘
  
  Purpose: Verify new service handles same requests without errors
  Risk: Zero — legacy still serves 100% traffic

Step 2: COMPARE MODE (Day 4-5)
  YARP sends to both, COMPARE responses
  Log mismatches → fix business logic differences
  Risk: Zero — legacy still serves

Step 3: CANARY 5% (Day 6-7)
  ┌────────────┐  95%  ┌──────────┐
  │   YARP     │──────►│ Legacy   │
  │   Gateway  │       └──────────┘
  │            │  5%   ┌──────────┐
  │            │──────►│ New Svc  │
  │            │       └──────────┘

  Monitor: error rate, latency p95, business metrics
  Rollback: 1 config change = 100% back to legacy (< 30 seconds)

Step 4: CANARY 25% (Day 8-9)
  If Step 3 clean ≥ 48h → increase to 25%
  Same monitoring, same instant rollback

Step 5: CANARY 50% (Day 10)
  Half traffic on new service
  Run for 24h minimum

Step 6: FULL CUTOVER 100% (Day 11)
  All traffic to new service
  Legacy still running (hot standby)

Step 7: DECOMMISSION LEGACY MODULE (Day 18+)
  After 7-day soak at 100%
  Legacy module turned off (not deleted — archived)
  Legacy DB tables retained read-only for 30 days

2.3 YARP Routing Config (Zero Downtime Switch)

// yarp.json — change routing WITHOUT redeploying gateway
{
  "Routes": {
    "travel-route": {
      "ClusterId": "travel-cluster",
      "Match": { "Path": "/api/travel/{**catch-all}" }
    }
  },
  "Clusters": {
    "travel-cluster": {
      "Destinations": {
        "legacy": {
          "Address": "https://legacy-monolith.internal",
          "Weight": 0       // ← Post-migration phase: 0%
        },
        "new-service": {
          "Address": "https://travel-service.internal",
          "Weight": 100     // ← Post-migration phase: 100%
        }
      },
      "LoadBalancingPolicy": "WeightedRoundRobin"
    }
  }
}

Key insight: Changing weight in config = changing traffic split.
YARP hot-reload config → zero downtime, zero redeploy.


3. Data Migration — Zero Data Loss

3.1 CDC (Change Data Capture) Pipeline

┌──────────────┐     CDC Stream      ┌──────────────┐
│  Legacy DB   │ ──────────────────► │  New Service  │
│  (source of  │                     │  Database     │
│   truth)     │                     │  (replica)    │
└──────────────┘                     └──────────────┘
       │                                    │
       │         Verification               │
       │  ◄──────────────────────────────── │
       │     Hash compare every 6h          │
       │                                    │
       
Phase A: Legacy = write, New = read-only replica
Phase B: Both write (dual-write via Service Bus events)  
Phase C: New = write, Legacy = read-only (deprecated)
Phase D: Legacy tables archived

3.2 Dual-Write Transition

                    Phase A                    Phase B                    Phase C
                    ───────                    ───────                    ───────
User Request:       → Legacy DB               → New Service DB           → New Service DB
                    → CDC → New DB (async)     → Event → Legacy DB       → Legacy DB (deprecated)
                                                  (for backward compat)

Read Traffic:       Legacy DB                  New Service DB             New Service DB
Data Authority:     Legacy                     New Service                New Service
Rollback:           N/A (haven't switched)     Switch reads back         Switch writes back

Duration:           Week 1-2                   Week 3                    Week 4+

3.3 Data Integrity Verification

# Run every 6 hours during migration
# Compare record count + checksum between legacy and new DB

SELECT COUNT(*), CHECKSUM_AGG(CHECKSUM(*)) 
FROM LegacyDB.dbo.Bookings 
WHERE ModifiedDate > @LastSync

vs

SELECT COUNT(*), CHECKSUM_AGG(CHECKSUM(*)) 
FROM TravelDB.dbo.Bookings 
WHERE ModifiedDate > @LastSync

# Alert if mismatch > 0
# Auto-pause migration if mismatch > threshold

4. Globally Distributed Users — Architecture

4.1 Global Topology

                         ┌────────────────────────────┐
                         │      Azure Front Door       │
                         │   (Global Load Balancer)    │
                         │   + WAF + DDoS Protection   │
                         └──────────┬─────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    │               │               │
                    ▼               ▼               ▼
          ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
          │ Southeast   │ │  Australia  │ │   Europe    │
          │ Asia        │ │  East       │ │   West      │
          │ (Primary)   │ │ (Secondary) │ │ (Secondary) │
          └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
                 │               │               │
          ┌──────┴──────┐       │               │
          │ Container   │  Read Replicas    Read Replicas
          │ Apps Env    │  (geo-replicated) (geo-replicated)
          │             │       
          │ ┌─────────┐ │
          │ │ Gateway │ │
          │ ├─────────┤ │
          │ │ Travel  │ │
          │ │ Event   │ │
          │ │ Workfrc │ │
          │ │ Comms   │ │
          │ │ Report  │ │
          │ └─────────┘ │
          │             │
          │ ┌─────────┐ │
          │ │ SQL DBs │ │ ← Primary writes
          │ │ SvcBus  │ │
          │ │ KeyVault│ │
          │ └─────────┘ │
          └─────────────┘

4.2 Multi-Region Strategy

Concern Solution Detail
Routing Azure Front Door Route user to nearest healthy region, < 5ms routing decision
Static assets Azure CDN (React bundle) Edge cache across 100+ PoPs, cache hit > 95%
API latency Primary region + read replicas Writes → SEA primary, Reads → nearest replica
Database Azure SQL Geo-Replication Async replication < 5s lag, auto-failover group
Failover Active-Passive (Phase 1–2), Active-Active (Phase 3+) Phase 1–2: SEA primary, AU failover. Phase 3+: both active
DNS Azure Front Door health probes Auto-switch DNS if primary region down (< 60s)

4.3 Latency Budget

                          Target: < 500ms end-to-end for 95th percentile

User (browser)
  │
  ├── DNS resolution ──────────── < 5ms   (Azure Front Door anycast)
  ├── TLS handshake ───────────── < 20ms  (CDN edge termination)
  ├── Static assets (React) ───── < 50ms  (CDN edge cache)
  ├── API call (network) ─────── < 80ms  (SEA: 20ms, AU: 50ms, EU: 80ms)
  ├── Gateway routing ──────────── < 5ms  (YARP in-memory)
  ├── Service processing ──────── < 100ms (business logic + DB query)
  ├── Database query ───────────── < 50ms (indexed queries, connection pool)
  └── Response serialization ──── < 10ms
                                  ───────
                          Total:   < 320ms typical
                                   < 500ms p95 (worst case EU user)

4.4 User Distribution & Region Mapping

User Region        Est. Users    Nearest Azure Region    Latency (API)
──────────────     ──────────    ────────────────────    ─────────────
Vietnam            15,000        Southeast Asia           ~20ms
Singapore          5,000         Southeast Asia           ~10ms
Australia          8,000         Australia East           ~30ms
India              4,000         Southeast Asia           ~60ms
UK/Europe          3,000         West Europe              ~40ms
US (West Coast)    2,000         West Europe*             ~120ms
Others             3,000         Nearest PoP              varies

* Phase 1–2: No US region. Phase 3+: Consider US West if user growth justifies.

5. Failure Scenarios & Recovery

5.1 Failure Matrix

┌────────────────────────────┬────────────┬────────────┬────────────────────────────────┐
│ Failure Scenario           │ Impact     │ Detection  │ Recovery                       │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Single container crash     │ None       │ Health     │ Auto-restart (< 10s)           │
│                            │            │ check      │ Other replicas serve traffic    │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Service fully down         │ Service    │ YARP       │ Circuit breaker opens           │
│ (all replicas)             │ degraded   │ health     │ Return cached/fallback response │
│                            │            │ probe      │ Alert on-call → manual fix      │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Database failure            │ Service    │ Connection │ Failover to geo-replica (< 30s)│
│                            │ down       │ timeout    │ Azure SQL auto-failover group   │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Service Bus failure         │ Events     │ DLQ        │ Messages retry from DLQ         │
│                            │ delayed    │ monitor    │ Service Bus geo-DR (< 60s)      │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Primary region failure      │ Major      │ Front Door │ Front Door routes to AU region  │
│ (SEA down)                 │            │ health     │ Read from replica, queue writes  │
│                            │            │ probe      │ Auto-failover (< 60s)           │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Bad deployment              │ Service    │ Error rate │ Auto-rollback (error > 5%)      │
│                            │ errors     │ spike      │ Or manual: revert container rev  │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Legacy monolith crash       │ Payment    │ ACL health │ Circuit breaker on ACL           │
│ (during migration)         │ down       │ check      │ Queue payment requests           │
│                            │            │            │ Process when legacy recovers     │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Migration data mismatch     │ Data       │ Checksum   │ Auto-pause CDC                  │
│                            │ integrity  │ job (6h)   │ Alert team → manual reconcile    │
└────────────────────────────┴────────────┴────────────┴────────────────────────────────┘

5.2 Recovery Architecture

                    ┌──────────────────────────────┐
                    │     RESILIENCE PATTERNS       │
                    └──────────────────────────────┘

Level 1: CONTAINER (seconds)
  ┌──────────┐  crash  ┌──────────┐
  │Container │ ──────► │ Auto     │ ──► New container spun up
  │ App      │         │ Restart  │     Other replicas handle traffic
  └──────────┘         └──────────┘     MinReplicas ≥ 2 (production)

Level 2: SERVICE (seconds)
  ┌──────────┐  fail   ┌──────────┐
  │ Service  │ ──────► │ Circuit  │ ──► Fallback response / cached data
  │ Call     │         │ Breaker  │     Retry after cooldown (30s)
  └──────────┘         │ (Polly)  │     Alert fires
                       └──────────┘

Level 3: DATABASE (< 30 seconds)
  ┌──────────┐  fail   ┌──────────┐
  │ Primary  │ ──────► │ Auto     │ ──► Geo-replica promoted to primary
  │ SQL DB   │         │ Failover │     Connection string auto-updated
  └──────────┘         │ Group    │     < 30s data loss (async replication)
                       └──────────┘

Level 4: REGION (< 60 seconds)
  ┌──────────┐  fail   ┌──────────┐
  │ SEA      │ ──────► │ Front    │ ──► Traffic routes to AU region
  │ Region   │         │ Door     │     Read from replica
  └──────────┘         │ Failover │     Queue writes for reconciliation
                       └──────────┘

Level 5: MIGRATION ROLLBACK (< 5 minutes)
  ┌──────────┐  fail   ┌──────────┐
  │ New Svc  │ ──────► │ YARP     │ ──► 100% traffic back to legacy
  │ Issues   │         │ Route    │     No data loss (CDC still syncing)
  └──────────┘         │ Switch   │     Investigate, fix, retry
                       └──────────┘

6. Deployment — Zero Downtime Techniques

6.1 Rolling Update (regular deployments)

Time T0: [Pod-A v1] [Pod-B v1] [Pod-C v1]    ← 3 replicas running v1
              │
Time T1: [Pod-A v2] [Pod-B v1] [Pod-C v1]    ← Pod-A updated, traffic shifted
              │
Time T2: [Pod-A v2] [Pod-B v2] [Pod-C v1]    ← Pod-B updated
              │
Time T3: [Pod-A v2] [Pod-B v2] [Pod-C v2]    ← All running v2

At no point is capacity < 2 pods → zero downtime
Container Apps handles this automatically with minReplicas ≥ 2

6.2 Blue-Green (module cutover)

                    YARP Gateway
                    ┌──────────────────┐
                    │  Route: /travel/* │
                    │                  │
                    │  ┌──── 100% ────►│──── BLUE (legacy)
                    │  │               │     travel module
                    │  │               │
                    │  └──── 0% ──────►│──── GREEN (new service)
                    │                  │     travel-service v1
                    └──────────────────┘

Switch: Change weight 100/0 → 0/100
Rollback: Change weight back 0/100 → 100/0
Time to switch: < 30 seconds (YARP hot reload)

6.3 Canary (production validation)

Traffic Distribution Over Time:

Day 1:  ████████████████████████████████████████████████████ Legacy 95%
        ███                                                  New 5%

Day 3:  ████████████████████████████████████████             Legacy 75%
        █████████████                                        New 25%

Day 5:  ██████████████████████████                           Legacy 50%
        ██████████████████████████                           New 50%

Day 7:  █████                                                Legacy 5%
        ███████████████████████████████████████████████      New 95%

Day 9:                                                       Legacy 0%
        ████████████████████████████████████████████████████ New 100%

Monitoring at each step:
  □ Error rate < 0.5%
  □ Latency p95 within 10% of baseline
  □ No increase in support tickets
  □ Business metrics (bookings/events created) within normal range

7. Observability for HA

7.1 Health Check Hierarchy

// Each service exposes 3 health endpoints:
// 
// /health/live   — container is alive (K8s/Container Apps liveness)
// /health/ready  — ready to accept traffic (readiness)
// /health/full   — deep check (DB, Service Bus, dependencies)

builder.Services.AddHealthChecks()
    .AddSqlServer(connectionString, name: "database")
    .AddAzureServiceBusTopic(sbConnection, "events", name: "servicebus")
    .AddUrlGroup(new Uri("http://legacy/api/health"), name: "legacy-acl");

7.2 Dashboard (Key Metrics)

┌──────────────────────────────────────────────────────────────────┐
│                    HA DASHBOARD                                   │
├──────────────────┬───────────────────┬───────────────────────────┤
│ AVAILABILITY     │ LATENCY           │ TRAFFIC                   │
│                  │                   │                           │
│ Travel:  99.97%  │ Travel p95: 120ms │ New Service:  78% ████▓  │
│ Event:   99.95%  │ Event p95:  95ms  │ Legacy:       22% ██     │
│ Workfrc: 99.98%  │ Workfrc p95: 80ms │                          │
│ Comms:   99.96%  │ Comms p95: 110ms  │ Requests/sec: 450        │
│ Report:  99.92%  │ Report p95: 200ms │ Errors/min:   2.1        │
│ Legacy:  99.90%  │ Legacy p95: 350ms │                          │
├──────────────────┴───────────────────┴───────────────────────────┤
│ REGIONS                                                          │
│                                                                  │
│ SEA (Primary):  ✅ Healthy    AU (Secondary): ✅ Healthy         │
│ EU  (Secondary): ✅ Healthy   CDN PoPs: 112 active               │
│                                                                  │
│ Front Door: All backends healthy, 0 failovers last 24h          │
├──────────────────────────────────────────────────────────────────┤
│ DATA SYNC (CDC)                                                  │
│                                                                  │
│ Travel DB:  ✅ In sync (lag: 2s)    Last verified: 5 min ago    │
│ Event DB:   ✅ In sync (lag: 1s)    Last verified: 5 min ago    │
│ Workforce:  ⏳ Syncing (lag: 12s)   Last verified: 5 min ago    │
│ Comms DB:   ✅ In sync (lag: 3s)    Last verified: 5 min ago    │
└──────────────────────────────────────────────────────────────────┘

8. SLA Targets

Metric Target Measurement Consequence if missed
Uptime 99.95% (26 min downtime/month) Azure Monitor Escalate to management
API latency p95 < 500ms global Application Insights Performance sprint
API latency p95 < 200ms same-region Application Insights Investigate query/code
Recovery Time (RTO) < 5 min (service), < 60s (region) Incident log Improve automation
Data Loss (RPO) < 30s (async replication) Failover test Switch to sync replication
Migration rollback < 5 min from detection to legacy Runbook timer Simplify rollback process
Deploy frequency 2-3 per week per service CI/CD metrics Fix pipeline bottleneck
Error budget 0.05% per month (SLO) Error rate tracking Feature freeze → fix

Error Budget Policy

Monthly error budget: 0.05% = ~26 minutes downtime OR ~2,160 failed requests (at 3M/month)

Budget tracking:
  ┌──────────────────────────────────────────────────┐
  │ March budget: 26 min                              │
  │ Used so far:  8 min (Deploy incident Mar 5)       │
  │ Remaining:    18 min                              │
  │ Status:       ✅ Healthy                          │
  │                                                   │
  │ ████████░░░░░░░░░░░░░░░░░░░░░░  31% used         │
  └──────────────────────────────────────────────────┘

If budget exhausted:
  → Feature freeze
  → All engineering effort → reliability
  → No deployments except hotfixes
  → Postmortem required

9. Summary — How We Solve Both Problems

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  PROBLEM 1: ZERO DOWNTIME MIGRATION                               │
│  ───────────────────────────────────────                              │
│  ✓ Strangler Fig pattern          — migrate module by module, no    │
│                                      big bang                       │
│  ✓ YARP weighted routing          — shift traffic 5% → 100%,       │
│                                      rollback < 30s                 │
│  ✓ CDC data sync                  — legacy and new DB always in     │
│                                      sync                           │
│  ✓ Shadow + Compare mode          — validate before switching       │
│  ✓ Rolling updates                — deploy updates with no downtime │
│  ✓ Feature flags                  — instant kill switch             │
│                                                                     │
│  PROBLEM 2: GLOBALLY DISTRIBUTED USERS                            │
│  ────────────────────────────────────────                           │
│  ✓ Azure Front Door               — global LB, auto-failover       │
│                                      < 60s                          │
│  ✓ Azure CDN                      — static assets at edge,          │
│                                      112 PoPs worldwide             │
│  ✓ Geo-replicated SQL             — read replicas near users,       │
│                                      < 30s replication lag          │
│  ✓ Multi-region Container Apps    — SEA primary, AU secondary      │
│  ✓ 4-level resilience             — container → service → DB       │
│                                      → region auto-recovery         │
│  ✓ SLA 99.95%                     — error budget tracking,         │
│                                      feature freeze if exceeded     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘