High Availability
High Availability & Zero Downtime Strategy
Solving 2 core problems: Zero downtime migration + Serving 40K users globally
Platform: Azure Container Apps | YARP Gateway | Azure Service Bus
1. Two Problems to Solve
Problem 1: ZERO DOWNTIME MIGRATION
Legacy monolith is running live, 40K users actively using it
→ Migrate to microservices WITHOUT shutting down the system
→ No data loss, no lost transactions
Problem 2: GLOBALLY DISTRIBUTED USERS
40K users spread across multiple timezones
→ No safe "maintenance window" available
→ Latency must be acceptable for all regions
→ System must survive regional failure
2. Zero Downtime — Approach
2.1 Strangler Fig + YARP = Zero Downtime Migration
BEFORE MIGRATION
────────────────
Users ──────► Legacy Monolith (all modules)
DURING MIGRATION (Phase 1–2, Month 2–7)
──────────────────────────────────────
┌──────────────────────┐
│ YARP Gateway │
│ (traffic router) │
└──────┬───────┬────────┘
│ │
┌───────────┘ └───────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ New Services │ │ Legacy Monolith │
│ (migrated) │ │ (remaining) │
│ │ │ │
│ • Travel ────── │ ◄──ACL──── │ • Payment │
│ • Event │ │ (frozen Phase 1) │
│ • Workforce │ │ │
│ • Comms │ │ │
│ • Reporting │ │ │
└──────────────────┘ └──────────────────┘
AFTER MIGRATION (Phase 3, Month 8–9+)
──────────────────────────────────────
┌──────────────────────┐
│ YARP Gateway │
└──────┬───────┬────────┘
│ │
┌───────────┘ └──── ACL ────┐
▼ ▼
┌──────────────────┐ ┌──────────────┐
│ All Services │ │ Legacy │
│ (100% traffic) │ │ Payment only │
└──────────────────┘ └──────────────┘
2.2 Traffic Cutover Procedure (per Module)
Each module follows a 7-step migration procedure — NO BIG BANG:
Step 1: SHADOW MODE (Day 1-3)
┌────────────┐ ┌──────────┐
│ YARP │────►│ Legacy │ ← serves response
│ Gateway │ └──────────┘
│ │────►│ New Svc │ ← receives copy, response discarded
│ │ └──────────┘
Purpose: Verify new service handles same requests without errors
Risk: Zero — legacy still serves 100% traffic
Step 2: COMPARE MODE (Day 4-5)
YARP sends to both, COMPARE responses
Log mismatches → fix business logic differences
Risk: Zero — legacy still serves
Step 3: CANARY 5% (Day 6-7)
┌────────────┐ 95% ┌──────────┐
│ YARP │──────►│ Legacy │
│ Gateway │ └──────────┘
│ │ 5% ┌──────────┐
│ │──────►│ New Svc │
│ │ └──────────┘
Monitor: error rate, latency p95, business metrics
Rollback: 1 config change = 100% back to legacy (< 30 seconds)
Step 4: CANARY 25% (Day 8-9)
If Step 3 clean ≥ 48h → increase to 25%
Same monitoring, same instant rollback
Step 5: CANARY 50% (Day 10)
Half traffic on new service
Run for 24h minimum
Step 6: FULL CUTOVER 100% (Day 11)
All traffic to new service
Legacy still running (hot standby)
Step 7: DECOMMISSION LEGACY MODULE (Day 18+)
After 7-day soak at 100%
Legacy module turned off (not deleted — archived)
Legacy DB tables retained read-only for 30 days
2.3 YARP Routing Config (Zero Downtime Switch)
// yarp.json — change routing WITHOUT redeploying gateway
{
"Routes": {
"travel-route": {
"ClusterId": "travel-cluster",
"Match": { "Path": "/api/travel/{**catch-all}" }
}
},
"Clusters": {
"travel-cluster": {
"Destinations": {
"legacy": {
"Address": "https://legacy-monolith.internal",
"Weight": 0 // ← Post-migration phase: 0%
},
"new-service": {
"Address": "https://travel-service.internal",
"Weight": 100 // ← Post-migration phase: 100%
}
},
"LoadBalancingPolicy": "WeightedRoundRobin"
}
}
}
Key insight: Changing weight in config = changing traffic split.
YARP hot-reload config → zero downtime, zero redeploy.
3. Data Migration — Zero Data Loss
3.1 CDC (Change Data Capture) Pipeline
┌──────────────┐ CDC Stream ┌──────────────┐
│ Legacy DB │ ──────────────────► │ New Service │
│ (source of │ │ Database │
│ truth) │ │ (replica) │
└──────────────┘ └──────────────┘
│ │
│ Verification │
│ ◄──────────────────────────────── │
│ Hash compare every 6h │
│ │
Phase A: Legacy = write, New = read-only replica
Phase B: Both write (dual-write via Service Bus events)
Phase C: New = write, Legacy = read-only (deprecated)
Phase D: Legacy tables archived
3.2 Dual-Write Transition
Phase A Phase B Phase C
─────── ─────── ───────
User Request: → Legacy DB → New Service DB → New Service DB
→ CDC → New DB (async) → Event → Legacy DB → Legacy DB (deprecated)
(for backward compat)
Read Traffic: Legacy DB New Service DB New Service DB
Data Authority: Legacy New Service New Service
Rollback: N/A (haven't switched) Switch reads back Switch writes back
Duration: Week 1-2 Week 3 Week 4+
3.3 Data Integrity Verification
# Run every 6 hours during migration
# Compare record count + checksum between legacy and new DB
SELECT COUNT(*), CHECKSUM_AGG(CHECKSUM(*))
FROM LegacyDB.dbo.Bookings
WHERE ModifiedDate > @LastSync
vs
SELECT COUNT(*), CHECKSUM_AGG(CHECKSUM(*))
FROM TravelDB.dbo.Bookings
WHERE ModifiedDate > @LastSync
# Alert if mismatch > 0
# Auto-pause migration if mismatch > threshold
4. Globally Distributed Users — Architecture
4.1 Global Topology
┌────────────────────────────┐
│ Azure Front Door │
│ (Global Load Balancer) │
│ + WAF + DDoS Protection │
└──────────┬─────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Southeast │ │ Australia │ │ Europe │
│ Asia │ │ East │ │ West │
│ (Primary) │ │ (Secondary) │ │ (Secondary) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌──────┴──────┐ │ │
│ Container │ Read Replicas Read Replicas
│ Apps Env │ (geo-replicated) (geo-replicated)
│ │
│ ┌─────────┐ │
│ │ Gateway │ │
│ ├─────────┤ │
│ │ Travel │ │
│ │ Event │ │
│ │ Workfrc │ │
│ │ Comms │ │
│ │ Report │ │
│ └─────────┘ │
│ │
│ ┌─────────┐ │
│ │ SQL DBs │ │ ← Primary writes
│ │ SvcBus │ │
│ │ KeyVault│ │
│ └─────────┘ │
└─────────────┘
4.2 Multi-Region Strategy
| Concern | Solution | Detail |
|---|---|---|
| Routing | Azure Front Door | Route user to nearest healthy region, < 5ms routing decision |
| Static assets | Azure CDN (React bundle) | Edge cache across 100+ PoPs, cache hit > 95% |
| API latency | Primary region + read replicas | Writes → SEA primary, Reads → nearest replica |
| Database | Azure SQL Geo-Replication | Async replication < 5s lag, auto-failover group |
| Failover | Active-Passive (Phase 1–2), Active-Active (Phase 3+) | Phase 1–2: SEA primary, AU failover. Phase 3+: both active |
| DNS | Azure Front Door health probes | Auto-switch DNS if primary region down (< 60s) |
4.3 Latency Budget
Target: < 500ms end-to-end for 95th percentile
User (browser)
│
├── DNS resolution ──────────── < 5ms (Azure Front Door anycast)
├── TLS handshake ───────────── < 20ms (CDN edge termination)
├── Static assets (React) ───── < 50ms (CDN edge cache)
├── API call (network) ─────── < 80ms (SEA: 20ms, AU: 50ms, EU: 80ms)
├── Gateway routing ──────────── < 5ms (YARP in-memory)
├── Service processing ──────── < 100ms (business logic + DB query)
├── Database query ───────────── < 50ms (indexed queries, connection pool)
└── Response serialization ──── < 10ms
───────
Total: < 320ms typical
< 500ms p95 (worst case EU user)
4.4 User Distribution & Region Mapping
User Region Est. Users Nearest Azure Region Latency (API)
────────────── ────────── ──────────────────── ─────────────
Vietnam 15,000 Southeast Asia ~20ms
Singapore 5,000 Southeast Asia ~10ms
Australia 8,000 Australia East ~30ms
India 4,000 Southeast Asia ~60ms
UK/Europe 3,000 West Europe ~40ms
US (West Coast) 2,000 West Europe* ~120ms
Others 3,000 Nearest PoP varies
* Phase 1–2: No US region. Phase 3+: Consider US West if user growth justifies.
5. Failure Scenarios & Recovery
5.1 Failure Matrix
┌────────────────────────────┬────────────┬────────────┬────────────────────────────────┐
│ Failure Scenario │ Impact │ Detection │ Recovery │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Single container crash │ None │ Health │ Auto-restart (< 10s) │
│ │ │ check │ Other replicas serve traffic │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Service fully down │ Service │ YARP │ Circuit breaker opens │
│ (all replicas) │ degraded │ health │ Return cached/fallback response │
│ │ │ probe │ Alert on-call → manual fix │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Database failure │ Service │ Connection │ Failover to geo-replica (< 30s)│
│ │ down │ timeout │ Azure SQL auto-failover group │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Service Bus failure │ Events │ DLQ │ Messages retry from DLQ │
│ │ delayed │ monitor │ Service Bus geo-DR (< 60s) │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Primary region failure │ Major │ Front Door │ Front Door routes to AU region │
│ (SEA down) │ │ health │ Read from replica, queue writes │
│ │ │ probe │ Auto-failover (< 60s) │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Bad deployment │ Service │ Error rate │ Auto-rollback (error > 5%) │
│ │ errors │ spike │ Or manual: revert container rev │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Legacy monolith crash │ Payment │ ACL health │ Circuit breaker on ACL │
│ (during migration) │ down │ check │ Queue payment requests │
│ │ │ │ Process when legacy recovers │
├────────────────────────────┼────────────┼────────────┼────────────────────────────────┤
│ Migration data mismatch │ Data │ Checksum │ Auto-pause CDC │
│ │ integrity │ job (6h) │ Alert team → manual reconcile │
└────────────────────────────┴────────────┴────────────┴────────────────────────────────┘
5.2 Recovery Architecture
┌──────────────────────────────┐
│ RESILIENCE PATTERNS │
└──────────────────────────────┘
Level 1: CONTAINER (seconds)
┌──────────┐ crash ┌──────────┐
│Container │ ──────► │ Auto │ ──► New container spun up
│ App │ │ Restart │ Other replicas handle traffic
└──────────┘ └──────────┘ MinReplicas ≥ 2 (production)
Level 2: SERVICE (seconds)
┌──────────┐ fail ┌──────────┐
│ Service │ ──────► │ Circuit │ ──► Fallback response / cached data
│ Call │ │ Breaker │ Retry after cooldown (30s)
└──────────┘ │ (Polly) │ Alert fires
└──────────┘
Level 3: DATABASE (< 30 seconds)
┌──────────┐ fail ┌──────────┐
│ Primary │ ──────► │ Auto │ ──► Geo-replica promoted to primary
│ SQL DB │ │ Failover │ Connection string auto-updated
└──────────┘ │ Group │ < 30s data loss (async replication)
└──────────┘
Level 4: REGION (< 60 seconds)
┌──────────┐ fail ┌──────────┐
│ SEA │ ──────► │ Front │ ──► Traffic routes to AU region
│ Region │ │ Door │ Read from replica
└──────────┘ │ Failover │ Queue writes for reconciliation
└──────────┘
Level 5: MIGRATION ROLLBACK (< 5 minutes)
┌──────────┐ fail ┌──────────┐
│ New Svc │ ──────► │ YARP │ ──► 100% traffic back to legacy
│ Issues │ │ Route │ No data loss (CDC still syncing)
└──────────┘ │ Switch │ Investigate, fix, retry
└──────────┘
6. Deployment — Zero Downtime Techniques
6.1 Rolling Update (regular deployments)
Time T0: [Pod-A v1] [Pod-B v1] [Pod-C v1] ← 3 replicas running v1
│
Time T1: [Pod-A v2] [Pod-B v1] [Pod-C v1] ← Pod-A updated, traffic shifted
│
Time T2: [Pod-A v2] [Pod-B v2] [Pod-C v1] ← Pod-B updated
│
Time T3: [Pod-A v2] [Pod-B v2] [Pod-C v2] ← All running v2
At no point is capacity < 2 pods → zero downtime
Container Apps handles this automatically with minReplicas ≥ 2
6.2 Blue-Green (module cutover)
YARP Gateway
┌──────────────────┐
│ Route: /travel/* │
│ │
│ ┌──── 100% ────►│──── BLUE (legacy)
│ │ │ travel module
│ │ │
│ └──── 0% ──────►│──── GREEN (new service)
│ │ travel-service v1
└──────────────────┘
Switch: Change weight 100/0 → 0/100
Rollback: Change weight back 0/100 → 100/0
Time to switch: < 30 seconds (YARP hot reload)
6.3 Canary (production validation)
Traffic Distribution Over Time:
Day 1: ████████████████████████████████████████████████████ Legacy 95%
███ New 5%
Day 3: ████████████████████████████████████████ Legacy 75%
█████████████ New 25%
Day 5: ██████████████████████████ Legacy 50%
██████████████████████████ New 50%
Day 7: █████ Legacy 5%
███████████████████████████████████████████████ New 95%
Day 9: Legacy 0%
████████████████████████████████████████████████████ New 100%
Monitoring at each step:
□ Error rate < 0.5%
□ Latency p95 within 10% of baseline
□ No increase in support tickets
□ Business metrics (bookings/events created) within normal range
7. Observability for HA
7.1 Health Check Hierarchy
// Each service exposes 3 health endpoints:
//
// /health/live — container is alive (K8s/Container Apps liveness)
// /health/ready — ready to accept traffic (readiness)
// /health/full — deep check (DB, Service Bus, dependencies)
builder.Services.AddHealthChecks()
.AddSqlServer(connectionString, name: "database")
.AddAzureServiceBusTopic(sbConnection, "events", name: "servicebus")
.AddUrlGroup(new Uri("http://legacy/api/health"), name: "legacy-acl");
7.2 Dashboard (Key Metrics)
┌──────────────────────────────────────────────────────────────────┐
│ HA DASHBOARD │
├──────────────────┬───────────────────┬───────────────────────────┤
│ AVAILABILITY │ LATENCY │ TRAFFIC │
│ │ │ │
│ Travel: 99.97% │ Travel p95: 120ms │ New Service: 78% ████▓ │
│ Event: 99.95% │ Event p95: 95ms │ Legacy: 22% ██ │
│ Workfrc: 99.98% │ Workfrc p95: 80ms │ │
│ Comms: 99.96% │ Comms p95: 110ms │ Requests/sec: 450 │
│ Report: 99.92% │ Report p95: 200ms │ Errors/min: 2.1 │
│ Legacy: 99.90% │ Legacy p95: 350ms │ │
├──────────────────┴───────────────────┴───────────────────────────┤
│ REGIONS │
│ │
│ SEA (Primary): ✅ Healthy AU (Secondary): ✅ Healthy │
│ EU (Secondary): ✅ Healthy CDN PoPs: 112 active │
│ │
│ Front Door: All backends healthy, 0 failovers last 24h │
├──────────────────────────────────────────────────────────────────┤
│ DATA SYNC (CDC) │
│ │
│ Travel DB: ✅ In sync (lag: 2s) Last verified: 5 min ago │
│ Event DB: ✅ In sync (lag: 1s) Last verified: 5 min ago │
│ Workforce: ⏳ Syncing (lag: 12s) Last verified: 5 min ago │
│ Comms DB: ✅ In sync (lag: 3s) Last verified: 5 min ago │
└──────────────────────────────────────────────────────────────────┘
8. SLA Targets
| Metric | Target | Measurement | Consequence if missed |
|---|---|---|---|
| Uptime | 99.95% (26 min downtime/month) | Azure Monitor | Escalate to management |
| API latency p95 | < 500ms global | Application Insights | Performance sprint |
| API latency p95 | < 200ms same-region | Application Insights | Investigate query/code |
| Recovery Time (RTO) | < 5 min (service), < 60s (region) | Incident log | Improve automation |
| Data Loss (RPO) | < 30s (async replication) | Failover test | Switch to sync replication |
| Migration rollback | < 5 min from detection to legacy | Runbook timer | Simplify rollback process |
| Deploy frequency | 2-3 per week per service | CI/CD metrics | Fix pipeline bottleneck |
| Error budget | 0.05% per month (SLO) | Error rate tracking | Feature freeze → fix |
Error Budget Policy
Monthly error budget: 0.05% = ~26 minutes downtime OR ~2,160 failed requests (at 3M/month)
Budget tracking:
┌──────────────────────────────────────────────────┐
│ March budget: 26 min │
│ Used so far: 8 min (Deploy incident Mar 5) │
│ Remaining: 18 min │
│ Status: ✅ Healthy │
│ │
│ ████████░░░░░░░░░░░░░░░░░░░░░░ 31% used │
└──────────────────────────────────────────────────┘
If budget exhausted:
→ Feature freeze
→ All engineering effort → reliability
→ No deployments except hotfixes
→ Postmortem required
9. Summary — How We Solve Both Problems
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ PROBLEM 1: ZERO DOWNTIME MIGRATION │
│ ─────────────────────────────────────── │
│ ✓ Strangler Fig pattern — migrate module by module, no │
│ big bang │
│ ✓ YARP weighted routing — shift traffic 5% → 100%, │
│ rollback < 30s │
│ ✓ CDC data sync — legacy and new DB always in │
│ sync │
│ ✓ Shadow + Compare mode — validate before switching │
│ ✓ Rolling updates — deploy updates with no downtime │
│ ✓ Feature flags — instant kill switch │
│ │
│ PROBLEM 2: GLOBALLY DISTRIBUTED USERS │
│ ──────────────────────────────────────── │
│ ✓ Azure Front Door — global LB, auto-failover │
│ < 60s │
│ ✓ Azure CDN — static assets at edge, │
│ 112 PoPs worldwide │
│ ✓ Geo-replicated SQL — read replicas near users, │
│ < 30s replication lag │
│ ✓ Multi-region Container Apps — SEA primary, AU secondary │
│ ✓ 4-level resilience — container → service → DB │
│ → region auto-recovery │
│ ✓ SLA 99.95% — error budget tracking, │
│ feature freeze if exceeded │
│ │
└─────────────────────────────────────────────────────────────────────┘