Documents/deliverables/4.3 — Failure Modeling

4.3 — Failure Modeling

Deliverable 4.3 — Failure Modeling

Requirement: 5 high-risk failure scenarios, Likelihood/Impact, mitigation strategy
Source: HA.md, Constraints Analysis.md, Submission.md, Architect.md


5 High-Risk Failure Scenarios

# Scenario Likelihood Impact Category
F1 Data inconsistency during CDC migration Medium High Data
F2 Legacy Payment outage blocks new services Medium High Integration
F3 AI-generated code has business logic errors in production High High AI/Quality
F4 Cascading failure during canary release Low High Infrastructure
F5 Key engineer leaves mid-migration Medium Medium Team

F1: Data Inconsistency During CDC Migration

Scenario:
  CDC sync from legacy DB → new service DB is delayed or misses records.
  New service serves stale/incorrect data to users.
  E.g.: Travel Booking shows a booking that was already cancelled in legacy.

Root Causes:
  • CDC connector crash (Debezium/SQL CDC agent down)
  • Network partition between legacy DB and new DB
  • Schema change in legacy that CDC fails to pick up
  • High volume writes overwhelm CDC consumer

Timeline: Highest risk Phase 1 (M2-4) when Travel + Event first cut over
Dimension Assessment
Likelihood Medium — CDC is proven technology, but schema complexity + volume creates edge cases
Impact High — Users see incorrect data = trust erosion. Incorrect financial data = critical

Mitigation Strategy:

Layer Action
Prevention Automated checksum verification: compare record count + hash between legacy and new DB every 6 hours
Detection Real-time CDC lag monitoring, alert if lag > 30 seconds. Schema change detection hooks
Response Auto-pause CDC if checksum mismatch > 0. Alert on-call engineer. Serve data from legacy (fallback)
Validation Dual-read validation: before switching writes to new DB, 7-day parallel read comparing responses
Recovery Re-sync from legacy DB (full snapshot → incremental CDC). Legacy DB is source of truth until Phase C

F2: Legacy Payment Outage Blocks New Services

Scenario:
  Legacy monolith crashes or overloads → Payment API unavailable.
  Travel + Event services call Payment ACL → timeout → booking flow blocked.
  E.g.: User cannot complete booking because "payment processing failed".

Root Causes:
  • Legacy monolith memory leak / crash (aging .NET Framework)
  • Database deadlock on legacy Payment tables
  • Deployment to legacy breaks Payment module
  • Network issue between Azure Container Apps and legacy infrastructure

Timeline: Risk present from Phase 1 (M2) when ACL goes live, decreases if Payment modernized
Dimension Assessment
Likelihood Medium — Legacy monolith aging = incidents increase. But has been running for 40K users, somewhat stable
Impact High — Payment blocked = revenue blocked. Travel + Event booking fail completely

Mitigation Strategy:

Layer Action
Prevention Circuit breaker (Polly) on ACL: fail fast after 3 timeout attempts. Bulkhead: isolate Payment calls from other logic
Graceful degradation Booking created as pending_payment. User sees "payment processing — we'll confirm shortly"
Async fallback Queue payment request into Azure Service Bus → process when legacy recovers. Idempotency key prevents double-charge
Monitoring ACL health check dashboard. Alert if circuit breaker OPEN > 5 minutes. Legacy monolith health endpoint
Recovery When legacy recovers → DLQ/queued payments drain automatically. Reconciliation job verifies all queued payments processed

F3: AI-Generated Code Contains Business Logic Errors

Scenario:
  AI agent migrates Travel pricing rules from legacy → .NET 8 service.
  AI misses edge case: promotional discount stacking logic.
  Code passes CI (tests don’t cover edge case). Deployed to production.
  Users charged wrong prices. Financial impact + refund overhead.

Root Causes:
  • AI hallucination: confident-looking but incorrect logic translation
  • Legacy business rules embedded in stored procedures (not obvious to AI)
  • Insufficient test coverage for edge cases
  • Team trusts AI output without deep validation (blind AI usage)

Timeline: Highest risk Phase 1-2 (M2-7) when bulk migration happening
Dimension Assessment
Likelihood High — AI hallucination on business logic is well-documented. Legacy code is messy
Impact High — Wrong pricing = financial loss. Wrong booking = trust erosion. Could require manual reconciliation

Mitigation Strategy:

Layer Action
Prevention Mandatory human review for ALL business logic migrations. Checklist: trace every business rule from legacy → new code
Shadow + Compare Before canary: mirror traffic to new service, compare responses vs legacy. Catch mismatches before any real traffic
Contract tests (Pact) API behavior must match legacy exactly. Consumer expectations codified as tests
Payment rule Payment-related code: 2 human reviewers, zero AI-only merge. 100% human validated
Detection Business metrics monitoring: compare booking value, discount rates, conversion rates vs legacy baseline. Alert on deviation > 5%
Recovery Rollback via YARP weight → 100% legacy (< 30 seconds). Reconciliation for affected transactions

F4: Cascading Failure During Canary Release

Scenario:
  Event Service at 25% canary traffic → latency spike (bug or DB issue).
  Legacy receives retry storm from YARP timeout → overloaded.
  Both new service AND legacy degrade → full module outage.

Root Causes:
  • New service has unoptimized query → slow response
  • YARP retries to legacy when new service slow → doubles load
  • No backpressure mechanism → retry storm compounds
  • DB connection pool exhaustion (new service steals from legacy pool)

Timeline: Highest risk during canary Steps 3-5 (first traffic to new service)
Dimension Assessment
Likelihood Low — Shadow + Compare mode catches most issues before canary. Canary starts at 5%
Impact High — If both legacy AND new service down → full module outage for all users

Mitigation Strategy:

Layer Action
Prevention Bulkhead pattern: new service failures physically isolated from legacy traffic (separate connection pools, separate pods)
Auto-rollback Canary with automatic rollback: error rate > 0.5% → instant 100% legacy. YARP monitors health probe
Rate limiting Gateway applies rate limit per destination. New service gets max 25% capacity during canary
No retry storm YARP retry policy: max 1 retry, 2s timeout. No exponential retry to legacy
Kill switch Feature flag → 100% legacy in < 30 seconds. One button, zero deploy
Detection p95 latency + error rate dashboard, real-time. PagerDuty alert if anomaly detected

F5: Key Engineer Leaves Mid-Migration

Scenario:
  D2 (Senior Backend, Travel Booking lead) exits at Month 4.
  Travel Booking is 80% done but has complex pricing logic only D2 understands.
  Remaining 4 engineers scramble to cover Travel + continue Event extraction.
  Timeline slips. Quality drops. Team morale decreases.

Root Causes:
  • Bus factor = 1 for critical module (only D2 deeply knows Travel domain)
  • High stress (tight timeline + zero downtime pressure)
  • External opportunity (competing offers in market)
  • Burnout from AI-heavy + migration-heavy workload

Timeline: Risk highest Month 3-6 (deep in migration, most knowledge accumulated)
Dimension Assessment
Likelihood Medium — Vietnam tech market competitive. 5-person team = 20% lost if 1 leaves
Impact Medium — With AI docs + cross-training, knowledge loss is reduced. But velocity hit is real

Mitigation Strategy:

Layer Action
Prevention Cross-training: every service has primary + secondary owner. Weekly code walkthroughs. Rotate PR review assignments
Documentation AI-generated docs from legacy code (Phase 0). All decisions in ADRs. Architecture docs maintained
AI buffer AI multiplier reduces individual dependency. New engineer onboards faster with AI codebase Q&A
Scope flexibility If engineer leaves → immediately defer lowest-priority service (Reporting) to post-9-months
Recovery With 2x AI: remaining 4 engineers ≈ 8 traditional. Not ideal but survivable if scope tightened

Risk Heat Map

                    LOW IMPACT        MEDIUM IMPACT      HIGH IMPACT
                    ──────────        ─────────────      ───────────
HIGH LIKELIHOOD     │                 │                  │ F3: AI code
                    │                 │                  │ errors
                    │                 │                  │
MEDIUM LIKELIHOOD   │                 │ F5: Key eng      │ F1: Data
                    │                 │ leaves           │ inconsistency
                    │                 │                  │ F2: Payment
                    │                 │                  │ outage
LOW LIKELIHOOD      │                 │                  │ F4: Cascade
                    │                 │                  │ failure

Priority: F3 > F1 = F2 > F5 > F4

Summary — What Each Risk Tests

Risk Assessor Tests Your Signal
F1 (Data) Can you handle distributed data? CDC + checksum + dual-read = proven approach
F2 (Payment) Can you decouple from legacy? Circuit breaker + async queue = graceful degradation
F3 (AI) Do you understand AI limitations? Shadow+Compare + human review = AI-aware leadership
F4 (Cascade) Do you understand distributed failure modes? Bulkhead + auto-rollback + rate limit = resilience patterns
F5 (Team) Can you manage people risk? Cross-training + AI buffer + scope flex = pragmatic leadership