Deliverable 4.3 — Failure Modeling

Requirement: 5 high-risk failure scenarios, Likelihood/Impact, mitigation strategy
Source: HA.md, Constraints Analysis.md, Submission.md, Architect.md

5 High-Risk Failure Scenarios

#	Scenario	Likelihood	Impact	Category
F1	Data inconsistency during CDC migration	Medium	High	Data
F2	Legacy Payment outage blocks new services	Medium	High	Integration
F3	AI-generated code has business logic errors in production	High	High	AI/Quality
F4	Cascading failure during canary release	Low	High	Infrastructure
F5	Key engineer leaves mid-migration	Medium	Medium	Team

F1: Data Inconsistency During CDC Migration

Scenario:
  CDC sync from legacy DB → new service DB is delayed or misses records.
  New service serves stale/incorrect data to users.
  E.g.: Travel Booking shows a booking that was already cancelled in legacy.

Root Causes:
  • CDC connector crash (Debezium/SQL CDC agent down)
  • Network partition between legacy DB and new DB
  • Schema change in legacy that CDC fails to pick up
  • High volume writes overwhelm CDC consumer

Timeline: Highest risk Phase 1 (M2-4) when Travel + Event first cut over

Dimension	Assessment
Likelihood	Medium — CDC is proven technology, but schema complexity + volume creates edge cases
Impact	High — Users see incorrect data = trust erosion. Incorrect financial data = critical

Mitigation Strategy:

Layer	Action
Prevention	Automated checksum verification: compare record count + hash between legacy and new DB every 6 hours
Detection	Real-time CDC lag monitoring, alert if lag > 30 seconds. Schema change detection hooks
Response	Auto-pause CDC if checksum mismatch > 0. Alert on-call engineer. Serve data from legacy (fallback)
Validation	Dual-read validation: before switching writes to new DB, 7-day parallel read comparing responses
Recovery	Re-sync from legacy DB (full snapshot → incremental CDC). Legacy DB is source of truth until Phase C

F2: Legacy Payment Outage Blocks New Services

Scenario:
  Legacy monolith crashes or overloads → Payment API unavailable.
  Travel + Event services call Payment ACL → timeout → booking flow blocked.
  E.g.: User cannot complete booking because "payment processing failed".

Root Causes:
  • Legacy monolith memory leak / crash (aging .NET Framework)
  • Database deadlock on legacy Payment tables
  • Deployment to legacy breaks Payment module
  • Network issue between Azure Container Apps and legacy infrastructure

Timeline: Risk present from Phase 1 (M2) when ACL goes live, decreases if Payment modernized

Dimension	Assessment
Likelihood	Medium — Legacy monolith aging = incidents increase. But has been running for 40K users, somewhat stable
Impact	High — Payment blocked = revenue blocked. Travel + Event booking fail completely

Mitigation Strategy:

Layer	Action
Prevention	Circuit breaker (Polly) on ACL: fail fast after 3 timeout attempts. Bulkhead: isolate Payment calls from other logic
Graceful degradation	Booking created as `pending_payment`. User sees "payment processing — we'll confirm shortly"
Async fallback	Queue payment request into Azure Service Bus → process when legacy recovers. Idempotency key prevents double-charge
Monitoring	ACL health check dashboard. Alert if circuit breaker OPEN > 5 minutes. Legacy monolith health endpoint
Recovery	When legacy recovers → DLQ/queued payments drain automatically. Reconciliation job verifies all queued payments processed

F3: AI-Generated Code Contains Business Logic Errors

Scenario:
  AI agent migrates Travel pricing rules from legacy → .NET 8 service.
  AI misses edge case: promotional discount stacking logic.
  Code passes CI (tests don’t cover edge case). Deployed to production.
  Users charged wrong prices. Financial impact + refund overhead.

Root Causes:
  • AI hallucination: confident-looking but incorrect logic translation
  • Legacy business rules embedded in stored procedures (not obvious to AI)
  • Insufficient test coverage for edge cases
  • Team trusts AI output without deep validation (blind AI usage)

Timeline: Highest risk Phase 1-2 (M2-7) when bulk migration happening

Dimension	Assessment
Likelihood	High — AI hallucination on business logic is well-documented. Legacy code is messy
Impact	High — Wrong pricing = financial loss. Wrong booking = trust erosion. Could require manual reconciliation

Mitigation Strategy:

Layer	Action
Prevention	Mandatory human review for ALL business logic migrations. Checklist: trace every business rule from legacy → new code
Shadow + Compare	Before canary: mirror traffic to new service, compare responses vs legacy. Catch mismatches before any real traffic
Contract tests (Pact)	API behavior must match legacy exactly. Consumer expectations codified as tests
Payment rule	Payment-related code: 2 human reviewers, zero AI-only merge. 100% human validated
Detection	Business metrics monitoring: compare booking value, discount rates, conversion rates vs legacy baseline. Alert on deviation > 5%
Recovery	Rollback via YARP weight → 100% legacy (< 30 seconds). Reconciliation for affected transactions

F4: Cascading Failure During Canary Release

Scenario:
  Event Service at 25% canary traffic → latency spike (bug or DB issue).
  Legacy receives retry storm from YARP timeout → overloaded.
  Both new service AND legacy degrade → full module outage.

Root Causes:
  • New service has unoptimized query → slow response
  • YARP retries to legacy when new service slow → doubles load
  • No backpressure mechanism → retry storm compounds
  • DB connection pool exhaustion (new service steals from legacy pool)

Timeline: Highest risk during canary Steps 3-5 (first traffic to new service)

Dimension	Assessment
Likelihood	Low — Shadow + Compare mode catches most issues before canary. Canary starts at 5%
Impact	High — If both legacy AND new service down → full module outage for all users

Mitigation Strategy:

Layer	Action
Prevention	Bulkhead pattern: new service failures physically isolated from legacy traffic (separate connection pools, separate pods)
Auto-rollback	Canary with automatic rollback: error rate > 0.5% → instant 100% legacy. YARP monitors health probe
Rate limiting	Gateway applies rate limit per destination. New service gets max 25% capacity during canary
No retry storm	YARP retry policy: max 1 retry, 2s timeout. No exponential retry to legacy
Kill switch	Feature flag → 100% legacy in < 30 seconds. One button, zero deploy
Detection	p95 latency + error rate dashboard, real-time. PagerDuty alert if anomaly detected

F5: Key Engineer Leaves Mid-Migration

Scenario:
  D2 (Senior Backend, Travel Booking lead) exits at Month 4.
  Travel Booking is 80% done but has complex pricing logic only D2 understands.
  Remaining 4 engineers scramble to cover Travel + continue Event extraction.
  Timeline slips. Quality drops. Team morale decreases.

Root Causes:
  • Bus factor = 1 for critical module (only D2 deeply knows Travel domain)
  • High stress (tight timeline + zero downtime pressure)
  • External opportunity (competing offers in market)
  • Burnout from AI-heavy + migration-heavy workload

Timeline: Risk highest Month 3-6 (deep in migration, most knowledge accumulated)

Dimension	Assessment
Likelihood	Medium — Vietnam tech market competitive. 5-person team = 20% lost if 1 leaves
Impact	Medium — With AI docs + cross-training, knowledge loss is reduced. But velocity hit is real

Mitigation Strategy:

Layer	Action
Prevention	Cross-training: every service has primary + secondary owner. Weekly code walkthroughs. Rotate PR review assignments
Documentation	AI-generated docs from legacy code (Phase 0). All decisions in ADRs. Architecture docs maintained
AI buffer	AI multiplier reduces individual dependency. New engineer onboards faster with AI codebase Q&A
Scope flexibility	If engineer leaves → immediately defer lowest-priority service (Reporting) to post-9-months
Recovery	With 2x AI: remaining 4 engineers ≈ 8 traditional. Not ideal but survivable if scope tightened

Risk Heat Map

                    LOW IMPACT        MEDIUM IMPACT      HIGH IMPACT
                    ──────────        ─────────────      ───────────
HIGH LIKELIHOOD     │                 │                  │ F3: AI code
                    │                 │                  │ errors
                    │                 │                  │
MEDIUM LIKELIHOOD   │                 │ F5: Key eng      │ F1: Data
                    │                 │ leaves           │ inconsistency
                    │                 │                  │ F2: Payment
                    │                 │                  │ outage
LOW LIKELIHOOD      │                 │                  │ F4: Cascade
                    │                 │                  │ failure

Priority: F3 > F1 = F2 > F5 > F4

Summary — What Each Risk Tests

Risk	Assessor Tests	Your Signal
F1 (Data)	Can you handle distributed data?	CDC + checksum + dual-read = proven approach
F2 (Payment)	Can you decouple from legacy?	Circuit breaker + async queue = graceful degradation
F3 (AI)	Do you understand AI limitations?	Shadow+Compare + human review = AI-aware leadership
F4 (Cascade)	Do you understand distributed failure modes?	Bulkhead + auto-rollback + rate limit = resilience patterns
F5 (Team)	Can you manage people risk?	Cross-training + AI buffer + scope flex = pragmatic leadership

4.3 — Failure Modeling

Deliverable 4.3 — Failure Modeling

5 High-Risk Failure Scenarios

F1: Data Inconsistency During CDC Migration

F2: Legacy Payment Outage Blocks New Services

F3: AI-Generated Code Contains Business Logic Errors

F4: Cascading Failure During Canary Release

F5: Key Engineer Leaves Mid-Migration

Risk Heat Map

Summary — What Each Risk Tests

Related Documents

Links to →

← Referenced by