Deliverable 4.3 — Failure Modeling
Requirement: 5 high-risk failure scenarios, Likelihood/Impact, mitigation strategy
Source: HA.md, Constraints Analysis.md, Submission.md, Architect.md
5 High-Risk Failure Scenarios
| # |
Scenario |
Likelihood |
Impact |
Category |
| F1 |
Data inconsistency during CDC migration |
Medium |
High |
Data |
| F2 |
Legacy Payment outage blocks new services |
Medium |
High |
Integration |
| F3 |
AI-generated code has business logic errors in production |
High |
High |
AI/Quality |
| F4 |
Cascading failure during canary release |
Low |
High |
Infrastructure |
| F5 |
Key engineer leaves mid-migration |
Medium |
Medium |
Team |
F1: Data Inconsistency During CDC Migration
Scenario:
CDC sync from legacy DB → new service DB is delayed or misses records.
New service serves stale/incorrect data to users.
E.g.: Travel Booking shows a booking that was already cancelled in legacy.
Root Causes:
• CDC connector crash (Debezium/SQL CDC agent down)
• Network partition between legacy DB and new DB
• Schema change in legacy that CDC fails to pick up
• High volume writes overwhelm CDC consumer
Timeline: Highest risk Phase 1 (M2-4) when Travel + Event first cut over
| Dimension |
Assessment |
| Likelihood |
Medium — CDC is proven technology, but schema complexity + volume creates edge cases |
| Impact |
High — Users see incorrect data = trust erosion. Incorrect financial data = critical |
Mitigation Strategy:
| Layer |
Action |
| Prevention |
Automated checksum verification: compare record count + hash between legacy and new DB every 6 hours |
| Detection |
Real-time CDC lag monitoring, alert if lag > 30 seconds. Schema change detection hooks |
| Response |
Auto-pause CDC if checksum mismatch > 0. Alert on-call engineer. Serve data from legacy (fallback) |
| Validation |
Dual-read validation: before switching writes to new DB, 7-day parallel read comparing responses |
| Recovery |
Re-sync from legacy DB (full snapshot → incremental CDC). Legacy DB is source of truth until Phase C |
F2: Legacy Payment Outage Blocks New Services
Scenario:
Legacy monolith crashes or overloads → Payment API unavailable.
Travel + Event services call Payment ACL → timeout → booking flow blocked.
E.g.: User cannot complete booking because "payment processing failed".
Root Causes:
• Legacy monolith memory leak / crash (aging .NET Framework)
• Database deadlock on legacy Payment tables
• Deployment to legacy breaks Payment module
• Network issue between Azure Container Apps and legacy infrastructure
Timeline: Risk present from Phase 1 (M2) when ACL goes live, decreases if Payment modernized
| Dimension |
Assessment |
| Likelihood |
Medium — Legacy monolith aging = incidents increase. But has been running for 40K users, somewhat stable |
| Impact |
High — Payment blocked = revenue blocked. Travel + Event booking fail completely |
Mitigation Strategy:
| Layer |
Action |
| Prevention |
Circuit breaker (Polly) on ACL: fail fast after 3 timeout attempts. Bulkhead: isolate Payment calls from other logic |
| Graceful degradation |
Booking created as pending_payment. User sees "payment processing — we'll confirm shortly" |
| Async fallback |
Queue payment request into Azure Service Bus → process when legacy recovers. Idempotency key prevents double-charge |
| Monitoring |
ACL health check dashboard. Alert if circuit breaker OPEN > 5 minutes. Legacy monolith health endpoint |
| Recovery |
When legacy recovers → DLQ/queued payments drain automatically. Reconciliation job verifies all queued payments processed |
F3: AI-Generated Code Contains Business Logic Errors
Scenario:
AI agent migrates Travel pricing rules from legacy → .NET 8 service.
AI misses edge case: promotional discount stacking logic.
Code passes CI (tests don’t cover edge case). Deployed to production.
Users charged wrong prices. Financial impact + refund overhead.
Root Causes:
• AI hallucination: confident-looking but incorrect logic translation
• Legacy business rules embedded in stored procedures (not obvious to AI)
• Insufficient test coverage for edge cases
• Team trusts AI output without deep validation (blind AI usage)
Timeline: Highest risk Phase 1-2 (M2-7) when bulk migration happening
| Dimension |
Assessment |
| Likelihood |
High — AI hallucination on business logic is well-documented. Legacy code is messy |
| Impact |
High — Wrong pricing = financial loss. Wrong booking = trust erosion. Could require manual reconciliation |
Mitigation Strategy:
| Layer |
Action |
| Prevention |
Mandatory human review for ALL business logic migrations. Checklist: trace every business rule from legacy → new code |
| Shadow + Compare |
Before canary: mirror traffic to new service, compare responses vs legacy. Catch mismatches before any real traffic |
| Contract tests (Pact) |
API behavior must match legacy exactly. Consumer expectations codified as tests |
| Payment rule |
Payment-related code: 2 human reviewers, zero AI-only merge. 100% human validated |
| Detection |
Business metrics monitoring: compare booking value, discount rates, conversion rates vs legacy baseline. Alert on deviation > 5% |
| Recovery |
Rollback via YARP weight → 100% legacy (< 30 seconds). Reconciliation for affected transactions |
F4: Cascading Failure During Canary Release
Scenario:
Event Service at 25% canary traffic → latency spike (bug or DB issue).
Legacy receives retry storm from YARP timeout → overloaded.
Both new service AND legacy degrade → full module outage.
Root Causes:
• New service has unoptimized query → slow response
• YARP retries to legacy when new service slow → doubles load
• No backpressure mechanism → retry storm compounds
• DB connection pool exhaustion (new service steals from legacy pool)
Timeline: Highest risk during canary Steps 3-5 (first traffic to new service)
| Dimension |
Assessment |
| Likelihood |
Low — Shadow + Compare mode catches most issues before canary. Canary starts at 5% |
| Impact |
High — If both legacy AND new service down → full module outage for all users |
Mitigation Strategy:
| Layer |
Action |
| Prevention |
Bulkhead pattern: new service failures physically isolated from legacy traffic (separate connection pools, separate pods) |
| Auto-rollback |
Canary with automatic rollback: error rate > 0.5% → instant 100% legacy. YARP monitors health probe |
| Rate limiting |
Gateway applies rate limit per destination. New service gets max 25% capacity during canary |
| No retry storm |
YARP retry policy: max 1 retry, 2s timeout. No exponential retry to legacy |
| Kill switch |
Feature flag → 100% legacy in < 30 seconds. One button, zero deploy |
| Detection |
p95 latency + error rate dashboard, real-time. PagerDuty alert if anomaly detected |
F5: Key Engineer Leaves Mid-Migration
Scenario:
D2 (Senior Backend, Travel Booking lead) exits at Month 4.
Travel Booking is 80% done but has complex pricing logic only D2 understands.
Remaining 4 engineers scramble to cover Travel + continue Event extraction.
Timeline slips. Quality drops. Team morale decreases.
Root Causes:
• Bus factor = 1 for critical module (only D2 deeply knows Travel domain)
• High stress (tight timeline + zero downtime pressure)
• External opportunity (competing offers in market)
• Burnout from AI-heavy + migration-heavy workload
Timeline: Risk highest Month 3-6 (deep in migration, most knowledge accumulated)
| Dimension |
Assessment |
| Likelihood |
Medium — Vietnam tech market competitive. 5-person team = 20% lost if 1 leaves |
| Impact |
Medium — With AI docs + cross-training, knowledge loss is reduced. But velocity hit is real |
Mitigation Strategy:
| Layer |
Action |
| Prevention |
Cross-training: every service has primary + secondary owner. Weekly code walkthroughs. Rotate PR review assignments |
| Documentation |
AI-generated docs from legacy code (Phase 0). All decisions in ADRs. Architecture docs maintained |
| AI buffer |
AI multiplier reduces individual dependency. New engineer onboards faster with AI codebase Q&A |
| Scope flexibility |
If engineer leaves → immediately defer lowest-priority service (Reporting) to post-9-months |
| Recovery |
With 2x AI: remaining 4 engineers ≈ 8 traditional. Not ideal but survivable if scope tightened |
Risk Heat Map
LOW IMPACT MEDIUM IMPACT HIGH IMPACT
────────── ───────────── ───────────
HIGH LIKELIHOOD │ │ │ F3: AI code
│ │ │ errors
│ │ │
MEDIUM LIKELIHOOD │ │ F5: Key eng │ F1: Data
│ │ leaves │ inconsistency
│ │ │ F2: Payment
│ │ │ outage
LOW LIKELIHOOD │ │ │ F4: Cascade
│ │ │ failure
Priority: F3 > F1 = F2 > F5 > F4
Summary — What Each Risk Tests
| Risk |
Assessor Tests |
Your Signal |
| F1 (Data) |
Can you handle distributed data? |
CDC + checksum + dual-read = proven approach |
| F2 (Payment) |
Can you decouple from legacy? |
Circuit breaker + async queue = graceful degradation |
| F3 (AI) |
Do you understand AI limitations? |
Shadow+Compare + human review = AI-aware leadership |
| F4 (Cascade) |
Do you understand distributed failure modes? |
Bulkhead + auto-rollback + rate limit = resilience patterns |
| F5 (Team) |
Can you manage people risk? |
Cross-training + AI buffer + scope flex = pragmatic leadership |