Risk Management
Failure scenarios, architectural trade-offs, and key assumptions tracked across the migration.
Risk Heat Map
Failure Scenarios (5)
F1
CDC Data Inconsistency
L: MediumI: High
Scenario: CDC sync from legacy DB to new service DB is delayed or misses records. New service serves stale/incorrect data. E.g., Travel shows a booking already cancelled in legacy.
Mitigation: Automated checksum verification every 6h. Dual-read validation before switching writes. Auto-pause CDC on mismatch. 7-day parallel soak at 100% before decommission.
F2
Legacy Payment Outage
L: MediumI: High
Scenario: Legacy monolith crashes → Payment API unavailable. Travel + Event services call ACL → timeout → booking flow blocked entirely.
Mitigation: Circuit breaker (Polly): fail fast after 3 retries. Queue payment in Service Bus → process when legacy recovers. Graceful degradation: booking as 'pending payment'.
F3
AI Business Logic Errors
L: HighI: High
Scenario: AI agent migrates Travel pricing rules — misses edge case (promo discount stacking). Code passes CI. Users charged wrong prices in production.
Mitigation: Human review mandatory for ALL business logic. Contract tests (Pact) verify API matches legacy. Shadow+Compare before traffic switch. Payment: zero AI-only merge.
F4
Cascading Canary Failure
L: LowI: High
Scenario: New Event Service at 25% traffic causes timeout. Legacy overloaded with retry storm. Both old and new systems degrade.
Mitigation: Auto-rollback if error > 0.5%. Bulkhead isolation. Rate limiting at Gateway. Kill switch → 100% legacy in < 30s.
F5
Key Engineer Departure
L: MediumI: Medium
Scenario: Bus factor = 1 for a service. Engineer leaves mid-migration, taking domain knowledge.
Mitigation: Primary + secondary engineer per service. AI-generated docs from legacy code (Phase 0). Weekly walkthroughs. All decisions in ADRs.
Architectural Trade-offs (7)
| # | Decision | Alternative | Reason | Revisit |
|---|---|---|---|---|
| T1 | Payment stays in monolith | Migrate Payment early | Constraint (frozen Phase 1) + highest risk (PCI, financial). ACL provides clean bridge. | Post-9-months when all other services stable |
| T2 | Azure Container Apps over AKS | Kubernetes (AKS) | 5 engineers can't manage K8s cluster. Container Apps = managed, auto-scale, zero ops. | If services > 15 or team > 10 engineers |
| T3 | Azure SQL everywhere | Cosmos DB, Redis, etc. | One DB technology = one skill to maintain. Polyglot = multiple ops burdens for 5 engineers. | If specific service needs document store or cache |
| T4 | Incremental React (3-4 modules) | Full React rewrite | 5 engineers can't rewrite all frontend + backend simultaneously. | Month 10+, or hire frontend engineer |
| T5 | Single region (active-passive) | Multi-region active-active | Active-active = double infra complexity. 40K users served well with SEA primary + CDN. | If user growth justifies multi-region |
| T6 | Contract tests over heavy E2E | Comprehensive E2E suite (Playwright) | E2E = slow, flaky, high maintenance. Contract tests verify boundaries efficiently. | Phase 3+ expand E2E coverage |
| T7 | Shared DB views during CDC transition | Full data decomposition from Day 1 | Per-service DBs come at each service's go-live. During transition, CDC bridges the gap. | Month 7+ when all services own their data |
Key Assumptions (12)
A1Team
Team has senior .NET experience — no major ramp-up needed
If wrong: Phase 0 extends 2-4 weeks for training
Validate: Week 2: pair programming assessment
A10Compliance
No regulatory requirements beyond standard enterprise security
If wrong: Add 2-4 weeks compliance work per service
Validate: Week 1: legal/security review
A11Data
Legacy database is SQL Server (CDC compatible)
If wrong: Different CDC tooling needed
Validate: Week 2: DBA verification
A12Scope
No mobile app in scope — web-only modernization
If wrong: Need React Native track + additional frontend engineer
Validate: Week 1: product owner scope confirmation
A2Codebase
Legacy codebase has some documentation or discoverable APIs
If wrong: AI analysis takes longer, risk of missed business rules
Validate: Week 2: AI codebase ingestion results
A3Infrastructure
Azure is the approved cloud provider
If wrong: Complete architecture rework if AWS/GCP mandated
Validate: Week 1: stakeholder confirmation
A4Constraint
'Payment frozen' = code stays in monolith, API still callable via ACL
If wrong: If API frozen too → bookings blocked entirely
Validate: Week 1: product owner clarification
A5AI Strategy
AI tools (Cursor Pro, Claude Code) can be purchased — no procurement blocker
If wrong: Multiplier drops from 2x → 1.2x, capacity ~32 MM
Validate: Week 1: budget approval
A6Timeline
Legacy monolith continues running during full 9-month migration
If wrong: If forced shutdown → scope shrinks dramatically
Validate: Week 1: ops team confirmation
A7Availability
40K users across timezones — no safe maintenance window
If wrong: If single timezone → could simplify cutover
Validate: Week 2: analytics data
A8Team
Team co-located or same timezone (Vietnam)
If wrong: Add async overhead (~10% capacity loss)
Validate: Ongoing: retro feedback
A9Architecture
Azure Service Bus acceptable for messaging
If wrong: Minor: swap messaging broker, patterns stay same
Validate: Week 1: infra preference check
See also: Failure Modeling Document · Trade-off Matrix