Risk Management

Failure scenarios, architectural trade-offs, and key assumptions tracked across the migration.

Risk Heat Map

Failure Scenarios (5)

F1

CDC Data Inconsistency

L: MediumI: High
Scenario: CDC sync from legacy DB to new service DB is delayed or misses records. New service serves stale/incorrect data. E.g., Travel shows a booking already cancelled in legacy.
Mitigation: Automated checksum verification every 6h. Dual-read validation before switching writes. Auto-pause CDC on mismatch. 7-day parallel soak at 100% before decommission.
F2

Legacy Payment Outage

L: MediumI: High
Scenario: Legacy monolith crashes → Payment API unavailable. Travel + Event services call ACL → timeout → booking flow blocked entirely.
Mitigation: Circuit breaker (Polly): fail fast after 3 retries. Queue payment in Service Bus → process when legacy recovers. Graceful degradation: booking as 'pending payment'.
F3

AI Business Logic Errors

L: HighI: High
Scenario: AI agent migrates Travel pricing rules — misses edge case (promo discount stacking). Code passes CI. Users charged wrong prices in production.
Mitigation: Human review mandatory for ALL business logic. Contract tests (Pact) verify API matches legacy. Shadow+Compare before traffic switch. Payment: zero AI-only merge.
F4

Cascading Canary Failure

L: LowI: High
Scenario: New Event Service at 25% traffic causes timeout. Legacy overloaded with retry storm. Both old and new systems degrade.
Mitigation: Auto-rollback if error > 0.5%. Bulkhead isolation. Rate limiting at Gateway. Kill switch → 100% legacy in < 30s.
F5

Key Engineer Departure

L: MediumI: Medium
Scenario: Bus factor = 1 for a service. Engineer leaves mid-migration, taking domain knowledge.
Mitigation: Primary + secondary engineer per service. AI-generated docs from legacy code (Phase 0). Weekly walkthroughs. All decisions in ADRs.

Architectural Trade-offs (7)

#DecisionAlternativeReasonRevisit
T1Payment stays in monolithMigrate Payment earlyConstraint (frozen Phase 1) + highest risk (PCI, financial). ACL provides clean bridge.Post-9-months when all other services stable
T2Azure Container Apps over AKSKubernetes (AKS)5 engineers can't manage K8s cluster. Container Apps = managed, auto-scale, zero ops.If services > 15 or team > 10 engineers
T3Azure SQL everywhereCosmos DB, Redis, etc.One DB technology = one skill to maintain. Polyglot = multiple ops burdens for 5 engineers.If specific service needs document store or cache
T4Incremental React (3-4 modules)Full React rewrite5 engineers can't rewrite all frontend + backend simultaneously.Month 10+, or hire frontend engineer
T5Single region (active-passive)Multi-region active-activeActive-active = double infra complexity. 40K users served well with SEA primary + CDN.If user growth justifies multi-region
T6Contract tests over heavy E2EComprehensive E2E suite (Playwright)E2E = slow, flaky, high maintenance. Contract tests verify boundaries efficiently.Phase 3+ expand E2E coverage
T7Shared DB views during CDC transitionFull data decomposition from Day 1Per-service DBs come at each service's go-live. During transition, CDC bridges the gap.Month 7+ when all services own their data

Key Assumptions (12)

A1Team

Team has senior .NET experience — no major ramp-up needed

If wrong: Phase 0 extends 2-4 weeks for training
Validate: Week 2: pair programming assessment
A10Compliance

No regulatory requirements beyond standard enterprise security

If wrong: Add 2-4 weeks compliance work per service
Validate: Week 1: legal/security review
A11Data

Legacy database is SQL Server (CDC compatible)

If wrong: Different CDC tooling needed
Validate: Week 2: DBA verification
A12Scope

No mobile app in scope — web-only modernization

If wrong: Need React Native track + additional frontend engineer
Validate: Week 1: product owner scope confirmation
A2Codebase

Legacy codebase has some documentation or discoverable APIs

If wrong: AI analysis takes longer, risk of missed business rules
Validate: Week 2: AI codebase ingestion results
A3Infrastructure

Azure is the approved cloud provider

If wrong: Complete architecture rework if AWS/GCP mandated
Validate: Week 1: stakeholder confirmation
A4Constraint

'Payment frozen' = code stays in monolith, API still callable via ACL

If wrong: If API frozen too → bookings blocked entirely
Validate: Week 1: product owner clarification
A5AI Strategy

AI tools (Cursor Pro, Claude Code) can be purchased — no procurement blocker

If wrong: Multiplier drops from 2x → 1.2x, capacity ~32 MM
Validate: Week 1: budget approval
A6Timeline

Legacy monolith continues running during full 9-month migration

If wrong: If forced shutdown → scope shrinks dramatically
Validate: Week 1: ops team confirmation
A7Availability

40K users across timezones — no safe maintenance window

If wrong: If single timezone → could simplify cutover
Validate: Week 2: analytics data
A8Team

Team co-located or same timezone (Vietnam)

If wrong: Add async overhead (~10% capacity loss)
Validate: Ongoing: retro feedback
A9Architecture

Azure Service Bus acceptable for messaging

If wrong: Minor: swap messaging broker, patterns stay same
Validate: Week 1: infra preference check