Deliverable 4.2 — Migration Strategy
Requirement: Phased approach (min 4 phases), zero-downtime strategy, backward compatibility plan
Source: Planning.md, HA.md, Analysis v2.md, Architect - High Level.md
1. Phased Approach — 4 Phases
1.1 Timeline Overview
Month: 1 2 3 4 5 6 7 8 9
├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
Phase 0 │███████│
│ AI Foundation + Infrastructure
│ Comms pilot (staging), CI/CD, IaC, YARP, AI toolchain
│
Phase 1 │ │███████████████████████│
│ │ Core Service Extraction
│ │ Travel go-live (M3), Event go-live (M4)
│ │ Payment ACL, React 18 (Travel + Event)
│
Phase 2 │ │███████████████████████│
│ │ Scale Out
│ │ Workforce (M6), Comms+Reporting (M7)
│ │ React 18 (Workforce + Reporting)
│
Phase 3 │ │████████│
│ │ Harden
│ │ Perf, DR, docs, security audit
1.2 Phase Details
| Phase |
Duration |
Focus |
Deliverables |
Services Live |
Team |
| 0: AI Foundation |
Month 1 |
Setup |
AI toolchain (Cursor Pro, Claude Code, CodeRabbit), CI/CD (GitHub Actions), IaC (Bicep), YARP Gateway, Comms pilot migration (staging), shared kernel |
0 |
All 5 |
| 1: Core |
Month 2–4 |
Extract |
Travel Booking + Event Management extracted, per-service DBs (Travel M3, Event M4), Payment ACL live, CDC data sync, React 18 for Travel/Event |
2 (+ACL) |
D1/D2: Travel+ACL, D3: Event, D4: React, D5: Infra |
| 2: Scale |
Month 5–7 |
Extract |
Workforce (M6), Communications production (M7), Reporting CQRS (M7), React 18 dashboards |
5 |
D4: Workforce+Comms, D5+D3: Reporting, D2: support |
| 3: Harden |
Month 8–9 |
Stabilize |
Load testing (40K users sim), security audit, DR validation, operational docs, Payment migration plan for next phase |
5 (hardened) |
All 5 focus on quality |
1.3 Service Go-Live Order (Justification)
| Order |
Service |
Month |
Why This Order |
| 1 |
Communications (pilot) |
M1 (staging) |
Simplest → validate AI migration pipeline + team patterns before complex modules |
| 2 |
Payment ACL |
M2 |
Travel + Event NEED payment → ACL must exist before they go live |
| 3 |
Travel Booking |
M3 (canary→100%) |
Core domain, highest business value, hardest → start early = more time for complexity |
| 4 |
Event Management |
M4 |
Core domain, reuses patterns proven in Travel extraction |
| 5 |
Workforce + Allocation |
M6 |
Supporting domain, depends on Event (staff for events) → extract after Event stable |
| 6 |
Communications (prod) |
M7 |
Pilot to production. Already validated in Phase 0 staging |
| 7 |
Reporting (CQRS) |
M7 |
Read-only, depends on events from ALL services → extract last = more event sources |
1.4 Capacity per Phase
Raw Overhead Net AI Multiplier Effective
──── ──────── ─── ───────────── ─────────
Phase 0 (Month 1): 5.0 -3.0 2.0 ×1.0* 2.0
Phase 1 (Month 2–4): 15.0 -6.0 9.0 ×2.0 18.0
Phase 2 (Month 5–7): 15.0 -5.5 9.5 ×2.0 19.0
Phase 3 (Month 8–9): 10.0 -3.5 6.5 ×1.0** 6.5
──── ────── ─────
TOTAL: 45.0 -18.0 45.5 ≈ 44
* Phase 0: AI not yet set up → multiplier = 1.0
** Phase 3: Performance/docs = less AI leverage → 1.0x
2. Zero-Downtime Strategy
2.1 Core Pattern: Strangler Fig + YARP
Principle: NEVER cut over all at once. Route traffic incrementally per module.
BEFORE (all traffic → legacy):
Users ──► Legacy Monolith (100%)
DURING (Phase 1-2, gradual shift):
Users ──► YARP Gateway ──┬──► Travel Service (NEW, 100%)
├──► Event Service (NEW, 100%)
├──► /payments/* → Legacy (ACL)
└──► /* → Legacy (remaining)
AFTER (Phase 3, only Payment in legacy):
Users ──► YARP Gateway ──┬──► 5 New Services (100%)
└──► Legacy (Payment only, via ACL)
2.2 Per-Module Cutover Procedure (7 Steps)
Each module goes through this process — NO BIG BANG:
Step 1: SHADOW MODE (Day 1-3)
YARP sends traffic to BOTH legacy (serves response) + new service (discards response)
→ Verify new service handles requests without errors
→ Risk: ZERO — legacy still serving
Step 2: COMPARE MODE (Day 4-5)
YARP sends to both, COMPARE responses, log mismatches
→ Fix business logic differences before any real traffic
→ Risk: ZERO — legacy still serving
Step 3: CANARY 5% (Day 6-7)
5% traffic → new service, 95% → legacy
Monitor: error rate, latency p95, business metrics
→ Rollback: 1 config change = 100% back to legacy (< 30 seconds)
Step 4: CANARY 25% (Day 8-9)
If Step 3 clean ≥ 48h → increase to 25%
Step 5: CANARY 50% (Day 10)
Half traffic on new service. Run 24h minimum.
Step 6: FULL CUTOVER 100% (Day 11)
All traffic → new service. Legacy = hot standby.
Step 7: DECOMMISSION LEGACY MODULE (Day 18+)
After 7-day soak at 100% → archive legacy module
Legacy DB tables retained read-only 30 days
2.3 YARP Weighted Routing (Zero Downtime Switch)
{
"Clusters": {
"travel-cluster": {
"Destinations": {
"legacy": { "Address": "https://legacy.internal", "Weight": 0 },
"new-service": { "Address": "https://travel-svc.internal", "Weight": 100 }
},
"LoadBalancingPolicy": "WeightedRoundRobin"
}
}
}
Key: Changing Weight = changing traffic split
YARP hot-reload config → zero downtime, zero redeploy
Rollback = set legacy Weight=100, new Weight=0 (< 30 seconds)
2.4 Rollback Guarantees
| Layer |
Rollback Method |
Time |
| Traffic routing |
YARP weight → 100% legacy |
< 30 seconds |
| Container |
Revert Container Apps revision |
< 2 minutes |
| Database |
CDC still syncing → data consistent |
Immediate (read from legacy) |
| Feature flag |
Kill switch → disable new module |
< 10 seconds |
| Full phase |
All routes back to legacy |
< 5 minutes |
3. Backward Compatibility Plan (Legacy ↔ New Coexistence)
3.1 Coexistence Architecture
┌─────────────────────────────────────────────────────────────────┐
│ During migration, BOTH systems run simultaneously: │
│ │
│ • YARP routes /api/travel/* → new Travel Service │
│ • YARP routes /api/payments/* → legacy monolith │
│ • New services call legacy Payment via ACL (adapter pattern) │
│ • ACL translates new contracts ↔ legacy API formats │
│ • CDC keeps data in sync: legacy DB → new service DBs │
│ • Event schema versioned (v1.0+) — backward compatible changes │
│ • React shell loads: new React modules + legacy UI (iframe) │
│ │
│ When Payment modernized → swap ACL target, │
│ zero changes to consuming services. │
└─────────────────────────────────────────────────────────────────┘
3.2 Data Coexistence (CDC Pipeline)
Phase A: Legacy = write authority → CDC → New DB (read replica)
Phase B: New service = write authority → Event → Legacy (backward compat)
Phase C: Legacy module decommissioned → New service fully autonomous
┌──────────┐ CDC Stream ┌──────────┐
│ Legacy DB│ ──────────────────► │ New Svc │
│ (source) │ │ DB │
└──────────┘ └──────────┘
│ Verification │
│ ◄──── Hash compare every 6h ──── │
│ Auto-pause CDC if mismatch │
3.3 API Versioning Strategy
New services expose v1 APIs from Day 1:
/api/v1/travel/bookings
/api/v1/events
When breaking changes needed:
/api/v2/travel/bookings (new schema)
/api/v1/travel/bookings (still works, deprecated)
Rule: Support N-1 version minimum (2 active versions)
Rule: Deprecated version = 6 month sunset with notifications
3.4 Frontend Coexistence
┌──────────────────────────────────────────────────┐
│ React Shell (SPA) │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Travel Module│ │ Event Module │ ← React 18 │
│ │ (new React) │ │ (new React) │ components │
│ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Payment Module (legacy iframe) │ ← Legacy │
│ │ Untouched — frozen Phase 1 │ UI stays │
│ └──────────────────────────────────┘ │
│ │
│ Route-based: new React modules for migrated │
│ services, iframe for legacy modules │
└──────────────────────────────────────────────────┘
4. Migration Risk Mitigations
| Risk |
Mitigation |
| Data loss during cutover |
CDC + 6h checksum verification + 7-day soak period |
| New service performance degradation |
Shadow mode + canary ramp (5% → 100% over 11 days) |
| Legacy breaks during migration |
Legacy untouched — YARP routes around it, not through it |
| Team unable to maintain both systems |
AI 2x multiplier frees capacity. Monolith shrinks as modules extract |
| Payment integration breaks |
ACL isolation + circuit breaker + retry + DLQ queue |
5. Summary — What Assessor Should See
✅ 4 phases clearly defined with months + milestones
✅ Zero downtime via Strangler Fig + YARP (not aspirational — concrete 7-step procedure)
✅ Rollback at every stage (< 30 seconds routing, < 5 min full phase)
✅ Backward compatibility: CDC, ACL, API versioning, frontend iframe
✅ Data coexistence: 3-phase CDC transition (legacy → dual → new authority)
✅ Go-live order justified by domain dependency + complexity analysis