Documents/deliverables/4.2 — Migration Strategy

4.2 — Migration Strategy

Deliverable 4.2 — Migration Strategy

Requirement: Phased approach (min 4 phases), zero-downtime strategy, backward compatibility plan
Source: Planning.md, HA.md, Analysis v2.md, Architect - High Level.md


1. Phased Approach — 4 Phases

1.1 Timeline Overview

Month:    1       2       3       4       5       6       7       8       9
        ├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
Phase 0 │███████│                                                        
        │ AI Foundation + Infrastructure                                  
        │ Comms pilot (staging), CI/CD, IaC, YARP, AI toolchain           
        │                                                                 
Phase 1 │       │███████████████████████│                                 
        │       │ Core Service Extraction                                 
        │       │ Travel go-live (M3), Event go-live (M4)                 
        │       │ Payment ACL, React 18 (Travel + Event)                  
        │                                                                 
Phase 2 │                               │███████████████████████│         
        │                               │ Scale Out                       
        │                               │ Workforce (M6), Comms+Reporting (M7)
        │                               │ React 18 (Workforce + Reporting) 
        │                                                                 
Phase 3 │                                                       │████████│
        │                                                       │ Harden  
        │                                                       │ Perf, DR, docs, security audit

1.2 Phase Details

Phase Duration Focus Deliverables Services Live Team
0: AI Foundation Month 1 Setup AI toolchain (Cursor Pro, Claude Code, CodeRabbit), CI/CD (GitHub Actions), IaC (Bicep), YARP Gateway, Comms pilot migration (staging), shared kernel 0 All 5
1: Core Month 2–4 Extract Travel Booking + Event Management extracted, per-service DBs (Travel M3, Event M4), Payment ACL live, CDC data sync, React 18 for Travel/Event 2 (+ACL) D1/D2: Travel+ACL, D3: Event, D4: React, D5: Infra
2: Scale Month 5–7 Extract Workforce (M6), Communications production (M7), Reporting CQRS (M7), React 18 dashboards 5 D4: Workforce+Comms, D5+D3: Reporting, D2: support
3: Harden Month 8–9 Stabilize Load testing (40K users sim), security audit, DR validation, operational docs, Payment migration plan for next phase 5 (hardened) All 5 focus on quality

1.3 Service Go-Live Order (Justification)

Order Service Month Why This Order
1 Communications (pilot) M1 (staging) Simplest → validate AI migration pipeline + team patterns before complex modules
2 Payment ACL M2 Travel + Event NEED payment → ACL must exist before they go live
3 Travel Booking M3 (canary→100%) Core domain, highest business value, hardest → start early = more time for complexity
4 Event Management M4 Core domain, reuses patterns proven in Travel extraction
5 Workforce + Allocation M6 Supporting domain, depends on Event (staff for events) → extract after Event stable
6 Communications (prod) M7 Pilot to production. Already validated in Phase 0 staging
7 Reporting (CQRS) M7 Read-only, depends on events from ALL services → extract last = more event sources

1.4 Capacity per Phase

                          Raw    Overhead   Net    AI Multiplier  Effective
                          ────   ────────   ───    ───────────── ─────────
Phase 0 (Month 1):        5.0    -3.0       2.0    ×1.0*          2.0
Phase 1 (Month 2–4):     15.0    -6.0       9.0    ×2.0          18.0
Phase 2 (Month 5–7):     15.0    -5.5       9.5    ×2.0          19.0
Phase 3 (Month 8–9):     10.0    -3.5       6.5    ×1.0**         6.5
                          ────   ──────                           ─────
TOTAL:                    45.0   -18.0                            45.5 ≈ 44

* Phase 0: AI not yet set up → multiplier = 1.0
** Phase 3: Performance/docs = less AI leverage → 1.0x

2. Zero-Downtime Strategy

2.1 Core Pattern: Strangler Fig + YARP

Principle: NEVER cut over all at once. Route traffic incrementally per module.

BEFORE (all traffic → legacy):
  Users ──► Legacy Monolith (100%)

DURING (Phase 1-2, gradual shift):
  Users ──► YARP Gateway ──┬──► Travel Service (NEW, 100%)
                           ├──► Event Service (NEW, 100%)
                           ├──► /payments/* → Legacy (ACL)
                           └──► /* → Legacy (remaining)

AFTER (Phase 3, only Payment in legacy):
  Users ──► YARP Gateway ──┬──► 5 New Services (100%)
                           └──► Legacy (Payment only, via ACL)

2.2 Per-Module Cutover Procedure (7 Steps)

Each module goes through this process — NO BIG BANG:

Step 1: SHADOW MODE (Day 1-3)
  YARP sends traffic to BOTH legacy (serves response) + new service (discards response)
  → Verify new service handles requests without errors
  → Risk: ZERO — legacy still serving

Step 2: COMPARE MODE (Day 4-5)
  YARP sends to both, COMPARE responses, log mismatches
  → Fix business logic differences before any real traffic
  → Risk: ZERO — legacy still serving

Step 3: CANARY 5% (Day 6-7)
  5% traffic → new service, 95% → legacy
  Monitor: error rate, latency p95, business metrics
  → Rollback: 1 config change = 100% back to legacy (< 30 seconds)

Step 4: CANARY 25% (Day 8-9)
  If Step 3 clean ≥ 48h → increase to 25%

Step 5: CANARY 50% (Day 10)
  Half traffic on new service. Run 24h minimum.

Step 6: FULL CUTOVER 100% (Day 11)
  All traffic → new service. Legacy = hot standby.

Step 7: DECOMMISSION LEGACY MODULE (Day 18+)
  After 7-day soak at 100% → archive legacy module
  Legacy DB tables retained read-only 30 days

2.3 YARP Weighted Routing (Zero Downtime Switch)

{
  "Clusters": {
    "travel-cluster": {
      "Destinations": {
        "legacy":      { "Address": "https://legacy.internal",  "Weight": 0   },
        "new-service": { "Address": "https://travel-svc.internal", "Weight": 100 }
      },
      "LoadBalancingPolicy": "WeightedRoundRobin"
    }
  }
}

Key: Changing Weight = changing traffic split
     YARP hot-reload config → zero downtime, zero redeploy
     Rollback = set legacy Weight=100, new Weight=0 (< 30 seconds)

2.4 Rollback Guarantees

Layer Rollback Method Time
Traffic routing YARP weight → 100% legacy < 30 seconds
Container Revert Container Apps revision < 2 minutes
Database CDC still syncing → data consistent Immediate (read from legacy)
Feature flag Kill switch → disable new module < 10 seconds
Full phase All routes back to legacy < 5 minutes

3. Backward Compatibility Plan (Legacy ↔ New Coexistence)

3.1 Coexistence Architecture

┌─────────────────────────────────────────────────────────────────┐
│ During migration, BOTH systems run simultaneously:              │
│                                                                 │
│ • YARP routes /api/travel/* → new Travel Service                │
│ • YARP routes /api/payments/* → legacy monolith                 │
│ • New services call legacy Payment via ACL (adapter pattern)    │
│ • ACL translates new contracts ↔ legacy API formats             │
│ • CDC keeps data in sync: legacy DB → new service DBs           │
│ • Event schema versioned (v1.0+) — backward compatible changes  │
│ • React shell loads: new React modules + legacy UI (iframe)     │
│                                                                 │
│ When Payment modernized → swap ACL target,                      │
│ zero changes to consuming services.                             │
└─────────────────────────────────────────────────────────────────┘

3.2 Data Coexistence (CDC Pipeline)

Phase A: Legacy = write authority → CDC → New DB (read replica)
Phase B: New service = write authority → Event → Legacy (backward compat)
Phase C: Legacy module decommissioned → New service fully autonomous

┌──────────┐     CDC Stream      ┌──────────┐
│ Legacy DB│ ──────────────────► │ New Svc  │
│ (source) │                     │ DB       │
└──────────┘                     └──────────┘
     │         Verification              │
     │  ◄──── Hash compare every 6h ──── │
     │    Auto-pause CDC if mismatch     │

3.3 API Versioning Strategy

New services expose v1 APIs from Day 1:
  /api/v1/travel/bookings
  /api/v1/events

When breaking changes needed:
  /api/v2/travel/bookings (new schema)
  /api/v1/travel/bookings (still works, deprecated)

Rule: Support N-1 version minimum (2 active versions)
Rule: Deprecated version = 6 month sunset with notifications

3.4 Frontend Coexistence

┌──────────────────────────────────────────────────┐
│              React Shell (SPA)                    │
│                                                   │
│  ┌──────────────┐  ┌──────────────┐              │
│  │ Travel Module│  │ Event Module │  ← React 18  │
│  │ (new React)  │  │ (new React)  │    components │
│  └──────────────┘  └──────────────┘              │
│                                                   │
│  ┌──────────────────────────────────┐            │
│  │ Payment Module (legacy iframe)   │ ← Legacy   │
│  │ Untouched — frozen Phase 1       │   UI stays  │
│  └──────────────────────────────────┘            │
│                                                   │
│  Route-based: new React modules for migrated      │
│  services, iframe for legacy modules              │
└──────────────────────────────────────────────────┘

4. Migration Risk Mitigations

Risk Mitigation
Data loss during cutover CDC + 6h checksum verification + 7-day soak period
New service performance degradation Shadow mode + canary ramp (5% → 100% over 11 days)
Legacy breaks during migration Legacy untouched — YARP routes around it, not through it
Team unable to maintain both systems AI 2x multiplier frees capacity. Monolith shrinks as modules extract
Payment integration breaks ACL isolation + circuit breaker + retry + DLQ queue

5. Summary — What Assessor Should See

✅ 4 phases clearly defined with months + milestones
✅ Zero downtime via Strangler Fig + YARP (not aspirational — concrete 7-step procedure)
✅ Rollback at every stage (< 30 seconds routing, < 5 min full phase)
✅ Backward compatibility: CDC, ACL, API versioning, frontend iframe
✅ Data coexistence: 3-phase CDC transition (legacy → dual → new authority)
✅ Go-live order justified by domain dependency + complexity analysis