Documents/analysis/Constraints Analysis

Constraints Analysis

Constraints Analysis

Deep-dive into Section 3 of the brief — 5 constraints, interactions between them, and capacity math


1. Original Brief (Verbatim)

Constraint Detail
Users 40,000 active global users
Availability Zero downtime during migration
Payment Payment flow cannot change in Phase 1
Team Size 5 engineers only
Timeline 9 months total

2. Constraint-by-Constraint Analysis — What They Really Mean

2.1 "40,000 active global users"

Surface: User count.
Real signal:

Aspect Analysis
"40,000" Not a startup (100 users) but not Facebook either (1B). Medium-scale enterprise. What matters more than the absolute number is the behavior pattern
"active" These are active users, not registered. 40K active = potentially 200K+ registered. Daily active likely 5K-15K
"global" Multiple timezones → no maintenance window. "There is always someone using the system." Traffic follows the sun

Traffic estimation:

40K active users
├── Peak concurrent: ~10% = 4,000 concurrent sessions
├── Requests per session: ~20 pages/actions per session
├── Peak RPS: 4,000 × 20 / 3600 ≈ 22 req/s sustained
│   └── Burst: 5-10x → 100-220 req/s peak
├── Daily API calls: ~2-5 million
└── Database transactions: ~500K-1M/day

Scale verdict: MODERATE
  • No need for Kafka-level streaming
  • Azure Service Bus Standard tier is sufficient
  • Per-service Azure SQL handles this comfortably  
  • Azure Container Apps auto-scale handles burst
  • CDN + cache reduces 60-80% of raw traffic

"Global" implications for architecture:

Global users = multi-region consideration

Option A: Single region + CDN (recommended Phase 1-2)
  Azure Southeast Asia (Singapore) — nearest to Vietnam
  Azure Front Door → cache static + route to nearest edge
  Latency: ~50-200ms globally (acceptable for enterprise app)

Option B: Multi-region active-active (Phase 3+, if needed)
  Primary: Southeast Asia
  Secondary: West Europe or East US
  Data replication: Azure SQL geo-replication
  
Trade-off: Option A is sufficient for 40K users. Multi-region is only 
  justified if SLA requires <100ms globally or regulatory compliance.
  With 5 engineers + 9 months → Option A is the correct choice.

2.2 "Zero downtime during migration"

This is the HARDEST constraint.

Zero downtime does NOT mean "deploy fast then restart"

Zero downtime MEANS:
  ┌─────────────────────────────────────────────────┐
  │ 1. At every point during 9 months, 40K users     │
  │    MUST be able to access the system normally      │
  │                                                    │
  │ 2. No "scheduled maintenance window"               │
  │    (global users = no safe window)                 │
  │                                                    │
  │ 3. Migration happens "invisible" to users          │
  │    Today: request → monolith                       │
  │    Tomorrow: request → new service (user unaware)  │
  │                                                    │
  │ 4. Rollback MUST be instant                        │
  │    If new service fails → route back to monolith   │
  │    in seconds, not minutes                         │
  └─────────────────────────────────────────────────┘

Patterns required by this constraint:

Pattern Purpose Implementation
Strangler Fig Shift traffic per module, no big bang YARP proxy routing: path-based → old or new
Feature Flags Toggle new vs old per module, per user LaunchDarkly or Azure App Configuration
Blue-Green Deploy 2 versions running in parallel, switch routing Container Apps revisions, traffic splitting
Canary Release Route 5% traffic → new service, monitor, scale YARP weighted routing rules
Shadow Mode New service receives copy of traffic, compare output Dual-write: monolith processes, new service validates
Circuit Breaker Auto-fallback if new service unhealthy Polly library (.NET)
Database CDC Sync data between old DB and new DBs Debezium or Azure SQL CDC → Service Bus

Interaction with other constraints:

Zero downtime + 5 engineers = MUST simplify

If 50 engineers: parallel migration, complex infra is OK
If 5 engineers: one module at a time, simple patterns, automated rollback

→ Strangler Fig + YARP + Feature Flags is the minimum viable set
→ Blue-Green with Container Apps (built-in), not custom
→ Shadow Mode only for critical paths (Travel booking)

2.3 "Payment flow cannot change in Phase 1"

Why freeze Payment?

Payment = HIGHEST RISK module

Risks if you touch Payment:
  1. PCI DSS compliance → audit, certification
  2. Financial transactions → money loss if bugs slip through
  3. Regulatory → legal liability
  4. User trust → payment failure = user churn
  5. Complexity → payment gateway integrations, reconciliation

Assessment signal: "Do you understand that NOT doing something is also an engineering decision?"

"Phase 1" definition:

Scope Timeline Meaning
Phase 0 Month 1 AI setup, infra foundation
Phase 1 Month 2-4 Payment FROZEN here
Phase 2 Month 5-7 Payment stays in monolith, but planning can begin
Phase 3 Month 8-9 Payment migration CAN start if team is confident

ACL Pattern cho Payment:

┌──────────────────────────────────────────────────────────┐
│                                                           │
│  New Travel Service ──→ Anti-Corruption Layer ──→ Legacy  │
│  (needs payment)         (adapter/facade)       Monolith  │
│                                                 Payment   │
│                                                           │
│  ACL does:                                                │
│  1. Translate new service's PaymentRequest                │
│     → legacy Payment API format                           │
│  2. Handle legacy exceptions → standard errors            │
│  3. Log/trace calls for observability                     │
│  4. Rate limit to protect legacy system                   │
│  5. Circuit break if legacy is slow                       │
│                                                           │
│  ACL does NOT:                                            │
│  • Change payment logic                                   │
│  • Store payment data in new DB                           │
│  • Process payments differently                           │
│                                                           │
│  This is a BRIDGE, not a migration.                       │
└──────────────────────────────────────────────────────────┘

Hidden implication: Every other service (Travel, Event) that needs payment must go through the ACL. This is one additional component to build and maintain. Effort for ACL ≈ 1-2 weeks.

2.4 "5 engineers only"

This is the MOST LIMITING constraint.

5 engineers × 9 months = 45 man-months RAW

Overhead deduction:
  - Meetings, planning, reviews: ~15%
  - Learning new tech/patterns: ~10% (higher in Phase 0-1)
  - Sick leave, vacation: ~5%
  - Context switching: ~5%
  
Effective: 45 × 0.65 ≈ 29 man-months TRADITIONAL

With AI multiplier 2x (AI-heavy agentic):
  Effective: 29 × 1.5 (conservative) ≈ 44 man-months
  
  Explanation:
  • 2x does NOT mean 2 × 45 = 90
  • 2x applies to coding tasks (~60% of time)
  • Non-coding tasks (meetings, design) do not get the 2x multiplier
  • Realistic: ~44-50 effective man-months

Team allocation (recommended):

Engineer Role Focus
D1 (Lead/Architect) Tech Lead Architecture, AI pipeline setup, code review, cross-cutting
D2 (Senior) Backend Lead Travel Service (hardest module), mentoring D4-D5
D3 (Senior) Platform Infra (Bicep), CI/CD, API Gateway, observability, shared libraries
D4 (Mid) Full-stack Event Service, then Workforce, frontend
D5 (Mid) Full-stack Communications (pilot), then Reporting, frontend

Brooks's Law warning:

"Adding people to a late project makes it later" — Fred Brooks

Implication: 5 engineers is FIXED. Cannot add people at Month 6.
Every decision must pass the "Feasible for 5?" test:

  ✅ Azure Container Apps (not AKS) — less ops
  ✅ Bicep over Terraform — simpler, Azure-only
  ✅ Single SPA not micro-frontends — less infra
  ✅ Azure SQL everywhere — one DB technology
  ✅ MassTransit over raw SDK — less boilerplate
  
  ❌ Kubernetes — too much ops
  ❌ Kafka — too much ops  
  ❌ Micro-frontends — too much infra
  ❌ Polyglot databases — too much expertise spread
  ❌ Custom service mesh — unnecessary at this scale

2.5 "9 months total"

Timeline analysis:

9 months = 39 working weeks ≈ 195 working days

Phase breakdown:
  Phase 0 (M1):     4 weeks  — AI setup, infra, pilot
  Phase 1 (M2-4):  13 weeks  — Travel + Event extraction
  Phase 2 (M5-7):  13 weeks  — Workforce + Comms + Reporting  
  Phase 3 (M8-9):   9 weeks  — Stabilize, optimize, handover

Key milestones:
  M1 end:  AI pipeline working, infra ready, Comms pilot extracted
  M3 end:  Travel Service live (canary 10%)
  M4 end:  Event Service live, Travel 100%
  M6 end:  Workforce live
  M7 end:  Comms + Reporting live
  M9 end:  Monolith reduced to Payment + legacy shell

What does NOT fit within 9 months?

Item Status Reason
Full Kubernetes migration ❌ Defer Ops overhead, Container Apps is sufficient
Payment extraction ❌ Defer Frozen Phase 1, only plan in Phase 3. Execute post-9-months
Multi-region active-active ❌ Defer Single region + CDN is sufficient for 40K
ML models in production ❌ Defer AI-ready data foundation: YES. Production ML: NO
Mobile app ❌ Out of scope Not mentioned in the brief
Full legacy decommission ❌ Defer Monolith will keep running for Payment. Kill post-payment migration

3. Constraint Interaction Matrix — How They Affect Each Other

        40K Users  Zero DT   Payment    5 Eng     9 Months
        ─────────  ────────  ─────────  ────────  ────────
40K     ·          AMPLIFY   neutral    PRESSURE  neutral
Zero DT AMPLIFY    ·         SIMPLIFY   PRESSURE  PRESSURE
Payment neutral    SIMPLIFY  ·          RELIEF    RELIEF
5 Eng   PRESSURE   PRESSURE  RELIEF     ·         AMPLIFY
9 Mon   neutral    PRESSURE  RELIEF     AMPLIFY   ·

Key interaction explanations:

Interaction Meaning
40K × Zero DT = AMPLIFY 40K global users → NO maintenance window. Zero downtime must be absolute, not "around 2am should be fine"
Zero DT × 5 Eng = PRESSURE Strangler Fig + canary + rollback requires ops investment. 5 engineers must automate everything, no manual rollback
Payment frozen × 5 Eng = RELIEF 1 fewer module → 5 engineers can focus on remaining 5 modules. Frozen = scope reduction
Payment frozen × 9 Months = RELIEF Reduced scope → more breathing room on timeline. This is a trade-off the brief gifts you — leverage it!
5 Eng × 9 Months = AMPLIFY Not enough people + not enough time = must sacrifice scope or quality. Choose to sacrifice scope (defer features)

4. Capacity Math — Man-Month Breakdown

Total raw: 5 engineers × 9 months = 45 man-months

Overhead deduction (~40%):
  Sprint ceremonies + reviews:    -7  MM
  Learning curve (Phase 0-1):     -4  MM
  Context switching + meetings:   -5  MM
  Leave + buffer:                 -2  MM
  ─────────────────────────────────────
  Net available:                  27  MM traditional

Phase-by-phase (variable AI multiplier):
  P0 (M1):    5.0 raw - 3.0 overhead = 2.0 × 1.0 =  2.0 MM
  P1 (M2-4): 15.0 raw - 6.0 overhead = 9.0 × 2.0 = 18.0 MM
  P2 (M5-7): 15.0 raw - 5.5 overhead = 9.5 × 2.0 = 19.0 MM
  P3 (M8-9): 10.0 raw - 3.5 overhead = 6.5 × 1.0 =  6.5 MM
  ─────────────────────────────────────────────────────────
  Total effective: 45.5 ≈ ~44 man-months (conservative)
  Equivalent to: ~7.5 traditional engineers for 9 months

(Methodology aligned with Analysis v2.md & Planning.md)

Allocation per phase (variable multiplier):
  Phase 0 (M1):    5 MM raw →  2 MM effective (AI not yet set up, ×1.0)
  Phase 1 (M2-4): 15 MM raw → 18 MM effective (AI kicking in, ×2.0)
  Phase 2 (M5-7): 15 MM raw → 19 MM effective (full AI velocity, ×2.0)
  Phase 3 (M8-9): 10 MM raw →  6 MM effective (perf/docs, ×1.0)

"Is it enough?"

Module Estimated Effort Feasible?
Travel Booking (hardest) 10-12 MM ✅ Phase 1 (3 months, 2 senior engineers + AI)
Event Management 6-8 MM ✅ Phase 1-2 (overlap with Travel tail)
Workforce + Allocation 5-7 MM ✅ Phase 2
Communications (simplest) 3-4 MM ✅ Phase 0 pilot + Phase 2 complete
Reporting (read-only) 3-4 MM ✅ Phase 2
Infra + Platform 6-8 MM ✅ D3 full-time + shared effort
ACL for Payment 2-3 MM ✅ Phase 1
Total 35-46 MM ⚠️ Tight fit at 44 effective

Conclusion: Feasible but no room for error. All scope creep must be blocked aggressively.


5. Risk Matrix Derived From Constraints

Risk Likelihood Impact Constraint Source Mitigation
Migration causes outage Medium Critical Zero DT × 40K Strangler Fig, canary, instant rollback
Team burnout High High 5 Eng × 9 Months Aggressive scope control, AI automation, sprint sustainable pace
Payment integration breaks Low Critical Payment frozen ACL isolation, extensive integration tests
Underestimate Travel complexity Medium High 9 Months tight Start Travel first (hardest), AI legacy analysis
AI tooling doesn't deliver 2x Medium High Capacity dependent Measure velocity weekly, fallback plan = reduce scope
Key engineer leaves Low Critical 5 Eng Cross-training, documentation, no single-person dependency

6. Assessor Perspective

✅ WANT TO SEE:
  • Clear capacity math (man-months, AI multiplier, overhead)
  • Constraint interactions (40K + zero DT = no maintenance window) 
  • Explicit defer list (doesn't fit 9 months? SAY SO)
  • Payment frozen = scope gift, leverage it
  • Risk-aware: zero downtime with 5 people is hard, acknowledge it

❌ DO NOT WANT TO SEE:
  • "5 engineers is enough because we use AI" — need math, not faith
  • Ignoring zero downtime complexity
  • Promising to deliver all 6 modules + Payment in 9 months
  • Not acknowledging team burnout risk
  • Analyzing constraints in isolation without discussing interactions

7. Summary — What The Constraints Tell Us About The Playing Field

This is an OPTIMIZATION problem under CONSTRAINTS:

  Maximize: number of modules extracted into microservices
  Subject to:
    - Downtime = 0
    - Payment = frozen Phase 1
    - Engineers ≤ 5
    - Time ≤ 9 months
    - Quality ≥ production-grade

Optimal strategy:
  1. Payment frozen → reduced scope → leverage it
  2. AI 2x → increased capacity → exploit fully
  3. Simplest first (Comms) → build the pattern → apply to complex (Travel)
  4. Per-module extraction → Strangler Fig → zero downtime
  5. Defer everything that doesn't fit → say it directly

Constraints are NOT obstacles. Constraints are BOUNDARIES for engineering judgment.
Assessor test: Do you know how to play within boundaries, or do you try to break them?