Constraints Analysis
Constraints Analysis
Deep-dive into Section 3 of the brief — 5 constraints, interactions between them, and capacity math
1. Original Brief (Verbatim)
| Constraint | Detail |
|---|---|
| Users | 40,000 active global users |
| Availability | Zero downtime during migration |
| Payment | Payment flow cannot change in Phase 1 |
| Team Size | 5 engineers only |
| Timeline | 9 months total |
2. Constraint-by-Constraint Analysis — What They Really Mean
2.1 "40,000 active global users"
Surface: User count.
Real signal:
| Aspect | Analysis |
|---|---|
| "40,000" | Not a startup (100 users) but not Facebook either (1B). Medium-scale enterprise. What matters more than the absolute number is the behavior pattern |
| "active" | These are active users, not registered. 40K active = potentially 200K+ registered. Daily active likely 5K-15K |
| "global" | Multiple timezones → no maintenance window. "There is always someone using the system." Traffic follows the sun |
Traffic estimation:
40K active users
├── Peak concurrent: ~10% = 4,000 concurrent sessions
├── Requests per session: ~20 pages/actions per session
├── Peak RPS: 4,000 × 20 / 3600 ≈ 22 req/s sustained
│ └── Burst: 5-10x → 100-220 req/s peak
├── Daily API calls: ~2-5 million
└── Database transactions: ~500K-1M/day
Scale verdict: MODERATE
• No need for Kafka-level streaming
• Azure Service Bus Standard tier is sufficient
• Per-service Azure SQL handles this comfortably
• Azure Container Apps auto-scale handles burst
• CDN + cache reduces 60-80% of raw traffic
"Global" implications for architecture:
Global users = multi-region consideration
Option A: Single region + CDN (recommended Phase 1-2)
Azure Southeast Asia (Singapore) — nearest to Vietnam
Azure Front Door → cache static + route to nearest edge
Latency: ~50-200ms globally (acceptable for enterprise app)
Option B: Multi-region active-active (Phase 3+, if needed)
Primary: Southeast Asia
Secondary: West Europe or East US
Data replication: Azure SQL geo-replication
Trade-off: Option A is sufficient for 40K users. Multi-region is only
justified if SLA requires <100ms globally or regulatory compliance.
With 5 engineers + 9 months → Option A is the correct choice.
2.2 "Zero downtime during migration"
This is the HARDEST constraint.
Zero downtime does NOT mean "deploy fast then restart"
Zero downtime MEANS:
┌─────────────────────────────────────────────────┐
│ 1. At every point during 9 months, 40K users │
│ MUST be able to access the system normally │
│ │
│ 2. No "scheduled maintenance window" │
│ (global users = no safe window) │
│ │
│ 3. Migration happens "invisible" to users │
│ Today: request → monolith │
│ Tomorrow: request → new service (user unaware) │
│ │
│ 4. Rollback MUST be instant │
│ If new service fails → route back to monolith │
│ in seconds, not minutes │
└─────────────────────────────────────────────────┘
Patterns required by this constraint:
| Pattern | Purpose | Implementation |
|---|---|---|
| Strangler Fig | Shift traffic per module, no big bang | YARP proxy routing: path-based → old or new |
| Feature Flags | Toggle new vs old per module, per user | LaunchDarkly or Azure App Configuration |
| Blue-Green Deploy | 2 versions running in parallel, switch routing | Container Apps revisions, traffic splitting |
| Canary Release | Route 5% traffic → new service, monitor, scale | YARP weighted routing rules |
| Shadow Mode | New service receives copy of traffic, compare output | Dual-write: monolith processes, new service validates |
| Circuit Breaker | Auto-fallback if new service unhealthy | Polly library (.NET) |
| Database CDC | Sync data between old DB and new DBs | Debezium or Azure SQL CDC → Service Bus |
Interaction with other constraints:
Zero downtime + 5 engineers = MUST simplify
If 50 engineers: parallel migration, complex infra is OK
If 5 engineers: one module at a time, simple patterns, automated rollback
→ Strangler Fig + YARP + Feature Flags is the minimum viable set
→ Blue-Green with Container Apps (built-in), not custom
→ Shadow Mode only for critical paths (Travel booking)
2.3 "Payment flow cannot change in Phase 1"
Why freeze Payment?
Payment = HIGHEST RISK module
Risks if you touch Payment:
1. PCI DSS compliance → audit, certification
2. Financial transactions → money loss if bugs slip through
3. Regulatory → legal liability
4. User trust → payment failure = user churn
5. Complexity → payment gateway integrations, reconciliation
Assessment signal: "Do you understand that NOT doing something is also an engineering decision?"
"Phase 1" definition:
| Scope | Timeline | Meaning |
|---|---|---|
| Phase 0 | Month 1 | AI setup, infra foundation |
| Phase 1 | Month 2-4 | Payment FROZEN here |
| Phase 2 | Month 5-7 | Payment stays in monolith, but planning can begin |
| Phase 3 | Month 8-9 | Payment migration CAN start if team is confident |
ACL Pattern cho Payment:
┌──────────────────────────────────────────────────────────┐
│ │
│ New Travel Service ──→ Anti-Corruption Layer ──→ Legacy │
│ (needs payment) (adapter/facade) Monolith │
│ Payment │
│ │
│ ACL does: │
│ 1. Translate new service's PaymentRequest │
│ → legacy Payment API format │
│ 2. Handle legacy exceptions → standard errors │
│ 3. Log/trace calls for observability │
│ 4. Rate limit to protect legacy system │
│ 5. Circuit break if legacy is slow │
│ │
│ ACL does NOT: │
│ • Change payment logic │
│ • Store payment data in new DB │
│ • Process payments differently │
│ │
│ This is a BRIDGE, not a migration. │
└──────────────────────────────────────────────────────────┘
Hidden implication: Every other service (Travel, Event) that needs payment must go through the ACL. This is one additional component to build and maintain. Effort for ACL ≈ 1-2 weeks.
2.4 "5 engineers only"
This is the MOST LIMITING constraint.
5 engineers × 9 months = 45 man-months RAW
Overhead deduction:
- Meetings, planning, reviews: ~15%
- Learning new tech/patterns: ~10% (higher in Phase 0-1)
- Sick leave, vacation: ~5%
- Context switching: ~5%
Effective: 45 × 0.65 ≈ 29 man-months TRADITIONAL
With AI multiplier 2x (AI-heavy agentic):
Effective: 29 × 1.5 (conservative) ≈ 44 man-months
Explanation:
• 2x does NOT mean 2 × 45 = 90
• 2x applies to coding tasks (~60% of time)
• Non-coding tasks (meetings, design) do not get the 2x multiplier
• Realistic: ~44-50 effective man-months
Team allocation (recommended):
| Engineer | Role | Focus |
|---|---|---|
| D1 (Lead/Architect) | Tech Lead | Architecture, AI pipeline setup, code review, cross-cutting |
| D2 (Senior) | Backend Lead | Travel Service (hardest module), mentoring D4-D5 |
| D3 (Senior) | Platform | Infra (Bicep), CI/CD, API Gateway, observability, shared libraries |
| D4 (Mid) | Full-stack | Event Service, then Workforce, frontend |
| D5 (Mid) | Full-stack | Communications (pilot), then Reporting, frontend |
Brooks's Law warning:
"Adding people to a late project makes it later" — Fred Brooks
Implication: 5 engineers is FIXED. Cannot add people at Month 6.
Every decision must pass the "Feasible for 5?" test:
✅ Azure Container Apps (not AKS) — less ops
✅ Bicep over Terraform — simpler, Azure-only
✅ Single SPA not micro-frontends — less infra
✅ Azure SQL everywhere — one DB technology
✅ MassTransit over raw SDK — less boilerplate
❌ Kubernetes — too much ops
❌ Kafka — too much ops
❌ Micro-frontends — too much infra
❌ Polyglot databases — too much expertise spread
❌ Custom service mesh — unnecessary at this scale
2.5 "9 months total"
Timeline analysis:
9 months = 39 working weeks ≈ 195 working days
Phase breakdown:
Phase 0 (M1): 4 weeks — AI setup, infra, pilot
Phase 1 (M2-4): 13 weeks — Travel + Event extraction
Phase 2 (M5-7): 13 weeks — Workforce + Comms + Reporting
Phase 3 (M8-9): 9 weeks — Stabilize, optimize, handover
Key milestones:
M1 end: AI pipeline working, infra ready, Comms pilot extracted
M3 end: Travel Service live (canary 10%)
M4 end: Event Service live, Travel 100%
M6 end: Workforce live
M7 end: Comms + Reporting live
M9 end: Monolith reduced to Payment + legacy shell
What does NOT fit within 9 months?
| Item | Status | Reason |
|---|---|---|
| Full Kubernetes migration | ❌ Defer | Ops overhead, Container Apps is sufficient |
| Payment extraction | ❌ Defer | Frozen Phase 1, only plan in Phase 3. Execute post-9-months |
| Multi-region active-active | ❌ Defer | Single region + CDN is sufficient for 40K |
| ML models in production | ❌ Defer | AI-ready data foundation: YES. Production ML: NO |
| Mobile app | ❌ Out of scope | Not mentioned in the brief |
| Full legacy decommission | ❌ Defer | Monolith will keep running for Payment. Kill post-payment migration |
3. Constraint Interaction Matrix — How They Affect Each Other
40K Users Zero DT Payment 5 Eng 9 Months
───────── ──────── ───────── ──────── ────────
40K · AMPLIFY neutral PRESSURE neutral
Zero DT AMPLIFY · SIMPLIFY PRESSURE PRESSURE
Payment neutral SIMPLIFY · RELIEF RELIEF
5 Eng PRESSURE PRESSURE RELIEF · AMPLIFY
9 Mon neutral PRESSURE RELIEF AMPLIFY ·
Key interaction explanations:
| Interaction | Meaning |
|---|---|
| 40K × Zero DT = AMPLIFY | 40K global users → NO maintenance window. Zero downtime must be absolute, not "around 2am should be fine" |
| Zero DT × 5 Eng = PRESSURE | Strangler Fig + canary + rollback requires ops investment. 5 engineers must automate everything, no manual rollback |
| Payment frozen × 5 Eng = RELIEF | 1 fewer module → 5 engineers can focus on remaining 5 modules. Frozen = scope reduction |
| Payment frozen × 9 Months = RELIEF | Reduced scope → more breathing room on timeline. This is a trade-off the brief gifts you — leverage it! |
| 5 Eng × 9 Months = AMPLIFY | Not enough people + not enough time = must sacrifice scope or quality. Choose to sacrifice scope (defer features) |
4. Capacity Math — Man-Month Breakdown
Total raw: 5 engineers × 9 months = 45 man-months
Overhead deduction (~40%):
Sprint ceremonies + reviews: -7 MM
Learning curve (Phase 0-1): -4 MM
Context switching + meetings: -5 MM
Leave + buffer: -2 MM
─────────────────────────────────────
Net available: 27 MM traditional
Phase-by-phase (variable AI multiplier):
P0 (M1): 5.0 raw - 3.0 overhead = 2.0 × 1.0 = 2.0 MM
P1 (M2-4): 15.0 raw - 6.0 overhead = 9.0 × 2.0 = 18.0 MM
P2 (M5-7): 15.0 raw - 5.5 overhead = 9.5 × 2.0 = 19.0 MM
P3 (M8-9): 10.0 raw - 3.5 overhead = 6.5 × 1.0 = 6.5 MM
─────────────────────────────────────────────────────────
Total effective: 45.5 ≈ ~44 man-months (conservative)
Equivalent to: ~7.5 traditional engineers for 9 months
(Methodology aligned with Analysis v2.md & Planning.md)
Allocation per phase (variable multiplier):
Phase 0 (M1): 5 MM raw → 2 MM effective (AI not yet set up, ×1.0)
Phase 1 (M2-4): 15 MM raw → 18 MM effective (AI kicking in, ×2.0)
Phase 2 (M5-7): 15 MM raw → 19 MM effective (full AI velocity, ×2.0)
Phase 3 (M8-9): 10 MM raw → 6 MM effective (perf/docs, ×1.0)
"Is it enough?"
| Module | Estimated Effort | Feasible? |
|---|---|---|
| Travel Booking (hardest) | 10-12 MM | ✅ Phase 1 (3 months, 2 senior engineers + AI) |
| Event Management | 6-8 MM | ✅ Phase 1-2 (overlap with Travel tail) |
| Workforce + Allocation | 5-7 MM | ✅ Phase 2 |
| Communications (simplest) | 3-4 MM | ✅ Phase 0 pilot + Phase 2 complete |
| Reporting (read-only) | 3-4 MM | ✅ Phase 2 |
| Infra + Platform | 6-8 MM | ✅ D3 full-time + shared effort |
| ACL for Payment | 2-3 MM | ✅ Phase 1 |
| Total | 35-46 MM | ⚠️ Tight fit at 44 effective |
Conclusion: Feasible but no room for error. All scope creep must be blocked aggressively.
5. Risk Matrix Derived From Constraints
| Risk | Likelihood | Impact | Constraint Source | Mitigation |
|---|---|---|---|---|
| Migration causes outage | Medium | Critical | Zero DT × 40K | Strangler Fig, canary, instant rollback |
| Team burnout | High | High | 5 Eng × 9 Months | Aggressive scope control, AI automation, sprint sustainable pace |
| Payment integration breaks | Low | Critical | Payment frozen | ACL isolation, extensive integration tests |
| Underestimate Travel complexity | Medium | High | 9 Months tight | Start Travel first (hardest), AI legacy analysis |
| AI tooling doesn't deliver 2x | Medium | High | Capacity dependent | Measure velocity weekly, fallback plan = reduce scope |
| Key engineer leaves | Low | Critical | 5 Eng | Cross-training, documentation, no single-person dependency |
6. Assessor Perspective
✅ WANT TO SEE:
• Clear capacity math (man-months, AI multiplier, overhead)
• Constraint interactions (40K + zero DT = no maintenance window)
• Explicit defer list (doesn't fit 9 months? SAY SO)
• Payment frozen = scope gift, leverage it
• Risk-aware: zero downtime with 5 people is hard, acknowledge it
❌ DO NOT WANT TO SEE:
• "5 engineers is enough because we use AI" — need math, not faith
• Ignoring zero downtime complexity
• Promising to deliver all 6 modules + Payment in 9 months
• Not acknowledging team burnout risk
• Analyzing constraints in isolation without discussing interactions
7. Summary — What The Constraints Tell Us About The Playing Field
This is an OPTIMIZATION problem under CONSTRAINTS:
Maximize: number of modules extracted into microservices
Subject to:
- Downtime = 0
- Payment = frozen Phase 1
- Engineers ≤ 5
- Time ≤ 9 months
- Quality ≥ production-grade
Optimal strategy:
1. Payment frozen → reduced scope → leverage it
2. AI 2x → increased capacity → exploit fully
3. Simplest first (Comms) → build the pattern → apply to complex (Travel)
4. Per-module extraction → Strangler Fig → zero downtime
5. Defer everything that doesn't fit → say it directly
Constraints are NOT obstacles. Constraints are BOUNDARIES for engineering judgment.
Assessor test: Do you know how to play within boundaries, or do you try to break them?