
Most teams asking about AI in software architecture design want a shortcut: “Can we feed requirements into an LLM and get a clean architecture out?”
We use AI in architecture too, but with a hard rule:
Humans own architecture. AI validates it.
That single constraint is the difference between “fast diagrams” and an architecture you can actually operate at scale.
If you read the previous article in this series, you’ve already seen our stance: AI is an accelerator inside a quality-first SDLC, not the driver of system-level decisions. This second piece goes deeper into the architectural layer: how we use AI to pressure-test scalability and resilience before implementation, especially for real-time and high-load systems.
Can AI Design Software Architecture?
AI can propose architectures. That is not the same as designing architecture.
In enterprise systems, architecture is a set of trade-offs under constraints: latency budgets, failure domains, cost ceilings, compliance rules, team skills, operational maturity. Those constraints are often implicit, political, or only visible through scar tissue.
Also, LLMs have well-documented limitations with long contexts: performance can degrade when critical information is buried mid-context. Architecture problems are exactly the kind of “long context, many constraints” situation where that failure mode hurts.
And when tasks become long-horizon and multi-step, reliability drops sharply. METR’s work on long tasks shows success rates fall dramatically as tasks extend into multi-hour human-equivalent complexity. Architecture is a long-horizon task.
So yes, AI can help. But if AI “designs” your architecture, you’ve outsourced accountability to a tool that cannot carry it.
The Rule: Humans Design, AI Validates
We treat AI as a validator with three jobs:
- Surface risks you forgot to ask about (edge cases, failure modes, hidden couplings).
- Stress-test assumptions (capacity, latency, fanout, contention, queue growth).
- Force clarity (turn fuzzy requirements into measurable quality attributes).
This aligns well with established architecture and reliability practices:
- Architecture evaluation methods like ATAM exist precisely to make trade-offs explicit and testable.
- Reliability engineering emphasizes testing and evidence over belief (see Google’s “Testing for Reliability” for practical patterns and mindsets).
- Resilience practice (including chaos engineering) operationalizes the uncomfortable truth: you don’t “argue” your way into reliability – you prove it by experimentation.
AI is useful here because validation is a questioning machine. It doesn’t get tired of asking “what happens if…?”
Our Architecture Validation Workflow (Before Code)
Step 1: Define quality attributes as measurable targets
Architecture debates get stuck when everything is subjective. We start by turning “scalable” into numbers.
Examples:
- Latency: p95 and p99 targets by endpoint and region
- Availability: SLOs by feature (not just “the service”)
- Throughput: steady-state and burst requirements
- Cost: max cost per 1,000 sessions / per 1M events
- Data integrity: acceptable loss window, idempotency expectations
AI helps by converting vague goals into candidate metrics and by generating “did you mean…” questions:
- “When you say ‘real-time’, do you mean <200ms glass-to-glass, or <1s end-to-end?”
- “Do you need global low-latency or regional?”
This is also where you define what failure is allowed. If you use error budgets, write them down now (Google’s SRE material is a good starting point for the framing, even if you adapt it).
Step 2: Build a back-of-envelope capacity model (then challenge it)
Before you draw microservices boxes, you should be able to answer:
- What’s the expected QPS / CPS?
- What’s the fanout per request?
- What’s the payload size and frequency?
- What’s the concurrency model: threads, async, actors, processes?
- Where does state live, and how is it partitioned?
AI is useful for sanity-checking math and for enumerating “hidden multipliers”:
- retries
- reconnect storms
- cache misses
- “thundering herd” after deploys/incidents
- background jobs that scale with user actions
You’re not asking AI to invent your numbers. You’re asking it to attack your model.
Step 3: Identify bottlenecks and the “one thing that will page you”
We look for:
- single-threaded chokepoints
- shared locks/contention
- unbounded queues
- cross-region chatty calls
- “database as message bus” patterns
- poorly bounded cardinality (e.g., per-user, per-room, per-device)
AI helps by producing structured checklists, but humans must apply domain judgment.
If you want a formal discipline for this evaluation conversation, ATAM is a proven framing tool: quality attributes, scenarios, sensitivity points, and trade-offs.
Step 4: Map failure modes and blast radius (design for containment)
We want to answer:
- What fails first under overload?
- What fails second?
- Can one tenant take down others?
- What happens when a dependency slows down (not just “goes down”)?
- What is the degradation mode: shed load, reduce quality, disable features?
AI is strong at enumerating failure modes, especially when you constrain it to a component (“Only focus on signaling”, “Only focus on media plane”, “Only focus on DB and cache”).
Then humans decide the invariants:
- what must never happen (data corruption, security boundary break)
- what is acceptable (dropping non-critical analytics events)
Step 5: Write the validation plan (tests you can run later)
Before coding, we define what evidence will “prove” the design:
- load test types (steady, burst, soak)
- dependency degradation tests (latency injection)
- reconnect storm simulations
- rollback tests and deploy safety
- observability requirements (what you must measure to know you’re safe)
Google’s SRE guidance is blunt about this: reliability comes from testable claims and engineering discipline, not optimism.
And if your system is truly high-stakes, you eventually want controlled failure injection in production-like environments (Netflix’s chaos engineering write-ups are a useful reference for the philosophy and evolution of tooling).
Techniques We Reuse (And Where Ai Fits)
1) ATAM-style trade-off evaluation
ATAM’s value is forcing trade-offs into daylight. AI helps by:
- generating scenario lists per quality attribute
- asking “what changes in the design if this scenario doubles?”
- summarizing risks and sensitivity points for review
2) SRE-style reliability testing mindset
AI helps you produce:
- coverage matrices (feature × failure type)
- test case catalogs
- observability checklists (metrics/logs/traces you’ll need)
3) Chaos / failure injection (as a maturity step)
AI can help propose experiments, but humans own:
- safety constraints
- rollback conditions
- blast radius controls
Example: validating a real-time WebRTC system before you build it
Real-time systems are a good stress test for “AI in enterprise architecture” because the constraints bite early: latency, jitter, bandwidth, NAT traversal, reconnect storms, and unpredictable client networks.
A common scalability decision is the media topology:
- Mesh / P2P: simplest, but quickly limited by client uplink as participants grow
- SFU (Selective Forwarding Unit): forwards streams; scales better for conferences
- MCU: mixes/transcodes centrally; can simplify client load but increases server CPU cost and complexity
If you’re building conferencing, you almost always end up in SFU territory. Good public references:
- Jitsi’s architecture overview (signaling + videobridge components).
- LiveKit’s SFU internals (useful for understanding how modern SFUs organize routing and layers).
How we use AI in this phase (validation-only):
- Enumerate scaling questions:
- “What happens to CPU when you enable simulcast?”
- “What happens to egress cost when average subscriber count increases?”
- “What happens during mass reconnect after a brief outage?”
- Generate a test matrix:
- participants: 2/4/10/30/100
- network: good/congested/high-loss/mobile
- behaviors: screen share on/off, camera toggles, rapid join/leave, device rotation
- Identify hidden coupling risks:
- signaling reliability affects perceived media reliability
- TURN cost explosion under certain NAT conditions
- Regional placement decisions change p99
What we do not do: ask AI to choose SFU vs MCU as “the architect.” We choose, then AI tries to break our choice.
What To Ask AI (Prompts That Actually Work)
The best prompts are narrow, bounded, and validation-oriented.
Prompt patterns:
- Assumption attack
- “Here is our capacity model. List the top 15 assumptions that could be wrong, and how each would fail in production.”
- Failure-mode enumeration (bounded component)
- “For the signaling service only: list failure modes under burst joins, and propose containment strategies.”
- Test plan generation
- “Generate a load + resilience test plan that would validate these SLOs. Include steady, burst, and soak tests.”
- Observability checklist
- “Given these failure modes, what metrics and alerts are required to detect them before users complain?”
Why this works: it keeps AI doing what it’s good at—structured enumeration—without asking it to own the system.
What Not To Ask AI (Where Teams Get Burned)
- “Design the architecture for our system.”
- “Pick our database/messaging stack.”
- “Decide our boundaries and service ownership.”
- “Optimize for cost/performance without constraints.”
Those prompts produce confident outputs that often ignore the constraints that actually matter, and when the output is “almost right,” the cost is paid later in rework (a theme you already established in Article 1).
If you want a quantitative reminder that AI output quality still needs hard controls, you can reference:
- Code quality deltas found in real PR analysis (AI PRs showing more issues on average): CodeRabbit report (press summary: Business Wire)
- Duplication/refactoring trends in large-scale code change datasets: GitClear 2025 research
(These aren’t “architecture papers,” but they support the governance posture: AI output increases the need for disciplined validation.)
The “Before Code” Architecture Validation Checklist
Use this as your pre-implementation gate:
- Quality attributes defined (latency, availability, cost, security) with measurable targets
- Workload model written (steady + burst + growth curve)
- Capacity model sanity-checked (including retries, reconnect storms, cache miss paths)
- Bottlenecks identified (contention points, single-thread chokepoints, fanout hotspots)
- State strategy clear (partitioning, consistency model, data lifecycle)
- Failure modes mapped (dependency slowness, partial outage, regional impairment)
- Blast radius bounded (bulkheads, rate limits, circuit breakers, backpressure)
- Degradation modes designed (feature flags, quality reduction, load shedding)
- Test plan defined (load, stress, soak, fault injection)
- Observability requirements defined (golden signals + component-specific metrics)
- Operational readiness (runbooks, rollback, incident roles)
AI can help you generate the first version of this list, but a human needs to make it true.
FAQ
Can AI design software architecture end-to-end?
AI can generate architecture proposals, but end-to-end architecture requires accountable trade-offs across latency, cost, security, and operations. We treat AI as a validator that stress-tests a human-owned design, because long-context constraints and long-horizon task reliability remain real limits.
What is AI architecture validation?
AI architecture validation means using AI to pressure-test decisions: enumerate failure modes, challenge capacity assumptions, generate test matrices, and identify missing observability. The goal isn’t “AI-designed architecture,” but evidence-backed architecture that can be validated via reliability testing and experiments.
What should we validate before writing any code?
Validate measurable quality attributes, workload/capacity assumptions, bottlenecks, failure modes, blast radius, degradation behavior, and a load + resilience test plan. Structured methods like ATAM help formalize trade-offs early.
How does chaos engineering relate to architecture validation?
Chaos engineering validates that your system’s architecture can survive real failures by injecting controlled faults and observing behavior. It’s a maturity step: once you have basic reliability tests, you evolve toward safe, repeatable experiments that confirm resilience assumptions.
What are common AI-driven architecture mistakes?
The most common mistakes are letting AI choose boundaries and infrastructure without real constraints, missing operational failure modes (slowness > outage), and ignoring hidden multipliers like retries and reconnect storms. We mitigate this by bounding AI to validation tasks and requiring measurable targets plus testable claims.
Conclusion: The Fastest Way To “Use AI In Architecture” Is To Stop Asking It To Be The Architect
If you’re serious about AI in software architecture design, the win is not an AI-generated diagram. The win is a validated architecture: measurable targets, explicit trade-offs, and a test plan that proves you can scale.
AI’s best architectural value is relentless questioning:
- “What breaks first?”
- “What happens under burst?”
- “What’s the containment plan?”
- “What evidence will convince us we’re safe?”
That’s how you get speed and quality.
If you’re building a real-time or high-load system and want a second set of eyes on your architecture – this is exactly where we engage: human-led design, AI-accelerated validation, and a plan you can execute.
Read the workflow context here: AI in the Software Development Process


.avif)

Comments