Inside AWS Security Agent: How Multi-Agent Systems Automate Penetration Testing
A deep dive into the multi-agent architecture behind AWS Security Agent's automated penetration testing — from specialized agent swarms to assertion-based validation.
Table of Contents
- The Problem
- The Solution
- How It Works
- Phase 1: Authentication & Initial Access
- Phase 2: Baseline Scanning
- Phase 3: Multi-Phased Exploration
- Phase 4: Specialized Agent Swarm
- Phase 5: Validation & Report Generation
- Benchmarking: How Good Is It?
- The Exploration vs. Exploitation Trade-off
- What I Learned
- What’s Next
AWS just published a detailed look at how they built automated penetration testing into AWS Security Agent — and the architecture is a textbook example of multi-agent collaboration done right. Here’s a breakdown of what makes it work, why it matters, and what you can learn from it.
The Problem
Penetration testing is one of the most valuable security practices — and one of the hardest to scale. A manual pentest on a web application typically takes weeks, requires specialized human expertise, and produces results that are point-in-time snapshots. By the time the report is delivered, the codebase has already changed.
Traditional automated tools (vulnerability scanners, DAST tools) help but fall short in critical ways:
- No reasoning — they follow predefined rules, missing business logic flaws and complex attack chains
- No adaptation — they can’t adjust strategy based on what they discover mid-test
- High false positives — without contextual understanding, they flag everything that matches a pattern
- No chained attacks — they test individual vulnerabilities in isolation, missing combinations like IDOR + privilege escalation
The gap between what a skilled human pentester can find and what automated tools catch is still enormous. Closing that gap is exactly what multi-agent systems are designed for.
The Solution
AWS Security Agent uses a multi-agent architecture where specialized agents collaborate across five distinct phases: authentication, baseline scanning, multi-phased exploration, swarm execution, and validated reporting.
The key insight is the combination of breadth (systematic scanning across known vulnerability categories) and depth (adaptive, intelligence-driven exploration that reasons about what it finds). Instead of a single agent trying to do everything, specialized workers handle specific risk types while an orchestration layer manages task generation, dispatch, and validation.
How It Works
Phase 1: Authentication & Initial Access
Before testing begins, the system needs to get in. An intelligent sign-in component combines LLM-based reasoning with deterministic mechanisms to:
- Locate sign-in pages across diverse application architectures
- Attempt provided credentials
- Maintain authenticated sessions for subsequent testing phases
This isn’t just “fill in a form” — the agent adapts to different app structures automatically. Developers can optionally provide a custom sign-in prompt for tricky authentication flows.
Phase 2: Baseline Scanning
Once authenticated, parallel scanners establish initial coverage:
| Scanner | Mode | Output |
|---|---|---|
| Network scanner | Black-box | Raw traffic interactions, candidate vulnerable endpoints |
| Code scanner | White-box (when source available) | Descriptive documentation across vulnerability categories |
| Specialized scanners | Both | Multi-dimensional vulnerability identification |
These scanners run in parallel — this is the breadth-first foundation that maps the attack surface before the agents start reasoning about what to do with it.
Phase 3: Multi-Phased Exploration
This is where the architecture gets interesting. Two exploration strategies work in concert:
Managed execution runs predefined, static tasks across major risk categories — XSS, IDOR, privilege escalation, injection, and more. Think of this as the “checklist” approach: systematic, comprehensive, and predictable.
Guided exploration is the adaptive layer. It ingests everything discovered so far — endpoints, validated findings, code analysis — and reasons about application-specific attack opportunities. It operates in two stages:
- Planning — generates a contextual penetration testing plan by identifying unexplored resources and potential vulnerability chains
- Execution — programmatically manages the dynamically generated tasks, which evolve based on application responses
This managed-then-guided approach is smart. You get guaranteed coverage from the managed tasks, then targeted depth from the guided explorer. The guided explorer can spot things like: “This endpoint returns user data without checking ownership — let me test if I can chain this with the session token from endpoint X.”
Phase 4: Specialized Agent Swarm
Both exploration approaches dispatch work to swarm worker agents — each configured for a specific risk type. Every worker comes equipped with a full penetration testing toolkit:
- Code executors — run exploit code against targets
- Web fuzzers — test input handling with malformed data
- NVD search — query the National Vulnerability Database for known CVEs
- Vulnerability-specific tools — tailored instruments per risk category
Workers operate with timeout management and produce structured reports. This is the “embarrassingly parallel” part of the architecture — dozens of specialized agents hammering away at their specific domains simultaneously.
Phase 5: Validation & Report Generation
Here’s what separates this from a glorified scanner: rigorous, multi-layered validation.
LLM agents can produce plausible-sounding findings that aren’t actually exploitable. The AWS team addresses this with:
- Deterministic validators — programmatic checks that verify exploit prerequisites
- LLM-based validators — specialized agents that attempt independent re-exploitation
- Assertion-based validation — natural language assertions written by security experts that encode deep knowledge about real attack behaviors
The assertion-based approach is clever. Instead of narrow regex checks that are easy to game, these assertions require explicit, structured proof of exploitation. They’re harder to satisfy accidentally and encode the kind of judgment a human reviewer would apply.
Validated findings then go through CVSS scoring for severity assessment, combining the agent’s actual exploit evidence with standardized vulnerability metrics.
Benchmarking: How Good Is It?
The team evaluated against CVE Bench — 40 real-world critical CVEs from the National Vulnerability Database:
| Configuration | Attack Success Rate | Notes |
|---|---|---|
| With CTF instructions + grader feedback | 92.5% | Best case — agent gets exploit verification |
| Without CTF instructions or grader | 80% | More realistic — agent must self-validate |
| LLM with pre-CVE-Bench knowledge cutoff | 65% | Tests novel vulnerability discovery |
The 80% figure is the most meaningful for real-world use — no hints, no external oracle. The drop to 65% when the LLM can’t rely on parametric knowledge of known CVEs shows that the system genuinely discovers vulnerabilities, not just pattern-matches against training data.
One interesting finding: the agent sometimes demonstrates parametric knowledge of specific CVEs:
# HT Mega 2.2.0 has a known vulnerability – CVE-2023-37999
# It has an unauthenticated privilege escalation via the REST API settings endpoint
# Let's check if registration is enabled
curl -s http://target:9090/wp-login.php?action=register -I | head -10
This is both a strength (leveraging known vulnerability intelligence) and something to watch (could mask the system’s ability to find truly novel issues).
The Exploration vs. Exploitation Trade-off
The team explicitly calls out a fundamental challenge: balancing depth and breadth under a fixed compute budget.
- Go too deep (depth-first) on one attack vector and you burn compute without covering the full surface
- Go too broad (breadth-first) and you miss complex, multi-step vulnerabilities
Their hybrid approach — managed execution for breadth, guided exploration for depth — is a practical solution, but they acknowledge that a fully dynamic allocation strategy remains an open research question.
Another challenge: non-determinism. LLM-based agents produce different results across runs. The mitigation? Run multiple passes and consolidate findings — trading compute for consistency.
What I Learned
- Specialization beats generalization in agent architectures — instead of one monolithic agent, AWS uses a swarm of purpose-built workers with specific toolkits. Each agent is an expert in its domain, and orchestration handles the coordination. This pattern applies far beyond security.
- Validation is the hardest part of agentic systems — LLMs generate confident outputs that may be wrong. The three-layer validation approach (deterministic + LLM re-validation + assertion-based) is a reusable pattern for any agentic workflow where output correctness matters.
- The managed-then-guided exploration pattern is widely applicable — start with systematic, predictable coverage (managed), then use intelligence-driven exploration to go deeper (guided). This balances thoroughness with adaptiveness, and you could apply the same pattern to code review, compliance checking, or incident response.
What’s Next
- Try AWS Security Agent in public preview on a test application to see the multi-agent system in action
- Explore how the assertion-based validation pattern could apply to other agentic workflows (code review, infrastructure compliance)
- Monitor CVE Bench and newer benchmarks to track how frontier agent performance evolves
- Investigate the exploration vs. exploitation trade-off for compute-bounded agentic systems — this is an open research question worth watching
Related Posts
OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants
A deep architectural comparison of four open-source frameworks that turn messaging apps into AI assistant interfaces — from a 349-file TypeScript monolith to a 10MB Go binary that runs on a $10 board.
AIBuilding a RAG System That Actually Works: Chunking, Vector Engines, and Testing
Most RAG tutorials stop at 'put vectors in a database.' This post covers what actually determines quality: how you chunk documents, which vector search engine to pick, and how to measure and iterate on retrieval performance using Bedrock Knowledge Bases and LLM-as-judge evaluation.
AIVector Search vs Semantic Search: They're Not the Same Thing
Vector search, semantic search, keyword search, hybrid search — these terms get used interchangeably but they mean different things. This post breaks down what each actually does, when each matters, and why hybrid search wins for RAG.
