Inside AWS Security Agent: How Multi-Agent Systems Automate Penetration Testing

A deep dive into the multi-agent architecture behind AWS Security Agent's automated penetration testing — from specialized agent swarms to assertion-based validation.

Alexandre Agius

AWS Solutions Architect

March 2, 2026 7 min read

AWS Security Multi-Agent Penetration Testing GenAI Agentic AI

Table of Contents

The Problem
The Solution
How It Works
Phase 1: Authentication & Initial Access
Phase 2: Baseline Scanning
Phase 3: Multi-Phased Exploration
Phase 4: Specialized Agent Swarm
Phase 5: Validation & Report Generation
Benchmarking: How Good Is It?
The Exploration vs. Exploitation Trade-off
What I Learned
What’s Next

AWS just published a detailed look at how they built automated penetration testing into AWS Security Agent — and the architecture is a textbook example of multi-agent collaboration done right. Here’s a breakdown of what makes it work, why it matters, and what you can learn from it.

The Problem

Penetration testing is one of the most valuable security practices — and one of the hardest to scale. A manual pentest on a web application typically takes weeks, requires specialized human expertise, and produces results that are point-in-time snapshots. By the time the report is delivered, the codebase has already changed.

Traditional automated tools (vulnerability scanners, DAST tools) help but fall short in critical ways:

No reasoning — they follow predefined rules, missing business logic flaws and complex attack chains
No adaptation — they can’t adjust strategy based on what they discover mid-test
High false positives — without contextual understanding, they flag everything that matches a pattern
No chained attacks — they test individual vulnerabilities in isolation, missing combinations like IDOR + privilege escalation

The gap between what a skilled human pentester can find and what automated tools catch is still enormous. Closing that gap is exactly what multi-agent systems are designed for.

The Solution

AWS Security Agent uses a multi-agent architecture where specialized agents collaborate across five distinct phases: authentication, baseline scanning, multi-phased exploration, swarm execution, and validated reporting.

The key insight is the combination of breadth (systematic scanning across known vulnerability categories) and depth (adaptive, intelligence-driven exploration that reasons about what it finds). Instead of a single agent trying to do everything, specialized workers handle specific risk types while an orchestration layer manages task generation, dispatch, and validation.

AWS Security Agent — Multi-Agent Architecture

How It Works

Phase 1: Authentication & Initial Access

Before testing begins, the system needs to get in. An intelligent sign-in component combines LLM-based reasoning with deterministic mechanisms to:

Locate sign-in pages across diverse application architectures
Attempt provided credentials
Maintain authenticated sessions for subsequent testing phases

This isn’t just “fill in a form” — the agent adapts to different app structures automatically. Developers can optionally provide a custom sign-in prompt for tricky authentication flows.

Phase 2: Baseline Scanning

Once authenticated, parallel scanners establish initial coverage:

Scanner	Mode	Output
Network scanner	Black-box	Raw traffic interactions, candidate vulnerable endpoints
Code scanner	White-box (when source available)	Descriptive documentation across vulnerability categories
Specialized scanners	Both	Multi-dimensional vulnerability identification

These scanners run in parallel — this is the breadth-first foundation that maps the attack surface before the agents start reasoning about what to do with it.

Phase 3: Multi-Phased Exploration

This is where the architecture gets interesting. Two exploration strategies work in concert:

Managed execution runs predefined, static tasks across major risk categories — XSS, IDOR, privilege escalation, injection, and more. Think of this as the “checklist” approach: systematic, comprehensive, and predictable.

Guided exploration is the adaptive layer. It ingests everything discovered so far — endpoints, validated findings, code analysis — and reasons about application-specific attack opportunities. It operates in two stages:

Planning — generates a contextual penetration testing plan by identifying unexplored resources and potential vulnerability chains
Execution — programmatically manages the dynamically generated tasks, which evolve based on application responses

This managed-then-guided approach is smart. You get guaranteed coverage from the managed tasks, then targeted depth from the guided explorer. The guided explorer can spot things like: “This endpoint returns user data without checking ownership — let me test if I can chain this with the session token from endpoint X.”

Phase 4: Specialized Agent Swarm

Both exploration approaches dispatch work to swarm worker agents — each configured for a specific risk type. Every worker comes equipped with a full penetration testing toolkit:

Code executors — run exploit code against targets
Web fuzzers — test input handling with malformed data
NVD search — query the National Vulnerability Database for known CVEs
Vulnerability-specific tools — tailored instruments per risk category

Workers operate with timeout management and produce structured reports. This is the “embarrassingly parallel” part of the architecture — dozens of specialized agents hammering away at their specific domains simultaneously.

Phase 5: Validation & Report Generation

Here’s what separates this from a glorified scanner: rigorous, multi-layered validation.

LLM agents can produce plausible-sounding findings that aren’t actually exploitable. The AWS team addresses this with:

Deterministic validators — programmatic checks that verify exploit prerequisites
LLM-based validators — specialized agents that attempt independent re-exploitation
Assertion-based validation — natural language assertions written by security experts that encode deep knowledge about real attack behaviors

The assertion-based approach is clever. Instead of narrow regex checks that are easy to game, these assertions require explicit, structured proof of exploitation. They’re harder to satisfy accidentally and encode the kind of judgment a human reviewer would apply.

Validated findings then go through CVSS scoring for severity assessment, combining the agent’s actual exploit evidence with standardized vulnerability metrics.

Benchmarking: How Good Is It?

The team evaluated against CVE Bench — 40 real-world critical CVEs from the National Vulnerability Database:

Configuration	Attack Success Rate	Notes
With CTF instructions + grader feedback	92.5%	Best case — agent gets exploit verification
Without CTF instructions or grader	80%	More realistic — agent must self-validate
LLM with pre-CVE-Bench knowledge cutoff	65%	Tests novel vulnerability discovery

The 80% figure is the most meaningful for real-world use — no hints, no external oracle. The drop to 65% when the LLM can’t rely on parametric knowledge of known CVEs shows that the system genuinely discovers vulnerabilities, not just pattern-matches against training data.

One interesting finding: the agent sometimes demonstrates parametric knowledge of specific CVEs:

# HT Mega 2.2.0 has a known vulnerability – CVE-2023-37999
# It has an unauthenticated privilege escalation via the REST API settings endpoint
# Let's check if registration is enabled
curl -s http://target:9090/wp-login.php?action=register -I | head -10

This is both a strength (leveraging known vulnerability intelligence) and something to watch (could mask the system’s ability to find truly novel issues).

The Exploration vs. Exploitation Trade-off

The team explicitly calls out a fundamental challenge: balancing depth and breadth under a fixed compute budget.

Go too deep (depth-first) on one attack vector and you burn compute without covering the full surface
Go too broad (breadth-first) and you miss complex, multi-step vulnerabilities

Their hybrid approach — managed execution for breadth, guided exploration for depth — is a practical solution, but they acknowledge that a fully dynamic allocation strategy remains an open research question.

Another challenge: non-determinism. LLM-based agents produce different results across runs. The mitigation? Run multiple passes and consolidate findings — trading compute for consistency.

What I Learned

Specialization beats generalization in agent architectures — instead of one monolithic agent, AWS uses a swarm of purpose-built workers with specific toolkits. Each agent is an expert in its domain, and orchestration handles the coordination. This pattern applies far beyond security.
Validation is the hardest part of agentic systems — LLMs generate confident outputs that may be wrong. The three-layer validation approach (deterministic + LLM re-validation + assertion-based) is a reusable pattern for any agentic workflow where output correctness matters.
The managed-then-guided exploration pattern is widely applicable — start with systematic, predictable coverage (managed), then use intelligence-driven exploration to go deeper (guided). This balances thoroughness with adaptiveness, and you could apply the same pattern to code review, compliance checking, or incident response.

What’s Next

Try AWS Security Agent in public preview on a test application to see the multi-agent system in action
Explore how the assertion-based validation pattern could apply to other agentic workflows (code review, infrastructure compliance)
Monitor CVE Bench and newer benchmarks to track how frontier agent performance evolves
Investigate the exploration vs. exploitation trade-off for compute-bounded agentic systems — this is an open research question worth watching

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants

A deep architectural comparison of four open-source frameworks that turn messaging apps into AI assistant interfaces — from a 349-file TypeScript monolith to a 10MB Go binary that runs on a $10 board.

Mar 4, 2026 AI

Building a RAG System That Actually Works: Chunking, Vector Engines, and Testing

Most RAG tutorials stop at 'put vectors in a database.' This post covers what actually determines quality: how you chunk documents, which vector search engine to pick, and how to measure and iterate on retrieval performance using Bedrock Knowledge Bases and LLM-as-judge evaluation.

Mar 10, 2026 AI

Vector Search vs Semantic Search: They're Not the Same Thing

Vector search, semantic search, keyword search, hybrid search — these terms get used interchangeably but they mean different things. This post breaks down what each actually does, when each matters, and why hybrid search wins for RAG.

Mar 10, 2026

The Problem

The Solution

How It Works

Phase 1: Authentication & Initial Access

Phase 2: Baseline Scanning

Phase 3: Multi-Phased Exploration

Phase 4: Specialized Agent Swarm

Phase 5: Validation & Report Generation

Benchmarking: How Good Is It?

The Exploration vs. Exploitation Trade-off

What I Learned

What’s Next

Alexandre Agius

Related Posts

OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants

Building a RAG System That Actually Works: Chunking, Vector Engines, and Testing

Vector Search vs Semantic Search: They're Not the Same Thing