AI

Inside AWS Security Agent: How Multi-Agent Systems Automate Penetration Testing

A deep dive into the multi-agent architecture behind AWS Security Agent's automated penetration testing — from specialized agent swarms to assertion-based validation.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

7 min read
Share:

AWS just published a detailed look at how they built automated penetration testing into AWS Security Agent — and the architecture is a textbook example of multi-agent collaboration done right. Here’s a breakdown of what makes it work, why it matters, and what you can learn from it.

The Problem

Penetration testing is one of the most valuable security practices — and one of the hardest to scale. A manual pentest on a web application typically takes weeks, requires specialized human expertise, and produces results that are point-in-time snapshots. By the time the report is delivered, the codebase has already changed.

Traditional automated tools (vulnerability scanners, DAST tools) help but fall short in critical ways:

  • No reasoning — they follow predefined rules, missing business logic flaws and complex attack chains
  • No adaptation — they can’t adjust strategy based on what they discover mid-test
  • High false positives — without contextual understanding, they flag everything that matches a pattern
  • No chained attacks — they test individual vulnerabilities in isolation, missing combinations like IDOR + privilege escalation

The gap between what a skilled human pentester can find and what automated tools catch is still enormous. Closing that gap is exactly what multi-agent systems are designed for.

The Solution

AWS Security Agent uses a multi-agent architecture where specialized agents collaborate across five distinct phases: authentication, baseline scanning, multi-phased exploration, swarm execution, and validated reporting.

The key insight is the combination of breadth (systematic scanning across known vulnerability categories) and depth (adaptive, intelligence-driven exploration that reasons about what it finds). Instead of a single agent trying to do everything, specialized workers handle specific risk types while an orchestration layer manages task generation, dispatch, and validation.

AWS Security Agent — Multi-Agent Architecture

How It Works

Phase 1: Authentication & Initial Access

Before testing begins, the system needs to get in. An intelligent sign-in component combines LLM-based reasoning with deterministic mechanisms to:

  • Locate sign-in pages across diverse application architectures
  • Attempt provided credentials
  • Maintain authenticated sessions for subsequent testing phases

This isn’t just “fill in a form” — the agent adapts to different app structures automatically. Developers can optionally provide a custom sign-in prompt for tricky authentication flows.

Phase 2: Baseline Scanning

Once authenticated, parallel scanners establish initial coverage:

ScannerModeOutput
Network scannerBlack-boxRaw traffic interactions, candidate vulnerable endpoints
Code scannerWhite-box (when source available)Descriptive documentation across vulnerability categories
Specialized scannersBothMulti-dimensional vulnerability identification

These scanners run in parallel — this is the breadth-first foundation that maps the attack surface before the agents start reasoning about what to do with it.

Phase 3: Multi-Phased Exploration

This is where the architecture gets interesting. Two exploration strategies work in concert:

Managed execution runs predefined, static tasks across major risk categories — XSS, IDOR, privilege escalation, injection, and more. Think of this as the “checklist” approach: systematic, comprehensive, and predictable.

Guided exploration is the adaptive layer. It ingests everything discovered so far — endpoints, validated findings, code analysis — and reasons about application-specific attack opportunities. It operates in two stages:

  1. Planning — generates a contextual penetration testing plan by identifying unexplored resources and potential vulnerability chains
  2. Execution — programmatically manages the dynamically generated tasks, which evolve based on application responses

This managed-then-guided approach is smart. You get guaranteed coverage from the managed tasks, then targeted depth from the guided explorer. The guided explorer can spot things like: “This endpoint returns user data without checking ownership — let me test if I can chain this with the session token from endpoint X.”

Phase 4: Specialized Agent Swarm

Both exploration approaches dispatch work to swarm worker agents — each configured for a specific risk type. Every worker comes equipped with a full penetration testing toolkit:

  • Code executors — run exploit code against targets
  • Web fuzzers — test input handling with malformed data
  • NVD search — query the National Vulnerability Database for known CVEs
  • Vulnerability-specific tools — tailored instruments per risk category

Workers operate with timeout management and produce structured reports. This is the “embarrassingly parallel” part of the architecture — dozens of specialized agents hammering away at their specific domains simultaneously.

Phase 5: Validation & Report Generation

Here’s what separates this from a glorified scanner: rigorous, multi-layered validation.

LLM agents can produce plausible-sounding findings that aren’t actually exploitable. The AWS team addresses this with:

  1. Deterministic validators — programmatic checks that verify exploit prerequisites
  2. LLM-based validators — specialized agents that attempt independent re-exploitation
  3. Assertion-based validation — natural language assertions written by security experts that encode deep knowledge about real attack behaviors

The assertion-based approach is clever. Instead of narrow regex checks that are easy to game, these assertions require explicit, structured proof of exploitation. They’re harder to satisfy accidentally and encode the kind of judgment a human reviewer would apply.

Validated findings then go through CVSS scoring for severity assessment, combining the agent’s actual exploit evidence with standardized vulnerability metrics.

Benchmarking: How Good Is It?

The team evaluated against CVE Bench — 40 real-world critical CVEs from the National Vulnerability Database:

ConfigurationAttack Success RateNotes
With CTF instructions + grader feedback92.5%Best case — agent gets exploit verification
Without CTF instructions or grader80%More realistic — agent must self-validate
LLM with pre-CVE-Bench knowledge cutoff65%Tests novel vulnerability discovery

The 80% figure is the most meaningful for real-world use — no hints, no external oracle. The drop to 65% when the LLM can’t rely on parametric knowledge of known CVEs shows that the system genuinely discovers vulnerabilities, not just pattern-matches against training data.

One interesting finding: the agent sometimes demonstrates parametric knowledge of specific CVEs:

# HT Mega 2.2.0 has a known vulnerability – CVE-2023-37999
# It has an unauthenticated privilege escalation via the REST API settings endpoint
# Let's check if registration is enabled
curl -s http://target:9090/wp-login.php?action=register -I | head -10

This is both a strength (leveraging known vulnerability intelligence) and something to watch (could mask the system’s ability to find truly novel issues).

The Exploration vs. Exploitation Trade-off

The team explicitly calls out a fundamental challenge: balancing depth and breadth under a fixed compute budget.

  • Go too deep (depth-first) on one attack vector and you burn compute without covering the full surface
  • Go too broad (breadth-first) and you miss complex, multi-step vulnerabilities

Their hybrid approach — managed execution for breadth, guided exploration for depth — is a practical solution, but they acknowledge that a fully dynamic allocation strategy remains an open research question.

Another challenge: non-determinism. LLM-based agents produce different results across runs. The mitigation? Run multiple passes and consolidate findings — trading compute for consistency.

What I Learned

  • Specialization beats generalization in agent architectures — instead of one monolithic agent, AWS uses a swarm of purpose-built workers with specific toolkits. Each agent is an expert in its domain, and orchestration handles the coordination. This pattern applies far beyond security.
  • Validation is the hardest part of agentic systems — LLMs generate confident outputs that may be wrong. The three-layer validation approach (deterministic + LLM re-validation + assertion-based) is a reusable pattern for any agentic workflow where output correctness matters.
  • The managed-then-guided exploration pattern is widely applicable — start with systematic, predictable coverage (managed), then use intelligence-driven exploration to go deeper (guided). This balances thoroughness with adaptiveness, and you could apply the same pattern to code review, compliance checking, or incident response.

What’s Next

  • Try AWS Security Agent in public preview on a test application to see the multi-agent system in action
  • Explore how the assertion-based validation pattern could apply to other agentic workflows (code review, infrastructure compliance)
  • Monitor CVE Bench and newer benchmarks to track how frontier agent performance evolves
  • Investigate the exploration vs. exploitation trade-off for compute-bounded agentic systems — this is an open research question worth watching
Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog