AI Engineering

AWS Agent Toolkit GA: How I Gave an Agent 15,000 AWS APIs Without Losing Sleep

AWS released the Agent Toolkit for AWS on May 6, 2026 -- a managed MCP server exposing the full AWS API surface to autonomous agents. I shipped an infrastructure agent the same week. Here's the two-phase safety pattern that lets you hand an agent the keys to your account without waking up to a $10K bill.

14 May 2026 · 9 MIN READ

Alexandre Agius

AWS SOLUTIONS ARCHITECT

AI Agents AWS AgentCore Strands Security Architecture Cost Control

Part of: Cloud Security Part of: Agentic AI

CONTENTS

The Problem: 80/20 Coverage Gap
What the Agent Toolkit Actually Exposes
The Iron Rule
Phase 1: The Read-Only Advisor
Safety Gates: Six Layers Deep
Layer 1: SSM Kill Switch
Layer 2: Production Write-Intent Detection
Layer 3: MCP Tool Allowlist
Layer 4: Daily Invocation Cap
Layer 5: Per-Invocation Cost Cap
Layer 6: Retry Cap
The Test That Matters Most
What Phase 2 Looks Like (Not Shipped Yet)
The Orchestrator Routing Discovery
Results After One Week
The Honest Recommendation

On May 6, 2026, AWS quietly shipped one of the most dangerous tools in the agentic AI ecosystem.

The Agent Toolkit for AWS went GA. A managed MCP server exposing 11 tools, including aws___call_aws — a single function that can invoke any of the 15,000+ AWS API actions across every service. Create an EC2 fleet. Delete a production database. Modify IAM policies. All available to any agent that can authenticate via SigV4.

I had it wired into a new agent within 72 hours. And I shipped it to production the same week — safely.

This is the pattern I used: a two-phase approach where Phase 1 is strictly read-only and advisory, gated behind code-enforced safety breakers that exist before the agent ever loads. If you’ve read my previous post on runaway agents, you know why I’m paranoid about this. $900 in burned tokens taught me that prompt-only safeguards don’t survive contact with reality.

The Problem: 80/20 Coverage Gap

Boulder — my personal AI build factory running on Bedrock AgentCore — builds web applications and deploys them via AWS Amplify Gen 2. Amplify handles the backend cleanly: Cognito, AppSync, Lambda, DynamoDB, S3. For the 80% of app patterns that fit this mold, it works beautifully.

The remaining 20% sits outside Amplify’s abstraction. Apps that need VPC + ECS + RDS. ML pipelines with SageMaker and Bedrock Knowledge Bases. Data pipelines with Glue and Kinesis. Multi-region architectures. Pure infrastructure provisioning.

Before the Agent Toolkit, Boulder had no path for these. The agent would correctly identify that a user’s request needed infrastructure beyond Amplify, and then… stop. Return a polite “this is outside my scope” message.

The Agent Toolkit changes that equation entirely.

What the Agent Toolkit Actually Exposes

The toolkit ships as a managed MCP server with a Python client (mcp-proxy-for-aws). You wire it into a Strands agent via MCPClient(factory) + Agent(tools=mcp_tools). Authentication is SigV4 using the caller’s IAM identity — in my case, the AgentCore runtime role.

The 11 tools break into two categories that matter for safety:

Read-only (safe to expose):

aws___search_documentation — search AWS docs
aws___read_documentation — read specific doc pages
aws___retrieve_skill — get pre-built patterns
aws___recommend — architecture recommendations
aws___list_regions — list available regions
aws___get_regional_availability — service availability checks
aws___suggest_aws_commands — CLI command suggestions

Write-capable (dangerous):

aws___call_aws — invoke any AWS API (15,000+ actions)
aws___run_script — execute arbitrary Python in a sandbox
aws___get_presigned_url — generate signed URLs
aws___get_tasks — manage async tasks

The distinction is binary. The read-only tools can’t break anything. The write-capable tools can delete your production database in a single call.

The Iron Rule

After the runaway agent incident, I codified a principle into an architecture decision record: every new autonomous agent MUST have code-enforced breakers BEFORE production enablement. No exceptions. No “we’ll add safety later.” No “the prompt says not to.”

Prompts are suggestions. Code is law.

This means safety gates live in Python — not in system prompts, not in model instructions, not in “guidelines.” They execute before the agent even loads. If the gate says no, the LLM never sees the tools.

Phase 1: The Read-Only Advisor

The key insight is that most of the value comes from the advisory step, not the execution step. When someone says “I need a Kinesis data pipeline with Glue ETL into Redshift,” what they actually need first is:

A validated architecture pattern
Regional service availability confirmation
A CDK stack they can review
A cost estimate
A next-step checklist

None of that requires write access. All of it can be generated using only the read-only tools plus the LLM’s code generation capability.

So Phase 1 is purely advisory. The agent:

Accepts infrastructure prompts
Searches AWS docs and patterns via MCP
Checks regional availability
Generates a complete CDK stack as text
Returns the architecture + code + cost estimate + checklist
Hands it to the human for review and deployment

The human runs cdk deploy. Not the agent. Not yet.

Safety Gates: Six Layers Deep

Here’s what sits between the user’s prompt and the agent’s execution, enforced in code:

Layer 1: SSM Kill Switch

# Checked in main.py BEFORE the agent module loads
kill_switch = ssm.get_parameter("/boulder/aws-builder/enabled")
if kill_switch == "false":
    return {"status": "disabled", "message": "aws-builder is disabled"}

Independent from the global autopilot kill switch. I can kill this agent without touching anything else. Fail-open on SSM errors — if SSM is down, the agent refuses to load rather than running ungated.

Layer 2: Production Write-Intent Detection

if os.environ.get("BOULDER_ENV") == "prod":
    write_patterns = re.compile(r"deploy|create|delete|terminate|modify", re.I)
    if write_patterns.search(user_prompt):
        return advisory_only_response(user_prompt)

In production, if the prompt contains any write-intent keyword, the agent switches to advisory-only mode. It generates the CDK code, explains what it would do, and explicitly states “this is a draft for human review.” Hardcoded. Not a prompt instruction — a code branch.

Layer 3: MCP Tool Allowlist

This is the most important gate. The MCP client returns all 11 tools from the server. My wrapper filters them before they reach the Strands Agent():

ALLOWED_TOOLS_PHASE_1 = {
    "aws___search_documentation",
    "aws___read_documentation",
    "aws___retrieve_skill",
    "aws___recommend",
    "aws___list_regions",
    "aws___get_regional_availability",
    "aws___suggest_aws_commands",
}

def filter_tools(mcp_tools):
    return [t for t in mcp_tools if t.name in ALLOWED_TOOLS_PHASE_1]

aws___call_aws never reaches the agent. aws___run_script never reaches the agent. The LLM cannot invoke what it cannot see. This is the ulimit equivalent — the process doesn’t have the capability, period.

Layer 4: Daily Invocation Cap

Ten invocations per app per day. Enforced atomically via DynamoDB ADD operation on a daily-scoped counter. Even if someone finds a way to spam the agent, it stops after 10 calls.

Layer 5: Per-Invocation Cost Cap

$3 USD maximum token spend per single invocation. The Strands conversation soft-aborts if the running cost exceeds this threshold. This prevents the specific failure mode from my runaway incident where a single agent session burned $900 over 12 hours.

Layer 6: Retry Cap

Two attempts per task. Not three (which is what my autopilot uses). Infrastructure questions should either work on the first or second try, or they need a human. An agent that retries an infra question five times is an agent that’s hallucinating patterns.

The Test That Matters Most

Unit tests cover each gate in isolation. But the test that gives me confidence is the integration test for the tool allowlist:

def test_mcp_tool_allowlist():
    """Only read-only tools reach Agent(). Write tools are filtered."""
    all_tools = mock_mcp_list_tools()  # Returns all 11 tools
    filtered = filter_tools(all_tools)
    
    allowed_names = {t.name for t in filtered}
    assert "aws___call_aws" not in allowed_names
    assert "aws___run_script" not in allowed_names
    assert "aws___get_presigned_url" not in allowed_names
    assert len(filtered) == 7  # Exactly the read-only set

This test fails loudly if AWS adds a new tool to the toolkit that my allowlist doesn’t account for. It’s not a denylist (which fails silently on new additions). It’s an explicit allowlist with a count assertion.

What Phase 2 Looks Like (Not Shipped Yet)

Phase 2 adds actual deployment capability, gated behind additional safety:

Mandatory cdk diff review gate — human approval or per-app pre-approval policy
Hard dollar cost cap per stack — abort if estimated cost exceeds threshold
Distributed lock in DynamoDB per account + region — prevent concurrent deployments
Automatic rollback on any CREATE_FAILED CloudFormation status
Independent ADR with its own review cycle

The explicit rule: Phase 2 does NOT ship until Phase 1 has run for at least one week in production with zero safety-breach incidents AND an independent review.

This is the chmod 444 to chmod 755 progression. You start read-only. You observe. You build confidence. Then you grant write access with additional gates.

The Orchestrator Routing Discovery

An interesting finding from shipping this: when I wired the new agent into Boulder’s orchestrator, the LLM Planner consistently preferred the existing builder agent for infrastructure questions, even with explicit routing rules.

Root cause: the builder agent already owns tools like deploy_stack and search_github_stacks. The Planner has strong priors toward it for anything containing “build” or “infrastructure.” My routing rule was one of ten in the system prompt — not prominent enough to override learned associations.

The fix: don’t fight it. The aws-builder agent is invoked via direct dispatch, not orchestrator routing. It’s complementary to the existing builder, filling gaps where no pre-built stack template matches the user’s need.

This reinforced a principle I keep rediscovering: LLM routing is probabilistic. If you need deterministic dispatch, use code, not prompts.

Results After One Week

47 invocations across 6 apps
Zero safety breaches
Zero write operations attempted (tool allowlist is invisible to the LLM — it never even tries)
Average response quality: CDK stacks that cdk synth validates on first try ~70% of the time
Total token cost: $11.40 (well within the $5/day budget)
Kill switch triggered manually once (testing), recovered cleanly

The agent produces CDK drafts for patterns I would have spent 30-60 minutes assembling from docs. The architecture suggestions catch edge cases I miss — like suggesting cross-region inference profiles when deploying Bedrock workloads in eu-west-1 where model availability is limited.

The Honest Recommendation

If you’re building agents that touch AWS infrastructure, don’t start with write access. Don’t start with “the prompt says to ask permission first.” Don’t start with an IAM policy as your only safety net.

Start with a tool allowlist that physically removes dangerous capabilities from the agent’s view. Start with a kill switch that lives outside the agent’s process. Start with cost caps that abort execution before the bill arrives.

The Agent Toolkit is genuinely powerful. aws___call_aws with 15,000+ API actions is the closest thing to handing an agent root access to your cloud account. Treat it like you’d treat sudo — not with a policy document, but with sudoers rules that constrain exactly which commands the process can run.

Phase 1 read-only, Phase 2 write with gates. Ship the boring version first. Graduate to the dangerous version only after you’ve watched the boring version run without incident for long enough to trust it.

I burned $900 learning this lesson the hard way. You don’t have to.

ABOUT THE AUTHOR

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

ONE LETTER A MONTH · NO TRACKER · UNSUBSCRIBE ANYTIME

Related dispatches

AI Engineering16 May 2026

AWS Agent Toolkit GA: How I Gave an Agent 15,000 AWS APIs Without Losing Sleep

The Problem: 80/20 Coverage Gap

What the Agent Toolkit Actually Exposes

The Iron Rule

Phase 1: The Read-Only Advisor

Safety Gates: Six Layers Deep

Layer 1: SSM Kill Switch

Layer 2: Production Write-Intent Detection

Layer 3: MCP Tool Allowlist

Layer 4: Daily Invocation Cap

Layer 5: Per-Invocation Cost Cap

Layer 6: Retry Cap

The Test That Matters Most

What Phase 2 Looks Like (Not Shipped Yet)

The Orchestrator Routing Discovery

Results After One Week

The Honest Recommendation

Alexandre Agius

Related dispatches

MCP Gateway as Policy Enforcement Point: RBAC for Your Agent's Tool Access

Browser Automation Agents - Amazon Bedrock AgentCore

When Your AI Agent Runs Away: 204 PRs, $900 Wasted, and the 3-Layer Fix

Comments