AWS Agent Toolkit GA: How I Gave an Agent 15,000 AWS APIs Without Losing Sleep
AWS released the Agent Toolkit for AWS on May 6, 2026 -- a managed MCP server exposing the full AWS API surface to autonomous agents. I shipped an infrastructure agent the same week. Here's the two-phase safety pattern that lets you hand an agent the keys to your account without waking up to a $10K bill.
Table of Contents
- The Problem: 80/20 Coverage Gap
- What the Agent Toolkit Actually Exposes
- The Iron Rule
- Phase 1: The Read-Only Advisor
- Safety Gates: Six Layers Deep
- Layer 1: SSM Kill Switch
- Layer 2: Production Write-Intent Detection
- Layer 3: MCP Tool Allowlist
- Layer 4: Daily Invocation Cap
- Layer 5: Per-Invocation Cost Cap
- Layer 6: Retry Cap
- The Test That Matters Most
- What Phase 2 Looks Like (Not Shipped Yet)
- The Orchestrator Routing Discovery
- Results After One Week
- The Honest Recommendation
On May 6, 2026, AWS quietly shipped one of the most dangerous tools in the agentic AI ecosystem.
The Agent Toolkit for AWS went GA. A managed MCP server exposing 11 tools, including aws___call_aws — a single function that can invoke any of the 15,000+ AWS API actions across every service. Create an EC2 fleet. Delete a production database. Modify IAM policies. All available to any agent that can authenticate via SigV4.
I had it wired into a new agent within 72 hours. And I shipped it to production the same week — safely.
This is the pattern I used: a two-phase approach where Phase 1 is strictly read-only and advisory, gated behind code-enforced safety breakers that exist before the agent ever loads. If you’ve read my previous post on runaway agents, you know why I’m paranoid about this. $900 in burned tokens taught me that prompt-only safeguards don’t survive contact with reality.
The Problem: 80/20 Coverage Gap
Boulder — my personal AI build factory running on Bedrock AgentCore — builds web applications and deploys them via AWS Amplify Gen 2. Amplify handles the backend cleanly: Cognito, AppSync, Lambda, DynamoDB, S3. For the 80% of app patterns that fit this mold, it works beautifully.
The remaining 20% sits outside Amplify’s abstraction. Apps that need VPC + ECS + RDS. ML pipelines with SageMaker and Bedrock Knowledge Bases. Data pipelines with Glue and Kinesis. Multi-region architectures. Pure infrastructure provisioning.
Before the Agent Toolkit, Boulder had no path for these. The agent would correctly identify that a user’s request needed infrastructure beyond Amplify, and then… stop. Return a polite “this is outside my scope” message.
The Agent Toolkit changes that equation entirely.
What the Agent Toolkit Actually Exposes
The toolkit ships as a managed MCP server with a Python client (mcp-proxy-for-aws). You wire it into a Strands agent via MCPClient(factory) + Agent(tools=mcp_tools). Authentication is SigV4 using the caller’s IAM identity — in my case, the AgentCore runtime role.
The 11 tools break into two categories that matter for safety:
Read-only (safe to expose):
aws___search_documentation— search AWS docsaws___read_documentation— read specific doc pagesaws___retrieve_skill— get pre-built patternsaws___recommend— architecture recommendationsaws___list_regions— list available regionsaws___get_regional_availability— service availability checksaws___suggest_aws_commands— CLI command suggestions
Write-capable (dangerous):
aws___call_aws— invoke any AWS API (15,000+ actions)aws___run_script— execute arbitrary Python in a sandboxaws___get_presigned_url— generate signed URLsaws___get_tasks— manage async tasks
The distinction is binary. The read-only tools can’t break anything. The write-capable tools can delete your production database in a single call.
The Iron Rule
After the runaway agent incident, I codified a principle into an architecture decision record: every new autonomous agent MUST have code-enforced breakers BEFORE production enablement. No exceptions. No “we’ll add safety later.” No “the prompt says not to.”
Prompts are suggestions. Code is law.
This means safety gates live in Python — not in system prompts, not in model instructions, not in “guidelines.” They execute before the agent even loads. If the gate says no, the LLM never sees the tools.
Phase 1: The Read-Only Advisor
The key insight is that most of the value comes from the advisory step, not the execution step. When someone says “I need a Kinesis data pipeline with Glue ETL into Redshift,” what they actually need first is:
- A validated architecture pattern
- Regional service availability confirmation
- A CDK stack they can review
- A cost estimate
- A next-step checklist
None of that requires write access. All of it can be generated using only the read-only tools plus the LLM’s code generation capability.
So Phase 1 is purely advisory. The agent:
- Accepts infrastructure prompts
- Searches AWS docs and patterns via MCP
- Checks regional availability
- Generates a complete CDK stack as text
- Returns the architecture + code + cost estimate + checklist
- Hands it to the human for review and deployment
The human runs cdk deploy. Not the agent. Not yet.
Safety Gates: Six Layers Deep
Here’s what sits between the user’s prompt and the agent’s execution, enforced in code:
Layer 1: SSM Kill Switch
# Checked in main.py BEFORE the agent module loads
kill_switch = ssm.get_parameter("/boulder/aws-builder/enabled")
if kill_switch == "false":
return {"status": "disabled", "message": "aws-builder is disabled"}
Independent from the global autopilot kill switch. I can kill this agent without touching anything else. Fail-open on SSM errors — if SSM is down, the agent refuses to load rather than running ungated.
Layer 2: Production Write-Intent Detection
if os.environ.get("BOULDER_ENV") == "prod":
write_patterns = re.compile(r"deploy|create|delete|terminate|modify", re.I)
if write_patterns.search(user_prompt):
return advisory_only_response(user_prompt)
In production, if the prompt contains any write-intent keyword, the agent switches to advisory-only mode. It generates the CDK code, explains what it would do, and explicitly states “this is a draft for human review.” Hardcoded. Not a prompt instruction — a code branch.
Layer 3: MCP Tool Allowlist
This is the most important gate. The MCP client returns all 11 tools from the server. My wrapper filters them before they reach the Strands Agent():
ALLOWED_TOOLS_PHASE_1 = {
"aws___search_documentation",
"aws___read_documentation",
"aws___retrieve_skill",
"aws___recommend",
"aws___list_regions",
"aws___get_regional_availability",
"aws___suggest_aws_commands",
}
def filter_tools(mcp_tools):
return [t for t in mcp_tools if t.name in ALLOWED_TOOLS_PHASE_1]
aws___call_aws never reaches the agent. aws___run_script never reaches the agent. The LLM cannot invoke what it cannot see. This is the ulimit equivalent — the process doesn’t have the capability, period.
Layer 4: Daily Invocation Cap
Ten invocations per app per day. Enforced atomically via DynamoDB ADD operation on a daily-scoped counter. Even if someone finds a way to spam the agent, it stops after 10 calls.
Layer 5: Per-Invocation Cost Cap
$3 USD maximum token spend per single invocation. The Strands conversation soft-aborts if the running cost exceeds this threshold. This prevents the specific failure mode from my runaway incident where a single agent session burned $900 over 12 hours.
Layer 6: Retry Cap
Two attempts per task. Not three (which is what my autopilot uses). Infrastructure questions should either work on the first or second try, or they need a human. An agent that retries an infra question five times is an agent that’s hallucinating patterns.
The Test That Matters Most
Unit tests cover each gate in isolation. But the test that gives me confidence is the integration test for the tool allowlist:
def test_mcp_tool_allowlist():
"""Only read-only tools reach Agent(). Write tools are filtered."""
all_tools = mock_mcp_list_tools() # Returns all 11 tools
filtered = filter_tools(all_tools)
allowed_names = {t.name for t in filtered}
assert "aws___call_aws" not in allowed_names
assert "aws___run_script" not in allowed_names
assert "aws___get_presigned_url" not in allowed_names
assert len(filtered) == 7 # Exactly the read-only set
This test fails loudly if AWS adds a new tool to the toolkit that my allowlist doesn’t account for. It’s not a denylist (which fails silently on new additions). It’s an explicit allowlist with a count assertion.
What Phase 2 Looks Like (Not Shipped Yet)
Phase 2 adds actual deployment capability, gated behind additional safety:
- Mandatory
cdk diffreview gate — human approval or per-app pre-approval policy - Hard dollar cost cap per stack — abort if estimated cost exceeds threshold
- Distributed lock in DynamoDB per account + region — prevent concurrent deployments
- Automatic rollback on any
CREATE_FAILEDCloudFormation status - Independent ADR with its own review cycle
The explicit rule: Phase 2 does NOT ship until Phase 1 has run for at least one week in production with zero safety-breach incidents AND an independent review.
This is the chmod 444 to chmod 755 progression. You start read-only. You observe. You build confidence. Then you grant write access with additional gates.
The Orchestrator Routing Discovery
An interesting finding from shipping this: when I wired the new agent into Boulder’s orchestrator, the LLM Planner consistently preferred the existing builder agent for infrastructure questions, even with explicit routing rules.
Root cause: the builder agent already owns tools like deploy_stack and search_github_stacks. The Planner has strong priors toward it for anything containing “build” or “infrastructure.” My routing rule was one of ten in the system prompt — not prominent enough to override learned associations.
The fix: don’t fight it. The aws-builder agent is invoked via direct dispatch, not orchestrator routing. It’s complementary to the existing builder, filling gaps where no pre-built stack template matches the user’s need.
This reinforced a principle I keep rediscovering: LLM routing is probabilistic. If you need deterministic dispatch, use code, not prompts.
Results After One Week
- 47 invocations across 6 apps
- Zero safety breaches
- Zero write operations attempted (tool allowlist is invisible to the LLM — it never even tries)
- Average response quality: CDK stacks that
cdk synthvalidates on first try ~70% of the time - Total token cost: $11.40 (well within the $5/day budget)
- Kill switch triggered manually once (testing), recovered cleanly
The agent produces CDK drafts for patterns I would have spent 30-60 minutes assembling from docs. The architecture suggestions catch edge cases I miss — like suggesting cross-region inference profiles when deploying Bedrock workloads in eu-west-1 where model availability is limited.
The Honest Recommendation
If you’re building agents that touch AWS infrastructure, don’t start with write access. Don’t start with “the prompt says to ask permission first.” Don’t start with an IAM policy as your only safety net.
Start with a tool allowlist that physically removes dangerous capabilities from the agent’s view. Start with a kill switch that lives outside the agent’s process. Start with cost caps that abort execution before the bill arrives.
The Agent Toolkit is genuinely powerful. aws___call_aws with 15,000+ API actions is the closest thing to handing an agent root access to your cloud account. Treat it like you’d treat sudo — not with a policy document, but with sudoers rules that constrain exactly which commands the process can run.
Phase 1 read-only, Phase 2 write with gates. Ship the boring version first. Graduate to the dangerous version only after you’ve watched the boring version run without incident for long enough to trust it.
I burned $900 learning this lesson the hard way. You don’t have to.
Never miss a post
Get notified when I publish new articles about AI, Cloud, and AWS.
No spam, unsubscribe anytime.
Comments
Sign in to leave a comment
Related Posts
MCP Gateway as Policy Enforcement Point: RBAC for Your Agent's Tool Access
Your AI agent has access to tools that perform real actions -- approving expenses, querying databases, modifying infrastructure. Prompt-based guardrails don't survive adversarial inputs. Here's how AgentCore Gateway + Cedar policies create a deterministic enforcement layer that operates independently of the agent's reasoning.
When Your AI Agent Runs Away: 204 PRs, $900 Wasted, and the 3-Layer Fix
I woke up to 204 pull requests from a single autonomous agent running overnight. 12 hours, ~$900 in Bedrock tokens, 509 failed builds, zero features shipped. Prompt-only safeguards all failed. Here's the 3-layer fix — hard kill switch, atomic circuit breakers, drift observability — that now prevents runaway agents.
Browser Automation Agents - Amazon Bedrock AgentCore
Enterprise workflows often require interacting with web applications that lack APIs. Traditional automation scripts are brittle and break when UIs change.
