Boulder — An AI Build Factory on AWS That Generates, Deploys, and Maintains Apps on Its Own

Boulder uses 9 Strands agents on Bedrock AgentCore to generate, deploy, and maintain full-stack apps on AWS Amplify — with self-healing builds and self-improving prompts.

Alexandre Agius

AWS Solutions Architect

April 20, 2026 8 min read

AI AWS Agents Bedrock Amplify CDK Strands

🧠 Part of: AI & LLM Fundamentals ☁️ Part of: AWS Architecture 🤖 Part of: Agentic AI

Table of Contents

Architecture: 9 agents, one runtime
The 9 agents, each with its own job
Builder — the core of the system
Orchestrator — the conductor
The 3 Scouts — idea → validate → monetize
Backlog Pipeline — plan → implement → review
Autopilot — the infinite loop
Portfolio Manager & Post-Deploy
Self-healing builds: the correction loop
Self-improving prompts: error memory
Tech stack: CDK, all in one stack
What I learned
Conclusion

We’ve all been there: you have an app idea, spend 3 days on Next.js scaffolding, 2 more debugging the Amplify deployment, and your MVP ships 2 weeks later. Boulder is the side project I built to short-circuit all of that. One CLI command, a plain-English description, and 5 minutes later the app is live on Amplify with its GitHub repo, Cognito/DynamoDB backend, and a backlog ready to be iterated on — by the AI itself.

Not a template generator. Not a copilot. An autonomous factory that generates code, pushes it, deploys it, catches build errors, fixes them, and learns from its failures so it doesn’t repeat them.

Architecture: 9 agents, one runtime

Boulder runs on Strands Agents SDK deployed via Bedrock AgentCore. A single main.py acts as a multiplexer: it receives a {agent, prompt} payload, resolves the agent’s canonical name, lazy-loads the corresponding Python module, and invokes it.

                          ┌──────────────┐
                          │ Web Dashboard│
                          │ (Amplify SSR)│
                          └──────┬───────┘
                                 │
┌─────────┐  AgentCore   ┌──────┴────────┐  Bedrock    ┌──────────┐
│ Boulder  │────────────▶│ Strands Agents │───────────▶│  Claude   │
│   CLI    │◀────────────│  (9 agents)    │            │ Opus 4.7  │
└─────────┘              └──────┬────────┘            └──────────┘
                                │
                    ┌───────────┼───────────┐
                    ▼           ▼           ▼
               ┌────────┐ ┌────────┐ ┌──────────┐
               │ GitHub  │ │Amplify │ │ DynamoDB │
               │  Repo   │ │ Deploy │ │  Tables  │
               └────────┘ └────────┘ └──────────┘
                                          │
          ┌───────────────────────────────┘
          ▼                    ▼
     ┌──────────┐     ┌──────────────┐
     │Learnings │     │  Schedules   │
     └──────────┘     └──────┬───────┘
                             │
                    ┌────────┴────────┐
                    │  EventBridge    │──▶ Lambda
                    │ (every 5 min)   │    (schedule-processor)
                    └─────────────────┘

The key idea: all agents share the same AgentCore runtime which scales to zero automatically. The CLI (TypeScript) calls InvokeAgentRuntimeCommand, the Python runtime routes to the right agent. Zero servers to manage.

The 9 agents, each with its own job

Builder — the core of the system

The Builder takes a project name + description and chains:

generate_app → calls Bedrock with a large prompt (~64K tokens max) tailored to the framework (Next.js, React SPA, Vue, Astro). The model generates all source files in a single shot.
Files are stored in S3 via a handle system — a 32-character opaque UUID. The LLM never touches files directly, only the handle. This prevents context blowup on apps with 50+ files.
push_to_github → creates the repo (idempotent), pushes via the Git Tree API.
deploy_amplify → creates the Amplify app, kicks off the build, polls every 10s for up to 10 minutes.

The Builder auto-detects the framework from the description. A game? → React SPA. A SaaS with auth? → Next.js + Amplify Gen 2 backend. A blog? → Astro. It also injects a starter BACKLOG.md into every project so the Autopilot can start iterating immediately.

Orchestrator — the conductor

The Orchestrator works in 2 phases:

Phase 1 — Planning: A lightweight agent (4K tokens, temperature 0.1) analyzes the user prompt and produces a structured JSON plan in phases. Example for “Find a profitable SaaS idea in fitness and build it”:

{
  "phases": [
    { "agents": ["idea_generator", "market_validator", "monetization_strategist"] },
    { "agents": ["builder"] }
  ]
}

Phase 2 — Execution: The plan is converted into a DAG via Strands’ GraphBuilder. Agents in the same phase run in parallel. A terminal Summarizer node synthesizes all results. Each agent is instantiated through a factory (no deepcopy — boto3 clients with their SSLContext and locks aren’t copyable).

The 3 Scouts — idea → validate → monetize

Three pure-LLM agents (no tools, just structured JSON output):

Idea Generator (temperature 0.7): generates 5 monetizable SaaS ideas with TAM, technical difficulty, time-to-MVP, and revenue potential.
Market Validator: analyzes competition, market sizing, and viability.
Monetization Strategist: proposes a business model with pricing tiers and revenue projections.

They can run in parallel within an Orchestrator phase, then the Builder picks up the winning idea.

Backlog Pipeline — plan → implement → review

The Pipeline orchestrates 3 sub-agents to process a backlog item:

Planner: reads the backlog + codebase structure, splits the feature into 1-3 independent tasks with the files to modify/create.
Implementer: for each task, reads existing files, writes the code, creates a backlog/{N}-task-{id} branch, and opens a PR.
Reviewer: reads the diffs, reviews for correctness/patterns/bugs/security, comments, and labels ready-to-merge if it passes.

Each sub-agent has its own calibrated system prompt and dedicated tools (read_backlog, create_branch, write_files_to_branch, get_pr_diff, label_pr…).

Autopilot — the infinite loop

The Autopilot is the “hands off” mode: it picks the next TODO item from the backlog, sends it through the full Pipeline, merges if the review passes, checks the Amplify build, and starts over. Budget cap: 10 tasks per cycle, 30-second pause between each. If a build fails, it automatically creates a fix task in the backlog.

In AgentCore mode, it processes one task per invocation (stateless). In local mode (run_autopilot()), it’s a true infinite loop with sleep and error handling.

Portfolio Manager & Post-Deploy

Portfolio Manager: global monitoring — lists all deployed apps, health checks, build history, cost summary.
Post-Deploy: post-deployment maintenance — health checks, stale dependency detection, automatic updates, SEO audits, alerts.

Self-healing builds: the correction loop

This is the feature that makes the difference. When an Amplify build fails (and it happens a lot with generated code):

deploy_amplify returns the appId + jobId of the failure
fetch_build_logs pulls the logs from each build step via Amplify’s pre-signed URLs
analyze_and_learn sends the logs to Bedrock to extract: error type, root cause, fix rule
The learning is stored in DynamoDB (boulder-learnings, PK = error pattern)
fetch_existing_files retrieves the current repo state → S3 handle
modify_app sends the handle + the error as a change_request to Bedrock → new corrected handle
push_to_github + deploy_amplify → retry

Budget: max 2 retries (3 total attempts). After that, the build is marked FAILED. No infinite loops.

Self-improving prompts: error memory

Each analyze_and_learn stores a pattern in boulder-learnings. On the next generate_app, the _get_learnings() function scans the table and injects a KNOWN ISSUES section into the prompt:

## KNOWN ISSUES — Learn from past failures
- Pattern: missing_tsconfig_paths
  Rule: Always configure paths alias in tsconfig.json for @/ imports
- Pattern: amplify_backend_type_error
  Rule: Use a.string().required() not a.required().string()

The cache TTL is 5 minutes to avoid scanning DynamoDB on every build. The result: the more Boulder fails, the more reliable it gets. The first build for a given framework often fails. The tenth one almost never does.

Tech stack: CDK, all in one stack

All the infrastructure lives in a single BoulderStack CDK construct:

7 DynamoDB tables (PAY_PER_REQUEST): builds, agents, agent-executions, learnings, schedules, agent-activity (TTL 7d), sessions (TTL 24h)
1 S3 bucket (boulder-tmp-{account}-{region}) with a 24h lifecycle on handles/ — generated files are ephemeral
1 Lambda (boulder-schedule-processor) on Node.js 22, inline in the CDK — handles cron schedules AND reconciles builds stuck in BUILDING for >30 min
EventBridge rate(5 minutes) → triggers the Lambda
IAM: 3 separate roles — Amplify service role, SSR compute role (for the Next.js dashboard), AgentCore runtime role (created by deploy.py, enriched by CDK for S3 access)
SSM Parameter Store: /boulder/github-token (SecureString, created manually)

Agent deployment is separate from CDK: deploy.py zips the Python code, uploads it to S3, and creates/updates the AgentCore runtime. First-time deploy order: deploy.py first (creates the IAM role), cdk deploy second (attaches the S3 policy).

What I learned

S3 handles are essential. Early on, the Builder passed generated files inline in the agent conversation. On a 40-file app, that blew up the context and token costs. The handle system (an opaque UUID pointing to a JSON blob in S3) changed everything: the LLM never sees file contents, just a 32-character identifier.

boto3 clients don’t copy. I lost a full day debugging SSLContext errors in the Orchestrator’s graph executor. copy.deepcopy on a BedrockModel containing a boto3 client with an SSLContext and threading.Lock → crash. Solution: a factory that instantiates a fresh agent at each graph node.

Self-healing has immediate ROI. 60% of first builds fail — missing imports, incorrect TypeScript config, badly versioned dependencies. The correction loop fixes ~80% of those failures in 1-2 retries. Without it, Boulder would be unusable.

Prompt engineering is infrastructure. The Builder’s system prompt is ~200 lines. It encodes framework-specific rules (when to use React SPA vs Next.js), Amplify Gen 2 patterns (a.model(), defineAuth(), generateClient<Schema>()), and code conventions. Every line was added after a concrete failure. It’s code, not prose.

Temperature per agent, not global. The Orchestrator’s Planner runs at 0.1 (you want deterministic JSON), the Idea Generator at 0.7 (you want creativity), the Builder at 0.3 (a tradeoff between code creativity and reliability). One parameter, big impact.

Conclusion

Boulder is ~2500 lines of Python (agents) + ~300 lines of TypeScript (CDK) + a CLI. The whole thing runs on managed AWS services: Bedrock AgentCore, Amplify, DynamoDB, EventBridge, Lambda, S3. No servers, no containers, scale-to-zero everywhere.

It’s not production-ready for critical apps — generated code needs human review. But for rapid prototyping, MVPs, or idea exploration, it’s a brutal accelerator. boulder build my-idea -d "..." and you grab your coffee while 9 agents do the work.

The repo is open source: github.com/agiusalexandre/boulder

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

Never miss a post

Get notified when I publish new articles about AI, Cloud, and AWS.

No spam, unsubscribe anytime.

Boulder — An AI Build Factory on AWS That Generates, Deploys, and Maintains Apps on Its Own

Architecture: 9 agents, one runtime

The 9 agents, each with its own job

Builder — the core of the system

Orchestrator — the conductor

The 3 Scouts — idea → validate → monetize

Backlog Pipeline — plan → implement → review

Autopilot — the infinite loop

Portfolio Manager & Post-Deploy

Self-healing builds: the correction loop

Self-improving prompts: error memory

Tech stack: CDK, all in one stack

What I learned

Conclusion

Alexandre Agius

Never miss a post

Comments

Related Posts

OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants

Fine-Tuning Mistral with Transformers and Serving with vLLM on AWS

How to Track and Cap AI Spending per Team with Amazon Bedrock