Boulder — An AI Build Factory on AWS That Generates, Deploys, and Maintains Apps on Its Own
Boulder uses 9 Strands agents on Bedrock AgentCore to generate, deploy, and maintain full-stack apps on AWS Amplify — with self-healing builds and self-improving prompts.
Table of Contents
- Architecture: 9 agents, one runtime
- The 9 agents, each with its own job
- Builder — the core of the system
- Orchestrator — the conductor
- The 3 Scouts — idea → validate → monetize
- Backlog Pipeline — plan → implement → review
- Autopilot — the infinite loop
- Portfolio Manager & Post-Deploy
- Self-healing builds: the correction loop
- Self-improving prompts: error memory
- Tech stack: CDK, all in one stack
- What I learned
- Conclusion
We’ve all been there: you have an app idea, spend 3 days on Next.js scaffolding, 2 more debugging the Amplify deployment, and your MVP ships 2 weeks later. Boulder is the side project I built to short-circuit all of that. One CLI command, a plain-English description, and 5 minutes later the app is live on Amplify with its GitHub repo, Cognito/DynamoDB backend, and a backlog ready to be iterated on — by the AI itself.
Not a template generator. Not a copilot. An autonomous factory that generates code, pushes it, deploys it, catches build errors, fixes them, and learns from its failures so it doesn’t repeat them.
Architecture: 9 agents, one runtime
Boulder runs on Strands Agents SDK deployed via Bedrock AgentCore. A single main.py acts as a multiplexer: it receives a {agent, prompt} payload, resolves the agent’s canonical name, lazy-loads the corresponding Python module, and invokes it.
┌──────────────┐
│ Web Dashboard│
│ (Amplify SSR)│
└──────┬───────┘
│
┌─────────┐ AgentCore ┌──────┴────────┐ Bedrock ┌──────────┐
│ Boulder │────────────▶│ Strands Agents │───────────▶│ Claude │
│ CLI │◀────────────│ (9 agents) │ │ Opus 4.7 │
└─────────┘ └──────┬────────┘ └──────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│ GitHub │ │Amplify │ │ DynamoDB │
│ Repo │ │ Deploy │ │ Tables │
└────────┘ └────────┘ └──────────┘
│
┌───────────────────────────────┘
▼ ▼
┌──────────┐ ┌──────────────┐
│Learnings │ │ Schedules │
└──────────┘ └──────┬───────┘
│
┌────────┴────────┐
│ EventBridge │──▶ Lambda
│ (every 5 min) │ (schedule-processor)
└─────────────────┘
The key idea: all agents share the same AgentCore runtime which scales to zero automatically. The CLI (TypeScript) calls InvokeAgentRuntimeCommand, the Python runtime routes to the right agent. Zero servers to manage.
The 9 agents, each with its own job
Builder — the core of the system
The Builder takes a project name + description and chains:
generate_app→ calls Bedrock with a large prompt (~64K tokens max) tailored to the framework (Next.js, React SPA, Vue, Astro). The model generates all source files in a single shot.- Files are stored in S3 via a handle system — a 32-character opaque UUID. The LLM never touches files directly, only the handle. This prevents context blowup on apps with 50+ files.
push_to_github→ creates the repo (idempotent), pushes via the Git Tree API.deploy_amplify→ creates the Amplify app, kicks off the build, polls every 10s for up to 10 minutes.
The Builder auto-detects the framework from the description. A game? → React SPA. A SaaS with auth? → Next.js + Amplify Gen 2 backend. A blog? → Astro. It also injects a starter BACKLOG.md into every project so the Autopilot can start iterating immediately.
Orchestrator — the conductor
The Orchestrator works in 2 phases:
Phase 1 — Planning: A lightweight agent (4K tokens, temperature 0.1) analyzes the user prompt and produces a structured JSON plan in phases. Example for “Find a profitable SaaS idea in fitness and build it”:
{
"phases": [
{ "agents": ["idea_generator", "market_validator", "monetization_strategist"] },
{ "agents": ["builder"] }
]
}
Phase 2 — Execution: The plan is converted into a DAG via Strands’ GraphBuilder. Agents in the same phase run in parallel. A terminal Summarizer node synthesizes all results. Each agent is instantiated through a factory (no deepcopy — boto3 clients with their SSLContext and locks aren’t copyable).
The 3 Scouts — idea → validate → monetize
Three pure-LLM agents (no tools, just structured JSON output):
- Idea Generator (temperature 0.7): generates 5 monetizable SaaS ideas with TAM, technical difficulty, time-to-MVP, and revenue potential.
- Market Validator: analyzes competition, market sizing, and viability.
- Monetization Strategist: proposes a business model with pricing tiers and revenue projections.
They can run in parallel within an Orchestrator phase, then the Builder picks up the winning idea.
Backlog Pipeline — plan → implement → review
The Pipeline orchestrates 3 sub-agents to process a backlog item:
- Planner: reads the backlog + codebase structure, splits the feature into 1-3 independent tasks with the files to modify/create.
- Implementer: for each task, reads existing files, writes the code, creates a
backlog/{N}-task-{id}branch, and opens a PR. - Reviewer: reads the diffs, reviews for correctness/patterns/bugs/security, comments, and labels
ready-to-mergeif it passes.
Each sub-agent has its own calibrated system prompt and dedicated tools (read_backlog, create_branch, write_files_to_branch, get_pr_diff, label_pr…).
Autopilot — the infinite loop
The Autopilot is the “hands off” mode: it picks the next TODO item from the backlog, sends it through the full Pipeline, merges if the review passes, checks the Amplify build, and starts over. Budget cap: 10 tasks per cycle, 30-second pause between each. If a build fails, it automatically creates a fix task in the backlog.
In AgentCore mode, it processes one task per invocation (stateless). In local mode (run_autopilot()), it’s a true infinite loop with sleep and error handling.
Portfolio Manager & Post-Deploy
- Portfolio Manager: global monitoring — lists all deployed apps, health checks, build history, cost summary.
- Post-Deploy: post-deployment maintenance — health checks, stale dependency detection, automatic updates, SEO audits, alerts.
Self-healing builds: the correction loop
This is the feature that makes the difference. When an Amplify build fails (and it happens a lot with generated code):
deploy_amplifyreturns theappId+jobIdof the failurefetch_build_logspulls the logs from each build step via Amplify’s pre-signed URLsanalyze_and_learnsends the logs to Bedrock to extract: error type, root cause, fix rule- The learning is stored in DynamoDB (
boulder-learnings, PK = error pattern) fetch_existing_filesretrieves the current repo state → S3 handlemodify_appsends the handle + the error as achange_requestto Bedrock → new corrected handlepush_to_github+deploy_amplify→ retry
Budget: max 2 retries (3 total attempts). After that, the build is marked FAILED. No infinite loops.
Self-improving prompts: error memory
Each analyze_and_learn stores a pattern in boulder-learnings. On the next generate_app, the _get_learnings() function scans the table and injects a KNOWN ISSUES section into the prompt:
## KNOWN ISSUES — Learn from past failures
- Pattern: missing_tsconfig_paths
Rule: Always configure paths alias in tsconfig.json for @/ imports
- Pattern: amplify_backend_type_error
Rule: Use a.string().required() not a.required().string()
The cache TTL is 5 minutes to avoid scanning DynamoDB on every build. The result: the more Boulder fails, the more reliable it gets. The first build for a given framework often fails. The tenth one almost never does.
Tech stack: CDK, all in one stack
All the infrastructure lives in a single BoulderStack CDK construct:
- 7 DynamoDB tables (PAY_PER_REQUEST): builds, agents, agent-executions, learnings, schedules, agent-activity (TTL 7d), sessions (TTL 24h)
- 1 S3 bucket (
boulder-tmp-{account}-{region}) with a 24h lifecycle onhandles/— generated files are ephemeral - 1 Lambda (
boulder-schedule-processor) on Node.js 22, inline in the CDK — handles cron schedules AND reconciles builds stuck in BUILDING for >30 min - EventBridge rate(5 minutes) → triggers the Lambda
- IAM: 3 separate roles — Amplify service role, SSR compute role (for the Next.js dashboard), AgentCore runtime role (created by
deploy.py, enriched by CDK for S3 access) - SSM Parameter Store:
/boulder/github-token(SecureString, created manually)
Agent deployment is separate from CDK: deploy.py zips the Python code, uploads it to S3, and creates/updates the AgentCore runtime. First-time deploy order: deploy.py first (creates the IAM role), cdk deploy second (attaches the S3 policy).
What I learned
S3 handles are essential. Early on, the Builder passed generated files inline in the agent conversation. On a 40-file app, that blew up the context and token costs. The handle system (an opaque UUID pointing to a JSON blob in S3) changed everything: the LLM never sees file contents, just a 32-character identifier.
boto3 clients don’t copy. I lost a full day debugging SSLContext errors in the Orchestrator’s graph executor. copy.deepcopy on a BedrockModel containing a boto3 client with an SSLContext and threading.Lock → crash. Solution: a factory that instantiates a fresh agent at each graph node.
Self-healing has immediate ROI. 60% of first builds fail — missing imports, incorrect TypeScript config, badly versioned dependencies. The correction loop fixes ~80% of those failures in 1-2 retries. Without it, Boulder would be unusable.
Prompt engineering is infrastructure. The Builder’s system prompt is ~200 lines. It encodes framework-specific rules (when to use React SPA vs Next.js), Amplify Gen 2 patterns (a.model(), defineAuth(), generateClient<Schema>()), and code conventions. Every line was added after a concrete failure. It’s code, not prose.
Temperature per agent, not global. The Orchestrator’s Planner runs at 0.1 (you want deterministic JSON), the Idea Generator at 0.7 (you want creativity), the Builder at 0.3 (a tradeoff between code creativity and reliability). One parameter, big impact.
Conclusion
Boulder is ~2500 lines of Python (agents) + ~300 lines of TypeScript (CDK) + a CLI. The whole thing runs on managed AWS services: Bedrock AgentCore, Amplify, DynamoDB, EventBridge, Lambda, S3. No servers, no containers, scale-to-zero everywhere.
It’s not production-ready for critical apps — generated code needs human review. But for rapid prototyping, MVPs, or idea exploration, it’s a brutal accelerator. boulder build my-idea -d "..." and you grab your coffee while 9 agents do the work.
The repo is open source: github.com/agiusalexandre/boulder
Never miss a post
Get notified when I publish new articles about AI, Cloud, and AWS.
No spam, unsubscribe anytime.
Comments
Sign in to leave a comment
Related Posts
OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants
A deep architectural comparison of four open-source frameworks that turn messaging apps into AI assistant interfaces — from a 349-file TypeScript monolith to a 10MB Go binary that runs on a $10 board.
Fine-Tuning Mistral with Transformers and Serving with vLLM on AWS
End-to-end guide: fine-tune Mistral models with LoRA using Hugging Face Transformers, then deploy at scale with vLLM on AWS — from training to production serving on SageMaker, ECS, or Bedrock.
How to Track and Cap AI Spending per Team with Amazon Bedrock
AI platform teams need governance before scaling. Learn how to use Amazon Bedrock inference profiles, AWS Budgets, and a proactive cost control pattern to track, allocate, and cap AI spending per team.
