AWS DevOps Agent: Build vs Buy for Enterprise AIOps
AWS DevOps Agent is GA and included with Support plans. But it doesn't replace your custom agents -- it complements them. Here's the hybrid pattern: what to buy, what to build, and how MCP bridges the gap.
Table of Contents
Every operations team I talk to is asking the same question: should we build our own AIOps agents with Bedrock and AgentCore, or should we adopt AWS DevOps Agent?
The answer is both. And the architecture for combining them is more interesting than either approach alone.
The PID 1 Analogy
Think of AWS DevOps Agent as init — the PID 1 of your operations stack. It coordinates incident response, routes alerts, and maintains the system map. But just like PID 1 doesn’t replace nginx or postgres, DevOps Agent doesn’t replace your specialized operational tooling. It orchestrates.
The mistake I see teams making: treating this as an either/or decision. Build everything custom with Strands and AgentCore, or adopt the managed service wholesale. Both extremes fail at enterprise scale.
What You Get Out of the Box
DevOps Agent went GA on March 31, 2026. Here’s what it does without writing a single line of code:
Incident investigation. The moment a CloudWatch alarm or Datadog alert fires, the agent starts investigating. It correlates telemetry across your observability stack, traces dependencies through your resource topology, and delivers root cause analysis. Customers in preview reported 94% root cause accuracy and 3-5x faster resolution.
Proactive prevention. Pattern analysis across historical incidents produces recommendations in four areas: observability gaps, infrastructure optimization, deployment pipeline hardening, and application resilience.
On-demand SRE tasks. Natural language queries against your operational data. “Show me all services that degraded after last Thursday’s deployment.” “Which Lambda functions hit concurrency limits this month?” Create charts, save reports, share with the team.
Built-in integrations. CloudWatch, Dynatrace, Datadog, Grafana, New Relic, Splunk for observability. GitHub, GitLab, Azure DevOps for code and CI/CD. ServiceNow, PagerDuty, Slack for coordination.
Topology understanding. This is the part people underestimate. DevOps Agent builds and maintains a resource graph of your environment — services, their dependencies, network paths, deployment pipelines. When an alert fires on Service A, it already knows that Service A depends on DynamoDB table X and is fronted by ALB Y. It doesn’t start from scratch every investigation.
Think of it as maintaining a living /etc/hosts for your entire infrastructure — except it updates itself by observing deployments, API calls, and telemetry patterns. When a new service appears or a dependency changes, the topology adapts.
EKS-native intelligence. For teams running Kubernetes, the agent understands pod scheduling, node pressure, HPA behavior, and service mesh routing. It can trace an intermittent 503 from the ingress controller through the service mesh to a pod eviction caused by node memory pressure — a chain that typically takes a human engineer 45 minutes to piece together.
Coordinated response. When the agent identifies a root cause, it doesn’t just write a report. It routes findings to your Slack channel, creates or updates a ServiceNow incident, pages the relevant team via PagerDuty, and populates the war room with the full investigation timeline. The “single pane of glass” cliche, except it actually works because the agent understands the incident, not just the alert.
United Airlines — one of the public reference customers — runs 38,000 Dynatrace agents across a hybrid environment with 500+ AWS accounts and 20,000 Lambda functions. Their principal engineer reported that DevOps Agent provides “a single pane of glass” for investigations that previously required switching between multiple tools at 3 AM.
Western Governors University, serving 191,000+ students, measured a 77% improvement in MTTR during production use — from an estimated two hours to 28 minutes for a Lambda configuration issue that the agent traced through undiscovered internal documentation.
That covers roughly 60% of what a mature operations team needs. The remaining 40% is where custom agents come in.
What You Still Need to Build
DevOps Agent investigates and recommends. For the most part, it does not autonomously act on your infrastructure by default. It tells you “the root cause is this Lambda’s memory configuration” — it doesn’t change the configuration for you.
Here’s what a custom agent layer adds:
Automated remediation. Restart the ECS service. Scale the Auto Scaling Group. Execute the SSM runbook. Roll back the CodeDeploy deployment. These actions require IAM-scoped permissions and company-specific approval workflows. No managed service can safely assume your blast radius tolerance.
Custom integrations. Your CMDB isn’t ServiceNow. Your monitoring stack includes CheckMK and SolarWinds alongside Datadog. Your ticketing system has custom fields for change control. Your security tool is Wiz, not GuardDuty. Custom MCP servers bridge these gaps.
Domain-specific workflows. “When a P1 fires on the payment service, page the payments team lead, create a war room in Slack, and auto-populate it with the last 3 deployments to that service.” This logic is unique to your organization.
Cross-system orchestration. ServiceNow write-back with enriched context. Dynatrace problem correlation fed back into your internal knowledge base. Automated post-incident reviews that pull data from five different systems.
The MCP Bridge
This is where the architecture gets elegant. DevOps Agent supports custom MCP servers — the same Model Context Protocol that Bedrock AgentCore uses for tool integration.
The pattern:
- DevOps Agent handles detection, investigation, and root cause analysis
- Custom MCP servers expose your proprietary systems (CMDB, custom runbooks, internal APIs) to the agent
- AWS API MCP Server (GA since May 2026) lets the agent interact with 15,000+ AWS APIs via IAM-scoped permissions
- Custom Bedrock/AgentCore agents handle the remediation actions that require human-in-the-loop approval or company-specific safety gates
The MCP bridge means DevOps Agent isn’t a walled garden. It can reach into your Salesforce instance (there’s a published integration blog from April 2026), query your Elasticsearch cluster alongside Datadog (published May 2026), or invoke your internal compliance API.
But critically: write actions (anything that changes state) should flow through your own authorization layer. DevOps Agent identifies “restart this ECS service” as the mitigation. Your custom agent, with proper IAM scoping and approval gates, executes it.
Here’s a concrete example of the architecture:
[CloudWatch Alarm] → [DevOps Agent investigates]
↓
Root cause: "ECS task OOM due to memory leak
in build v2.14.3, deployed 2h ago"
↓
[Custom MCP Server: your-remediation-api]
↓
[Approval gate: Slack confirmation from on-call]
↓
[AWS API MCP Server: CodeDeploy rollback to v2.14.2]
↓
[ServiceNow: auto-close incident with RCA]
The Economics
Pricing is $0.0083 per agent-second. That sounds abstract, so let me make it concrete.
A typical investigation takes 8 minutes. That’s $3.98 per investigation.
For a team running 80 investigations and 100 chat queries per month: roughly $344/month.
For an enterprise with 10 agent spaces and 500 incidents/month: roughly $2,291/month.
But here’s the key: AWS Support credits offset most or all of this cost.
- Enterprise Support: 75% of your monthly Support charge becomes DevOps Agent credits
- Unified Operations: 100% credit
- Business Support+: 30% credit
The math for a typical enterprise:
An enterprise paying $10,000/month for Enterprise Support gets $7,500 in monthly DevOps Agent credits. That covers approximately 150 hours of agent time per month. Most enterprises won’t exceed that.
For many customers, DevOps Agent is effectively free.
Compare that to building equivalent functionality from scratch with Bedrock and AgentCore: you’re looking at 3-6 months of engineering time for the investigation engine alone, plus ongoing model costs, plus the operational overhead of maintaining the system. The managed service handles model updates, integration maintenance, and the topology graph.
The hybrid economics are compelling: DevOps Agent (effectively free via credits) handles investigation, while your custom agents (Bedrock/AgentCore, pay-per-invocation) handle only the remediation actions that fire maybe 10-20 times per day. Your custom agent costs stay minimal because it only activates on confirmed root causes, not on every alert.
A 2-month free trial is available: 10 agent spaces, 20 hours of investigations, 15 hours of evaluations, and 20 hours of on-demand SRE tasks per month. Enough to validate the approach without commitment.
Credits expire monthly — they don’t accumulate. Use them or lose them. This incentivizes consistent adoption rather than burst usage.
Agent Spaces: Team Isolation Done Right
One feature that matters at enterprise scale: Agent Spaces. Each space is an isolated environment with its own resource topology, integrations, and investigation history.
This maps to how real enterprises organize operations:
- Production platform team gets one space
- Application team A gets another
- Security operations gets a third
Each space learns the relationships and patterns of its domain. The payment service team’s agent knows that their Lambda connects to DynamoDB and SQS. The platform team’s agent knows the VPC topology and Transit Gateway routes.
This is the right isolation model. A single omniscient agent across 500 services creates noise. Domain-scoped agents create signal.
The Two-Month Validation Pattern
DevOps Agent offers a 2-month free trial: 10 agent spaces, 20 hours of investigations, 15 hours of evaluations, and 20 hours of on-demand SRE tasks per month.
Here’s the validation pattern I recommend:
Month 1: shadow mode. Point DevOps Agent at your production environment. Let it investigate every alert alongside your human on-call engineers. Compare its root cause analysis against what your team finds manually. Measure accuracy. Don’t act on its recommendations yet.
Month 2: selective delegation. For the categories where Month 1 showed high accuracy (typically: resource exhaustion, deployment regressions, dependency failures), start delegating initial triage to the agent. Human reviews the finding before action. Measure time-to-resolution delta.
If accuracy exceeds 90% on your workload and MTTR improves by 50%+, you have a business case. If not, you’ve spent zero dollars learning that.
The “IT for IT” Starting Point
One framing I find useful: start with “IT for IT.” Use agentic AI to improve your IT operations before tackling business use cases.
Why this works:
- The feedback loop is tight (alert fires, agent investigates, you verify immediately)
- The blast radius is contained (worst case: a bad recommendation that you catch in review)
- The ROI is measurable (MTTR reduction, on-call burden reduction, incident recurrence rate)
- The data is already instrumented (CloudWatch, Datadog, your existing observability stack)
Business use cases (customer-facing chatbots, document processing, code generation) are harder to validate because the feedback loops are longer and the success criteria are fuzzier. Operations is concrete: the service was down, now it’s up, and here’s how long that took.
The Maturity Ladder
I see enterprises adopting this in three phases. Each phase is stable on its own — you don’t need to commit to phase 3 from day one.
Phase 1: Passive intelligence. DevOps Agent investigates alerts and delivers root cause findings to Slack/ServiceNow. Humans review and act. No automation, no custom agents. This is the 2-month trial phase. Value: faster MTTR, reduced context-switching during incidents, knowledge capture.
Phase 2: Guided remediation. DevOps Agent investigates and produces a remediation spec (e.g., “roll back deployment X, scale ASG Y to 4 instances”). A coding agent (Kiro, Claude Code, or a custom Strands agent) generates the implementation. Human approves with one click. Value: consistent remediation quality, reduced human error during 3 AM incidents.
Phase 3: Autonomous ops with guardrails. For well-understood incident categories (resource exhaustion, deployment regressions, certificate expiry), the system investigates and remediates without human intervention. Circuit breakers prevent cascading actions. Humans are notified post-facto. This is the “self-healing infrastructure” end state — but only for patterns where accuracy is proven above 95% and blast radius is bounded.
Most enterprises I work with are targeting Phase 2 by end of 2026. Phase 3 is a 2027 conversation for most. And that’s fine. Phase 1 alone delivers measurable ROI.
The 6 Use Cases You Don’t Build
Based on what’s available out-of-the-box, here are the six investigation patterns you can stop building custom solutions for:
-
Root cause analysis. Alert fires, agent traces the chain through your dependency graph, identifies the actual root cause (not the symptom). Previously: 45-90 minutes of human investigation. Now: 8 minutes average.
-
Alert correlation. Three alerts fire within 5 minutes — are they the same incident? The agent correlates based on resource dependencies, timing, and historical patterns. Previously: experienced engineer’s intuition. Now: automated and consistent.
-
P1 prevention. Pattern recognition across historical incidents identifies conditions that precede outages. “Last 3 times DynamoDB consumed capacity exceeded 80%, the service degraded within 2 hours.” Previously: post-mortem action items that nobody implements. Now: proactive alerts before the incident.
-
Health checks. On-demand assessment of application health across all layers — compute, network, storage, dependencies. Previously: custom dashboards that drift out of date. Now: real-time queries in natural language.
-
War-room coordination. Auto-populate a Slack channel with investigation timeline, affected services, deployment history, and mitigation options. Previously: someone manually copying links from 5 different consoles. Now: automatic context assembly.
-
Post-incident review. Generate structured incident reports with timeline, root cause, impact, and prevention recommendations. Previously: a 2-hour meeting where everyone tries to remember what happened. Now: auto-generated draft that the team refines.
What Doesn’t Work Yet
Honest assessment of current limitations:
Multi-cloud depth. DevOps Agent works across AWS, multicloud, and on-prem — but its topology understanding is deepest on AWS. If your critical path runs through Azure Kubernetes Service, the investigation quality will be lower than for EKS.
Automated remediation guardrails. The agent can recommend actions and generate implementation specs, but the actual execution of write operations (restart services, modify configurations) requires either custom MCP server integration or handing the spec to a coding agent. The safety model is conservative by design — and that’s probably correct for now.
Custom runbook depth. If your team has 200 internal runbooks in Confluence that aren’t connected as an MCP server, the agent can’t reference them. Integration effort is non-zero.
Alert fatigue baseline. If your monitoring is already noisy (300 alerts/day, 90% false positives), the agent inherits that noise. Clean up your alerting hygiene first. DevOps Agent amplifies signal — it doesn’t create it from nothing.
Observability tool access. The agent needs credentials to your Dynatrace, Datadog, or Splunk instance. For enterprises with strict network segmentation (observability in a separate VPC, no internet egress), the plumbing to grant access can be non-trivial. Plan for this in your PoC.
Honest Recommendation
Don’t build what DevOps Agent gives you for free. Investigation, correlation, root cause analysis, proactive recommendations — these are commodity capabilities that a managed service handles better than your team’s custom implementation. The managed service has seen thousands of incident patterns across AWS customers. Your custom build has seen only yours.
Build the layer that’s unique to you: remediation automation with company-specific safety gates, custom integrations with proprietary systems, and domain workflows that encode your organization’s operational runbooks.
The architecture is: managed detection and investigation (DevOps Agent) feeding into custom remediation and orchestration (your Bedrock/AgentCore agents). MCP is the glue.
Start with the free trial. Shadow your on-call for one month. Let the data decide.
ONE LETTER A MONTH · NO TRACKER · UNSUBSCRIBE ANYTIME
Comments
Sign in to leave a comment
Related Posts
When Your AI Agent Runs Away: 204 PRs, $900 Wasted, and the 3-Layer Fix
13 MIN READ
AWS Agent Toolkit GA: How I Gave an Agent 15,000 AWS APIs Without Losing Sleep
9 MIN READ
Amazon Quick: The Day My IDE, Terminal, CRM, and Cloud Console Became One Conversation
7 MIN READ
