RAG on AWS: Which Vector Store Is Right for You?

AWS now offers 9 different ways to store and search vectors for RAG workloads. This guide compares every option through the Well-Architected Framework to help you pick the right one.

Alexandre Agius

AWS Solutions Architect

February 9, 2026 22 min read

AWS RAG Bedrock Vector Search OpenSearch Aurora Well-Architected GenAI

🧠 Part of: AI & LLM Fundamentals ☁️ Part of: AWS Architecture 🤖 Part of: Agentic AI

Table of Contents

The Problem
The Solution
How It Works
The 9 Vector Storage Options at a Glance
1. Amazon S3 Vectors (Preview)
2. Amazon Aurora PostgreSQL with pgvector
3. Amazon OpenSearch Service
4. Amazon Bedrock Knowledge Bases (Managed RAG)
5. Amazon Neptune Analytics (GraphRAG)
6. Amazon MemoryDB
7. Amazon ElastiCache (Valkey 8.2+)
8. Amazon DocumentDB
9. Amazon Kendra (Enterprise Search)
Choosing Your Vector Store: The Decision Tree
Advanced Patterns
What I Learned
Do It Yourself
Key takeaways
Try it now

Your team has decided to implement RAG. You’ve picked a foundation model on Bedrock. Now comes the question that stalls most projects: where do you store your vectors?

AWS now offers at least 9 services with vector search capabilities — from a new S3-native option to graph-based retrieval with Neptune. Each has different trade-offs in cost, latency, operational complexity, and scalability. Picking the wrong one means rework later. This guide walks through every option and helps you choose.

The Problem

The RAG pattern is straightforward: embed your documents as vectors, store them, retrieve the most relevant chunks at query time, and inject them into your prompt. The hard part isn’t the pattern — it’s the infrastructure decision.

Here’s what makes this choice difficult:

Too many options — S3 Vectors, Aurora pgvector, OpenSearch, Bedrock Knowledge Bases, Neptune, MemoryDB, ElastiCache, DocumentDB, Kendra. Each marketed as “the best” for GenAI.
Different trade-offs — Some optimize for latency, others for cost. Some are fully managed, others give you full control. Some support hybrid search, others don’t.
No single winner — The right answer depends on your existing stack, query patterns, scale, and team expertise.
Well-Architected implications — Your vector store choice affects all six pillars: operational overhead, security posture, reliability, performance, cost trajectory, and resource efficiency.

Without a structured comparison, teams either default to whatever the first tutorial used (usually OpenSearch) or spend weeks evaluating options that weren’t right for them in the first place.

The Solution

We’ll evaluate every AWS vector storage option through the Well-Architected Generative AI Lens — the framework AWS published specifically for GenAI workloads. For each service, we assess operational excellence, security, reliability, performance, cost, and sustainability. Then we provide a decision tree to shortcut the choice.

RAG Vector Store Decision Tree on AWS

How It Works

The 9 Vector Storage Options at a Glance

Before diving deep, here’s the landscape:

Service	Max Dimensions	Latency	Hybrid Search	Managed Level	Pricing Model
S3 Vectors	Up to 2B vectors/index	~100ms	No	Serverless	Pay-per-request
Aurora PostgreSQL (pgvector)	2,000 / 4,000 (halfvec)	Single-digit ms	Yes (SQL + GIN)	Semi-managed	Instance-based
OpenSearch Service	16,000	Sub-100ms	Yes (BM25 + k-NN)	Managed / Serverless	Instance or OCU-based
Bedrock Knowledge Bases	Per embedding model	Medium	Yes (built-in)	Fully managed	Pay-per-query
Neptune Analytics	Per embedding model	Low ms	Graph + Vector	Managed	NCU-hour
MemoryDB	32,768	Microsecond	No	Managed	Node + data written
ElastiCache (Valkey 8.2+)	~32,768	Microsecond	No	Managed	Node-based
DocumentDB	2,000 (indexed) / 16,000	Low ms	No	Managed	Instance-based
Kendra	N/A (managed)	Sub-second	Yes (NL + keyword)	Fully managed	Index + connector

1. Amazon S3 Vectors (Preview)

The newest option, announced at re:Invent 2024. S3 Vectors lets you store and query embeddings directly in S3 — no separate database needed.

How it works: You create a “vector bucket” in S3, push vectors with metadata, and query by similarity. S3 manages the indexing automatically. Each index can hold up to 2 billion vectors, and each bucket supports up to 10,000 indexes. Distance metrics: cosine and Euclidean. Metadata limited to 1KB per vector (35 keys max).

Best for: Teams with existing S3 data pipelines who want the simplest possible RAG setup. Cost-sensitive workloads where you’re optimizing for dollars-per-query, not milliseconds-per-query. AWS claims up to 90% lower cost than traditional vector databases.

import boto3

s3vectors = boto3.client('s3vectors')

# Create a vector index in S3
s3vectors.create_vector_bucket(
    vectorBucketName='my-rag-vectors'
)

# Put vectors
s3vectors.put_vectors(
    vectorBucketName='my-rag-vectors',
    indexName='product-docs',
    vectors=[
        {
            'key': 'doc-001',
            'data': {'float32': embedding_vector},
            'metadata': {'source': 'product-manual.pdf', 'page': 12}
        }
    ]
)

# Query
results = s3vectors.query_vectors(
    vectorBucketName='my-rag-vectors',
    indexName='product-docs',
    queryVector={'float32': query_embedding},
    topK=5
)

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	High	Zero infrastructure to manage. Serverless.
Security	High	Inherits S3 security model (IAM, bucket policies, encryption at rest/transit).
Reliability	High	S3 durability (11 9’s).
Performance	Medium	Not designed for sub-millisecond queries. Adequate for most RAG workloads.
Cost	High	Pay-per-request. No idle cost. Ideal for variable or low-volume workloads.
Sustainability	High	No over-provisioned infrastructure.

Trade-off: Simplest option with lowest operational burden, but limited query sophistication — no hybrid search, no complex filtering at query time. Still in preview.

2. Amazon Aurora PostgreSQL with pgvector

If your team already runs PostgreSQL, this is the path of least resistance. The pgvector extension (v0.8.1) adds vector columns, indexing, and similarity search to your existing database.

How it works: Add a vector column to your table, create an HNSW or IVFFlat index, and query using distance operators.

Key specs:

Dimensions: Up to 2,000 (vector type), 4,000 (halfvec), 64,000 (bit)
Index types: HNSW (better recall, slower to build) and IVFFlat (faster build, lower recall)
Distance functions: L2 (<->), inner product (<#>), cosine (<=>), L1 (<+>), Hamming (<~>), Jaccard (<%>)

-- Enable pgvector
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding vector(1536)  -- dimension matches your embedding model
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Query: find 5 most similar documents
SELECT id, title, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

Best for: Teams already on PostgreSQL who want vectors alongside relational data. Use cases where you need JOINs between vector results and business data (e.g., filter by tenant, date range, product category).

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Medium	You manage the database (even if Aurora handles patching). Need to tune index parameters.
Security	High	VPC, IAM, encryption, audit logging — full Aurora security model.
Reliability	High	Aurora’s multi-AZ, automated backups, point-in-time recovery.
Performance	Medium-High	HNSW gives good recall/latency trade-off. But vector workloads compete with relational queries for resources.
Cost	Medium	Instance-based pricing. You’re paying for the database whether you query vectors or not. Good if already running Aurora. Expensive if deployed just for vectors.
Sustainability	Medium	May over-provision to handle mixed workloads.

Trade-off: Minimal new infrastructure if you’re already on PostgreSQL. But vector-heavy workloads can starve relational queries — consider read replicas dedicated to vector search.

3. Amazon OpenSearch Service

OpenSearch is the strongest option for hybrid search — combining traditional keyword matching (BM25) with semantic vector search (k-NN) in a single query. This is a significant advantage for RAG quality.

How it works: OpenSearch stores documents with both text fields and vector fields. At query time, you can run BM25 and k-NN simultaneously, then combine scores.

Key specs:

Dimensions: Up to 16,000
Algorithms: HNSW, IVFFlat (via Faiss and Lucene engines)
Hybrid search: Native support for combining keyword + semantic results
Serverless option: OpenSearch Serverless with OCU-based auto-scaling

import boto3
from opensearchpy import OpenSearch

client = OpenSearch(
    hosts=[{'host': 'your-domain.us-east-1.es.amazonaws.com', 'port': 443}],
    use_ssl=True,
    connection_class=RequestsHttpConnection
)

# Create index with k-NN enabled
index_body = {
    "settings": {
        "index.knn": True,
        "index.knn.algo_param.ef_search": 512
    },
    "mappings": {
        "properties": {
            "content": {"type": "text"},
            "embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {"ef_construction": 512, "m": 16}
                }
            }
        }
    }
}

# Hybrid search: keyword + vector
hybrid_query = {
    "size": 5,
    "query": {
        "hybrid": {
            "queries": [
                {
                    "match": {
                        "content": "how to configure VPC peering"
                    }
                },
                {
                    "knn": {
                        "embedding": {
                            "vector": query_embedding,
                            "k": 5
                        }
                    }
                }
            ]
        }
    }
}

Best for: RAG workloads where retrieval quality matters most. Hybrid search consistently outperforms pure vector search because it catches both semantic matches and exact keyword matches. Also ideal if you’re already using OpenSearch for log analytics.

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Medium	Managed service, but index management, shard tuning, and capacity planning require expertise. Serverless simplifies this.
Security	High	VPC, fine-grained access control, encryption, SAML/OIDC.
Reliability	High	Multi-AZ, automated snapshots, blue/green deployments.
Performance	High	Hybrid search gives best retrieval quality. Sub-100ms latency typical.
Cost	Medium-Low	Instance-based can be expensive at scale. Serverless (OCU) has minimum costs (~$700/mo for 2 OCUs).
Sustainability	Medium	Over-provisioning common with instance-based. Serverless improves this.

Trade-off: Best retrieval quality through hybrid search, but highest operational complexity and cost floor. OpenSearch Serverless reduces ops but has a minimum cost that hurts small workloads.

4. Amazon Bedrock Knowledge Bases (Managed RAG)

If you want RAG without managing any vector infrastructure, this is it. Bedrock Knowledge Bases handles the entire pipeline: data ingestion, chunking, embedding, storage, and retrieval.

How it works: Point it at your data sources (S3, web pages, Confluence, SharePoint, Salesforce), pick a chunking strategy, and Bedrock does the rest. It creates and manages the vector store behind the scenes.

Key features:

Supported vector stores (8 total): OpenSearch Serverless, OpenSearch Managed, S3 Vectors, Aurora PostgreSQL, Neptune Analytics, Pinecone, Redis Enterprise, MongoDB Atlas
Data sources: S3, Web Crawler, Confluence, SharePoint, Salesforce, and more
Chunking: Fixed-size, semantic, hierarchical
Reranking: Built-in with Cohere Rerank and Amazon reranker
File types: PDF, DOCX, HTML, CSV, MD, TXT, XLS (max 50MB per file)

import boto3

bedrock_agent = boto3.client('bedrock-agent')

# Create a knowledge base
kb = bedrock_agent.create_knowledge_base(
    name='product-documentation',
    roleArn='arn:aws:iam::123456789012:role/BedrockKBRole',
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': 'arn:aws:aoss:us-east-1:123456789012:collection/xyz',
            'fieldMapping': {
                'vectorField': 'embedding',
                'textField': 'text',
                'metadataField': 'metadata'
            }
        }
    }
)

# Query with reranking
bedrock_runtime = boto3.client('bedrock-agent-runtime')

response = bedrock_runtime.retrieve(
    knowledgeBaseId=kb['knowledgeBase']['knowledgeBaseId'],
    retrievalQuery={'text': 'How do I configure cross-account access?'},
    retrievalConfiguration={
        'vectorSearchConfiguration': {
            'numberOfResults': 10,
            'overrideSearchType': 'HYBRID'
        }
    }
)

Best for: Teams that want the fastest path to production RAG. Non-expert teams. Prototyping. Any workload where you’d rather not manage chunking pipelines and vector databases yourself.

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Highest	Fully managed. No infrastructure, no index tuning, no chunking code.
Security	High	IAM, encryption, data source permissions. Guardrails integration for content filtering.
Reliability	High	Depends on underlying vector store. Bedrock handles sync and retries.
Performance	Medium	Adds a layer of abstraction. Reranking improves quality but adds latency.
Cost	Medium	Pay per retrieval query + underlying vector store costs. Predictable but not the cheapest at scale.
Sustainability	High	No over-provisioned infrastructure. Scales with demand.

Trade-off: Fastest time-to-value, lowest operational burden — but you trade control. Custom chunking logic, specialized filtering, or non-standard retrieval patterns may push you toward a self-managed approach.

5. Amazon Neptune Analytics (GraphRAG)

GraphRAG is the approach for domains where relationships between entities matter as much as the content itself. Neptune Analytics combines knowledge graphs with vector similarity search.

How it works: Store entities as graph nodes with vector embeddings. Query using openCypher with vector extensions. Retrieval follows graph relationships (e.g., “find documents related to Entity X and similar to this query vector”).

Best for: Entity-rich domains — healthcare (drug interactions, patient records), finance (transaction networks, regulatory entities), legal (case law citations, precedent chains). Any domain where “connected to” matters as much as “similar to.”

// Find nodes similar to a query vector that are connected to a specific entity
MATCH (doc:Document)-[:REFERENCES]->(entity:Regulation {name: 'GDPR'})
WHERE doc.embedding IS NOT NULL
WITH doc, vector.similarity.cosine(doc.embedding, $queryVector) AS score
WHERE score > 0.7
RETURN doc.title, doc.content, score
ORDER BY score DESC
LIMIT 5

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Medium	Graph modeling requires specialized expertise. Not a simple “dump vectors” approach.
Security	High	VPC, IAM, encryption.
Reliability	High	Managed service with automated backups.
Performance	Medium-High	Graph traversal + vector search is powerful but adds query complexity.
Cost	Medium	Memory-hour pricing. Costs scale with graph size.
Sustainability	Medium	Right-size memory allocation for your graph.

Trade-off: Highest retrieval quality for relationship-heavy domains, but highest learning curve. You need to model a knowledge graph, not just embed documents.

6. Amazon MemoryDB

MemoryDB brings vector search with sub-millisecond latency and durability guarantees — the key differentiator from ElastiCache. Every write is persisted to a transaction log, so your vectors survive restarts and failures.

How it works: Redis-compatible API with HNSW vector indexing. Store vectors as Redis data structures, query with FT.SEARCH. Supports up to 32,768 dimensions, max 10 indexes per cluster, 50 fields per index. HNSW parameters: M up to 512, EF_CONSTRUCTION up to 4,096.

Limitation: Vector search is currently single-shard only — no horizontal scaling. Vertical and replica scaling are supported. Must be enabled at cluster creation (R6g, R7g, T4g node types only).

Best for: Real-time RAG where latency is critical — chatbots, live recommendations, session-based context retrieval. Workloads that need both speed and durability.

import redis

r = redis.Redis(host='your-memorydb-endpoint', port=6379, ssl=True)

# Create vector index
r.execute_command(
    'FT.CREATE', 'doc_index',
    'ON', 'HASH',
    'PREFIX', '1', 'doc:',
    'SCHEMA',
    'content', 'TEXT',
    'embedding', 'VECTOR', 'HNSW', '6',
    'TYPE', 'FLOAT32',
    'DIM', '1536',
    'DISTANCE_METRIC', 'COSINE'
)

# Query
results = r.execute_command(
    'FT.SEARCH', 'doc_index',
    '*=>[KNN 5 @embedding $query_vec AS score]',
    'PARAMS', '2', 'query_vec', query_embedding_bytes,
    'RETURN', '2', 'content', 'score',
    'SORTBY', 'score',
    'DIALECT', '2'
)

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Medium	Node management, cluster sizing. Redis expertise helpful.
Security	High	VPC, TLS, IAM, encryption at rest.
Reliability	High	Durable — transaction log persists every write. Multi-AZ.
Performance	High	Sub-millisecond latency.
Cost	Medium-High	Node-based + $0.20/GB data written (first 10TB/mo free). Memory-intensive.
Sustainability	Medium	Must provision for peak. Data tiering with R6gd nodes helps.

Trade-off: Fastest durable vector store on AWS. But you’re paying for memory-optimized instances, which gets expensive at scale.

7. Amazon ElastiCache (Valkey 8.2+)

ElastiCache with Valkey 8.2+ offers vector search at microsecond latency — the absolute lowest on AWS. The catch: it’s an in-memory cache, not a durable store.

How it works: Same Redis-compatible API as MemoryDB, but without transaction log durability. If a node restarts, you lose data unless you’ve configured snapshots.

Best for: Semantic caching (cache frequent RAG queries to avoid re-embedding and re-retrieval), ultra-low-latency lookups, and real-time recommendation layers where vectors can be rebuilt from a source of truth.

Key differentiator from MemoryDB:

ElastiCache = speed over durability (microsecond, but ephemeral)
MemoryDB = speed with durability (sub-millisecond, persistent)

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Medium	Similar to MemoryDB. Valkey 8.2+ required for vector search.
Security	High	VPC, TLS, IAM, encryption.
Reliability	Low-Medium	Not durable by default. Snapshot-based recovery. Data loss on node failure.
Performance	Highest	Microsecond latency. Up to 99% recall.
Cost	Medium	Node-based. No additional vector search cost. Cheaper than MemoryDB.
Sustainability	Medium	Must provision for peak.

Trade-off: If you can tolerate data loss (because vectors are derivable from source documents), this gives you the fastest possible retrieval. Use it as a caching layer in front of a durable vector store.

8. Amazon DocumentDB

DocumentDB added vector search for teams already running MongoDB-compatible workloads on AWS.

How it works: HNSW or IVFFlat indexing on vector fields within documents. MongoDB-compatible query syntax with $vectorSearch aggregation stage (DocumentDB 8.0+) or $search (5.0+).

Key specs:

Indexed vectors: Up to 2,000 dimensions
Stored (unindexed) vectors: Up to 16,000 dimensions
Index types: HNSW (M: max 100, efConstruction: max 1,000) and IVFFlat
Distance metrics: Euclidean, Cosine, Dot Product

Best for: Teams already on DocumentDB or migrating from MongoDB who want to add RAG without introducing a new database.

// Create vector index
db.documents.createIndex(
  { "embedding": "vectorSearch" },
  {
    "name": "vector_index",
    "vectorOptions": {
      "type": "hnsw",
      "dimensions": 1536,
      "similarity": "cosine",
      "m": 16,
      "efConstruction": 200
    }
  }
);

// Query
db.documents.aggregate([
  {
    "$vectorSearch": {
      "queryVector": queryEmbedding,
      "path": "embedding",
      "numCandidates": 100,
      "limit": 5,
      "index": "vector_index"
    }
  }
]);

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	Medium	Managed service but requires index management.
Security	High	VPC, TLS, KMS encryption, IAM.
Reliability	High	Multi-AZ, automated backups.
Performance	Medium	Good for moderate scale. Not optimized for vector-heavy workloads.
Cost	Medium	Instance-based. Cost-effective if already running DocumentDB.
Sustainability	Medium	Shared instance with document workloads.

Trade-off: Convenient if you’re already on DocumentDB. But it’s not a vector-first database — if vectors are your primary workload, dedicated options perform better.

9. Amazon Kendra (Enterprise Search)

Kendra isn’t a vector database — it’s an ML-powered enterprise search service. But it solves the same problem for a different audience: retrieval from enterprise knowledge for RAG.

How it works: Kendra indexes documents from 40+ connectors (S3, SharePoint, Confluence, ServiceNow, Salesforce, databases, web crawlers) and provides natural language query capabilities. No embedding pipeline needed.

Best for: Enterprise knowledge retrieval where the data lives across many SaaS tools and you don’t want to build custom embedding pipelines. Compliance-sensitive environments where Kendra’s built-in access control (respecting source system permissions) matters.

Well-Architected assessment:

Pillar	Rating	Notes
Operational Excellence	High	Fully managed. Connectors handle sync automatically.
Security	High	Respects source system ACLs. Built-in access control per user/group.
Reliability	High	Managed, multi-AZ.
Performance	Medium	Higher latency than purpose-built vector stores. Optimized for precision, not speed.
Cost	Low	Index + connector pricing can be expensive. Starts at ~$810/month for Developer edition.
Sustainability	High	Serverless, scales with usage.

Trade-off: Best out-of-box enterprise search with built-in security and connectors. But expensive, and less flexible than building your own retrieval pipeline.

Choosing Your Vector Store: The Decision Tree

Here’s how to shortcut the decision:

Start here: Do you want to manage vector infrastructure?

No → Bedrock Knowledge Bases. It handles everything. Start here, graduate later if needed.

Yes — then what’s your priority?

Hybrid search quality → OpenSearch. Combine keyword + semantic for best retrieval accuracy.
Lowest latency (cache) → ElastiCache. Microsecond retrieval for semantic caching layers.
Lowest latency (durable) → MemoryDB. Sub-millisecond with durability guarantees.
Simplest, cheapest → S3 Vectors. Pay-per-query, zero infra. Good for getting started.
Already on PostgreSQL → Aurora pgvector. Vectors alongside relational data.
Already on MongoDB → DocumentDB. Add vectors without a new database.
Entity relationships matter → Neptune Analytics. GraphRAG for complex domains.
Enterprise data across SaaS tools → Kendra. 40+ connectors, built-in ACL.

Advanced Patterns

Hybrid Search (Keyword + Semantic)

Pure vector search misses exact matches. If a user asks about “error code E-4012”, a semantic search might return results about generic error handling instead of the specific code. Hybrid search solves this by combining BM25 keyword matching with k-NN vector similarity.

OpenSearch is the only AWS-native service with built-in hybrid search. If you’re using another vector store, you can achieve hybrid search by:

Running a keyword search in your primary database
Running a vector search in your vector store
Merging and reranking results using Bedrock’s reranker

Chunking Strategies

How you chunk your documents impacts retrieval quality more than which vector store you use.

Strategy	How It Works	Best For
Fixed-size	Split every N tokens with overlap	Simple documents, consistent structure
Semantic	Split at topic boundaries using NLP	Long-form content with varying topics
Hierarchical	Parent chunks (summaries) + child chunks (details)	Technical documentation, manuals
Sentence-window	Index individual sentences, retrieve surrounding context	Precise retrieval from dense text

Bedrock Knowledge Bases supports fixed-size, semantic, and hierarchical chunking natively. For other vector stores, you’ll build this in your ingestion pipeline.

Reranking

First-pass retrieval (vector similarity) gets you candidates. Reranking uses a cross-encoder model to score each candidate against the actual query — dramatically improving relevance.

On AWS, you have two options:

Reranker	Model ID	Pricing	Regions
Amazon Rerank 1.0	`amazon.rerank-v1:0`	Included	ap-northeast-1, ca-central-1, eu-central-1, us-west-2
Cohere Rerank 3.5	`cohere.rerank-v3-5:0`	$2.00 / 1K queries	ap-northeast-1, ca-central-1, eu-central-1, us-east-1, us-west-2

Watch out: Amazon Rerank 1.0 is not available in us-east-1. Use Cohere Rerank 3.5 if you’re in that region.

Always retrieve more candidates than you need (e.g., top 20), rerank, then use the top 5 in your prompt.

Prompt Caching on Bedrock

If your RAG context is large and repeated across queries (e.g., a system prompt with 50 pages of product documentation), Bedrock’s prompt caching reduces both latency and cost by caching the processed context prefix across invocations.

Add cachePoint objects in your Converse API calls (or cache_control for Claude via InvokeModel). Cached tokens are read at a reduced rate, and cache hits don’t count against rate limits.

Model	Min Tokens per Checkpoint	Max Checkpoints	TTL
Claude Sonnet 4.5	1,024	4	5m or 1h
Claude Sonnet 4	1,024	4	5m
Claude 3.7 Sonnet	1,024	4	5m
Amazon Nova (all)	1,000	4	5m

This is particularly effective for multi-turn RAG conversations where the system prompt + document context stays constant across turns.

Multi-Service Architecture

Production RAG systems often combine multiple services:

User Query
    ↓
[Bedrock KB] → Retrieves top 20 chunks from [OpenSearch]
    ↓
[Bedrock Reranker] → Reranks to top 5
    ↓
[ElastiCache] → Cache check (have we answered this before?)
    ↓
[Bedrock FM] → Generate answer with reranked context
    ↓
[ElastiCache] → Cache the response for future queries

This pattern uses OpenSearch for hybrid retrieval quality, ElastiCache for response caching, and Bedrock for orchestration and generation.

What I Learned

Hybrid search is underrated — Most RAG tutorials skip it, but combining keyword + semantic retrieval consistently produces better results than vector search alone. OpenSearch is the only AWS-native service that does this in a single query.
Start managed, graduate to custom — Bedrock Knowledge Bases gets you to production fastest. You can always migrate to a self-managed vector store later once you understand your query patterns and scale requirements.
Your chunking strategy matters more than your vector store — Teams spend weeks debating Aurora vs OpenSearch, then use naive fixed-size chunking and wonder why retrieval quality is poor. Invest in chunking first.
Caching is the hidden cost saver — Adding ElastiCache as a semantic cache in front of your vector store can reduce retrieval costs by 60-80% for workloads with repetitive queries (support chatbots, FAQ assistants).
The Well-Architected GenAI Lens is worth reading — It provides specific guidance on vector store optimization and data retrieval performance that goes beyond what any single blog post can cover.

Do It Yourself

Key takeaways

Start managed, graduate when needed — Bedrock Knowledge Bases gets you to production fastest. Migrate to a custom vector store only when you understand your query patterns and have a clear operational justification.
Hybrid search trumps pure vector search — If retrieval quality matters, choose OpenSearch for native hybrid search or build your own reranking pipeline. Exact keyword matches catch what semantic embeddings miss.
Chunking strategy beats infrastructure choice — Teams debate Aurora vs OpenSearch for weeks, then use naive fixed-size chunking and wonder why retrieval is poor. Invest in semantic or hierarchical chunking first.

Try it now

Quick start with Bedrock Knowledge Bases: Deploy a fully managed RAG pipeline in under 30 minutes using the Knowledge Bases quick start guide. No vector database setup required.
Compare vector stores hands-on: Clone the AWS samples repo and run the vector store comparison notebooks to see latency/cost differences with your own data.
Implement hybrid search: Follow the OpenSearch k-NN + BM25 tutorial to build a hybrid search pipeline, or use Bedrock’s reranking API to combine results from any vector store.

References:

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

Never miss a post

Get notified when I publish new articles about AI, Cloud, and AWS.

No spam, unsubscribe anytime.

Comments

Building a RAG System That Actually Works: Chunking, Vector Engines, and Testing

Most RAG tutorials stop at 'put vectors in a database.' This post covers what actually determines quality: how you chunk documents, which vector search engine to pick, and how to measure and iterate on retrieval performance using Bedrock Knowledge Bases and LLM-as-judge evaluation.

Mar 10, 202614 min

Vector Search vs Semantic Search: They're Not the Same Thing

Vector search, semantic search, keyword search, hybrid search — these terms get used interchangeably but they mean different things. This post breaks down what each actually does, when each matters, and why hybrid search wins for RAG.

Mar 10, 202612 min

OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants

A deep architectural comparison of four open-source frameworks that turn messaging apps into AI assistant interfaces — from a 349-file TypeScript monolith to a 10MB Go binary that runs on a $10 board.

Mar 4, 202619 min

The Problem

The Solution

How It Works

The 9 Vector Storage Options at a Glance

1. Amazon S3 Vectors (Preview)

2. Amazon Aurora PostgreSQL with pgvector

3. Amazon OpenSearch Service

4. Amazon Bedrock Knowledge Bases (Managed RAG)

5. Amazon Neptune Analytics (GraphRAG)

6. Amazon MemoryDB

7. Amazon ElastiCache (Valkey 8.2+)

8. Amazon DocumentDB

9. Amazon Kendra (Enterprise Search)

Choosing Your Vector Store: The Decision Tree

Advanced Patterns

Hybrid Search (Keyword + Semantic)

Chunking Strategies

Reranking

Prompt Caching on Bedrock

Multi-Service Architecture

What I Learned

Do It Yourself

Key takeaways

Try it now

Alexandre Agius

Never miss a post

Comments

Related Posts

Building a RAG System That Actually Works: Chunking, Vector Engines, and Testing

Vector Search vs Semantic Search: They're Not the Same Thing

OpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants