RAG on AWS: Which Vector Store Is Right for You?
AWS now offers 9 different ways to store and search vectors for RAG workloads. This guide compares every option through the Well-Architected Framework to help you pick the right one.
Table of Contents
- The Problem
- The Solution
- How It Works
- The 9 Vector Storage Options at a Glance
- 1. Amazon S3 Vectors (Preview)
- 2. Amazon Aurora PostgreSQL with pgvector
- 3. Amazon OpenSearch Service
- 4. Amazon Bedrock Knowledge Bases (Managed RAG)
- 5. Amazon Neptune Analytics (GraphRAG)
- 6. Amazon MemoryDB
- 7. Amazon ElastiCache (Valkey 8.2+)
- 8. Amazon DocumentDB
- 9. Amazon Kendra (Enterprise Search)
- Choosing Your Vector Store: The Decision Tree
- Advanced Patterns
- What I Learned
- What’s Next
Your team has decided to implement RAG. You’ve picked a foundation model on Bedrock. Now comes the question that stalls most projects: where do you store your vectors?
AWS now offers at least 9 services with vector search capabilities — from a new S3-native option to graph-based retrieval with Neptune. Each has different trade-offs in cost, latency, operational complexity, and scalability. Picking the wrong one means rework later. This guide walks through every option and helps you choose.
The Problem
The RAG pattern is straightforward: embed your documents as vectors, store them, retrieve the most relevant chunks at query time, and inject them into your prompt. The hard part isn’t the pattern — it’s the infrastructure decision.
Here’s what makes this choice difficult:
- Too many options — S3 Vectors, Aurora pgvector, OpenSearch, Bedrock Knowledge Bases, Neptune, MemoryDB, ElastiCache, DocumentDB, Kendra. Each marketed as “the best” for GenAI.
- Different trade-offs — Some optimize for latency, others for cost. Some are fully managed, others give you full control. Some support hybrid search, others don’t.
- No single winner — The right answer depends on your existing stack, query patterns, scale, and team expertise.
- Well-Architected implications — Your vector store choice affects all six pillars: operational overhead, security posture, reliability, performance, cost trajectory, and resource efficiency.
Without a structured comparison, teams either default to whatever the first tutorial used (usually OpenSearch) or spend weeks evaluating options that weren’t right for them in the first place.
The Solution
We’ll evaluate every AWS vector storage option through the Well-Architected Generative AI Lens — the framework AWS published specifically for GenAI workloads. For each service, we assess operational excellence, security, reliability, performance, cost, and sustainability. Then we provide a decision tree to shortcut the choice.
How It Works
The 9 Vector Storage Options at a Glance
Before diving deep, here’s the landscape:
| Service | Max Dimensions | Latency | Hybrid Search | Managed Level | Pricing Model |
|---|---|---|---|---|---|
| S3 Vectors | Up to 2B vectors/index | ~100ms | No | Serverless | Pay-per-request |
| Aurora PostgreSQL (pgvector) | 2,000 / 4,000 (halfvec) | Single-digit ms | Yes (SQL + GIN) | Semi-managed | Instance-based |
| OpenSearch Service | 16,000 | Sub-100ms | Yes (BM25 + k-NN) | Managed / Serverless | Instance or OCU-based |
| Bedrock Knowledge Bases | Per embedding model | Medium | Yes (built-in) | Fully managed | Pay-per-query |
| Neptune Analytics | Per embedding model | Low ms | Graph + Vector | Managed | NCU-hour |
| MemoryDB | 32,768 | Microsecond | No | Managed | Node + data written |
| ElastiCache (Valkey 8.2+) | ~32,768 | Microsecond | No | Managed | Node-based |
| DocumentDB | 2,000 (indexed) / 16,000 | Low ms | No | Managed | Instance-based |
| Kendra | N/A (managed) | Sub-second | Yes (NL + keyword) | Fully managed | Index + connector |
1. Amazon S3 Vectors (Preview)
The newest option, announced at re:Invent 2024. S3 Vectors lets you store and query embeddings directly in S3 — no separate database needed.
How it works: You create a “vector bucket” in S3, push vectors with metadata, and query by similarity. S3 manages the indexing automatically. Each index can hold up to 2 billion vectors, and each bucket supports up to 10,000 indexes. Distance metrics: cosine and Euclidean. Metadata limited to 1KB per vector (35 keys max).
Best for: Teams with existing S3 data pipelines who want the simplest possible RAG setup. Cost-sensitive workloads where you’re optimizing for dollars-per-query, not milliseconds-per-query. AWS claims up to 90% lower cost than traditional vector databases.
import boto3
s3vectors = boto3.client('s3vectors')
# Create a vector index in S3
s3vectors.create_vector_bucket(
vectorBucketName='my-rag-vectors'
)
# Put vectors
s3vectors.put_vectors(
vectorBucketName='my-rag-vectors',
indexName='product-docs',
vectors=[
{
'key': 'doc-001',
'data': {'float32': embedding_vector},
'metadata': {'source': 'product-manual.pdf', 'page': 12}
}
]
)
# Query
results = s3vectors.query_vectors(
vectorBucketName='my-rag-vectors',
indexName='product-docs',
queryVector={'float32': query_embedding},
topK=5
)
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | High | Zero infrastructure to manage. Serverless. |
| Security | High | Inherits S3 security model (IAM, bucket policies, encryption at rest/transit). |
| Reliability | High | S3 durability (11 9’s). |
| Performance | Medium | Not designed for sub-millisecond queries. Adequate for most RAG workloads. |
| Cost | High | Pay-per-request. No idle cost. Ideal for variable or low-volume workloads. |
| Sustainability | High | No over-provisioned infrastructure. |
Trade-off: Simplest option with lowest operational burden, but limited query sophistication — no hybrid search, no complex filtering at query time. Still in preview.
2. Amazon Aurora PostgreSQL with pgvector
If your team already runs PostgreSQL, this is the path of least resistance. The pgvector extension (v0.8.1) adds vector columns, indexing, and similarity search to your existing database.
How it works: Add a vector column to your table, create an HNSW or IVFFlat index, and query using distance operators.
Key specs:
- Dimensions: Up to 2,000 (vector type), 4,000 (halfvec), 64,000 (bit)
- Index types: HNSW (better recall, slower to build) and IVFFlat (faster build, lower recall)
- Distance functions: L2 (
<->), inner product (<#>), cosine (<=>), L1 (<+>), Hamming (<~>), Jaccard (<%>)
-- Enable pgvector
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
embedding vector(1536) -- dimension matches your embedding model
);
-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Query: find 5 most similar documents
SELECT id, title, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;
Best for: Teams already on PostgreSQL who want vectors alongside relational data. Use cases where you need JOINs between vector results and business data (e.g., filter by tenant, date range, product category).
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Medium | You manage the database (even if Aurora handles patching). Need to tune index parameters. |
| Security | High | VPC, IAM, encryption, audit logging — full Aurora security model. |
| Reliability | High | Aurora’s multi-AZ, automated backups, point-in-time recovery. |
| Performance | Medium-High | HNSW gives good recall/latency trade-off. But vector workloads compete with relational queries for resources. |
| Cost | Medium | Instance-based pricing. You’re paying for the database whether you query vectors or not. Good if already running Aurora. Expensive if deployed just for vectors. |
| Sustainability | Medium | May over-provision to handle mixed workloads. |
Trade-off: Minimal new infrastructure if you’re already on PostgreSQL. But vector-heavy workloads can starve relational queries — consider read replicas dedicated to vector search.
3. Amazon OpenSearch Service
OpenSearch is the strongest option for hybrid search — combining traditional keyword matching (BM25) with semantic vector search (k-NN) in a single query. This is a significant advantage for RAG quality.
How it works: OpenSearch stores documents with both text fields and vector fields. At query time, you can run BM25 and k-NN simultaneously, then combine scores.
Key specs:
- Dimensions: Up to 16,000
- Algorithms: HNSW, IVFFlat (via Faiss and Lucene engines)
- Hybrid search: Native support for combining keyword + semantic results
- Serverless option: OpenSearch Serverless with OCU-based auto-scaling
import boto3
from opensearchpy import OpenSearch
client = OpenSearch(
hosts=[{'host': 'your-domain.us-east-1.es.amazonaws.com', 'port': 443}],
use_ssl=True,
connection_class=RequestsHttpConnection
)
# Create index with k-NN enabled
index_body = {
"settings": {
"index.knn": True,
"index.knn.algo_param.ef_search": 512
},
"mappings": {
"properties": {
"content": {"type": "text"},
"embedding": {
"type": "knn_vector",
"dimension": 1536,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 16}
}
}
}
}
}
# Hybrid search: keyword + vector
hybrid_query = {
"size": 5,
"query": {
"hybrid": {
"queries": [
{
"match": {
"content": "how to configure VPC peering"
}
},
{
"knn": {
"embedding": {
"vector": query_embedding,
"k": 5
}
}
}
]
}
}
}
Best for: RAG workloads where retrieval quality matters most. Hybrid search consistently outperforms pure vector search because it catches both semantic matches and exact keyword matches. Also ideal if you’re already using OpenSearch for log analytics.
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Medium | Managed service, but index management, shard tuning, and capacity planning require expertise. Serverless simplifies this. |
| Security | High | VPC, fine-grained access control, encryption, SAML/OIDC. |
| Reliability | High | Multi-AZ, automated snapshots, blue/green deployments. |
| Performance | High | Hybrid search gives best retrieval quality. Sub-100ms latency typical. |
| Cost | Medium-Low | Instance-based can be expensive at scale. Serverless (OCU) has minimum costs (~$700/mo for 2 OCUs). |
| Sustainability | Medium | Over-provisioning common with instance-based. Serverless improves this. |
Trade-off: Best retrieval quality through hybrid search, but highest operational complexity and cost floor. OpenSearch Serverless reduces ops but has a minimum cost that hurts small workloads.
4. Amazon Bedrock Knowledge Bases (Managed RAG)
If you want RAG without managing any vector infrastructure, this is it. Bedrock Knowledge Bases handles the entire pipeline: data ingestion, chunking, embedding, storage, and retrieval.
How it works: Point it at your data sources (S3, web pages, Confluence, SharePoint, Salesforce), pick a chunking strategy, and Bedrock does the rest. It creates and manages the vector store behind the scenes.
Key features:
- Supported vector stores (8 total): OpenSearch Serverless, OpenSearch Managed, S3 Vectors, Aurora PostgreSQL, Neptune Analytics, Pinecone, Redis Enterprise, MongoDB Atlas
- Data sources: S3, Web Crawler, Confluence, SharePoint, Salesforce, and more
- Chunking: Fixed-size, semantic, hierarchical
- Reranking: Built-in with Cohere Rerank and Amazon reranker
- File types: PDF, DOCX, HTML, CSV, MD, TXT, XLS (max 50MB per file)
import boto3
bedrock_agent = boto3.client('bedrock-agent')
# Create a knowledge base
kb = bedrock_agent.create_knowledge_base(
name='product-documentation',
roleArn='arn:aws:iam::123456789012:role/BedrockKBRole',
knowledgeBaseConfiguration={
'type': 'VECTOR',
'vectorKnowledgeBaseConfiguration': {
'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'
}
},
storageConfiguration={
'type': 'OPENSEARCH_SERVERLESS',
'opensearchServerlessConfiguration': {
'collectionArn': 'arn:aws:aoss:us-east-1:123456789012:collection/xyz',
'fieldMapping': {
'vectorField': 'embedding',
'textField': 'text',
'metadataField': 'metadata'
}
}
}
)
# Query with reranking
bedrock_runtime = boto3.client('bedrock-agent-runtime')
response = bedrock_runtime.retrieve(
knowledgeBaseId=kb['knowledgeBase']['knowledgeBaseId'],
retrievalQuery={'text': 'How do I configure cross-account access?'},
retrievalConfiguration={
'vectorSearchConfiguration': {
'numberOfResults': 10,
'overrideSearchType': 'HYBRID'
}
}
)
Best for: Teams that want the fastest path to production RAG. Non-expert teams. Prototyping. Any workload where you’d rather not manage chunking pipelines and vector databases yourself.
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Highest | Fully managed. No infrastructure, no index tuning, no chunking code. |
| Security | High | IAM, encryption, data source permissions. Guardrails integration for content filtering. |
| Reliability | High | Depends on underlying vector store. Bedrock handles sync and retries. |
| Performance | Medium | Adds a layer of abstraction. Reranking improves quality but adds latency. |
| Cost | Medium | Pay per retrieval query + underlying vector store costs. Predictable but not the cheapest at scale. |
| Sustainability | High | No over-provisioned infrastructure. Scales with demand. |
Trade-off: Fastest time-to-value, lowest operational burden — but you trade control. Custom chunking logic, specialized filtering, or non-standard retrieval patterns may push you toward a self-managed approach.
5. Amazon Neptune Analytics (GraphRAG)
GraphRAG is the approach for domains where relationships between entities matter as much as the content itself. Neptune Analytics combines knowledge graphs with vector similarity search.
How it works: Store entities as graph nodes with vector embeddings. Query using openCypher with vector extensions. Retrieval follows graph relationships (e.g., “find documents related to Entity X and similar to this query vector”).
Best for: Entity-rich domains — healthcare (drug interactions, patient records), finance (transaction networks, regulatory entities), legal (case law citations, precedent chains). Any domain where “connected to” matters as much as “similar to.”
// Find nodes similar to a query vector that are connected to a specific entity
MATCH (doc:Document)-[:REFERENCES]->(entity:Regulation {name: 'GDPR'})
WHERE doc.embedding IS NOT NULL
WITH doc, vector.similarity.cosine(doc.embedding, $queryVector) AS score
WHERE score > 0.7
RETURN doc.title, doc.content, score
ORDER BY score DESC
LIMIT 5
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Medium | Graph modeling requires specialized expertise. Not a simple “dump vectors” approach. |
| Security | High | VPC, IAM, encryption. |
| Reliability | High | Managed service with automated backups. |
| Performance | Medium-High | Graph traversal + vector search is powerful but adds query complexity. |
| Cost | Medium | Memory-hour pricing. Costs scale with graph size. |
| Sustainability | Medium | Right-size memory allocation for your graph. |
Trade-off: Highest retrieval quality for relationship-heavy domains, but highest learning curve. You need to model a knowledge graph, not just embed documents.
6. Amazon MemoryDB
MemoryDB brings vector search with sub-millisecond latency and durability guarantees — the key differentiator from ElastiCache. Every write is persisted to a transaction log, so your vectors survive restarts and failures.
How it works: Redis-compatible API with HNSW vector indexing. Store vectors as Redis data structures, query with FT.SEARCH. Supports up to 32,768 dimensions, max 10 indexes per cluster, 50 fields per index. HNSW parameters: M up to 512, EF_CONSTRUCTION up to 4,096.
Limitation: Vector search is currently single-shard only — no horizontal scaling. Vertical and replica scaling are supported. Must be enabled at cluster creation (R6g, R7g, T4g node types only).
Best for: Real-time RAG where latency is critical — chatbots, live recommendations, session-based context retrieval. Workloads that need both speed and durability.
import redis
r = redis.Redis(host='your-memorydb-endpoint', port=6379, ssl=True)
# Create vector index
r.execute_command(
'FT.CREATE', 'doc_index',
'ON', 'HASH',
'PREFIX', '1', 'doc:',
'SCHEMA',
'content', 'TEXT',
'embedding', 'VECTOR', 'HNSW', '6',
'TYPE', 'FLOAT32',
'DIM', '1536',
'DISTANCE_METRIC', 'COSINE'
)
# Query
results = r.execute_command(
'FT.SEARCH', 'doc_index',
'*=>[KNN 5 @embedding $query_vec AS score]',
'PARAMS', '2', 'query_vec', query_embedding_bytes,
'RETURN', '2', 'content', 'score',
'SORTBY', 'score',
'DIALECT', '2'
)
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Medium | Node management, cluster sizing. Redis expertise helpful. |
| Security | High | VPC, TLS, IAM, encryption at rest. |
| Reliability | High | Durable — transaction log persists every write. Multi-AZ. |
| Performance | High | Sub-millisecond latency. |
| Cost | Medium-High | Node-based + $0.20/GB data written (first 10TB/mo free). Memory-intensive. |
| Sustainability | Medium | Must provision for peak. Data tiering with R6gd nodes helps. |
Trade-off: Fastest durable vector store on AWS. But you’re paying for memory-optimized instances, which gets expensive at scale.
7. Amazon ElastiCache (Valkey 8.2+)
ElastiCache with Valkey 8.2+ offers vector search at microsecond latency — the absolute lowest on AWS. The catch: it’s an in-memory cache, not a durable store.
How it works: Same Redis-compatible API as MemoryDB, but without transaction log durability. If a node restarts, you lose data unless you’ve configured snapshots.
Best for: Semantic caching (cache frequent RAG queries to avoid re-embedding and re-retrieval), ultra-low-latency lookups, and real-time recommendation layers where vectors can be rebuilt from a source of truth.
Key differentiator from MemoryDB:
- ElastiCache = speed over durability (microsecond, but ephemeral)
- MemoryDB = speed with durability (sub-millisecond, persistent)
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Medium | Similar to MemoryDB. Valkey 8.2+ required for vector search. |
| Security | High | VPC, TLS, IAM, encryption. |
| Reliability | Low-Medium | Not durable by default. Snapshot-based recovery. Data loss on node failure. |
| Performance | Highest | Microsecond latency. Up to 99% recall. |
| Cost | Medium | Node-based. No additional vector search cost. Cheaper than MemoryDB. |
| Sustainability | Medium | Must provision for peak. |
Trade-off: If you can tolerate data loss (because vectors are derivable from source documents), this gives you the fastest possible retrieval. Use it as a caching layer in front of a durable vector store.
8. Amazon DocumentDB
DocumentDB added vector search for teams already running MongoDB-compatible workloads on AWS.
How it works: HNSW or IVFFlat indexing on vector fields within documents. MongoDB-compatible query syntax with $vectorSearch aggregation stage (DocumentDB 8.0+) or $search (5.0+).
Key specs:
- Indexed vectors: Up to 2,000 dimensions
- Stored (unindexed) vectors: Up to 16,000 dimensions
- Index types: HNSW (M: max 100, efConstruction: max 1,000) and IVFFlat
- Distance metrics: Euclidean, Cosine, Dot Product
Best for: Teams already on DocumentDB or migrating from MongoDB who want to add RAG without introducing a new database.
// Create vector index
db.documents.createIndex(
{ "embedding": "vectorSearch" },
{
"name": "vector_index",
"vectorOptions": {
"type": "hnsw",
"dimensions": 1536,
"similarity": "cosine",
"m": 16,
"efConstruction": 200
}
}
);
// Query
db.documents.aggregate([
{
"$vectorSearch": {
"queryVector": queryEmbedding,
"path": "embedding",
"numCandidates": 100,
"limit": 5,
"index": "vector_index"
}
}
]);
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | Medium | Managed service but requires index management. |
| Security | High | VPC, TLS, KMS encryption, IAM. |
| Reliability | High | Multi-AZ, automated backups. |
| Performance | Medium | Good for moderate scale. Not optimized for vector-heavy workloads. |
| Cost | Medium | Instance-based. Cost-effective if already running DocumentDB. |
| Sustainability | Medium | Shared instance with document workloads. |
Trade-off: Convenient if you’re already on DocumentDB. But it’s not a vector-first database — if vectors are your primary workload, dedicated options perform better.
9. Amazon Kendra (Enterprise Search)
Kendra isn’t a vector database — it’s an ML-powered enterprise search service. But it solves the same problem for a different audience: retrieval from enterprise knowledge for RAG.
How it works: Kendra indexes documents from 40+ connectors (S3, SharePoint, Confluence, ServiceNow, Salesforce, databases, web crawlers) and provides natural language query capabilities. No embedding pipeline needed.
Best for: Enterprise knowledge retrieval where the data lives across many SaaS tools and you don’t want to build custom embedding pipelines. Compliance-sensitive environments where Kendra’s built-in access control (respecting source system permissions) matters.
Well-Architected assessment:
| Pillar | Rating | Notes |
|---|---|---|
| Operational Excellence | High | Fully managed. Connectors handle sync automatically. |
| Security | High | Respects source system ACLs. Built-in access control per user/group. |
| Reliability | High | Managed, multi-AZ. |
| Performance | Medium | Higher latency than purpose-built vector stores. Optimized for precision, not speed. |
| Cost | Low | Index + connector pricing can be expensive. Starts at ~$810/month for Developer edition. |
| Sustainability | High | Serverless, scales with usage. |
Trade-off: Best out-of-box enterprise search with built-in security and connectors. But expensive, and less flexible than building your own retrieval pipeline.
Choosing Your Vector Store: The Decision Tree
Here’s how to shortcut the decision:
Start here: Do you want to manage vector infrastructure?
- No → Bedrock Knowledge Bases. It handles everything. Start here, graduate later if needed.
Yes — then what’s your priority?
- Hybrid search quality → OpenSearch. Combine keyword + semantic for best retrieval accuracy.
- Lowest latency (cache) → ElastiCache. Microsecond retrieval for semantic caching layers.
- Lowest latency (durable) → MemoryDB. Sub-millisecond with durability guarantees.
- Simplest, cheapest → S3 Vectors. Pay-per-query, zero infra. Good for getting started.
- Already on PostgreSQL → Aurora pgvector. Vectors alongside relational data.
- Already on MongoDB → DocumentDB. Add vectors without a new database.
- Entity relationships matter → Neptune Analytics. GraphRAG for complex domains.
- Enterprise data across SaaS tools → Kendra. 40+ connectors, built-in ACL.
Advanced Patterns
Hybrid Search (Keyword + Semantic)
Pure vector search misses exact matches. If a user asks about “error code E-4012”, a semantic search might return results about generic error handling instead of the specific code. Hybrid search solves this by combining BM25 keyword matching with k-NN vector similarity.
OpenSearch is the only AWS-native service with built-in hybrid search. If you’re using another vector store, you can achieve hybrid search by:
- Running a keyword search in your primary database
- Running a vector search in your vector store
- Merging and reranking results using Bedrock’s reranker
Chunking Strategies
How you chunk your documents impacts retrieval quality more than which vector store you use.
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simple documents, consistent structure |
| Semantic | Split at topic boundaries using NLP | Long-form content with varying topics |
| Hierarchical | Parent chunks (summaries) + child chunks (details) | Technical documentation, manuals |
| Sentence-window | Index individual sentences, retrieve surrounding context | Precise retrieval from dense text |
Bedrock Knowledge Bases supports fixed-size, semantic, and hierarchical chunking natively. For other vector stores, you’ll build this in your ingestion pipeline.
Reranking
First-pass retrieval (vector similarity) gets you candidates. Reranking uses a cross-encoder model to score each candidate against the actual query — dramatically improving relevance.
On AWS, you have two options:
| Reranker | Model ID | Pricing | Regions |
|---|---|---|---|
| Amazon Rerank 1.0 | amazon.rerank-v1:0 | Included | ap-northeast-1, ca-central-1, eu-central-1, us-west-2 |
| Cohere Rerank 3.5 | cohere.rerank-v3-5:0 | $2.00 / 1K queries | ap-northeast-1, ca-central-1, eu-central-1, us-east-1, us-west-2 |
Watch out: Amazon Rerank 1.0 is not available in us-east-1. Use Cohere Rerank 3.5 if you’re in that region.
Always retrieve more candidates than you need (e.g., top 20), rerank, then use the top 5 in your prompt.
Prompt Caching on Bedrock
If your RAG context is large and repeated across queries (e.g., a system prompt with 50 pages of product documentation), Bedrock’s prompt caching reduces both latency and cost by caching the processed context prefix across invocations.
Add cachePoint objects in your Converse API calls (or cache_control for Claude via InvokeModel). Cached tokens are read at a reduced rate, and cache hits don’t count against rate limits.
| Model | Min Tokens per Checkpoint | Max Checkpoints | TTL |
|---|---|---|---|
| Claude Sonnet 4.5 | 1,024 | 4 | 5m or 1h |
| Claude Sonnet 4 | 1,024 | 4 | 5m |
| Claude 3.7 Sonnet | 1,024 | 4 | 5m |
| Amazon Nova (all) | 1,000 | 4 | 5m |
This is particularly effective for multi-turn RAG conversations where the system prompt + document context stays constant across turns.
Multi-Service Architecture
Production RAG systems often combine multiple services:
User Query
↓
[Bedrock KB] → Retrieves top 20 chunks from [OpenSearch]
↓
[Bedrock Reranker] → Reranks to top 5
↓
[ElastiCache] → Cache check (have we answered this before?)
↓
[Bedrock FM] → Generate answer with reranked context
↓
[ElastiCache] → Cache the response for future queries
This pattern uses OpenSearch for hybrid retrieval quality, ElastiCache for response caching, and Bedrock for orchestration and generation.
What I Learned
- Hybrid search is underrated — Most RAG tutorials skip it, but combining keyword + semantic retrieval consistently produces better results than vector search alone. OpenSearch is the only AWS-native service that does this in a single query.
- Start managed, graduate to custom — Bedrock Knowledge Bases gets you to production fastest. You can always migrate to a self-managed vector store later once you understand your query patterns and scale requirements.
- Your chunking strategy matters more than your vector store — Teams spend weeks debating Aurora vs OpenSearch, then use naive fixed-size chunking and wonder why retrieval quality is poor. Invest in chunking first.
- Caching is the hidden cost saver — Adding ElastiCache as a semantic cache in front of your vector store can reduce retrieval costs by 60-80% for workloads with repetitive queries (support chatbots, FAQ assistants).
- The Well-Architected GenAI Lens is worth reading — It provides specific guidance on vector store optimization and data retrieval performance that goes beyond what any single blog post can cover.
What’s Next
- Benchmark latency and cost across S3 Vectors, OpenSearch Serverless, and Aurora pgvector with a real dataset
- Test Bedrock Knowledge Bases hierarchical chunking vs custom semantic chunking pipeline
- Build a reference architecture combining OpenSearch (retrieval) + ElastiCache (caching) + Bedrock (orchestration)
- Evaluate DynamoDB vector search when it moves to GA — single-table design with vectors could simplify many architectures
References:
- AWS Well-Architected Generative AI Lens
- Amazon Bedrock Knowledge Bases
- Bedrock Knowledge Bases — Vector Store Setup
- Bedrock Reranking
- Bedrock Prompt Caching
- OpenSearch k-NN Search
- OpenSearch Serverless Vector Search
- pgvector on GitHub
- ElastiCache Vector Search
- MemoryDB Vector Search
- MemoryDB Vector Search Limits
- Neptune Analytics — Vector Similarity
- Amazon S3 Vectors
- Amazon DocumentDB Vector Search
- Amazon Kendra
- The Role of Vector Datastores in GenAI Applications
Related Posts
Building a RAG System That Actually Works: Chunking, Vector Engines, and Testing
Most RAG tutorials stop at 'put vectors in a database.' This post covers what actually determines quality: how you chunk documents, which vector search engine to pick, and how to measure and iterate on retrieval performance using Bedrock Knowledge Bases and LLM-as-judge evaluation.
AIVector Search vs Semantic Search: They're Not the Same Thing
Vector search, semantic search, keyword search, hybrid search — these terms get used interchangeably but they mean different things. This post breaks down what each actually does, when each matters, and why hybrid search wins for RAG.
AIOpenClaw vs NanoBot vs PicoClaw vs TinyClaw: Four Approaches to Self-Hosted AI Assistants
A deep architectural comparison of four open-source frameworks that turn messaging apps into AI assistant interfaces — from a 349-file TypeScript monolith to a 10MB Go binary that runs on a $10 board.
