AI

RAG on AWS: Which Vector Store Is Right for You?

AWS now offers 9 different ways to store and search vectors for RAG workloads. This guide compares every option through the Well-Architected Framework to help you pick the right one.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

21 min read
Share:

Your team has decided to implement RAG. You’ve picked a foundation model on Bedrock. Now comes the question that stalls most projects: where do you store your vectors?

AWS now offers at least 9 services with vector search capabilities — from a new S3-native option to graph-based retrieval with Neptune. Each has different trade-offs in cost, latency, operational complexity, and scalability. Picking the wrong one means rework later. This guide walks through every option and helps you choose.

The Problem

The RAG pattern is straightforward: embed your documents as vectors, store them, retrieve the most relevant chunks at query time, and inject them into your prompt. The hard part isn’t the pattern — it’s the infrastructure decision.

Here’s what makes this choice difficult:

  • Too many options — S3 Vectors, Aurora pgvector, OpenSearch, Bedrock Knowledge Bases, Neptune, MemoryDB, ElastiCache, DocumentDB, Kendra. Each marketed as “the best” for GenAI.
  • Different trade-offs — Some optimize for latency, others for cost. Some are fully managed, others give you full control. Some support hybrid search, others don’t.
  • No single winner — The right answer depends on your existing stack, query patterns, scale, and team expertise.
  • Well-Architected implications — Your vector store choice affects all six pillars: operational overhead, security posture, reliability, performance, cost trajectory, and resource efficiency.

Without a structured comparison, teams either default to whatever the first tutorial used (usually OpenSearch) or spend weeks evaluating options that weren’t right for them in the first place.

The Solution

We’ll evaluate every AWS vector storage option through the Well-Architected Generative AI Lens — the framework AWS published specifically for GenAI workloads. For each service, we assess operational excellence, security, reliability, performance, cost, and sustainability. Then we provide a decision tree to shortcut the choice.

RAG Vector Store Decision Tree on AWS

How It Works

The 9 Vector Storage Options at a Glance

Before diving deep, here’s the landscape:

ServiceMax DimensionsLatencyHybrid SearchManaged LevelPricing Model
S3 VectorsUp to 2B vectors/index~100msNoServerlessPay-per-request
Aurora PostgreSQL (pgvector)2,000 / 4,000 (halfvec)Single-digit msYes (SQL + GIN)Semi-managedInstance-based
OpenSearch Service16,000Sub-100msYes (BM25 + k-NN)Managed / ServerlessInstance or OCU-based
Bedrock Knowledge BasesPer embedding modelMediumYes (built-in)Fully managedPay-per-query
Neptune AnalyticsPer embedding modelLow msGraph + VectorManagedNCU-hour
MemoryDB32,768MicrosecondNoManagedNode + data written
ElastiCache (Valkey 8.2+)~32,768MicrosecondNoManagedNode-based
DocumentDB2,000 (indexed) / 16,000Low msNoManagedInstance-based
KendraN/A (managed)Sub-secondYes (NL + keyword)Fully managedIndex + connector

1. Amazon S3 Vectors (Preview)

The newest option, announced at re:Invent 2024. S3 Vectors lets you store and query embeddings directly in S3 — no separate database needed.

How it works: You create a “vector bucket” in S3, push vectors with metadata, and query by similarity. S3 manages the indexing automatically. Each index can hold up to 2 billion vectors, and each bucket supports up to 10,000 indexes. Distance metrics: cosine and Euclidean. Metadata limited to 1KB per vector (35 keys max).

Best for: Teams with existing S3 data pipelines who want the simplest possible RAG setup. Cost-sensitive workloads where you’re optimizing for dollars-per-query, not milliseconds-per-query. AWS claims up to 90% lower cost than traditional vector databases.

import boto3

s3vectors = boto3.client('s3vectors')

# Create a vector index in S3
s3vectors.create_vector_bucket(
    vectorBucketName='my-rag-vectors'
)

# Put vectors
s3vectors.put_vectors(
    vectorBucketName='my-rag-vectors',
    indexName='product-docs',
    vectors=[
        {
            'key': 'doc-001',
            'data': {'float32': embedding_vector},
            'metadata': {'source': 'product-manual.pdf', 'page': 12}
        }
    ]
)

# Query
results = s3vectors.query_vectors(
    vectorBucketName='my-rag-vectors',
    indexName='product-docs',
    queryVector={'float32': query_embedding},
    topK=5
)

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceHighZero infrastructure to manage. Serverless.
SecurityHighInherits S3 security model (IAM, bucket policies, encryption at rest/transit).
ReliabilityHighS3 durability (11 9’s).
PerformanceMediumNot designed for sub-millisecond queries. Adequate for most RAG workloads.
CostHighPay-per-request. No idle cost. Ideal for variable or low-volume workloads.
SustainabilityHighNo over-provisioned infrastructure.

Trade-off: Simplest option with lowest operational burden, but limited query sophistication — no hybrid search, no complex filtering at query time. Still in preview.


2. Amazon Aurora PostgreSQL with pgvector

If your team already runs PostgreSQL, this is the path of least resistance. The pgvector extension (v0.8.1) adds vector columns, indexing, and similarity search to your existing database.

How it works: Add a vector column to your table, create an HNSW or IVFFlat index, and query using distance operators.

Key specs:

  • Dimensions: Up to 2,000 (vector type), 4,000 (halfvec), 64,000 (bit)
  • Index types: HNSW (better recall, slower to build) and IVFFlat (faster build, lower recall)
  • Distance functions: L2 (<->), inner product (<#>), cosine (<=>), L1 (<+>), Hamming (<~>), Jaccard (<%>)
-- Enable pgvector
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding vector(1536)  -- dimension matches your embedding model
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Query: find 5 most similar documents
SELECT id, title, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

Best for: Teams already on PostgreSQL who want vectors alongside relational data. Use cases where you need JOINs between vector results and business data (e.g., filter by tenant, date range, product category).

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceMediumYou manage the database (even if Aurora handles patching). Need to tune index parameters.
SecurityHighVPC, IAM, encryption, audit logging — full Aurora security model.
ReliabilityHighAurora’s multi-AZ, automated backups, point-in-time recovery.
PerformanceMedium-HighHNSW gives good recall/latency trade-off. But vector workloads compete with relational queries for resources.
CostMediumInstance-based pricing. You’re paying for the database whether you query vectors or not. Good if already running Aurora. Expensive if deployed just for vectors.
SustainabilityMediumMay over-provision to handle mixed workloads.

Trade-off: Minimal new infrastructure if you’re already on PostgreSQL. But vector-heavy workloads can starve relational queries — consider read replicas dedicated to vector search.


3. Amazon OpenSearch Service

OpenSearch is the strongest option for hybrid search — combining traditional keyword matching (BM25) with semantic vector search (k-NN) in a single query. This is a significant advantage for RAG quality.

How it works: OpenSearch stores documents with both text fields and vector fields. At query time, you can run BM25 and k-NN simultaneously, then combine scores.

Key specs:

  • Dimensions: Up to 16,000
  • Algorithms: HNSW, IVFFlat (via Faiss and Lucene engines)
  • Hybrid search: Native support for combining keyword + semantic results
  • Serverless option: OpenSearch Serverless with OCU-based auto-scaling
import boto3
from opensearchpy import OpenSearch

client = OpenSearch(
    hosts=[{'host': 'your-domain.us-east-1.es.amazonaws.com', 'port': 443}],
    use_ssl=True,
    connection_class=RequestsHttpConnection
)

# Create index with k-NN enabled
index_body = {
    "settings": {
        "index.knn": True,
        "index.knn.algo_param.ef_search": 512
    },
    "mappings": {
        "properties": {
            "content": {"type": "text"},
            "embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {"ef_construction": 512, "m": 16}
                }
            }
        }
    }
}

# Hybrid search: keyword + vector
hybrid_query = {
    "size": 5,
    "query": {
        "hybrid": {
            "queries": [
                {
                    "match": {
                        "content": "how to configure VPC peering"
                    }
                },
                {
                    "knn": {
                        "embedding": {
                            "vector": query_embedding,
                            "k": 5
                        }
                    }
                }
            ]
        }
    }
}

Best for: RAG workloads where retrieval quality matters most. Hybrid search consistently outperforms pure vector search because it catches both semantic matches and exact keyword matches. Also ideal if you’re already using OpenSearch for log analytics.

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceMediumManaged service, but index management, shard tuning, and capacity planning require expertise. Serverless simplifies this.
SecurityHighVPC, fine-grained access control, encryption, SAML/OIDC.
ReliabilityHighMulti-AZ, automated snapshots, blue/green deployments.
PerformanceHighHybrid search gives best retrieval quality. Sub-100ms latency typical.
CostMedium-LowInstance-based can be expensive at scale. Serverless (OCU) has minimum costs (~$700/mo for 2 OCUs).
SustainabilityMediumOver-provisioning common with instance-based. Serverless improves this.

Trade-off: Best retrieval quality through hybrid search, but highest operational complexity and cost floor. OpenSearch Serverless reduces ops but has a minimum cost that hurts small workloads.


4. Amazon Bedrock Knowledge Bases (Managed RAG)

If you want RAG without managing any vector infrastructure, this is it. Bedrock Knowledge Bases handles the entire pipeline: data ingestion, chunking, embedding, storage, and retrieval.

How it works: Point it at your data sources (S3, web pages, Confluence, SharePoint, Salesforce), pick a chunking strategy, and Bedrock does the rest. It creates and manages the vector store behind the scenes.

Key features:

  • Supported vector stores (8 total): OpenSearch Serverless, OpenSearch Managed, S3 Vectors, Aurora PostgreSQL, Neptune Analytics, Pinecone, Redis Enterprise, MongoDB Atlas
  • Data sources: S3, Web Crawler, Confluence, SharePoint, Salesforce, and more
  • Chunking: Fixed-size, semantic, hierarchical
  • Reranking: Built-in with Cohere Rerank and Amazon reranker
  • File types: PDF, DOCX, HTML, CSV, MD, TXT, XLS (max 50MB per file)
import boto3

bedrock_agent = boto3.client('bedrock-agent')

# Create a knowledge base
kb = bedrock_agent.create_knowledge_base(
    name='product-documentation',
    roleArn='arn:aws:iam::123456789012:role/BedrockKBRole',
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': 'arn:aws:aoss:us-east-1:123456789012:collection/xyz',
            'fieldMapping': {
                'vectorField': 'embedding',
                'textField': 'text',
                'metadataField': 'metadata'
            }
        }
    }
)

# Query with reranking
bedrock_runtime = boto3.client('bedrock-agent-runtime')

response = bedrock_runtime.retrieve(
    knowledgeBaseId=kb['knowledgeBase']['knowledgeBaseId'],
    retrievalQuery={'text': 'How do I configure cross-account access?'},
    retrievalConfiguration={
        'vectorSearchConfiguration': {
            'numberOfResults': 10,
            'overrideSearchType': 'HYBRID'
        }
    }
)

Best for: Teams that want the fastest path to production RAG. Non-expert teams. Prototyping. Any workload where you’d rather not manage chunking pipelines and vector databases yourself.

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceHighestFully managed. No infrastructure, no index tuning, no chunking code.
SecurityHighIAM, encryption, data source permissions. Guardrails integration for content filtering.
ReliabilityHighDepends on underlying vector store. Bedrock handles sync and retries.
PerformanceMediumAdds a layer of abstraction. Reranking improves quality but adds latency.
CostMediumPay per retrieval query + underlying vector store costs. Predictable but not the cheapest at scale.
SustainabilityHighNo over-provisioned infrastructure. Scales with demand.

Trade-off: Fastest time-to-value, lowest operational burden — but you trade control. Custom chunking logic, specialized filtering, or non-standard retrieval patterns may push you toward a self-managed approach.


5. Amazon Neptune Analytics (GraphRAG)

GraphRAG is the approach for domains where relationships between entities matter as much as the content itself. Neptune Analytics combines knowledge graphs with vector similarity search.

How it works: Store entities as graph nodes with vector embeddings. Query using openCypher with vector extensions. Retrieval follows graph relationships (e.g., “find documents related to Entity X and similar to this query vector”).

Best for: Entity-rich domains — healthcare (drug interactions, patient records), finance (transaction networks, regulatory entities), legal (case law citations, precedent chains). Any domain where “connected to” matters as much as “similar to.”

// Find nodes similar to a query vector that are connected to a specific entity
MATCH (doc:Document)-[:REFERENCES]->(entity:Regulation {name: 'GDPR'})
WHERE doc.embedding IS NOT NULL
WITH doc, vector.similarity.cosine(doc.embedding, $queryVector) AS score
WHERE score > 0.7
RETURN doc.title, doc.content, score
ORDER BY score DESC
LIMIT 5

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceMediumGraph modeling requires specialized expertise. Not a simple “dump vectors” approach.
SecurityHighVPC, IAM, encryption.
ReliabilityHighManaged service with automated backups.
PerformanceMedium-HighGraph traversal + vector search is powerful but adds query complexity.
CostMediumMemory-hour pricing. Costs scale with graph size.
SustainabilityMediumRight-size memory allocation for your graph.

Trade-off: Highest retrieval quality for relationship-heavy domains, but highest learning curve. You need to model a knowledge graph, not just embed documents.


6. Amazon MemoryDB

MemoryDB brings vector search with sub-millisecond latency and durability guarantees — the key differentiator from ElastiCache. Every write is persisted to a transaction log, so your vectors survive restarts and failures.

How it works: Redis-compatible API with HNSW vector indexing. Store vectors as Redis data structures, query with FT.SEARCH. Supports up to 32,768 dimensions, max 10 indexes per cluster, 50 fields per index. HNSW parameters: M up to 512, EF_CONSTRUCTION up to 4,096.

Limitation: Vector search is currently single-shard only — no horizontal scaling. Vertical and replica scaling are supported. Must be enabled at cluster creation (R6g, R7g, T4g node types only).

Best for: Real-time RAG where latency is critical — chatbots, live recommendations, session-based context retrieval. Workloads that need both speed and durability.

import redis

r = redis.Redis(host='your-memorydb-endpoint', port=6379, ssl=True)

# Create vector index
r.execute_command(
    'FT.CREATE', 'doc_index',
    'ON', 'HASH',
    'PREFIX', '1', 'doc:',
    'SCHEMA',
    'content', 'TEXT',
    'embedding', 'VECTOR', 'HNSW', '6',
    'TYPE', 'FLOAT32',
    'DIM', '1536',
    'DISTANCE_METRIC', 'COSINE'
)

# Query
results = r.execute_command(
    'FT.SEARCH', 'doc_index',
    '*=>[KNN 5 @embedding $query_vec AS score]',
    'PARAMS', '2', 'query_vec', query_embedding_bytes,
    'RETURN', '2', 'content', 'score',
    'SORTBY', 'score',
    'DIALECT', '2'
)

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceMediumNode management, cluster sizing. Redis expertise helpful.
SecurityHighVPC, TLS, IAM, encryption at rest.
ReliabilityHighDurable — transaction log persists every write. Multi-AZ.
PerformanceHighSub-millisecond latency.
CostMedium-HighNode-based + $0.20/GB data written (first 10TB/mo free). Memory-intensive.
SustainabilityMediumMust provision for peak. Data tiering with R6gd nodes helps.

Trade-off: Fastest durable vector store on AWS. But you’re paying for memory-optimized instances, which gets expensive at scale.


7. Amazon ElastiCache (Valkey 8.2+)

ElastiCache with Valkey 8.2+ offers vector search at microsecond latency — the absolute lowest on AWS. The catch: it’s an in-memory cache, not a durable store.

How it works: Same Redis-compatible API as MemoryDB, but without transaction log durability. If a node restarts, you lose data unless you’ve configured snapshots.

Best for: Semantic caching (cache frequent RAG queries to avoid re-embedding and re-retrieval), ultra-low-latency lookups, and real-time recommendation layers where vectors can be rebuilt from a source of truth.

Key differentiator from MemoryDB:

  • ElastiCache = speed over durability (microsecond, but ephemeral)
  • MemoryDB = speed with durability (sub-millisecond, persistent)

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceMediumSimilar to MemoryDB. Valkey 8.2+ required for vector search.
SecurityHighVPC, TLS, IAM, encryption.
ReliabilityLow-MediumNot durable by default. Snapshot-based recovery. Data loss on node failure.
PerformanceHighestMicrosecond latency. Up to 99% recall.
CostMediumNode-based. No additional vector search cost. Cheaper than MemoryDB.
SustainabilityMediumMust provision for peak.

Trade-off: If you can tolerate data loss (because vectors are derivable from source documents), this gives you the fastest possible retrieval. Use it as a caching layer in front of a durable vector store.


8. Amazon DocumentDB

DocumentDB added vector search for teams already running MongoDB-compatible workloads on AWS.

How it works: HNSW or IVFFlat indexing on vector fields within documents. MongoDB-compatible query syntax with $vectorSearch aggregation stage (DocumentDB 8.0+) or $search (5.0+).

Key specs:

  • Indexed vectors: Up to 2,000 dimensions
  • Stored (unindexed) vectors: Up to 16,000 dimensions
  • Index types: HNSW (M: max 100, efConstruction: max 1,000) and IVFFlat
  • Distance metrics: Euclidean, Cosine, Dot Product

Best for: Teams already on DocumentDB or migrating from MongoDB who want to add RAG without introducing a new database.

// Create vector index
db.documents.createIndex(
  { "embedding": "vectorSearch" },
  {
    "name": "vector_index",
    "vectorOptions": {
      "type": "hnsw",
      "dimensions": 1536,
      "similarity": "cosine",
      "m": 16,
      "efConstruction": 200
    }
  }
);

// Query
db.documents.aggregate([
  {
    "$vectorSearch": {
      "queryVector": queryEmbedding,
      "path": "embedding",
      "numCandidates": 100,
      "limit": 5,
      "index": "vector_index"
    }
  }
]);

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceMediumManaged service but requires index management.
SecurityHighVPC, TLS, KMS encryption, IAM.
ReliabilityHighMulti-AZ, automated backups.
PerformanceMediumGood for moderate scale. Not optimized for vector-heavy workloads.
CostMediumInstance-based. Cost-effective if already running DocumentDB.
SustainabilityMediumShared instance with document workloads.

Trade-off: Convenient if you’re already on DocumentDB. But it’s not a vector-first database — if vectors are your primary workload, dedicated options perform better.


Kendra isn’t a vector database — it’s an ML-powered enterprise search service. But it solves the same problem for a different audience: retrieval from enterprise knowledge for RAG.

How it works: Kendra indexes documents from 40+ connectors (S3, SharePoint, Confluence, ServiceNow, Salesforce, databases, web crawlers) and provides natural language query capabilities. No embedding pipeline needed.

Best for: Enterprise knowledge retrieval where the data lives across many SaaS tools and you don’t want to build custom embedding pipelines. Compliance-sensitive environments where Kendra’s built-in access control (respecting source system permissions) matters.

Well-Architected assessment:

PillarRatingNotes
Operational ExcellenceHighFully managed. Connectors handle sync automatically.
SecurityHighRespects source system ACLs. Built-in access control per user/group.
ReliabilityHighManaged, multi-AZ.
PerformanceMediumHigher latency than purpose-built vector stores. Optimized for precision, not speed.
CostLowIndex + connector pricing can be expensive. Starts at ~$810/month for Developer edition.
SustainabilityHighServerless, scales with usage.

Trade-off: Best out-of-box enterprise search with built-in security and connectors. But expensive, and less flexible than building your own retrieval pipeline.


Choosing Your Vector Store: The Decision Tree

Here’s how to shortcut the decision:

Start here: Do you want to manage vector infrastructure?

  • No → Bedrock Knowledge Bases. It handles everything. Start here, graduate later if needed.

Yes — then what’s your priority?

  • Hybrid search quality → OpenSearch. Combine keyword + semantic for best retrieval accuracy.
  • Lowest latency (cache) → ElastiCache. Microsecond retrieval for semantic caching layers.
  • Lowest latency (durable) → MemoryDB. Sub-millisecond with durability guarantees.
  • Simplest, cheapest → S3 Vectors. Pay-per-query, zero infra. Good for getting started.
  • Already on PostgreSQL → Aurora pgvector. Vectors alongside relational data.
  • Already on MongoDB → DocumentDB. Add vectors without a new database.
  • Entity relationships matter → Neptune Analytics. GraphRAG for complex domains.
  • Enterprise data across SaaS tools → Kendra. 40+ connectors, built-in ACL.

Advanced Patterns

Hybrid Search (Keyword + Semantic)

Pure vector search misses exact matches. If a user asks about “error code E-4012”, a semantic search might return results about generic error handling instead of the specific code. Hybrid search solves this by combining BM25 keyword matching with k-NN vector similarity.

OpenSearch is the only AWS-native service with built-in hybrid search. If you’re using another vector store, you can achieve hybrid search by:

  1. Running a keyword search in your primary database
  2. Running a vector search in your vector store
  3. Merging and reranking results using Bedrock’s reranker

Chunking Strategies

How you chunk your documents impacts retrieval quality more than which vector store you use.

StrategyHow It WorksBest For
Fixed-sizeSplit every N tokens with overlapSimple documents, consistent structure
SemanticSplit at topic boundaries using NLPLong-form content with varying topics
HierarchicalParent chunks (summaries) + child chunks (details)Technical documentation, manuals
Sentence-windowIndex individual sentences, retrieve surrounding contextPrecise retrieval from dense text

Bedrock Knowledge Bases supports fixed-size, semantic, and hierarchical chunking natively. For other vector stores, you’ll build this in your ingestion pipeline.

Reranking

First-pass retrieval (vector similarity) gets you candidates. Reranking uses a cross-encoder model to score each candidate against the actual query — dramatically improving relevance.

On AWS, you have two options:

RerankerModel IDPricingRegions
Amazon Rerank 1.0amazon.rerank-v1:0Includedap-northeast-1, ca-central-1, eu-central-1, us-west-2
Cohere Rerank 3.5cohere.rerank-v3-5:0$2.00 / 1K queriesap-northeast-1, ca-central-1, eu-central-1, us-east-1, us-west-2

Watch out: Amazon Rerank 1.0 is not available in us-east-1. Use Cohere Rerank 3.5 if you’re in that region.

Always retrieve more candidates than you need (e.g., top 20), rerank, then use the top 5 in your prompt.

Prompt Caching on Bedrock

If your RAG context is large and repeated across queries (e.g., a system prompt with 50 pages of product documentation), Bedrock’s prompt caching reduces both latency and cost by caching the processed context prefix across invocations.

Add cachePoint objects in your Converse API calls (or cache_control for Claude via InvokeModel). Cached tokens are read at a reduced rate, and cache hits don’t count against rate limits.

ModelMin Tokens per CheckpointMax CheckpointsTTL
Claude Sonnet 4.51,02445m or 1h
Claude Sonnet 41,02445m
Claude 3.7 Sonnet1,02445m
Amazon Nova (all)1,00045m

This is particularly effective for multi-turn RAG conversations where the system prompt + document context stays constant across turns.

Multi-Service Architecture

Production RAG systems often combine multiple services:

User Query
    ↓
[Bedrock KB] → Retrieves top 20 chunks from [OpenSearch]
    ↓
[Bedrock Reranker] → Reranks to top 5
    ↓
[ElastiCache] → Cache check (have we answered this before?)
    ↓
[Bedrock FM] → Generate answer with reranked context
    ↓
[ElastiCache] → Cache the response for future queries

This pattern uses OpenSearch for hybrid retrieval quality, ElastiCache for response caching, and Bedrock for orchestration and generation.

What I Learned

  • Hybrid search is underrated — Most RAG tutorials skip it, but combining keyword + semantic retrieval consistently produces better results than vector search alone. OpenSearch is the only AWS-native service that does this in a single query.
  • Start managed, graduate to custom — Bedrock Knowledge Bases gets you to production fastest. You can always migrate to a self-managed vector store later once you understand your query patterns and scale requirements.
  • Your chunking strategy matters more than your vector store — Teams spend weeks debating Aurora vs OpenSearch, then use naive fixed-size chunking and wonder why retrieval quality is poor. Invest in chunking first.
  • Caching is the hidden cost saver — Adding ElastiCache as a semantic cache in front of your vector store can reduce retrieval costs by 60-80% for workloads with repetitive queries (support chatbots, FAQ assistants).
  • The Well-Architected GenAI Lens is worth reading — It provides specific guidance on vector store optimization and data retrieval performance that goes beyond what any single blog post can cover.

What’s Next

  • Benchmark latency and cost across S3 Vectors, OpenSearch Serverless, and Aurora pgvector with a real dataset
  • Test Bedrock Knowledge Bases hierarchical chunking vs custom semantic chunking pipeline
  • Build a reference architecture combining OpenSearch (retrieval) + ElastiCache (caching) + Bedrock (orchestration)
  • Evaluate DynamoDB vector search when it moves to GA — single-table design with vectors could simplify many architectures

References:

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog