Claude Prompt for RAG Pipelines
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for product manuals.
You are a search and retrieval engineer. Design a hybrid retrieval stack over OpenSearch with Cohere Rerank 3.5 reranking for a corpus of product manuals.
## Why Hybrid
Dense retrieval nails semantic meaning but misses rare/exact terms (product codes, error strings, legal citations). BM25 nails lexical matches but misses paraphrase. Hybrid + rerank gets the best of both at the cost of one extra rerank call.
## Index Design in OpenSearch
- **Dense field:** nomic-embed-text-v1.5 embeddings (3072-dim, L2)
- **Sparse field:** BM25 (or SPLADE if OpenSearch supports learned sparse) with analyzer suited to product manuals
- **Metadata fields:** doc_id, source, section_path, created_at, author, lang, tenant_id, access_level, doc_type
- **Text field:** raw chunk text for rerank input and citation quotes
### Tokenizer / Analyzer Config
For product manuals, configure the BM25 analyzer with:
- Lowercase filter
- Stopword list (English + domain-specific: "the", "a", plus product manuals-specific noise like "page", "figure")
- Stemmer: Porter for English, language-specific otherwise
- Synonym filter: maintain a curated synonyms.txt with domain terms
- Preserve alphanumerics (product codes, versions like "v2.3.1")
- n-gram on exact-match fields (error codes, SKUs)
## Retrieval Stage 1: Parallel First-Stage
Run in parallel (not sequentially):
- **Dense:** top-200 by L2 via HNSW (efSearch=256)
- **Sparse:** top-200 by BM25
- Apply metadata filters in BOTH (tenant_id, access_level, date range)
## Retrieval Stage 2: Fusion
Combine with **Reciprocal Rank Fusion** (RRF):
```
score(doc) = Σ 1 / (k + rank_i(doc)) where k = 60
```
- Deduplicate by chunk_id (keep highest RRF score)
- Truncate to top-50 before rerank
Why RRF over weighted linear combination: scores from BM25 and cosine are NOT comparable without careful calibration. RRF uses only ranks, so it is robust to score drift.
## Retrieval Stage 3: Rerank with Cohere Rerank 3.5
- Send top-50 (chunk_text, query) pairs to Cohere Rerank 3.5
- Use batching (typically 32-64 pairs per API call) to hit provider rate limits efficiently
- Rerank returns relevance scores in [0, 1]
- Truncate to top-3 for the LLM context window
- **Confidence gate:** if top-1 rerank score < 0.30, return "no confident answer" refusal rather than synthesizing
## Context Packing
- Sort final chunks by relevance descending
- Inject with clear boundaries:
```
<chunk id="c-{i}" source="{url}" score="{score}">
{chunk_text}
</chunk>
```
- Token budget: leave 800 tokens headroom for the answer
- If chunks would overflow, DROP from the bottom (least relevant), never truncate mid-chunk
## Latency Budget (p95)
- Query embed: ≤ 80ms
- Dense search: ≤ 150ms
- Sparse search: ≤ 100ms (parallel with dense)
- Fusion: ≤ 5ms
- Rerank 50 pairs: ≤ 300ms
- LLM synthesis: ≤ 1500ms
- **Total p95:** ≤ 2.0s
## Evaluation: Ablation Matrix
Build a golden set from real user queries tagged with expected doc_ids. Run these variants:
1. Dense only (top-3)
2. Sparse only (top-3)
3. Hybrid no rerank (top-3 from RRF)
4. Hybrid + rerank (top-3)
Metrics: hit@1, hit@3, hit@10, MRR, NDCG@10. Report deltas vs variant 1.
**Expected findings on product manuals:**
- Hybrid beats dense by 3-8 pts on queries with product codes, citations, exact phrases
- Rerank adds 5-15 pts hit@3 over hybrid alone
- Latency cost of rerank: ~200-400ms
## Configuration File
Produce a `retrieval.yaml` that captures ALL knobs:
```yaml
dense:
model: nomic-embed-text-v1.5
dim: 3072
similarity: L2
hnsw: { M: 16, efConstruction: 200, efSearch: 256 }
top_k: 200
sparse:
analyzer: "product manuals_analyzer"
top_k: 200
fusion:
method: rrf
k: 60
rerank:
model: Cohere Rerank 3.5
top_n_in: 50
top_n_out: 3
min_score: 0.30
batch_size: 32
```
## Edge Cases
- Empty BM25 result (query has only stopwords) → dense-only
- Empty dense result (index cold start) → BM25-only, log alert
- Rerank API 5xx → use RRF ranking directly, log degradation
- Query contains product code `[A-Z]{2,4}-d{3,6}` → boost exact match in sparse via should-clause
Produce code in Python wiring this end to end, with metrics, tracing, and tests.
Present as numbered steps. Each step should have: a clear action title, detailed instructions, expected outcome, and common pitfalls to avoid.Replace the bracketed placeholders with your own context before running the prompt:
[0, 1]— fill in your specific 0, 1.[A-Z]— fill in your specific a-z.More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.