Claude Prompt for RAG Pipelines
Production RAG recipe: recursive character chunking, voyage-3 embeddings, Azure AI Search storage, Voyage rerank-2 reranking. Includes retrieval evals.
More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
You are a senior AI engineer designing a production RAG pipeline. Produce a complete, implementation-ready spec that a mid-level engineer can ship in one sprint.
## System Context
- **Document corpus:** scanned PDFs with OCR artifacts (~20M documents)
- **Query volume:** 50 queries/second peak, 5M daily
- **Latency SLO:** end-to-end p95 under 3s
- **Language coverage:** English only
- **Freshness requirement:** hourly
## Pipeline Architecture
### 1. Ingestion & Parsing
- Parser choice for scanned PDFs with OCR artifacts (Unstructured, LlamaParse, pdfplumber, custom): justify tradeoff
- How to preserve tables, code blocks, and hierarchical structure
- OCR fallback strategy for scanned pages (Tesseract vs GPT-4o vision vs Textract)
- Metadata to extract per document: source_url, created_at, author, section_path, doc_type, access_level
- Deduplication: content hash + near-duplicate detection via MinHash/SimHash
### 2. Chunking: recursive character
- Target chunk size: 512 tokens with 0 token overlap
- Boundary rules (where you're allowed and NOT allowed to split)
- How to handle chunks that are too small (merge) and too large (recursive split)
- Special handling for tables, code blocks, lists (do NOT split these)
- Contextual header: prepend document title + section breadcrumb to each chunk
- If chunking strategy is 'contextual retrieval', include the Claude Haiku prompt to generate the 50-100 token contextual prefix per chunk
### 3. Embedding: voyage-3
- Batch size, rate-limit handling, retry policy (exponential backoff + jitter)
- Cost estimate at corpus size (model price × tokens)
- Embedding dimensionality and storage implications
- When to re-embed (model upgrade, chunking change) and migration path
- Normalization: L2-normalize if using dot product; skip if cosine
### 4. Storage: Azure AI Search
- Index configuration (HNSW M, efConstruction, efSearch OR IVF nlist, nprobe)
- Payload schema (which metadata is filterable, which is projected)
- Sharding / namespace strategy for multi-tenancy
- Backup and point-in-time recovery
- Estimated storage cost per 1M chunks
### 5. Retrieval: hybrid (BM25 + dense, RRF fusion)
- Top-k at retrieval: 20
- Metadata filters (tenant_id, access_level, date range)
- Hybrid search weights and Reciprocal Rank Fusion k=60
- Query preprocessing: lowercase, stopword handling, acronym expansion
### 6. Reranking: Voyage rerank-2
- Rerank top-20 down to top-10
- Latency budget for rerank step
- When to skip rerank (very high-confidence top result, latency pressure)
- Fallback if reranker API fails
### 7. Answer Synthesis
- Model: Claude Sonnet 4.5 or GPT-4.1 (justify)
- System prompt template with citation requirements
- Output format: answer + citations array [{chunk_id, quote, doc_url}]
- Refusal rules: when retrieval returns low-relevance chunks, refuse rather than hallucinate
## Prompt Template
```
You are a helpful assistant answering questions about scanned PDFs with OCR artifacts.
RULES:
1. Use ONLY the context below. If the answer is not present, say "I don't have enough information."
2. Cite every factual claim with [doc_id] markers.
3. Quote short spans verbatim for critical facts.
4. If context conflicts, note the conflict.
CONTEXT:
{retrieved_chunks_with_ids}
QUESTION: {user_question}
ANSWER:
```
## Evaluation Plan
Build a golden set of 100 question+expected-sources pairs from real user queries. Score:
- **Retrieval hit@k** (20, 5, 1): did the correct chunk appear?
- **Context precision** (Ragas): are retrieved chunks relevant?
- **Context recall**: did we retrieve everything needed?
- **Faithfulness**: does the answer stay grounded in context?
- **Answer relevance**: does the answer address the question?
- **Citation accuracy**: do cited chunks actually support the claim?
Target thresholds: hit@10 ≥ 0.90, faithfulness ≥ 0.95, citation accuracy ≥ 0.98.
## Edge Cases to Handle
- Very long documents (> 1M tokens): hierarchical summarization or map-reduce
- Tables with numeric data: extract to markdown tables, preserve headers
- Code chunks: chunk by function/class, never mid-function
- Non-English docs: language detection, multilingual embedding, answer in query language
- PII redaction before storage (emails, SSNs, phone numbers)
- Access-controlled docs: enforce ACL filters at query time, NEVER post-filter after retrieval
## Cost & Latency Estimate
| Component | Latency (p95) | Cost per query |
|-----------|---------------|----------------|
| Query embed | — | — |
| Vector search | — | — |
| Rerank | — | — |
| LLM synthesis | — | — |
| **Total** | — | — |
Fill these in with specific numbers for voyage-3 + Azure AI Search + Voyage rerank-2 at 50 QPS.
## Deliverables
1. Code scaffold (Python or TypeScript): ingest.py, chunk.py, embed.py, retrieve.py, synthesize.py, evaluate.py
2. Chunking parameter table (chunk_size, overlap, separators) tuned for scanned PDFs with OCR artifacts
3. Prompt template (above) with variables called out
4. Eval harness wired to the golden set with CI gate on the three target thresholds
5. Runbook: what to do when retrieval quality drops, what to do when latency spikes
6. Cost model spreadsheet with ingestion cost, per-query cost, storage cost
- Use precise technical terminology appropriate for the audience
- Include code examples, configurations, or specifications where relevant
- Document assumptions, prerequisites, and dependencies
- Provide error handling and edge case considerations
Present your output in a clear, organized structure with headers (##), subheaders (###), and bullet points. Use bold for key terms.Replace the bracketed placeholders with your own context before running the prompt:
[{chunk_id, quote, doc_url}]— fill in your specific {chunk_id, quote, doc_url}.[doc_id]— fill in your specific doc_id.