Claude Prompt for RAG Pipelines
Production RAG recipe: token-based sliding window chunking, jina-embeddings-v3 embeddings, Vectorize (Cloudflare) storage, mxbai-rerank-large reranking. Includes retrieval evals.
More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
You are a senior AI engineer designing a production RAG pipeline. Produce a complete, implementation-ready spec that a mid-level engineer can ship in one sprint.
## System Context
- **Document corpus:** code repositories (~5M documents)
- **Query volume:** 200 queries/second peak, 200k daily
- **Latency SLO:** end-to-end p95 under 5s
- **Language coverage:** 40+ languages
- **Freshness requirement:** weekly
## Pipeline Architecture
### 1. Ingestion & Parsing
- Parser choice for code repositories (Unstructured, LlamaParse, pdfplumber, custom): justify tradeoff
- How to preserve tables, code blocks, and hierarchical structure
- OCR fallback strategy for scanned pages (Tesseract vs GPT-4o vision vs Textract)
- Metadata to extract per document: source_url, created_at, author, section_path, doc_type, access_level
- Deduplication: content hash + near-duplicate detection via MinHash/SimHash
### 2. Chunking: token-based sliding window
- Target chunk size: 512 tokens with 0 token overlap
- Boundary rules (where you're allowed and NOT allowed to split)
- How to handle chunks that are too small (merge) and too large (recursive split)
- Special handling for tables, code blocks, lists (do NOT split these)
- Contextual header: prepend document title + section breadcrumb to each chunk
- If chunking strategy is 'contextual retrieval', include the Claude Haiku prompt to generate the 50-100 token contextual prefix per chunk
### 3. Embedding: jina-embeddings-v3
- Batch size, rate-limit handling, retry policy (exponential backoff + jitter)
- Cost estimate at corpus size (model price × tokens)
- Embedding dimensionality and storage implications
- When to re-embed (model upgrade, chunking change) and migration path
- Normalization: L2-normalize if using dot product; skip if cosine
### 4. Storage: Vectorize (Cloudflare)
- Index configuration (HNSW M, efConstruction, efSearch OR IVF nlist, nprobe)
- Payload schema (which metadata is filterable, which is projected)
- Sharding / namespace strategy for multi-tenancy
- Backup and point-in-time recovery
- Estimated storage cost per 1M chunks
### 5. Retrieval: BM25
- Top-k at retrieval: 30
- Metadata filters (tenant_id, access_level, date range)
- Hybrid search weights and Reciprocal Rank Fusion k=60
- Query preprocessing: lowercase, stopword handling, acronym expansion
### 6. Reranking: mxbai-rerank-large
- Rerank top-30 down to top-5
- Latency budget for rerank step
- When to skip rerank (very high-confidence top result, latency pressure)
- Fallback if reranker API fails
### 7. Answer Synthesis
- Model: Claude Sonnet 4.5 or GPT-4.1 (justify)
- System prompt template with citation requirements
- Output format: answer + citations array [{chunk_id, quote, doc_url}]
- Refusal rules: when retrieval returns low-relevance chunks, refuse rather than hallucinate
## Prompt Template
```
You are a helpful assistant answering questions about code repositories.
RULES:
1. Use ONLY the context below. If the answer is not present, say "I don't have enough information."
2. Cite every factual claim with [doc_id] markers.
3. Quote short spans verbatim for critical facts.
4. If context conflicts, note the conflict.
CONTEXT:
{retrieved_chunks_with_ids}
QUESTION: {user_question}
ANSWER:
```
## Evaluation Plan
Build a golden set of 500 question+expected-sources pairs from real user queries. Score:
- **Retrieval hit@k** (30, 5, 1): did the correct chunk appear?
- **Context precision** (Ragas): are retrieved chunks relevant?
- **Context recall**: did we retrieve everything needed?
- **Faithfulness**: does the answer stay grounded in context?
- **Answer relevance**: does the answer address the question?
- **Citation accuracy**: do cited chunks actually support the claim?
Target thresholds: hit@10 ≥ 0.90, faithfulness ≥ 0.95, citation accuracy ≥ 0.98.
## Edge Cases to Handle
- Very long documents (> 1M tokens): hierarchical summarization or map-reduce
- Tables with numeric data: extract to markdown tables, preserve headers
- Code chunks: chunk by function/class, never mid-function
- Non-English docs: language detection, multilingual embedding, answer in query language
- PII redaction before storage (emails, SSNs, phone numbers)
- Access-controlled docs: enforce ACL filters at query time, NEVER post-filter after retrieval
## Cost & Latency Estimate
| Component | Latency (p95) | Cost per query |
|-----------|---------------|----------------|
| Query embed | — | — |
| Vector search | — | — |
| Rerank | — | — |
| LLM synthesis | — | — |
| **Total** | — | — |
Fill these in with specific numbers for jina-embeddings-v3 + Vectorize (Cloudflare) + mxbai-rerank-large at 200 QPS.
## Deliverables
1. Code scaffold (Python or TypeScript): ingest.py, chunk.py, embed.py, retrieve.py, synthesize.py, evaluate.py
2. Chunking parameter table (chunk_size, overlap, separators) tuned for code repositories
3. Prompt template (above) with variables called out
4. Eval harness wired to the golden set with CI gate on the three target thresholds
5. Runbook: what to do when retrieval quality drops, what to do when latency spikes
6. Cost model spreadsheet with ingestion cost, per-query cost, storage cost
- Use precise technical terminology appropriate for the audience
- Include code examples, configurations, or specifications where relevant
- Document assumptions, prerequisites, and dependencies
- Provide error handling and edge case considerations
Present your output in a clear, organized structure with headers (##), subheaders (###), and bullet points. Use bold for key terms.Replace the bracketed placeholders with your own context before running the prompt:
[{chunk_id, quote, doc_url}]— fill in your specific {chunk_id, quote, doc_url}.[doc_id]— fill in your specific doc_id.