ChatGPT Prompt for RAG Pipelines
Production RAG recipe: token-based sliding window chunking, mxbai-embed-large embeddings, OpenSearch storage, mxbai-rerank-large reranking. Includes retrieval evals.
More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
You are a senior AI engineer designing a production RAG pipeline. Produce a complete, implementation-ready spec that a mid-level engineer can ship in one sprint.
## System Context
- **Document corpus:** PDFs with tables (~100k documents)
- **Query volume:** 50 queries/second peak, 200k daily
- **Latency SLO:** end-to-end p95 under 3s
- **Language coverage:** English only
- **Freshness requirement:** weekly
## Pipeline Architecture
### 1. Ingestion & Parsing
- Parser choice for PDFs with tables (Unstructured, LlamaParse, pdfplumber, custom): justify tradeoff
- How to preserve tables, code blocks, and hierarchical structure
- OCR fallback strategy for scanned pages (Tesseract vs GPT-4o vision vs Textract)
- Metadata to extract per document: source_url, created_at, author, section_path, doc_type, access_level
- Deduplication: content hash + near-duplicate detection via MinHash/SimHash
### 2. Chunking: token-based sliding window
- Target chunk size: 1024 tokens with 128 token overlap
- Boundary rules (where you're allowed and NOT allowed to split)
- How to handle chunks that are too small (merge) and too large (recursive split)
- Special handling for tables, code blocks, lists (do NOT split these)
- Contextual header: prepend document title + section breadcrumb to each chunk
- If chunking strategy is 'contextual retrieval', include the Claude Haiku prompt to generate the 50-100 token contextual prefix per chunk
### 3. Embedding: mxbai-embed-large
- Batch size, rate-limit handling, retry policy (exponential backoff + jitter)
- Cost estimate at corpus size (model price × tokens)
- Embedding dimensionality and storage implications
- When to re-embed (model upgrade, chunking change) and migration path
- Normalization: L2-normalize if using dot product; skip if cosine
### 4. Storage: OpenSearch
- Index configuration (HNSW M, efConstruction, efSearch OR IVF nlist, nprobe)
- Payload schema (which metadata is filterable, which is projected)
- Sharding / namespace strategy for multi-tenancy
- Backup and point-in-time recovery
- Estimated storage cost per 1M chunks
### 5. Retrieval: hybrid (weighted linear combination)
- Top-k at retrieval: 30
- Metadata filters (tenant_id, access_level, date range)
- Hybrid search weights and Reciprocal Rank Fusion k=60
- Query preprocessing: lowercase, stopword handling, acronym expansion
### 6. Reranking: mxbai-rerank-large
- Rerank top-30 down to top-8
- Latency budget for rerank step
- When to skip rerank (very high-confidence top result, latency pressure)
- Fallback if reranker API fails
### 7. Answer Synthesis
- Model: Claude Sonnet 4.5 or GPT-4.1 (justify)
- System prompt template with citation requirements
- Output format: answer + citations array [{chunk_id, quote, doc_url}]
- Refusal rules: when retrieval returns low-relevance chunks, refuse rather than hallucinate
## Prompt Template
```
You are a helpful assistant answering questions about PDFs with tables.
RULES:
1. Use ONLY the context below. If the answer is not present, say "I don't have enough information."
2. Cite every factual claim with [doc_id] markers.
3. Quote short spans verbatim for critical facts.
4. If context conflicts, note the conflict.
CONTEXT:
{retrieved_chunks_with_ids}
QUESTION: {user_question}
ANSWER:
```
## Evaluation Plan
Build a golden set of 1000 question+expected-sources pairs from real user queries. Score:
- **Retrieval hit@k** (30, 5, 1): did the correct chunk appear?
- **Context precision** (Ragas): are retrieved chunks relevant?
- **Context recall**: did we retrieve everything needed?
- **Faithfulness**: does the answer stay grounded in context?
- **Answer relevance**: does the answer address the question?
- **Citation accuracy**: do cited chunks actually support the claim?
Target thresholds: hit@10 ≥ 0.90, faithfulness ≥ 0.95, citation accuracy ≥ 0.98.
## Edge Cases to Handle
- Very long documents (> 1M tokens): hierarchical summarization or map-reduce
- Tables with numeric data: extract to markdown tables, preserve headers
- Code chunks: chunk by function/class, never mid-function
- Non-English docs: language detection, multilingual embedding, answer in query language
- PII redaction before storage (emails, SSNs, phone numbers)
- Access-controlled docs: enforce ACL filters at query time, NEVER post-filter after retrieval
## Cost & Latency Estimate
| Component | Latency (p95) | Cost per query |
|-----------|---------------|----------------|
| Query embed | — | — |
| Vector search | — | — |
| Rerank | — | — |
| LLM synthesis | — | — |
| **Total** | — | — |
Fill these in with specific numbers for mxbai-embed-large + OpenSearch + mxbai-rerank-large at 50 QPS.
## Deliverables
1. Code scaffold (Python or TypeScript): ingest.py, chunk.py, embed.py, retrieve.py, synthesize.py, evaluate.py
2. Chunking parameter table (chunk_size, overlap, separators) tuned for PDFs with tables
3. Prompt template (above) with variables called out
4. Eval harness wired to the golden set with CI gate on the three target thresholds
5. Runbook: what to do when retrieval quality drops, what to do when latency spikes
6. Cost model spreadsheet with ingestion cost, per-query cost, storage cost
- Use precise technical terminology appropriate for the audience
- Include code examples, configurations, or specifications where relevant
- Document assumptions, prerequisites, and dependencies
- Provide error handling and edge case considerations
Present your output in a clear, organized structure with headers (##), subheaders (###), and bullet points. Use bold for key terms.Replace the bracketed placeholders with your own context before running the prompt:
[{chunk_id, quote, doc_url}]— fill in your specific {chunk_id, quote, doc_url}.[doc_id]— fill in your specific doc_id.