ChatGPT Prompt for Fine-tuning & Model Adaptation
Full fine-tuning recipe: QLoRA (4-bit) on Llama 3.1 70B via LitGPT, targeting AWS g5.12xlarge, with data mix and eval plan.
More prompts for Fine-tuning & Model Adaptation.
Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.
Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.
Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.
Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.
You are a senior ML engineer shipping a fine-tuned open-weight LLM to production. Produce an end-to-end training recipe that an engineer can run on AWS g5.12xlarge.
## Objective
Fine-tune Llama 3.1 70B via QLoRA (4-bit) to excel at SQL-from-text generation. Target: match or exceed GPT-4o-mini zero-shot on our internal benchmark while keeping general capability degradation (MMLU delta) within 2 points.
## Why QLoRA (4-bit)
Briefly justify QLoRA (4-bit) vs alternatives for this task+budget. Cover: VRAM, training-time, alignment-vs-capability tradeoff, and data requirements.
## Dataset Construction
### Sources
- **Primary:** GitHub PR discussions (10k examples)
- **Secondary:** Stack Overflow Q&A (10k examples)
- **Synthetic augmentation:** generated via DeepSeek-V3 with rejection sampling
### Data Schema
Each example is a JSON object:
```json
{
"id": "uuid",
"source": "string",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"task_type": "SQL-from-text generation",
"quality_score": 0.0,
"metadata": {"lang": "en", "difficulty": "medium"}
}
```
### Cleaning Pipeline
1. **Deduplication:** MinHash (128 hashes, Jaccard β₯ 0.85) on the concatenated messages
2. **Language filter:** fastText langid, keep target language(s)
3. **Quality filter:** length bounds, non-empty assistant, no boilerplate refusals
4. **PII scrub:** regex + Presidio for emails, phones, SSNs
5. **Contamination check:** exact-match and 13-gram overlap vs eval sets (MMLU, HumanEval, GSM8K, our internal eval)
6. **Toxicity filter:** Azure Content Safety score < 0.3
### Data Mix
Target 25k total examples, weighted:
- 70% core task (SQL-from-text generation)
- 20% instruction diversity (to preserve general capability)
- 8% safety/refusal calibration
- 5% format consistency (JSON, markdown, code)
### Splits
- Train: 90%
- Dev (for early stopping, hyperparam): 5%
- Holdout (held until final eval): 5%
## Prompt Template / Chat Format
Match Llama 3.1 70B's native chat template (criticalβmismatched template tanks performance):
```
<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
{assistant}<|im_end|>
```
(Use the EXACT tokens from the Llama 3.1 70B tokenizer config; verify with tokenizer.apply_chat_template.)
## Training Configuration (LitGPT)
```yaml
base_model: Llama 3.1 70B
adapter: qlora
load_in_4bit: false
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 5e-5
warmup_ratio: 0.03
weight_decay: 0.0
bf16: true
flash_attention: true
gradient_checkpointing: true
eval_steps: 200
save_steps: 500
save_total_limit: 3
early_stopping_patience: 3
```
### Why these hyperparameters
- LR for LoRA is typically 10-100Γ higher than full fine-tune; start at 5e-5 and cosine-decay
- r=64 is a solid default; go to 64 only if you see underfitting on train loss
- seq_len=8192 covers 99% of examples; longer truncated
- Pack samples aggressively to maximize GPU utilization
## Hardware: AWS g5.12xlarge
- Estimated VRAM: 48 GB
- Estimated train time: 24 hours
- Estimated cost: $120
If OOM: reduce micro_batch_size, enable CPU offload (DeepSpeed Zero-3), drop seq_len.
## Eval Plan
Run BEFORE and AFTER fine-tune:
1. **Task-specific eval** (SQL-from-text generation): golden set of 2000 held-out examples, scored by Claude Sonnet 4.5 rubric scorer
2. **General capability** (regression gate):
- MMLU 5-shot
- HumanEval (if code) or MT-Bench (if chat)
- TruthfulQA
3. **Format compliance:** % of outputs that match expected schema
4. **Safety:** refusal rate on a held-out harmful-prompt set (should stay near baseline)
Pass criteria:
- Task metric β₯ GPT-4o-mini zero-shot
- MMLU delta β₯ -2 points
- Refusal rate within Β±3% of baseline
- No regression on format compliance
## Failure Modes to Watch
- **Mode collapse:** outputs become repetitive or always the same short form β reduce LR, check data diversity
- **Catastrophic forgetting of format:** loses ability to output markdown tables β add format-preservation examples
- **Sycophancy spike:** agrees with user mistakes β verify preference pairs are not rewarding agreement
- **Tokenizer drift:** custom special tokens not properly added to tokenizer β verify tokenizer.get_vocab() before training
## Deliverables
1. `data/`: train.jsonl, dev.jsonl, holdout.jsonl with data card (source breakdown, quality stats, contamination report)
2. `configs/train.yaml`: the LitGPT config above
3. `scripts/preprocess.py`: the cleaning pipeline
4. `scripts/eval.py`: task eval + regression gates
5. `README.md` with reproducibility info (seeds, env versions, command lines)
6. Model card template for HuggingFace upload
Include the exact CLI commands: `LitGPT train configs/train.yaml`, eval commands, and a short runbook for the on-call engineer.
Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.