AI Prompt for Fine-tuning & Model Adaptation
Full fine-tuning recipe: DPO on Qwen 2.5 32B via Megatron-LM, targeting 8x H100, with data mix and eval plan.
More prompts for Fine-tuning & Model Adaptation.
Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.
Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.
Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.
Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.
You are a senior ML engineer shipping a fine-tuned open-weight LLM to production. Produce an end-to-end training recipe that an engineer can run on 8x H100.
## Objective
Fine-tune Qwen 2.5 32B via DPO to excel at customer support classification. Target: match or exceed the base model + 5-shot ICL on our internal benchmark while keeping general capability degradation (MMLU delta) within 2 points.
## Why DPO
Briefly justify DPO vs alternatives for this task+budget. Cover: VRAM, training-time, alignment-vs-capability tradeoff, and data requirements.
## Dataset Construction
### Sources
- **Primary:** GitHub PR discussions (50k examples)
- **Secondary:** customer-labeled examples (10k examples)
- **Synthetic augmentation:** generated via Claude Opus 4.5 with rejection sampling
### Data Schema
Each example is a JSON object:
```json
{
"id": "uuid",
"source": "string",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"task_type": "customer support classification",
"quality_score": 0.0,
"metadata": {"lang": "en", "difficulty": "medium"}
}
```
### Cleaning Pipeline
1. **Deduplication:** MinHash (128 hashes, Jaccard β₯ 0.85) on the concatenated messages
2. **Language filter:** fastText langid, keep target language(s)
3. **Quality filter:** length bounds, non-empty assistant, no boilerplate refusals
4. **PII scrub:** regex + Presidio for emails, phones, SSNs
5. **Contamination check:** exact-match and 13-gram overlap vs eval sets (MMLU, HumanEval, GSM8K, our internal eval)
6. **Toxicity filter:** Azure Content Safety score < 0.3
### Data Mix
Target 50k total examples, weighted:
- 80% core task (customer support classification)
- 15% instruction diversity (to preserve general capability)
- 10% safety/refusal calibration
- 7% format consistency (JSON, markdown, code)
### Splits
- Train: 90%
- Dev (for early stopping, hyperparam): 5%
- Holdout (held until final eval): 5%
## Prompt Template / Chat Format
Match Qwen 2.5 32B's native chat template (criticalβmismatched template tanks performance):
```
<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
{assistant}<|im_end|>
```
(Use the EXACT tokens from the Qwen 2.5 32B tokenizer config; verify with tokenizer.apply_chat_template.)
## Training Configuration (Megatron-LM)
```yaml
base_model: Qwen 2.5 32B
adapter: lora
load_in_4bit: true
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 8
micro_batch_size: 8
num_epochs: 3
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-4
warmup_ratio: 0.03
weight_decay: 0.0
bf16: true
flash_attention: true
gradient_checkpointing: true
eval_steps: 200
save_steps: 500
save_total_limit: 3
early_stopping_patience: 3
```
### Why these hyperparameters
- LR for LoRA is typically 10-100Γ higher than full fine-tune; start at 2e-4 and cosine-decay
- r=64 is a solid default; go to 64 only if you see underfitting on train loss
- seq_len=4096 covers 99% of examples; longer truncated
- Pack samples aggressively to maximize GPU utilization
## Hardware: 8x H100
- Estimated VRAM: 72 GB
- Estimated train time: 6 hours
- Estimated cost: $2,500
If OOM: reduce micro_batch_size, enable CPU offload (DeepSpeed Zero-3), drop seq_len.
## Eval Plan
Run BEFORE and AFTER fine-tune:
1. **Task-specific eval** (customer support classification): golden set of 500 held-out examples, scored by G-Eval with Gemini 2.5 Pro
2. **General capability** (regression gate):
- MMLU 5-shot
- HumanEval (if code) or MT-Bench (if chat)
- TruthfulQA
3. **Format compliance:** % of outputs that match expected schema
4. **Safety:** refusal rate on a held-out harmful-prompt set (should stay near baseline)
Pass criteria:
- Task metric β₯ the base model + 5-shot ICL
- MMLU delta β₯ -2 points
- Refusal rate within Β±3% of baseline
- No regression on format compliance
## Failure Modes to Watch
- **Mode collapse:** outputs become repetitive or always the same short form β reduce LR, check data diversity
- **Catastrophic forgetting of format:** loses ability to output markdown tables β add format-preservation examples
- **Sycophancy spike:** agrees with user mistakes β verify preference pairs are not rewarding agreement
- **Tokenizer drift:** custom special tokens not properly added to tokenizer β verify tokenizer.get_vocab() before training
## Deliverables
1. `data/`: train.jsonl, dev.jsonl, holdout.jsonl with data card (source breakdown, quality stats, contamination report)
2. `configs/train.yaml`: the Megatron-LM config above
3. `scripts/preprocess.py`: the cleaning pipeline
4. `scripts/eval.py`: task eval + regression gates
5. `README.md` with reproducibility info (seeds, env versions, command lines)
6. Model card template for HuggingFace upload
Include the exact CLI commands: `Megatron-LM train configs/train.yaml`, eval commands, and a short runbook for the on-call engineer.
Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.