Prompts/AI Engineering & LLM Apps/Fine-tuning & Model Adaptation

FreeAI Engineering & LLM Apps🤖 Any Model

DPO Fine-Tune of Qwen 2.5 32B for customer support classification

AI Prompt for Fine-tuning & Model Adaptation

Full fine-tuning recipe: DPO on Qwen 2.5 32B via Megatron-LM, targeting 8x H100, with data mix and eval plan.

Related prompts

More prompts for Fine-tuning & Model Adaptation.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.

🤖Any Model

1841518

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.

🤖Any Model

331515

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.

💬ChatGPT

2001509

AI Engineering & LLM Apps

Free

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.

🟠Claude

781505

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Phi-4 for financial report summarization

Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.

💬ChatGPT

3771503

AI Engineering & LLM Apps

Premium

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.

🤖Any Model

941503

You are a senior ML engineer shipping a fine-tuned open-weight LLM to production. Produce an end-to-end training recipe that an engineer can run on 8x H100. ## Objective Fine-tune Qwen 2.5 32B via DPO to excel at customer support classification. Target: match or exceed the base model + 5-shot ICL on our internal benchmark while keeping general capability degradation (MMLU delta) within 2 points. ## Why DPO Briefly justify DPO vs alternatives for this task+budget. Cover: VRAM, training-time, alignment-vs-capability tradeoff, and data requirements. ## Dataset Construction ### Sources - **Primary:** GitHub PR discussions (50k examples) - **Secondary:** customer-labeled examples (10k examples) - **Synthetic augmentation:** generated via Claude Opus 4.5 with rejection sampling ### Data Schema Each example is a JSON object: ```json { "id": "uuid", "source": "string", "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "task_type": "customer support classification", "quality_score": 0.0, "metadata": {"lang": "en", "difficulty": "medium"} } ``` ### Cleaning Pipeline 1. **Deduplication:** MinHash (128 hashes, Jaccard ≥ 0.85) on the concatenated messages 2. **Language filter:** fastText langid, keep target language(s) 3. **Quality filter:** length bounds, non-empty assistant, no boilerplate refusals 4. **PII scrub:** regex + Presidio for emails, phones, SSNs 5. **Contamination check:** exact-match and 13-gram overlap vs eval sets (MMLU, HumanEval, GSM8K, our internal eval) 6. **Toxicity filter:** Azure Content Safety score < 0.3 ### Data Mix Target 50k total examples, weighted: - 80% core task (customer support classification) - 15% instruction diversity (to preserve general capability) - 10% safety/refusal calibration - 7% format consistency (JSON, markdown, code) ### Splits - Train: 90% - Dev (for early stopping, hyperparam): 5% - Holdout (held until final eval): 5% ## Prompt Template / Chat Format Match Qwen 2.5 32B's native chat template (critical—mismatched template tanks performance): ``` <|im_start|>system {system}<|im_end|> <|im_start|>user {user}<|im_end|> <|im_start|>assistant {assistant}<|im_end|> ``` (Use the EXACT tokens from the Qwen 2.5 32B tokenizer config; verify with tokenizer.apply_chat_template.) ## Training Configuration (Megatron-LM) ```yaml base_model: Qwen 2.5 32B adapter: lora load_in_4bit: true lora_r: 64 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] sequence_len: 4096 sample_packing: true pad_to_sequence_len: true gradient_accumulation_steps: 8 micro_batch_size: 8 num_epochs: 3 optimizer: adamw_torch lr_scheduler: cosine learning_rate: 2e-4 warmup_ratio: 0.03 weight_decay: 0.0 bf16: true flash_attention: true gradient_checkpointing: true eval_steps: 200 save_steps: 500 save_total_limit: 3 early_stopping_patience: 3 ``` ### Why these hyperparameters - LR for LoRA is typically 10-100× higher than full fine-tune; start at 2e-4 and cosine-decay - r=64 is a solid default; go to 64 only if you see underfitting on train loss - seq_len=4096 covers 99% of examples; longer truncated - Pack samples aggressively to maximize GPU utilization ## Hardware: 8x H100 - Estimated VRAM: 72 GB - Estimated train time: 6 hours - Estimated cost: $2,500 If OOM: reduce micro_batch_size, enable CPU offload (DeepSpeed Zero-3), drop seq_len. ## Eval Plan Run BEFORE and AFTER fine-tune: 1. **Task-specific eval** (customer support classification): golden set of 500 held-out examples, scored by G-Eval with Gemini 2.5 Pro 2. **General capability** (regression gate): - MMLU 5-shot - HumanEval (if code) or MT-Bench (if chat) - TruthfulQA 3. **Format compliance:** % of outputs that match expected schema 4. **Safety:** refusal rate on a held-out harmful-prompt set (should stay near baseline) Pass criteria: - Task metric ≥ the base model + 5-shot ICL - MMLU delta ≥ -2 points - Refusal rate within ±3% of baseline - No regression on format compliance ## Failure Modes to Watch - **Mode collapse:** outputs become repetitive or always the same short form → reduce LR, check data diversity - **Catastrophic forgetting of format:** loses ability to output markdown tables → add format-preservation examples - **Sycophancy spike:** agrees with user mistakes → verify preference pairs are not rewarding agreement - **Tokenizer drift:** custom special tokens not properly added to tokenizer → verify tokenizer.get_vocab() before training ## Deliverables 1. `data/`: train.jsonl, dev.jsonl, holdout.jsonl with data card (source breakdown, quality stats, contamination report) 2. `configs/train.yaml`: the Megatron-LM config above 3. `scripts/preprocess.py`: the cleaning pipeline 4. `scripts/eval.py`: task eval + regression gates 5. `README.md` with reproducibility info (seeds, env versions, command lines) 6. Model card template for HuggingFace upload Include the exact CLI commands: `Megatron-LM train configs/train.yaml`, eval commands, and a short runbook for the on-call engineer. Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.

DPO Fine-Tune of Qwen 2.5 32B for customer support classification

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

DPO Fine-Tune of Qwen 2.5 32B for customer support classification

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Tags

Who this is for