Prompts/AI Engineering & LLM Apps/Fine-tuning & Model Adaptation

FreeAI Engineering & LLM Apps💬 ChatGPT

DPO Preference Pair Generation for code generation

ChatGPT Prompt for Fine-tuning & Model Adaptation

Generate high-quality DPO preference pairs for code generation using GPT-4.1 with a robust chosen/rejected protocol.

Related prompts

More prompts for Fine-tuning & Model Adaptation.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.

🤖Any Model

1841518

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.

🤖Any Model

331515

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.

💬ChatGPT

2001509

AI Engineering & LLM Apps

Free

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.

🟠Claude

781505

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Phi-4 for financial report summarization

Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.

💬ChatGPT

3771503

AI Engineering & LLM Apps

Premium

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.

🤖Any Model

941503

You are an alignment data engineer. Build a DPO (or DPO with length normalization) preference dataset of 25k pairs for code generation. Quality of pairs matters more than quantity — bad pairs teach the model the wrong signal. ## Preference Pair Structure Each row: ```json { "id": "pair_001", "prompt": "...", "chosen": "...", // preferred completion "rejected": "...", // worse completion "chosen_source": "human | strong_model | reward_model", "rejected_source": "weak_model | perturbed_chosen | human_bad_example", "preference_strength": 0.0, // 0-1, how clear the preference "rationale": "why chosen > rejected" } ``` ## Generation Strategies ### Strategy 1: Strong-Model vs Weak-Model - chosen = GPT-4.1 response - rejected = Phi-3.5-mini response on the same prompt - Rationale: exploits capability gap for free preference signal - Risk: may learn stylistic preferences rather than correctness. Mitigate with human review on 20% sample. ### Strategy 2: Best-of-N vs Worst-of-N - Sample N=4 completions from the SFT model at temperature=0.9 - Score with GPT-4.1 rubric scorer or a reward model - chosen = highest scored, rejected = lowest scored - Rationale: stays on-policy, captures fine-grained preference - Filter: require score gap ≥ 0.3 to include the pair ### Strategy 3: Perturbation - Start with a correct response (chosen) - Perturb to create rejected: - Introduce factual error (swap an entity) - Remove a step in a multi-step answer - Change tone / format to be off-spec - Truncate mid-answer - Rationale: precise control over what's being penalized - Use for: format compliance, refusal calibration, tool-use schemas ### Strategy 4: Human Pairs - Expert ranks 2-4 completions per prompt - Take all (chosen_i > rejected_j) pairs - Rationale: gold standard, expensive - Use for: 30% of the dataset — the most important / most nuanced prompts ## Prompt Distribution 25k pairs spread across: - 60% real user prompts from production logs (sampled to cover diversity) - 20% synthetic prompts from GPT-4.1 targeting known weak spots - 15% adversarial prompts (jailbreaks, ambiguous, edge cases) ### Synthetic Prompt Generator ``` Generate prompts that would test code generation capability. Focus on: 1. The failure mode: markdown code fences wrapping JSON 2. Variations of difficulty: easy, medium, hard 3. Realistic (a user might actually send this) Output JSON: {"prompts": ["...", "...", ...]} ``` ## Quality Control ### Automated Filters - Drop pairs where chosen == rejected (string-level or embedding-level sim > 0.95) - Drop pairs where rejected is actually better than chosen per GPT-4.1 rubric scorer - Drop pairs where preference_strength < 0.3 - Drop pairs with PII in either side - Length balance: |len(chosen) - len(rejected)| should NOT be the only differentiator. Monitor length-win rate (fraction of pairs where chosen is longer). If > 0.65, you have length bias leaking in — use SimPO or DPO with length normalization. ### Human Review Sample Sample 20% of auto-generated pairs. For each, a human annotator rates: - Is chosen genuinely better? (yes/no/unclear) - Preference strength (1-5) - Any issues (length bias, factual error in chosen, etc.) If < 95% pass, halt and re-examine generation pipeline before continuing. ### Leakage Check Run contamination check vs eval sets (same as SFT dataset work). ## Calibrating for DPO Beta / IPO Tau / KTO - DPO `beta` typical range: 0.1 - 0.5. Higher = closer to reference, safer but less impact. - For noisy preferences, prefer IPO or DPO with label smoothing - For unpaired chosen/rejected (can't always get both), use KTO ## Splits - Train: 90% - Dev: 10% (used for `beta` tuning, early stopping) - Keep the SFT holdout split clean ## Failure Modes in DPO Data 1. **Length bias** → model learns "longer = better". Fix: length-balanced pairs, SimPO. 2. **Style over substance** → model learns markdown/emoji habits. Fix: diversify chosen sources. 3. **Reference model drift** → chosen from GPT-4.1, rejected from base model: preferences reflect style gap, not task quality. Fix: mix multiple strong sources. 4. **Sycophancy reinforcement** → chosen agrees with user assumption even when user is wrong. Fix: include explicit counter-examples where chosen respectfully corrects. ## Deliverables 1. `pairs.jsonl` with schema above 2. Data card covering sources, strategies, and known biases 3. QC report: pass rate per filter, human review concordance, length statistics 4. Contamination report 5. Notebook with sample pairs for each strategy Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.

DPO Preference Pair Generation for code generation

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

DPO Preference Pair Generation for code generation

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

How to customize this prompt

Tags

Who this is for