ChatGPT Prompt for Fine-tuning & Model Adaptation
Generate high-quality DPO preference pairs for code generation using GPT-4.1 with a robust chosen/rejected protocol.
More prompts for Fine-tuning & Model Adaptation.
Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.
Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.
Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.
Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.
You are an alignment data engineer. Build a DPO (or DPO with length normalization) preference dataset of 25k pairs for code generation. Quality of pairs matters more than quantity — bad pairs teach the model the wrong signal.
## Preference Pair Structure
Each row:
```json
{
"id": "pair_001",
"prompt": "...",
"chosen": "...", // preferred completion
"rejected": "...", // worse completion
"chosen_source": "human | strong_model | reward_model",
"rejected_source": "weak_model | perturbed_chosen | human_bad_example",
"preference_strength": 0.0, // 0-1, how clear the preference
"rationale": "why chosen > rejected"
}
```
## Generation Strategies
### Strategy 1: Strong-Model vs Weak-Model
- chosen = GPT-4.1 response
- rejected = Phi-3.5-mini response on the same prompt
- Rationale: exploits capability gap for free preference signal
- Risk: may learn stylistic preferences rather than correctness. Mitigate with human review on 20% sample.
### Strategy 2: Best-of-N vs Worst-of-N
- Sample N=4 completions from the SFT model at temperature=0.9
- Score with GPT-4.1 rubric scorer or a reward model
- chosen = highest scored, rejected = lowest scored
- Rationale: stays on-policy, captures fine-grained preference
- Filter: require score gap ≥ 0.3 to include the pair
### Strategy 3: Perturbation
- Start with a correct response (chosen)
- Perturb to create rejected:
- Introduce factual error (swap an entity)
- Remove a step in a multi-step answer
- Change tone / format to be off-spec
- Truncate mid-answer
- Rationale: precise control over what's being penalized
- Use for: format compliance, refusal calibration, tool-use schemas
### Strategy 4: Human Pairs
- Expert ranks 2-4 completions per prompt
- Take all (chosen_i > rejected_j) pairs
- Rationale: gold standard, expensive
- Use for: 30% of the dataset — the most important / most nuanced prompts
## Prompt Distribution
25k pairs spread across:
- 60% real user prompts from production logs (sampled to cover diversity)
- 20% synthetic prompts from GPT-4.1 targeting known weak spots
- 15% adversarial prompts (jailbreaks, ambiguous, edge cases)
### Synthetic Prompt Generator
```
Generate prompts that would test code generation capability. Focus on:
1. The failure mode: markdown code fences wrapping JSON
2. Variations of difficulty: easy, medium, hard
3. Realistic (a user might actually send this)
Output JSON: {"prompts": ["...", "...", ...]}
```
## Quality Control
### Automated Filters
- Drop pairs where chosen == rejected (string-level or embedding-level sim > 0.95)
- Drop pairs where rejected is actually better than chosen per GPT-4.1 rubric scorer
- Drop pairs where preference_strength < 0.3
- Drop pairs with PII in either side
- Length balance: |len(chosen) - len(rejected)| should NOT be the only differentiator. Monitor length-win rate (fraction of pairs where chosen is longer). If > 0.65, you have length bias leaking in — use SimPO or DPO with length normalization.
### Human Review Sample
Sample 20% of auto-generated pairs. For each, a human annotator rates:
- Is chosen genuinely better? (yes/no/unclear)
- Preference strength (1-5)
- Any issues (length bias, factual error in chosen, etc.)
If < 95% pass, halt and re-examine generation pipeline before continuing.
### Leakage Check
Run contamination check vs eval sets (same as SFT dataset work).
## Calibrating for DPO Beta / IPO Tau / KTO
- DPO `beta` typical range: 0.1 - 0.5. Higher = closer to reference, safer but less impact.
- For noisy preferences, prefer IPO or DPO with label smoothing
- For unpaired chosen/rejected (can't always get both), use KTO
## Splits
- Train: 90%
- Dev: 10% (used for `beta` tuning, early stopping)
- Keep the SFT holdout split clean
## Failure Modes in DPO Data
1. **Length bias** → model learns "longer = better". Fix: length-balanced pairs, SimPO.
2. **Style over substance** → model learns markdown/emoji habits. Fix: diversify chosen sources.
3. **Reference model drift** → chosen from GPT-4.1, rejected from base model: preferences reflect style gap, not task quality. Fix: mix multiple strong sources.
4. **Sycophancy reinforcement** → chosen agrees with user assumption even when user is wrong. Fix: include explicit counter-examples where chosen respectfully corrects.
## Deliverables
1. `pairs.jsonl` with schema above
2. Data card covering sources, strategies, and known biases
3. QC report: pass rate per filter, human review concordance, length statistics
4. Contamination report
5. Notebook with sample pairs for each strategy
Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.Replace the bracketed placeholders with your own context before running the prompt:
["...", "...", ...]— fill in your specific "...", "...", ....