Prompts/AI Engineering & LLM Apps/Fine-tuning & Model Adaptation

FreeAI Engineering & LLM Apps💬 ChatGPT

QLoRA (4-bit) Fine-Tune of Llama 3.1 70B for SQL-from-text generation

ChatGPT Prompt for Fine-tuning & Model Adaptation

Full fine-tuning recipe: QLoRA (4-bit) on Llama 3.1 70B via LitGPT, targeting AWS g5.12xlarge, with data mix and eval plan.

Related prompts

More prompts for Fine-tuning & Model Adaptation.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.

🤖Any Model

1841518

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.

🤖Any Model

331515

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.

💬ChatGPT

2001509

AI Engineering & LLM Apps

Free

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.

🟠Claude

781505

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Phi-4 for financial report summarization

Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.

💬ChatGPT

3771503

AI Engineering & LLM Apps

Premium

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.

🤖Any Model

941503

You are a senior ML engineer shipping a fine-tuned open-weight LLM to production. Produce an end-to-end training recipe that an engineer can run on AWS g5.12xlarge. ## Objective Fine-tune Llama 3.1 70B via QLoRA (4-bit) to excel at SQL-from-text generation. Target: match or exceed GPT-4o-mini zero-shot on our internal benchmark while keeping general capability degradation (MMLU delta) within 2 points. ## Why QLoRA (4-bit) Briefly justify QLoRA (4-bit) vs alternatives for this task+budget. Cover: VRAM, training-time, alignment-vs-capability tradeoff, and data requirements. ## Dataset Construction ### Sources - **Primary:** GitHub PR discussions (10k examples) - **Secondary:** Stack Overflow Q&A (10k examples) - **Synthetic augmentation:** generated via DeepSeek-V3 with rejection sampling ### Data Schema Each example is a JSON object: ```json { "id": "uuid", "source": "string", "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "task_type": "SQL-from-text generation", "quality_score": 0.0, "metadata": {"lang": "en", "difficulty": "medium"} } ``` ### Cleaning Pipeline 1. **Deduplication:** MinHash (128 hashes, Jaccard ≥ 0.85) on the concatenated messages 2. **Language filter:** fastText langid, keep target language(s) 3. **Quality filter:** length bounds, non-empty assistant, no boilerplate refusals 4. **PII scrub:** regex + Presidio for emails, phones, SSNs 5. **Contamination check:** exact-match and 13-gram overlap vs eval sets (MMLU, HumanEval, GSM8K, our internal eval) 6. **Toxicity filter:** Azure Content Safety score < 0.3 ### Data Mix Target 25k total examples, weighted: - 70% core task (SQL-from-text generation) - 20% instruction diversity (to preserve general capability) - 8% safety/refusal calibration - 5% format consistency (JSON, markdown, code) ### Splits - Train: 90% - Dev (for early stopping, hyperparam): 5% - Holdout (held until final eval): 5% ## Prompt Template / Chat Format Match Llama 3.1 70B's native chat template (critical—mismatched template tanks performance): ``` <|im_start|>system {system}<|im_end|> <|im_start|>user {user}<|im_end|> <|im_start|>assistant {assistant}<|im_end|> ``` (Use the EXACT tokens from the Llama 3.1 70B tokenizer config; verify with tokenizer.apply_chat_template.) ## Training Configuration (LitGPT) ```yaml base_model: Llama 3.1 70B adapter: qlora load_in_4bit: false lora_r: 64 lora_alpha: 64 lora_dropout: 0.05 lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] sequence_len: 8192 sample_packing: true pad_to_sequence_len: true gradient_accumulation_steps: 8 micro_batch_size: 1 num_epochs: 1 optimizer: adamw_torch lr_scheduler: cosine learning_rate: 5e-5 warmup_ratio: 0.03 weight_decay: 0.0 bf16: true flash_attention: true gradient_checkpointing: true eval_steps: 200 save_steps: 500 save_total_limit: 3 early_stopping_patience: 3 ``` ### Why these hyperparameters - LR for LoRA is typically 10-100× higher than full fine-tune; start at 5e-5 and cosine-decay - r=64 is a solid default; go to 64 only if you see underfitting on train loss - seq_len=8192 covers 99% of examples; longer truncated - Pack samples aggressively to maximize GPU utilization ## Hardware: AWS g5.12xlarge - Estimated VRAM: 48 GB - Estimated train time: 24 hours - Estimated cost: $120 If OOM: reduce micro_batch_size, enable CPU offload (DeepSpeed Zero-3), drop seq_len. ## Eval Plan Run BEFORE and AFTER fine-tune: 1. **Task-specific eval** (SQL-from-text generation): golden set of 2000 held-out examples, scored by Claude Sonnet 4.5 rubric scorer 2. **General capability** (regression gate): - MMLU 5-shot - HumanEval (if code) or MT-Bench (if chat) - TruthfulQA 3. **Format compliance:** % of outputs that match expected schema 4. **Safety:** refusal rate on a held-out harmful-prompt set (should stay near baseline) Pass criteria: - Task metric ≥ GPT-4o-mini zero-shot - MMLU delta ≥ -2 points - Refusal rate within ±3% of baseline - No regression on format compliance ## Failure Modes to Watch - **Mode collapse:** outputs become repetitive or always the same short form → reduce LR, check data diversity - **Catastrophic forgetting of format:** loses ability to output markdown tables → add format-preservation examples - **Sycophancy spike:** agrees with user mistakes → verify preference pairs are not rewarding agreement - **Tokenizer drift:** custom special tokens not properly added to tokenizer → verify tokenizer.get_vocab() before training ## Deliverables 1. `data/`: train.jsonl, dev.jsonl, holdout.jsonl with data card (source breakdown, quality stats, contamination report) 2. `configs/train.yaml`: the LitGPT config above 3. `scripts/preprocess.py`: the cleaning pipeline 4. `scripts/eval.py`: task eval + regression gates 5. `README.md` with reproducibility info (seeds, env versions, command lines) 6. Model card template for HuggingFace upload Include the exact CLI commands: `LitGPT train configs/train.yaml`, eval commands, and a short runbook for the on-call engineer. Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.

QLoRA (4-bit) Fine-Tune of Llama 3.1 70B for SQL-from-text generation

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

QLoRA (4-bit) Fine-Tune of Llama 3.1 70B for SQL-from-text generation

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Tags

Who this is for