LLM Fine-Tuning Guide: When to Fine-Tune, LoRA, Data Prep, and Evaluation
·13 min read
Fine-tuning teaches an LLM new behavior through supervised examples. Done correctly it produces a smaller, faster, cheaper model that outperforms prompting for specific tasks. Done incorrectly, it wastes compute and degrades general capability.
Fine-tuning vs RAG vs prompting
| Technique | Use when | Limitations |
|---|---|---|
| Prompting | General tasks, fast iteration | Inconsistent output, token cost |
| RAG | Dynamic factual knowledge | Retrieval quality ceiling |
| Fine-tuning | Consistent format, style, domain vocabulary | Cannot inject new facts post-training |
| Fine-tuning + RAG | Domain style + dynamic facts | Higher complexity and cost |
When fine-tuning actually helps
- You need consistent output format that few-shot examples cannot reliably achieve.
- Your domain has specialized vocabulary that the base model mangles.
- You need to reduce latency and cost by using a smaller model with better task-specific performance.
- You want to instill a brand voice or tone across thousands of generations.
- You have >50–100 high-quality labeled examples.
Training data preparation
Quality beats quantity. 200 excellent examples outperform 2000 mediocre ones.
// OpenAI fine-tuning format (JSONL)
{"messages": [
{"role": "system", "content": "You are a JSON extractor. Return only valid JSON."},
{"role": "user", "content": "Extract the order details: John bought 3 apples for $4.50 on June 10."},
{"role": "assistant", "content": "{\"customer\":\"John\",\"items\":[{\"product\":\"apples\",\"quantity\":3}],\"total\":4.50,\"date\":\"2026-06-10\"}"}
]}
{"messages": [
{"role": "system", "content": "You are a JSON extractor. Return only valid JSON."},
{"role": "user", "content": "Extract: Alice ordered 1 laptop for $999.00 on June 5."},
{"role": "assistant", "content": "{\"customer\":\"Alice\",\"items\":[{\"product\":\"laptop\",\"quantity\":1}],\"total\":999.00,\"date\":\"2026-06-05\"}"}
]}- Minimum: 10 examples (OpenAI requirement); practical minimum: 50–100.
- Split 80/10/10 train/validation/test — never evaluate on training data.
- Cover edge cases and failures, not just happy paths.
- Use the same system prompt you intend to use in production.
OpenAI fine-tuning API
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI();
// 1. Upload training file
const file = await openai.files.create({
file: fs.createReadStream('training_data.jsonl'),
purpose: 'fine-tune',
});
// 2. Start fine-tuning job
const job = await openai.fineTuning.jobs.create({
training_file: file.id,
model: 'gpt-4o-mini-2024-07-18',
hyperparameters: {
n_epochs: 3,
},
suffix: 'json-extractor-v1',
});
console.log('Job ID:', job.id);
// 3. Monitor job
const status = await openai.fineTuning.jobs.retrieve(job.id);
console.log('Status:', status.status);
console.log('Fine-tuned model:', status.fine_tuned_model);LoRA and QLoRA (open-source models)
LoRA (Low-Rank Adaptation) freezes the original model weights and trains small adapter matrices, reducing GPU memory requirements by 3–10×.
# Using HuggingFace PEFT + transformers
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b-instruct')
lora_config = LoraConfig(
r=16, # rank — higher = more parameters = better fit
lora_alpha=32,
target_modules=['q_proj', 'v_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,034,549,760 || trainable%: 0.05%QLoRA adds 4-bit quantization on top of LoRA — fine-tune a 70B model on a single A100 80GB GPU.
Hyperparameter guidance
| Parameter | Default | When to change |
|---|---|---|
| Epochs | 3–5 | Increase for small datasets (<100 examples) |
| Learning rate | 1e-5 | Decrease for catastrophic forgetting |
| Batch size | 4–16 | Larger = more stable gradient, needs more VRAM |
| LoRA rank (r) | 8–64 | Higher rank = more capacity, more memory |
Evaluation after fine-tuning
// Compare fine-tuned vs base model on held-out test set
const testCases = loadTestSet('test_data.jsonl');
let ftCorrect = 0, baseCorrect = 0;
for (const tc of testCases) {
const ftResponse = await openai.chat.completions.create({
model: 'ft:gpt-4o-mini:your-org:json-extractor-v1:abc123',
messages: tc.messages.slice(0, -1), // exclude assistant turn
});
const baseResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: tc.messages.slice(0, -1),
});
if (isCorrect(ftResponse, tc.expected)) ftCorrect++;
if (isCorrect(baseResponse, tc.expected)) baseCorrect++;
}
console.log('Fine-tuned accuracy:', ftCorrect / testCases.length);
console.log('Base model accuracy:', baseCorrect / testCases.length);Common pitfalls
- Catastrophic forgetting: fine-tuning degrades general capability. Keep adapter weights small.
- Data contamination: never include test examples in training data.
- Overfitting: watch validation loss — stop training before it diverges from training loss.
- Format inconsistency in training data: a single malformed example can bias the whole model.
Takeaway
Fine-tune for format and style consistency, not factual knowledge. Treat data quality as your most important investment — 100 clean examples consistently beat 1000 noisy ones.