LLM Fine-Tuning Guide: When to Fine-Tune, LoRA, Data Prep, and Evaluation

Fine-tuning teaches an LLM new behavior through supervised examples. Done correctly it produces a smaller, faster, cheaper model that outperforms prompting for specific tasks. Done incorrectly, it wastes compute and degrades general capability.

Fine-tuning vs RAG vs prompting

Technique	Use when	Limitations
Prompting	General tasks, fast iteration	Inconsistent output, token cost
RAG	Dynamic factual knowledge	Retrieval quality ceiling
Fine-tuning	Consistent format, style, domain vocabulary	Cannot inject new facts post-training
Fine-tuning + RAG	Domain style + dynamic facts	Higher complexity and cost

When fine-tuning actually helps

You need consistent output format that few-shot examples cannot reliably achieve.
Your domain has specialized vocabulary that the base model mangles.
You need to reduce latency and cost by using a smaller model with better task-specific performance.
You want to instill a brand voice or tone across thousands of generations.
You have >50–100 high-quality labeled examples.

Training data preparation

Quality beats quantity. 200 excellent examples outperform 2000 mediocre ones.

// OpenAI fine-tuning format (JSONL)
{"messages": [
  {"role": "system", "content": "You are a JSON extractor. Return only valid JSON."},
  {"role": "user",   "content": "Extract the order details: John bought 3 apples for $4.50 on June 10."},
  {"role": "assistant", "content": "{\"customer\":\"John\",\"items\":[{\"product\":\"apples\",\"quantity\":3}],\"total\":4.50,\"date\":\"2026-06-10\"}"}
]}
{"messages": [
  {"role": "system", "content": "You are a JSON extractor. Return only valid JSON."},
  {"role": "user",   "content": "Extract: Alice ordered 1 laptop for $999.00 on June 5."},
  {"role": "assistant", "content": "{\"customer\":\"Alice\",\"items\":[{\"product\":\"laptop\",\"quantity\":1}],\"total\":999.00,\"date\":\"2026-06-05\"}"}
]}

Minimum: 10 examples (OpenAI requirement); practical minimum: 50–100.
Split 80/10/10 train/validation/test — never evaluate on training data.
Cover edge cases and failures, not just happy paths.
Use the same system prompt you intend to use in production.

OpenAI fine-tuning API

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI();

// 1. Upload training file
const file = await openai.files.create({
  file: fs.createReadStream('training_data.jsonl'),
  purpose: 'fine-tune',
});

// 2. Start fine-tuning job
const job = await openai.fineTuning.jobs.create({
  training_file: file.id,
  model: 'gpt-4o-mini-2024-07-18',
  hyperparameters: {
    n_epochs: 3,
  },
  suffix: 'json-extractor-v1',
});

console.log('Job ID:', job.id);

// 3. Monitor job
const status = await openai.fineTuning.jobs.retrieve(job.id);
console.log('Status:', status.status);
console.log('Fine-tuned model:', status.fine_tuned_model);

LoRA and QLoRA (open-source models)

LoRA (Low-Rank Adaptation) freezes the original model weights and trains small adapter matrices, reducing GPU memory requirements by 3–10×.

# Using HuggingFace PEFT + transformers
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8b-instruct')

lora_config = LoraConfig(
    r=16,                    # rank — higher = more parameters = better fit
    lora_alpha=32,
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,034,549,760 || trainable%: 0.05%

QLoRA adds 4-bit quantization on top of LoRA — fine-tune a 70B model on a single A100 80GB GPU.

Hyperparameter guidance

Parameter	Default	When to change
Epochs	3–5	Increase for small datasets (<100 examples)
Learning rate	1e-5	Decrease for catastrophic forgetting
Batch size	4–16	Larger = more stable gradient, needs more VRAM
LoRA rank (r)	8–64	Higher rank = more capacity, more memory

Evaluation after fine-tuning

// Compare fine-tuned vs base model on held-out test set
const testCases = loadTestSet('test_data.jsonl');
let ftCorrect = 0, baseCorrect = 0;

for (const tc of testCases) {
  const ftResponse = await openai.chat.completions.create({
    model: 'ft:gpt-4o-mini:your-org:json-extractor-v1:abc123',
    messages: tc.messages.slice(0, -1),  // exclude assistant turn
  });

  const baseResponse = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: tc.messages.slice(0, -1),
  });

  if (isCorrect(ftResponse, tc.expected))   ftCorrect++;
  if (isCorrect(baseResponse, tc.expected)) baseCorrect++;
}

console.log('Fine-tuned accuracy:', ftCorrect / testCases.length);
console.log('Base model accuracy:', baseCorrect / testCases.length);

Common pitfalls

Catastrophic forgetting: fine-tuning degrades general capability. Keep adapter weights small.
Data contamination: never include test examples in training data.
Overfitting: watch validation loss — stop training before it diverges from training loss.
Format inconsistency in training data: a single malformed example can bias the whole model.

Takeaway

Fine-tune for format and style consistency, not factual knowledge. Treat data quality as your most important investment — 100 clean examples consistently beat 1000 noisy ones.