TL;DR
Fine-tuning an open-weights LLM is now cheaper and more reliable than ever — but the failure modes (forgetting, overfitting, broken instruction following, mis-targeted compute) have not gone away. Here is the short version of what works in 2026:
- Try RAG and prompt engineering before fine-tuning. 80% of "we need fine-tuning" requests are solved by better retrieval.
- QLoRA is the right default — 4-bit base, LoRA adapters, single A100 80GB fine-tunes Llama 3 8B in 6 hours on 50k examples for ~USD 12.
- Data quality > data quantity. 1,000 hand-curated examples often beat 100,000 noisy ones (LIMA result still holds).
- DPO has displaced RLHF for alignment in most production teams. Cheaper, more stable, comparable quality.
- Always benchmark before and after on MMLU, GSM8K, HumanEval, and your domain eval. Catastrophic forgetting is real.
Fine-tuning a Large Language Model used to be a six-figure capital expense and a team of ML PhDs. In 2026 you can fine-tune Llama 3 8B Instruct on a domain task on a single rented A100 80GB for under USD 15, ship the resulting LoRA adapter as a 30 MB file, and serve it with hot-swappable adapters on vLLM. The economics have shifted so radically that the question is no longer "can we fine-tune?" but "should we, and if so, how?" The bad news is that the failure modes — catastrophic forgetting, overfitting to small datasets, broken instruction following, distribution shift between training and deployment — have not improved at the same rate. Fine-tuning still goes badly more often than it goes well.
This guide is the playbook we use at hjLabs.in when clients ask us to fine-tune. It covers the decision of whether to fine-tune, data preparation, the LoRA/QLoRA/full-FT trade-off, the alignment step (RLHF vs DPO vs ORPO), evaluation discipline, deployment with vLLM and adapter hot-swap, and the cost numbers on Indian and global GPU pricing as of May 2026.
"Fine-tuning is the answer when prompt engineering is too verbose, RAG is too slow, and the task is narrow enough that a 1,000-example dataset can specify it. Otherwise it is overkill."
1. Should You Fine-Tune At All?
Before you spend a single GPU-hour, run this decision tree honestly.
- Is the task domain-specific terminology, format, or style? Fine-tuning helps. Few-shot prompts also help for <3 in-context examples.
- Is the task knowledge-recall over a private corpus? Use RAG, not fine-tuning. Fine-tuning is bad at injecting new facts; it is good at injecting new behaviors.
- Is the task reasoning? Try a stronger base model first. Fine-tuning a 7B does not turn it into a GPT-4o; chain-of-thought distillation can close some of the gap.
- Do you have at least 500 high-quality labeled examples? If no, do not fine-tune; iterate on prompts or generate synthetic data with a stronger model.
- Can you live with a 30B-parameter model? If you must run a 70B, fine-tuning cost rises 10x — be sure the lift is worth it.
Concrete cases where we recommend fine-tuning: structured output adherence (JSON schemas, function calls), tone and style alignment (legal Hindi summarization, medical record cleanup), classification with custom taxonomies, code generation in a niche framework, and on-prem latency-bound deployments where a 7B fine-tune outperforms a 70B base. Cases where we do not: factual recall over enterprise docs (use RAG), broad reasoning (use a bigger base), or "make it smarter generally" (this is not how fine-tuning works).
2. Method Selection: LoRA vs QLoRA vs Full FT
The three methods are distinguished by which weights are updated and at what precision.
| Method | What it updates | VRAM (Llama 3 8B) | VRAM (Llama 3 70B) | Speed | Quality ceiling | Storage / artifact |
|---|---|---|---|---|---|---|
| Full Fine-Tune | All weights, bf16 | ~160 GB (multi-GPU) | ~1.4 TB (8x H100) | 1.0x | Highest | Full new checkpoint (~16 GB / 140 GB) |
| LoRA | Low-rank adapters (r=16–64) on q,k,v,o,gate,up,down | ~28 GB (1x A100 40GB) | ~200 GB (4x A100) | 1.5–2x | Within 1–2 pts of full FT on narrow tasks | ~30 MB adapter |
| QLoRA | LoRA adapters on 4-bit NF4 base | ~18 GB (1x A100 40GB or 1x A10G 24GB) | ~48 GB (1x A100 80GB or 1x H100) | 1.2–1.8x | Within 1 pt of LoRA on most tasks | ~30 MB adapter |
| DoRA (decomposed LoRA) | Magnitude + direction LoRA | ~20 GB | ~52 GB | 0.95x | Slightly above LoRA on some tasks | ~30 MB adapter |
The default we recommend in 90% of cases is QLoRA on Llama 3.1 8B Instruct or Mistral 7B v0.3. It fits on a single rented A100 80GB (USD 1.80/hr on Lambda, RunPod, or Modal), trains 50k examples in 4–6 hours, and produces a 30 MB adapter you can serve alongside the base model with no perf hit. Quality is within a point of full fine-tuning on every benchmark we have measured. Only go to full fine-tuning when (a) you have >500k high-quality examples, (b) you need maximum quality on a high-stakes task, and (c) you have the multi-GPU budget.
# QLoRA fine-tune skeleton with TRL + PEFT
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=bnb, device_map="auto")
lora = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
trainer = SFTTrainer(model=model, train_dataset=ds, peft_config=lora, max_seq_length=2048,
args=TrainingArguments(per_device_train_batch_size=4,
gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=3,
bf16=True, optim="paged_adamw_8bit", lr_scheduler_type="cosine"))
trainer.train()
3. Data Preparation: The Step That Actually Determines Quality
Every team underestimates data prep. Fine-tuning quality is bounded by data quality far more than by hyperparameters. The LIMA paper showed that 1,000 carefully curated examples can outperform 50,000 noisy ones on instruction following — this still holds for narrow tasks in 2026. Treat dataset construction as a first-class engineering deliverable.
Data prep checklist
- Define the exact task in one sentence and write 10 gold examples by hand. If you cannot, your task is under-specified.
- Choose the right format — chat templates differ across models. Llama 3 uses
<|begin_of_text|><|start_header_id|>...; Mistral uses[INST]...[/INST]. Get this wrong and the model learns to emit the wrong markers. - Deduplicate aggressively with MinHash or simple n-gram similarity. Duplicates inflate apparent quality and cause overfitting.
- Length-balance — pad to a max length that covers the 95th percentile of real inputs, not the longest outlier.
- Stratify your eval split by difficulty, category, and any protected attributes (see our ethical AI guide).
- Augment with synthetic data carefully — using GPT-4o or Claude Sonnet 3.7 to generate examples works well for format, badly for factual content. Always have a human review a sample.
4. Hyperparameters That Matter
Hyperparameter sweeps are mostly cargo-culted. For QLoRA with 7–13B models, the following defaults work the vast majority of the time:
- LoRA rank r: 16 for simple style tasks, 32 for general SFT, 64 for complex multi-turn or coding. Higher rank costs marginal extra VRAM and rarely hurts.
- LoRA alpha: 2 * r (so 32, 64, 128).
- Target modules: attention + MLP (
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). Attention-only LoRA underperforms. - Learning rate: 2e-4 with cosine schedule and 3% warmup. For full FT, 1e-5 to 5e-5.
- Epochs: 2–3 for instruction tuning on 5–50k examples. More epochs overfit fast on small datasets — watch eval loss.
- Batch size: effective batch 16–64 via gradient accumulation. Smaller batches generalize better with LoRA than you might expect.
5. Alignment: SFT → DPO (mostly skip RLHF in 2026)
RLHF (Reinforcement Learning from Human Feedback) was the headline alignment technique through 2023. In 2026 most production teams use DPO (Direct Preference Optimization, Rafailov 2023) or its descendants ORPO and KTO. DPO trains directly on preference pairs (chosen, rejected) with a closed-form contrastive loss — no separate reward model, no PPO instability, no GPU-hungry rollout loop. Quality is comparable to PPO-RLHF for the vast majority of tasks at a fraction of the engineering cost.
SFT → DPO is the standard alignment pipeline. First, supervised fine-tune on (prompt, ideal_response) pairs to teach format and base behavior. Then DPO on (prompt, chosen, rejected) pairs to teach preference. Generate the rejected set with the SFT model itself; choose 1–5k preference pairs hand-curated or via an LLM-as-judge (Claude or GPT-4o). Cost on top of SFT: another USD 4–8 of GPU for an 8B model. Quality lift: typically 5–15 ELO points on internal preference evals.
Use RLHF only if you have (a) a strong existing reward model, (b) a use case that requires careful credit assignment over long rollouts (e.g., agentic), and (c) the engineering capacity for PPO debugging. For 95% of fine-tuning projects, DPO is enough.
6. Evaluation: Benchmarks + Domain Evals + Vibes
You must evaluate your fine-tune on three things, in order: standard benchmarks (to detect catastrophic forgetting), your domain eval set (to confirm the fine-tune worked), and user feedback in production (to confirm nothing weird is happening).
Standard benchmarks to run before and after:
- MMLU (5-shot) — general knowledge. Forgetting shows up here first.
- GSM8K (8-shot, CoT) — grade-school math reasoning. Fine-tuning often degrades this.
- HumanEval (pass@1) — code generation. Only relevant if your fine-tune touches code.
- HellaSwag, ARC, TruthfulQA — broader generalization.
- Domain eval: 100–500 hand-written prompts with ideal answers, scored by LLM-as-judge or human.
Use the lm-evaluation-harness (EleutherAI) for the standard suite. Budget ~USD 3–8 to run the full suite on one model. If MMLU drops more than 2 points or GSM8K drops more than 3 points, your fine-tune is forgetting too much — reduce learning rate, fewer epochs, or mix general instruction data into your training set.
7. Deployment with vLLM and Hot-Swap Adapters
The deployment side of fine-tuning has gotten much better. With vLLM 0.7+ you can serve a base model and dynamically load LoRA adapters per request — meaning a single GPU can serve dozens of fine-tuned variants by hot-swapping 30 MB adapters. For high-throughput on an A100 80GB: Llama 3.1 8B with 4 LoRAs hot-swappable, 3000+ tokens/sec aggregate, p50 TTFT 180 ms.
# vLLM with LoRA hot-swap
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora --max-lora-rank 64 \
--lora-modules legal-lora=/models/legal-lora hr-lora=/models/hr-lora \
--max-loras 4 --gpu-memory-utilization 0.92
# Request hits the right adapter via the "model" field
curl http://localhost:8000/v1/completions -d '{"model":"legal-lora","prompt":"..."}'
For Indian deployments with DPDP residency requirements, this stack runs cleanly on Yotta Mumbai, CtrlS, or AWS Mumbai A100/H100 instances. Self-hosting an 8B QLoRA fine-tune amortizes to roughly USD 0.0006 per request at 100 QPS — about 20x cheaper than equivalent OpenAI API calls.
8. Indian-Language Fine-Tuning: A Note on Indic Models
For Hindi, Marathi, Gujarati, Tamil, Telugu, Bengali, and other Indic-language tasks, the right base model in 2026 is not always a Western-trained LLM. Sarvam-1, Krutrim-2, Pragna-1B, Airavata, OpenHathi, Tamil Llama, and Gemma-3-Sahyadri are open-weights models pre-trained or continued-pre-trained on Indic corpora and substantially outperform vanilla Llama 3 on Indic tasks (especially below the 8B parameter scale). For a Mumbai-area customer-support transcript summarization task in Marathi we benchmarked Llama 3.1 8B (BLEU 22.4 after QLoRA SFT) vs Sarvam-1 7B (BLEU 31.8 after the same SFT recipe) — a substantial win for the Indic-pretrained base. The trade-off is a smaller ecosystem and fewer eval harnesses; you will spend more time building your own eval and serving pipeline.
Tokenization matters disproportionately for Indic languages. Llama's tokenizer fragments Devanagari into 2-3 tokens per character; Sarvam and Krutrim ship Indic-aware tokenizers that emit roughly one token per Devanagari character, which is a 2-3x throughput and cost win on inference. For latency-sensitive Indic deployments this alone is reason enough to choose an Indic-native base.
9. Continual Learning, Catastrophic Forgetting, and Model Merging
If you fine-tune the same base for multiple tasks (legal-summarization adapter, customer-support adapter, code-review adapter), you have two architectural choices: separate adapters served via hot-swap (recommended) or model merging.
Model merging — combining multiple LoRAs or full fine-tunes into one — is a 2024-2025 development that has matured into a viable pattern. mergekit supports linear, SLERP, DARE, TIES, and Model Soup methods. Merging works best for similar tasks and breaks for very different tasks; benchmark before deploying.
Catastrophic forgetting happens when fine-tuning erases base-model capabilities. Mitigations: (1) include a small fraction (5-15%) of general instruction data (e.g., OpenHermes-2.5 sample) in your training mix, (2) use lower learning rates, (3) check MMLU/GSM8K before and after, (4) consider LoRA over full FT (less likely to forget). For long-running iterative fine-tunes (re-fine-tune on a new batch every quarter), keep the previous adapter as a fallback and run side-by-side evals.
How We Apply This at hjLabs.in
At hjLabs.in we fine-tune for narrow, valuable tasks where prompt engineering and RAG have plateaued. Recent engagements: a Gujarat pharma client needed structured JSON extraction from regulatory PDFs with 99%+ schema adherence — we QLoRA-tuned Llama 3.1 8B on 3,200 hand-labeled examples in 5 hours on a Modal A100, landed at 99.4% valid JSON and shipped the 32 MB adapter to their on-prem vLLM. A Mumbai fintech needed Hindi-Marathi summarization of customer call transcripts with strict tone constraints — we used a two-stage SFT → DPO pipeline on Llama 3.1 8B with 6,800 SFT pairs and 1,200 preference pairs, total compute cost USD 38, deployed on Yotta Mumbai for DPDP compliance.
The pattern is consistent: scoping, data prep, and evaluation eat 70% of the engagement; the actual training run is hours. We always benchmark on MMLU and GSM8K before and after — twice in the last twelve months we caught catastrophic forgetting on a client's fine-tune that they would have shipped blind. Agentic AI systems built on fine-tuned tool-use models is a growing line of work; see also our deep-dives on MLOps for serving and monitoring patterns.
Common Pitfalls
- Fine-tuning to inject knowledge. Fine-tuning teaches behaviors, not facts. For private knowledge, use RAG.
- Wrong chat template. Train with the wrong template and your model fails silently at inference time. Always print 5 training examples to verify formatting.
- No eval before and after. You cannot diagnose forgetting without baselines.
- Too many epochs. On small datasets, 5+ epochs almost always overfits. Watch eval loss, not train loss.
- Adapter rank too low. Rank 4 saves no real memory and hurts quality. r=32 is a fine default.
- Training in fp16 on bfloat16-supporting hardware. Use bf16 on A100/H100 — fp16 causes loss-scale headaches.
- Skipping DPO. SFT-only models often produce technically correct but stylistically wrong outputs. DPO is cheap; add it.
- Serving in HuggingFace transformers. Production serving belongs on vLLM, TGI, or Triton. Plain transformers leaves 5–10x throughput on the table.
10. Synthetic Data, Distillation, and Self-Improving Pipelines
Three advanced techniques have become standard practice in 2026 for teams who cannot or do not want to hand-label thousands of examples:
- Synthetic data from a strong teacher: use GPT-4o or Claude 3.7 Sonnet to generate (prompt, response) pairs for your task, then SFT a smaller open-weights model on them. Works great for format and style, less well for facts (the teacher's hallucinations propagate). Budget USD 50-300 to generate 10k high-quality examples.
- Distillation: train a small model to mimic a large model's outputs (not just final answer but logits or full traces). DistilBERT-style techniques apply to LLMs via tools like
distilkit. A distilled Llama 3.1 8B from Llama 3.1 70B routinely captures 90-95% of teacher quality at 10% of the inference cost. - Self-improving pipelines (STaR / RFT): have the model generate multiple answers, keep the correct ones, retrain on them. Reinforcement Fine-Tuning (RFT) — popularized by OpenAI in late 2024 — automates this loop with verifiable rewards. Works exceptionally well for math, code, and other tasks with checkable answers.
A practical pipeline we have used: teacher-generated SFT data → DPO on hand-curated preference pairs → optional RFT on a verifiable subset. Total compute under USD 60 for an 8B model. Quality competitive with much more expensive approaches.
11. Open-Source Tooling Stack We Recommend in 2026
The tooling has consolidated. Our default stack:
- Data prep: pandas, datasets (HuggingFace), Argilla for label review, Llama-Index for synthetic data generation.
- Training:
trl(HuggingFace) for SFT/DPO/ORPO,peftfor LoRA/QLoRA,axolotlfor opinionated end-to-end YAML configs,unslothfor 2-5x faster QLoRA on consumer GPUs. - Evaluation:
lm-evaluation-harnessfor standard benchmarks,promptfoofor prompt regression, custom LLM-as-judge harness for domain evals. - Tracking:
MLfloworWeights & Biasesfor runs and registry. - Serving:
vLLM0.7+ for throughput-optimized inference with LoRA hot-swap,TGI(Text Generation Inference) as alternative,Triton Inference Serverfor multi-model orchestration on bare metal. - Quantization:
llama.cpp(GGUF) for CPU/laptop inference,AutoAWQorGPTQfor 4-bit GPU inference at scale. - Cloud:
ModalorRunPodfor ad-hoc A100/H100,Lambda Labsfor longer reservations, AWS Mumbai p5 for production Indian deployments under DPDP.
This stack covers 95% of our client engagements with no proprietary lock-in.
FAQ: Fine-Tuning Questions We Field Constantly
How many examples do I really need? For style/format tasks: 500-2,000 high-quality. For classification: 2,000-10,000 balanced. For complex multi-turn behaviors: 10,000-50,000. Quality matters more than quantity past 1,000.
Which base model for a generic English task? Llama 3.1 8B Instruct (best general-purpose 8B), Mistral 7B v0.3 (slightly faster, slightly weaker), Qwen 2.5 7B (strong on reasoning and code), Gemma 2 9B (Google's open release). For 70B-class, Llama 3.1 70B Instruct or Qwen 2.5 72B.
Should I do continued pre-training before SFT? Only if you have hundreds of millions of tokens of in-domain unlabeled text AND the domain is genuinely far from the base model's distribution (e.g., niche scientific literature, low-resource language). For most enterprise tasks, skip it.
How do I avoid catastrophic forgetting? Mix 5-15% general instruction data into training, use lower learning rates (1e-4 to 5e-5 for SFT), prefer LoRA over full FT, and benchmark on MMLU/GSM8K before and after every iteration.
Can I fine-tune to inject new factual knowledge? Poorly. Fine-tuning teaches behaviors and formats reliably; it teaches facts unreliably and risks hallucination. For facts, use RAG.
How do I serve multiple LoRA fine-tunes economically? vLLM with --enable-lora hot-swaps adapters on a single base model load. Dozens of fine-tunes on one A100 with negligible per-adapter overhead.
Is GPT-4 / Claude / Gemini fine-tuning worth it vs open-source? Sometimes — for tasks needing maximum quality and where the cost per request is acceptable. Open-source fine-tuning wins on data sovereignty, cost-at-scale, and customization depth. Most of our BFSI clients land on open-source self-hosted for DPDP reasons.
Conclusion: A Disciplined, Cheap Capability
LLM fine-tuning in 2026 is a disciplined, cheap capability when you wield it for the right problems. QLoRA on an 8B base, a small high-quality dataset, SFT then DPO, evals on MMLU/GSM8K/your domain set, and vLLM serving — that pipeline gets a competent fine-tune to production for under USD 50 in compute and one week of engineering. Treat it as a precision tool, not a magic upgrade.
At hjLabs.in we have shipped this pipeline for clients in pharma, fintech, BFSI, and SaaS. If you have a narrow task that prompt engineering and RAG have not cracked, we can help you decide if fine-tuning is the right answer — and if it is, build it.
Further reading and related work at hjLabs.in
- AI & ML Services at hjLabs.in — overview
- Agentic AI Development Services
- Computer Vision Services
- AIML Pricing & Engagement Models
- Advanced RAG Systems 2026 (hybrid retrieval, reranking)
- The Future of Agentic AI (tool use, MCP)
- MLOps Production Lessons (serving, monitoring, drift)