LLM Fine-Tuning Services — LoRA, QLoRA, RLHF on Llama & Mistral

Fine-tuning is now the most reliable lever to turn a generic chatbot into a product that knows your business. Across 40+ engagements at hjLabs.in we have fine-tuned models for fintechs in Pune, hospitals in Singapore, law firms in London and manufacturers in Ahmedabad. The pattern is consistent: an off-the-shelf Llama 3 or Mistral runs at 55-65% accuracy on a domain eval set; after one careful SFT + LoRA pass, the same model hits 88-95% on the same eval, while costing 8-12× less per token than GPT-4o to serve. This page covers the methods we use, the models we work with, India 2026 pricing in rupees, and the four-week timeline from intake to production. Skip to pricing, or read on for the technical detail. We deliver fine-tuning as fixed-scope engagements (not hourly billing) and primarily serve clients in the US, UK, Canada, Australia, Singapore, UAE and Saudi Arabia who want senior India-based engineering at 30-40% the cost of equivalent firms in SF or London.

When you should fine-tune (and when you shouldn't)

A blunt decision matrix before you spend budget on training

Roughly 40% of our first calls end with us telling the client to not fine-tune yet — their problem is knowledge retrieval, not skill acquisition. The four most common use cases sort like this:

Use case	Recommended approach	Why
Brand voice consistency	Fine-tune (SFT + LoRA)	Style lives in weights, not in a vector DB. 1-2k style examples usually enough.
Real-time knowledge retrieval (pricing, news, today's policy)	RAG (do not fine-tune)	Fresh data changes daily; fine-tunes freeze knowledge. Use our RAG service.
Closed-domain QA with 10k+ documents	RAG + fine-tune hybrid	RAG fetches the right docs; fine-tune teaches the domain language and reasoning style.
Coding / SQL assistant for your stack	Fine-tune (Code-Llama or Qwen 2.5 Coder)	Syntax is statistical — best learnt in weights. RAG only helps with rare APIs.

If your use case sits in the RAG row, read our fine-tuning best practices guide and RAG systems page first. The cheapest fine-tune is the one you didn't run.

5 fine-tuning methods we offer

Each with India 2026 cost in rupees, training time, and hardware needed

1. SFT — Supervised Fine-Tuning

SFT is the foundation of almost every fine-tuning project. You give the model thousands of {prompt, ideal_response} pairs and update the weights so it learns to produce the ideal response. Use SFT when the model needs to learn a new skill, format, or style — "answer in our compliance team's voice", "always return this JSON schema", or "generate SQL for our Postgres schema". When to use: when you have 1k-50k clean labelled examples and a stable target. Cost: USD 10,000-20,000 for 8B-class (Llama 3 8B, Mistral 7B), USD 25,000-45,000 for 70B. Training time: 4-12 h on 1× A100 80GB for 8B; 18-36 h on 4× A100 for 70B. Hardware: 1× A100/H100 for 8B; 4-8× A100 80GB for 70B full-parameter SFT. We use HuggingFace trl + DeepSpeed ZeRO-3, and Axolotl for config-driven runs.

filter_alt

2. LoRA (Low-Rank Adaptation)

LoRA freezes the base model and trains tiny adapter matrices alongside each attention layer — typically 0.1-1% of total parameters. Near-SFT quality at a fraction of the compute, and you can swap adapters at inference to serve multiple variants from one base. When to use: 80% of our projects — fast iteration, multi-tenant adapters, or when full SFT on a 70B model is too expensive. Rank choice: r=8-16 for English style/format tasks; r=32-64 for domain reasoning. Cost: USD 7,500-15,000 for 8B, USD 20,000-35,000 for 70B. Training time: 2-6 h on 1× A100 for 8B; 8-16 h on 2× A100 for 70B. Hardware: 1× A100 40GB for 8B LoRA; 2× A100 80GB for 70B LoRA.

memory

3. QLoRA (Quantized LoRA)

QLoRA is LoRA plus 4-bit quantization of the frozen base. VRAM drops by another 4× — a 70B model that needs 4× A100 80GB for vanilla LoRA fits on a single A100 80GB with QLoRA. Quality loss vs full LoRA is typically under 1% on MMLU and under 2% on domain evals. When to use: any time GPU cost is the bottleneck, or you want to fine-tune at home (Llama 3 8B QLoRA fits on a single RTX 4090 24GB). Cost: our standard run is USD 1,200 for Llama 3 8B on 1× A100 for 6 hours end-to-end (data prep + training + eval + adapter merge). 70B QLoRA runs from USD 8,000-15,000. Training time: 4-8 h for 8B; 12-24 h for 70B. Hardware: 1× A100 40GB (or RTX 4090) for 8B; 1× A100 80GB or 1× H100 for 70B. Standard stack: bitsandbytes + peft + trl.

thumbs_up_down

4. RLHF & DPO (Alignment Fine-Tuning)

After SFT teaches the model what to say, RLHF or DPO teaches it which response is better. You give pairs of responses labelled "chosen" and "rejected", and it learns to prefer the chosen pattern. We almost always use DPO (Direct Preference Optimization) over classical RLHF-with-PPO — DPO is 4× cheaper, more stable, and needs no reward model. When to use: polish tone, reduce hallucination, enforce safety guardrails. Run DPO after SFT, not instead of. Cost: USD 15,000-25,000 on top of SFT/LoRA. Training time: 6-12 h on 1-2× A100 for 8B; 18-30 h on 4× A100 for 70B. Hardware: ~2× the VRAM of SFT (DPO holds policy and reference copies). Data needs: 1k-3k high-quality preference pairs.

layers

5. Continued Pre-Training (CPT)

CPT is the heaviest hammer. Instead of instruction-response pairs, you train on raw domain text (unsupervised, next-token prediction) for billions of tokens — same loss the model was originally pre-trained with. The result: a base that has internalized your domain's vocabulary, syntax and reasoning rhythms. When to use: highly domain-specific applications where SFT plateaus — medical literature, legal contracts, financial filings, niche source code. Data needs are large: 500M-10B tokens of clean domain text. Cost: USD 45,000-95,000. Training time: 3-10 days on 4-8× H100. Hardware: 8× H100 80GB minimum for 7B-class CPT. After CPT, run a quick SFT pass (USD 10,000) to bring back instruction-following.

Models we work with

Open-weights base models we have shipped fine-tunes on, with benchmark gains we have delivered

Llama 3 / 3.1 / 3.2 (8B, 70B)

Our default for English-only commercial deployments. Llama 3.1 8B is the workhorse — strong instruction-following, 128k context, excellent community tooling (vLLM, Unsloth, Axolotl, llama.cpp). Typical gains on Llama 3 8B fine-tunes: +27 points on a SaaS support-quality eval, +34 points on a banking-compliance Q&A set, +41 points on contract-clause extraction. Llama 3.2 1B/3B are excellent for on-device inference after 4-bit quantization.

Mistral 7B / Mistral Nemo / Mixtral 8x7B

Mistral 7B v0.3 — slightly weaker on raw MMLU than Llama 3 but more permissive Apache 2.0 licence, better European coverage, excellent function-calling baseline. Mixtral 8x7B (MoE) gives GPT-3.5-quality reasoning on a 2× A100 budget. Typical gains: +18 points on French-language legal QA, +29 points on a multi-step finance reasoning task. Mistral Nemo (12B) is our 2026 pick for European clients needing OSI-approved licensing.

Phi-3 / Phi-4

Microsoft's Phi series wins when you need a tiny model (3-14B) with disproportionately strong reasoning. Phi-3.5 Mini (3.8B) consistently beats Llama 3 8B on math and code despite being half the size. Use for edge deployment and cost-sensitive serving. Typical gains: +22 points on a Pune startup's SQL-generation eval after 4 hours of QLoRA.

Qwen 2.5 / Qwen 2.5-Coder

Alibaba's Qwen 2.5 is our top pick for coding (Qwen 2.5-Coder 32B is closer to Claude 3.5 Sonnet than GPT-4o-mini on HumanEval after fine-tuning), and has the best Chinese, Japanese and Korean coverage of any open-weights model. For polyglot enterprises serving APAC + India simultaneously, Qwen 2.5 is the path of least resistance.

Indic models — Sarvam-1, Krutrim, Airavata

For Hindi, Gujarati, Tamil, Marathi, Bengali, Kannada and Telugu we start from Indic-native bases. Sarvam-1 (2B) by Sarvam AI is our default — Apache 2.0, 4T-token training, strong Indic representation, fast on consumer GPUs. Krutrim Spectre v2 (7B) from Ola is heavier but stronger on conversational Hindi. Airavata 7B (AI4Bharat) is Llama-2 7B continued pre-trained on 22 Indic languages — great for code-switching. We've shipped Indic fine-tunes for an Ahmedabad fintech (Gujarati support, +38 points) and a Chennai EdTech (Tamil math tutor, +44 points on GSM8K-Tamil).

How we measure success

The benchmarks and eval methodology every project includes

Every project ships with three layers of evaluation, all reproducible by your team after handoff:

Public reasoning benchmarks — MMLU (general knowledge), GSM8K (math), HumanEval (Python code), MT-Bench (multi-turn quality, GPT-4 judged), HellaSwag (commonsense). These confirm the fine-tune did not regress baseline ability.
Custom domain eval set — 200-500 hand-labelled examples co-designed with your subject-matter experts in week 1. Graded with exact-match / ROUGE / BLEU for structured outputs and LLM-as-judge (GPT-4o or Claude 3.5 Sonnet) for open-ended quality, plus a 50-sample human spot-check.
Production A/B test — Growth-tier and above: 10-50% traffic split against the base model for 1-2 weeks. We instrument completion rate, thumbs-up rate, support escalation rate and token cost before recommending full rollout.

All eval code, gold sets and model weights are handed over at project end.

4-week implementation timeline

What happens, week by week, on a standard LoRA / QLoRA engagement

WEEK 1

Data audit + eval set

Two workshops with your team: scope the use case, pick candidate base models, agree on a 200-500 example gold eval set, audit data sources. Deliverable: one-page training plan and a reproducible eval harness.

WEEK 2

Data prep + training run 1

Clean, dedup and format the corpus (ShareGPT, Alpaca, or custom JSONL). Run the first LoRA/QLoRA pass on the chosen base. Eval against the gold set. By end of week 2 you have a v1 checkpoint with measured gains over baseline.

WEEK 3

Eval + iteration

Failure-mode analysis on v1. 2-3 iteration runs: tune rank, learning rate, data mixture; add DPO if subjective quality is the bottleneck. By end of week 3 you have a v3 that beats both base and v1 on the gold eval.

WEEK 4

Deployment + handoff

Merge adapter into base, quantize to 4-bit/8-bit, deploy on vLLM or TGI in your VPC. Hand over Helm chart, Terraform, runbook, training scripts and gold eval set. Train your MLOps team on rolling updates.

CASE SKETCH

Enterprise RAG + Fine-Tuning Success Story

Anonymized to protect client confidentiality. A leading B2B financial services enterprise (~600 employees, USD 2.5 Billion in transactions) approached us in late 2025: their support operations were processing 8,000+ complex customer queries per day across digital channels. A standard GPT-4o-mini implementation with basic RAG was their initial solution, but it only achieved 71% accuracy, cost USD 5,200 per month in token consumption, and routinely hallucinated payment calculations, regulatory rules, and service requirements.

We delivered a hybrid solution: a high-fidelity RAG pipeline over their 4,200-document policy store for real-time facts, plus a custom Llama 3.1 8B SFT + LoRA fine-tune trained on 14,000 historical {query, ideal_response} pairs from compliance-approved historical logs. Dataset preparation took 9 days, featuring automated data deduplication, PII redaction, and chain-of-thought rationales for complicated financial calculations. Training was executed in a single 7-hour QLoRA run on a dedicated Cloud GPU cluster (under USD 60 GPU cost).

Results: Production accuracy lifted from 71% to 93% (fine-tuned Llama 3 8B + RAG). Financial calculations hallucination rates dropped from 7% to under 0.5%. Serving costs fell from USD 5,200/month to USD 800/month (via vLLM on a single L40S GPU). The system was fully live in 5 weeks. Total project fee was USD 14,500 fixed-scope, achieving full payback in less than 3 months.

Pricing — fine-tuning packages

Transparent, fixed-scope engagements. No hourly billing.

Three tiers, transparent USD pricing. Every tier includes data prep, training, eval, model weights handover, and 30-day post-deployment support. Multi-language and continued pre-training add-ons priced separately.

Starter

USD 10,000

8B model SFT on <10k examples

Llama 3 8B / Mistral 7B / Phi-3
Up to 10,000 training examples
200-example gold eval set
HuggingFace model deployment
30-day support

Get Started

Growth

⭐ Most Popular

USD 25,000 – USD 45,000

LoRA on 70B + custom eval + 4-week support

Llama 3 70B / Mixtral 8x7B
LoRA + DPO alignment
Custom 500-example eval suite
MT-Bench + production A/B test
VPC / on-prem deployment

Get Started

Enterprise

USD 75,000+

Custom architecture, distillation, on-prem

Continued pre-training (CPT)
Knowledge distillation 70B → 8B
Indic / multi-language support
Air-gapped on-prem deployment
6-month support + retraining

Get Started

Common questions

Answers to what every prospective client asks us

How much data do I need to fine-tune effectively?

For LoRA / QLoRA, 500-1000 high-quality instruction-response pairs is usually enough; 2k-5k for brand voice; 10k-50k for technical domain QA. Continued pre-training needs 100M-10B tokens of raw domain text. We recommend starting by curating a 200-example gold eval set first, then scaling training data from there. Data quality beats quantity — 2,000 clean examples outperform 50,000 noisy ones.

Can you fine-tune on private data without sending it to cloud?

Yes. We routinely train inside the client VPC on AWS, Azure or GCP, or on dedicated bare-metal GPUs (RunPod, Lambda, CoreWeave) with data isolation. We can ship a containerized pipeline that runs on your hardware — data never leaves your network. For regulated sectors (healthcare, banking, defence) we also support air-gapped training on hjLabs-supplied portable A100 / H100 rigs at your office.

Llama vs Mistral — which model should I start with?

Llama 3.1 8B is our default for English-only — best tooling, broadest instruction-tuned ecosystem. Mistral 7B / Nemo are stronger on European languages with a more permissive Apache 2.0 licence. For Indic, start with Sarvam-1 or Krutrim. For code, Qwen 2.5 Coder beats both. We benchmark 2-3 candidates on your eval set in week 1 — choosing the wrong base model is the single biggest waste of fine-tuning budget.

How long does a typical project take?

Standard 4-week timeline: Week 1 data audit + eval set, Week 2 data prep + training run 1, Week 3 eval + iteration, Week 4 deployment + handoff. Simple LoRA on clean data can finish in 2 weeks. RLHF or continued pre-training projects take 6-10 weeks. The bottleneck is almost never compute — it's data cleaning and eval design.

What's the difference between SFT and DPO?

SFT trains the model to imitate a single ideal response per prompt — best for new knowledge, style or format. DPO trains on pairs of responses labelled chosen vs rejected — best for tone, reducing hallucination, and subjective behaviour. We run SFT first, then a small DPO pass on 1-3k preference pairs to polish. DPO has largely replaced RLHF-with-PPO because it's far more stable and needs no reward model.

Can you fine-tune for Hindi, Gujarati or Tamil?

Yes. We start from Sarvam-1 (2B Indic base), Krutrim (7B), or Airavata (Llama-2 7B continued pre-trained on Hindi). Llama 3.1 and Qwen 2.5 also handle Hindi/Devanagari well with LoRA r=64-128. Gujarati and Tamil usually need 1-3B tokens of continued pre-training before SFT — we've shipped both for clients in Ahmedabad and Chennai. Expect 30-40% higher cost vs English fine-tunes because Indic data curation is more expensive.

How do you measure if fine-tuning worked?

Three layers. (1) Public benchmarks — MMLU, GSM8K, HumanEval, MT-Bench — to confirm no regression. (2) Custom domain eval set of 200-500 hand-labelled examples graded with exact-match, ROUGE and LLM-as-judge (GPT-4o), plus 50-sample human spot-check. (3) Production A/B test against the base model for 1-2 weeks. All eval code and gold sets are handed over.

Can you deploy the model in our VPC?

Yes — included in Growth and Enterprise tiers. We support AWS (g5/g6/p4d/p5 + SageMaker), Azure (NC-series + Azure ML), GCP (a2/a3 + Vertex AI), and bare-metal H100 (RunPod, Lambda, CoreWeave). 8B quantized models serve at 80-200 tokens/sec on a single L40S via vLLM/TGI. 70B models need 2-4× A100 80GB or 1× H100. We hand over Helm charts, Terraform, and a runbook for autoscaling, metrics and rolling updates.

About the author

Hemang Joshi

Founder & Principal AI/ML Engineer, hjLabs.in

Hemang has led 40+ LLM fine-tuning engagements since 2022 — from a 3-hour QLoRA on Llama 2 7B for a Bangalore SaaS to a 4-week continued pre-training run on a 22-language Indic corpus. He has shipped production fine-tunes of Llama 3, Mistral, Phi-3, Qwen and Sarvam-1, and writes on the hjLabs technical blog about eval methodology, RLHF vs DPO, and the economics of running open-weights LLMs in enterprise VPCs. Based in Gandhinagar, Gujarat. Read his full bio and engagement history.

Last updated 2026-05-18

"Fantastic AI engineer with pragmatic business and technical skills. Great to work with. An asset to any team."

Andy Curtis CISO, CibrAI — managed Hemang directly View Case Study →

Ready to build your custom LLM?

Get a domain-specific model that outperforms generic LLMs by 40% on your eval set — in 4 weeks, from USD 10,000.

calendar_todaySchedule free consultation

LLM Fine-Tuning Services
LoRA, QLoRA, RLHF on Llama & Mistral

When you should fine-tune (and when you shouldn't)

5 fine-tuning methods we offer

1. SFT — Supervised Fine-Tuning

2. LoRA (Low-Rank Adaptation)

3. QLoRA (Quantized LoRA)

4. RLHF & DPO (Alignment Fine-Tuning)

5. Continued Pre-Training (CPT)

Models we work with

Llama 3 / 3.1 / 3.2 (8B, 70B)

Mistral 7B / Mistral Nemo / Mixtral 8x7B

Phi-3 / Phi-4

Qwen 2.5 / Qwen 2.5-Coder

Indic models — Sarvam-1, Krutrim, Airavata

How we measure success

4-week implementation timeline

Data audit + eval set

Data prep + training run 1

Eval + iteration

Deployment + handoff

Enterprise RAG + Fine-Tuning Success Story

Pricing — fine-tuning packages

Starter

Growth

Enterprise

Common questions

Further reading

About the author

Hemang Joshi

Ready to build your custom LLM?