MLOps Production Lessons 2026 | CI/CD, Drift, Observability, DPDP India

TL;DR

The hardest part of ML is the seven months after launch. Production MLOps in 2026, distilled:

Model CI/CD must test the model, not just the code — golden eval sets that block merges are non-negotiable.
Drift detection needs three layers: feature drift (PSI/KS), prediction drift, and concept drift (delayed-label calibration).
Observability for ML is logs + metrics + traces + prediction logs + ground-truth feedback loop. Most teams have only the first two.
On-prem in India is now competitive with cloud for steady-state ML workloads — DPDP-compliant, often 40–60% cheaper at scale.
The on-call rotation owns the model, not data science. Without paging, drift is invisible.

Putting an ML model into production is the easy part. Keeping it useful, accurate, and compliant for three years is the hard part — and the part no MOOC teaches. In 2026 the production-ML stack has matured around a handful of well-understood patterns, but the failure modes have not gone away: silent drift, untested model upgrades that break downstream consumers, prediction-time bugs invisible to feature-level monitoring, on-call rotations that page humans for symptoms but never for causes, and compliance teams who find out about ML systems six months after they shipped.

This guide collects the lessons we have learned at hjLabs.in shipping ML systems for Indian manufacturing, BFSI, and SaaS clients — including some scars from systems that failed silently for weeks before anyone noticed. It covers CI/CD for ML (which is different from CI/CD for code), the three layers of drift detection, the observability stack that actually catches bugs, the deployment patterns that minimize blast radius, the on-prem-vs-cloud calculus under DPDP, and the organizational practices (on-call ownership, model registry hygiene, post-mortems) without which the technical stack is theater.

The same MLOps discipline applies to physical-systems deployment — when a vision model gates a robotic auto-soldering production line or steers a CNC, a silent drift event does not just degrade a metric, it scraps inventory. Treat the model registry, drift detector, and rollback path as line-side machinery, not laptops.

"MLOps maturity is not measured by the number of tools deployed. It is measured by how fast the team detects, diagnoses, and rolls back a bad model. In our experience, three days is the threshold between 'good MLOps' and 'we are flying blind'."

1. The Five MLOps Maturity Levels (Honest Version)

Most MLOps maturity ladders are vendor marketing. Here is the honest version we use for client assessments.

Level	Description	Time to detect a bad model	Time to rollback
0 — Notebook in production	One person runs a notebook on a schedule, scp's outputs to a server	Weeks (when a stakeholder complains)	Days (debug + re-run by hand)
1 — Scripted training + manual deploy	Training is reproducible; deploy is a Docker push and PR	1–2 weeks	Hours
2 — CI/CD with eval gates	Pipeline trains, evals against gold set, blocks bad merges	Days (offline drift checks)	Minutes (canary rollback)
3 — Continuous monitoring + drift alerts	Feature/prediction/concept drift alerts, on-call rotation	Hours	Minutes
4 — Closed-loop retraining	Auto-retrain on drift; shadow-eval before promotion; A/B in production	Minutes	Minutes (auto-rollback)

Most production ML systems we audit at clients sit between Level 1 and Level 2. Getting to Level 3 is where MLOps pays for itself; Level 4 is genuine maturity and only worth pursuing for high-value, high-volume systems. Do not skip levels.

2. CI/CD For Models, Not Just Code

Software CI tests code. Model CI tests model behavior. A green pipeline that says "all unit tests pass" tells you nothing about whether your new model regresses on the customer-churn cohort. The minimal model CI we install at every client engagement:

Reproducibility check — pin random seeds, deterministic data ordering, version pinning. Two runs of the same commit must produce identical metrics within tolerance.
Schema validation — input/output schema with great_expectations or pandera. Reject training and inference data that violates types/ranges/categories.
Eval on golden set — 200–500 hand-curated examples with known ideal outputs. The new model must beat or match the production model within statistical noise.
Slice tests — performance on critical sub-populations (e.g., new-customer cohort, Tier-3 city loans, weekend predictions). Drops >3 points block merge.
Fairness gates — demographic-parity ratio, equal-opportunity gap. See our ethical AI guide.
Latency and memory budget tests — p95 inference latency under load, peak GPU/CPU memory. Regressions fail the build.
Model card auto-update — every release re-renders the model card with new numbers.

Stack we recommend: GitHub Actions or GitLab CI for orchestration, DVC or LakeFS for data versioning, MLflow or Weights & Biases for experiment tracking and the model registry, BentoML or KServe for serving, and Evidently or WhyLabs for evaluation and drift.

3. Drift Detection: Three Layers

Drift is the slow-motion failure mode that destroys ML systems. The three layers, in increasing order of how long they take to detect:

Feature drift

The distribution of incoming features changes. Detect with Population Stability Index (PSI) or Kolmogorov-Smirnov on each feature, computed daily on a rolling window. PSI > 0.2 is "investigate"; > 0.25 is "page." Watch for sudden spikes (data pipeline bug) and slow drift (genuine population change). Latency: hours.

Prediction drift

The distribution of model outputs changes. PSI on predicted probabilities or class distribution. Useful when ground truth is delayed — it tells you something is off before you can measure accuracy. Latency: hours.

Concept drift

The relationship between features and labels changes — your model is now miscalibrated even with the same inputs. Detect by tracking calibration on delayed-label samples (e.g., 30-day loan default actuals vs predictions). This is the most damaging and slowest-to-detect form of drift. Latency: days to months.

from evidently.metrics import DataDriftPreset
from evidently.report import Report
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=df_train_ref, current_data=df_prod_last_24h)
psi = report.as_dict()["metrics"][0]["result"]["drift_by_columns"]
for feature, stats in psi.items():
    if stats["drift_score"] > 0.25: page_oncall(feature, stats)

Run drift checks in a scheduled job (Airflow, Prefect, or a Kubernetes CronJob) every 1–6 hours. Pipe alerts to PagerDuty or Slack with the on-call rotation owning the response. Without paging, drift detection is reporting, not operations.

4. Observability: Logs, Metrics, Traces, Predictions, Feedback

Standard observability (logs, metrics, traces) is necessary but not sufficient for ML. You also need prediction logs (every prediction with input features hashed and model version) and a ground-truth feedback loop (a path for actual outcomes to flow back and be joined with predictions for delayed evaluation).

The five-pillar ML observability stack

┌──────────────────────────────────────────────────────────────────┐
│ Logs        — Loki / CloudWatch — exceptions, warnings, info       │
│ Metrics     — Prometheus / Grafana — p50/p95/p99 latency, QPS      │
│ Traces      — OpenTelemetry / Tempo — request path through service │
│ Predictions — Parquet/Iceberg in S3/R2 — input hash, output, ver   │
│ Feedback    — Kafka or batch ETL — actuals joined to predictions   │
└──────────────────────────────────────────────────────────────────┘
       │
       └─→ Drift detection + delayed-label calibration + retraining

Prediction logs are non-negotiable. Without them you cannot debug "why did the model predict X for this customer six weeks ago." Hash PII inputs (HMAC-SHA256 with a stable key) before logging — DPDP-safe, debug-useful. Retention: 90 days hot in object storage, 18 months cold for audit and retraining.

5. Deployment Patterns: Shadow, Canary, A/B, Champion-Challenger

Big-bang deploys ruin weekends. Use one of four patterns depending on risk tolerance.

Shadow deployment — new model runs alongside old, predictions logged but not served. Zero user impact, maximum confidence before promotion. Default for the first deploy of any new model class.
Canary rollout — new model serves 1% → 5% → 25% → 100% over hours/days, with auto-rollback on metric regression. Default for model upgrades.
A/B test — old and new each serve 50% of users, statistical test for lift. Use when you need business-metric proof, not just accuracy proof.
Champion-challenger — multiple challengers run in shadow; the one that beats champion on golden metrics for N days gets promoted automatically. Used by mature teams.

Auto-rollback is the feature most teams skip and then regret. Wire latency, error-rate, and at least one business metric (e.g., approval rate for credit, click-through for ranking) to a rollback trigger that flips traffic back to the previous model in under five minutes.

6. On-Prem vs Cloud in the DPDP Era

For Indian deployments, the on-prem-vs-cloud decision in 2026 is shaped heavily by the DPDP Act 2023 and sectoral regulators (RBI, SEBI, IRDAI). Three considerations dominate.

Data residency. DPDP gives the central government power to restrict cross-border transfers of personal data to specified countries. RBI's payment data localization mandate already requires payment data to stay in India. For BFSI ML systems, AWS Mumbai, Azure India, GCP Mumbai, Yotta, and CtrlS are the credible cloud options. Cross-border training on US-region GPUs of personal data is a compliance risk most CISOs will not sign off on.

Cost. At steady state, on-prem GPU is dramatically cheaper than rented cloud. An H100 80GB SXM5 server (8 GPUs) lists at roughly USD 280k–320k capex; the equivalent on AWS p5.48xlarge is ~USD 98/hr on-demand. The break-even point is around 8 months of continuous use. For variable training workloads, cloud is still right; for 24/7 inference at scale, on-prem in a Yotta or CtrlS colo wins decisively after year one.

Operational maturity. On-prem demands DevOps and DC ops capability most mid-size Indian companies do not have. Hybrid is often the answer: training on AWS Mumbai with auto-scaled spot fleets, inference on-prem in Yotta on dedicated reserved GPUs. We have shipped this exact split for two Mumbai NBFCs.

7. Organizational Practices That Make MLOps Work

The technical stack matters less than the organizational practices around it.

The on-call rotation owns the model. If "the model is broken" pages data scientists who have no on-call training, it does not get fixed. Models are infrastructure; treat them like infrastructure.
Model registry as source of truth. Production model = "the model in the registry tagged production." Period. No "well, Rahul has a different version on his laptop."
Post-mortems for model incidents. Same template as software incidents. Publish them internally. Track action items.
Per-model SLOs. p95 latency, error rate, accuracy on a rolling cohort. Tie to a runbook.
Quarterly model review board. Stakeholders + product + ML + compliance review every production model: still useful? Still fair? Drift trajectory? Cost?
Sunset policy. Models that are no longer used get archived. Living models that have not been evaluated in 12 months get re-evaluated or retired.

8. LLMOps: The New Wrinkle

LLMs in production introduce challenges classical MLOps frameworks were not designed for. Five LLMOps-specific patterns:

Prompt versioning — system prompts are model artifacts. Store them in git, tag releases, evaluate on every change. Treat system_prompt_v17.md the way you treat model_v17.pkl.
Cost observability — every LLM call logs tokens-in, tokens-out, model, USD cost. Aggregate per-feature, per-user, per-tenant. Without this you cannot understand or control spend.
LLM-as-judge evals in CI — for tasks without deterministic outputs, use Claude or GPT-4o as a judge with a structured rubric. Run on every prompt change. Set regression thresholds.
Hallucination monitoring — RAG faithfulness, citation rate, refusal rate. Spikes in any of these are leading indicators of a broken pipeline.
Multi-model fallback — when your primary model rate-limits or errors, fall back to a secondary. Track which path served each request. This is the LLM equivalent of database failover and is now table stakes for serious production LLM apps.

Tools: Langfuse, LangSmith, Helicone, Phoenix (Arize), Braintrust, Weave (W&B) all serve various subsets of these needs. We use Langfuse for self-hosted DPDP-compliant deployments and LangSmith when the client is already in the LangChain ecosystem.

9. Edge ML and Industrial Deployments

For manufacturing, retail, and IoT clients, ML is not always cloud or central data-center — it runs on edge devices (NVIDIA Jetson Orin, Coral TPU, Hailo-8, Intel NUC, or even ARM SBCs). Edge MLOps adds three concerns: model size and quantization, OTA deployment, and fleet observability.

Quantization: INT8 or INT4 quantization with TensorRT, ONNX Runtime, or Apache TVM compresses models 2-8x with 0-2 points accuracy drop. For YOLO-family detection models we routinely deploy at 30+ FPS on Jetson Orin Nano (USD 250) for industrial inspection. OTA deployment: Balena, AWS IoT, or a hand-rolled rsync-over-VPN pipeline with rollback. Never push a model to a fleet without a canary. Fleet observability: prediction logs uploaded over a metered link need rate limiting and selective sampling.

Edge MLOps is where our predictive maintenance and industrial automation deployments live. The discipline is the same as cloud MLOps, but the failure modes (network drops, GPU thermal throttling, SD-card corruption) are richer.

How We Apply This at hjLabs.in

For a Mumbai NBFC client, we built a Level-3 MLOps stack around their credit-scoring pipeline: GitHub Actions for CI/CD with golden-set evals as a merge gate, Evidently for daily drift checks on 38 features with PagerDuty alerts, MLflow Model Registry as the single source of truth, BentoML serving on a Yotta-Mumbai Kubernetes cluster with shadow + canary rollout, and a Kafka-based feedback loop that joins 30-day default actuals to predictions for monthly recalibration. Mean time to detect a bad model dropped from "three weeks if a stakeholder noticed" to four hours.

For a Gujarat-based predictive-maintenance deployment (see our predictive maintenance service), we built drift detection on vibration-sensor feature distributions running on industrial edge boxes, with predictions logged to a central InfluxDB and weekly drift reports surfaced in Grafana. The same client uses our industrial automation playbook for orchestration. Fine-tuned LLM systems we ship include the same observability pillars — prediction logging, drift on input distributions, evaluation on a rolling golden set.

Common Pitfalls

No golden eval set. Every model upgrade is a coin flip without one.
Feature drift only. Prediction drift and concept drift catch the bugs feature drift misses.
Training-serving skew. The number-one cause of "works in dev, fails in prod." Use the same feature pipeline code in both paths.
No prediction logs. You cannot debug what you cannot reconstruct.
No rollback drill. If you have never practiced rolling back, you will fail to in the moment.
Data scientist on-call without runbooks. Cruel and unproductive. Build runbooks for every alert.
One Jupyter notebook = one model. Notebooks are exploration, not production artifacts.
Ignoring DPDP from day one. Retrofitting data-residency into a production system costs 10x more than designing it in.

10. Reproducibility and the Audit Trail

For regulated industries — BFSI, healthcare, pharma — the question "show us how this model was trained" is not hypothetical. It is asked in audits, in regulatory inspections, and in litigation. Three artifacts must be reproducible on demand:

The training data at the exact version used. DVC, LakeFS, or hash-anchored S3 prefixes are credible.
The training code and dependencies at a tagged git commit, with a Dockerfile or pinned requirements.txt that yields a bitwise-reproducible image.
The training environment: GPU type, CUDA version, random seeds, Python version. Stored in the model registry alongside the artifact.

For models trained on personal data under DPDP, the audit trail must also include the legal basis (consent or specified legitimate use), the purpose statement, and the data retention policy. Wire this into the model registry — do not rely on a separate Excel file maintained by someone who left.

The audit-trail discipline pays off in unexpected ways. When a new engineer asks "why did we choose threshold 0.62?" or "what data was in the v3 training run?", a few git tags and an MLflow run page is the answer. Without this, the answer is "we are not sure, and we cannot re-run."

11. Cost Discipline: The MLOps Practice That Pays For Itself

ML costs spiral. Three disciplines that keep them under control:

Per-model cost attribution: tag every inference and training run with a model ID + cost center. Aggregate monthly. Surface the biggest spenders to engineering management. Without this, costs are invisible and uncontrollable.
Idle GPU detection: a long-running A100 at <30% utilization is wasted money. Datadog, Grafana with DCGM exporter, or cloud-native tools (CloudWatch GPU metrics) catch this. Right-size GPU SKUs aggressively — many "we need an A100" jobs run fine on an A10G at one-third the cost.
Spot vs reserved vs on-demand: training workloads belong on spot/preemptible (60-80% cost saving with checkpointing). Production inference belongs on reserved capacity. On-demand is the worst-case default — use it only for development.
Cache aggressively: embedding cache, prompt cache, retrieval cache. For chat applications, semantic caching alone often cuts LLM cost 20-40%.
Right-size context: many production LLM calls pass 8k tokens of context when 2k would do. Trim aggressively; reranking helps.

FAQ: MLOps Questions Production Teams Ask

How do we know if our current MLOps is actually working? Three signal questions. (1) How long would it take to roll back a bad model — under 15 minutes is healthy. (2) When was the last time you detected a regression from monitoring (not user complaint)? Within the last quarter is healthy. (3) Can a new engineer reproduce a year-old model from the registry? If yes, congratulations; this is rare.

What is the right team structure? The "ML platform team owns infra, product teams own models" pattern works at >15 engineers. Below that, a single team owns both. Avoid the trap of a "central data science team" that hands off models to engineering; the handoff is where 80% of production bugs live.

Open-source vs commercial MLOps tools? For small/mid teams, MLflow + DVC + Evidently + BentoML covers 80% of needs free. Pay for vendor tools (W&B, Tecton, Vertex AI, Databricks) when team scale or compliance burden justifies it. Do not pay for tools your team will not adopt — a shiny dashboard nobody reads is worse than a CSV in S3 your team actually reads.

How do we handle a model that needs to be retrained weekly? Build retraining as a Prefect/Airflow DAG with the same eval gates as initial deployment. New model goes to shadow first, canary second, full deploy third. Manual promotion step until you trust the pipeline.

What is the minimum monitoring we should have on day one? Latency p50/p95/p99, error rate, request volume, prediction logs to object storage, and one drift check per feature. Without these you are flying blind.

Do we need a feature store? Below ~5 models or ~50 features, probably no. Above that, the training-serving skew prevention and feature reuse justify the operational burden. Feast (open-source), Tecton, and Databricks Feature Store are the credible options.

Conclusion: MLOps Is a Discipline of Boring Things Done Well

The exciting parts of ML are training and architecture. The valuable parts are the boring things — reproducible pipelines, eval gates, drift dashboards, runbooks, on-call rotations, and post-mortems. MLOps maturity is measured by how quickly your team can detect, diagnose, and recover from a bad model. Teams that take this discipline seriously ship more models, take less downtime, and survive regulatory scrutiny without scrambling.

At hjLabs.in we have built this MLOps muscle into our standard delivery. If you have ML in production and want a Level-3 assessment or a roadmap to get there, we would love to talk.

See also: when a deployed model has to surface an inference result on an embedded display next to the equipment — a defect tag, an OK/NG badge, an icon glyph — the renderer choice between Adafruit GFX's two bitmap calls genuinely affects refresh latency on the line. Our companion benchmark on drawBitmap vs drawXBitmap performance covers the timing and memory-layout differences.

MLOps Production Lessons 2026: CI/CD, Drift, Observability & DPDP India