MLOps & AI Infrastructure¶
MLOps (Machine Learning Operations) is the discipline of deploying, monitoring, and maintaining ML models in production. It extends DevOps and SRE principles to machine learning systems, addressing unique challenges like model versioning, data drift, reproducibility, and the need for continuous retraining. As AI systems move from research to production, MLOps becomes the bridge between data science experimentation and reliable, scalable AI services.
With the rise of LLMs, a new sub-discipline—LLMOps—has emerged, focusing on the specific operational challenges of serving, monitoring, and optimizing large language model applications. This chapter covers the complete MLOps lifecycle, from experiment tracking to production monitoring, infrastructure, and cost optimization.
1. The MLOps Lifecycle¶
Traditional Software vs. ML Systems¶
ML systems are fundamentally different from traditional software:
| Aspect | Traditional Software | ML Systems |
|---|---|---|
| Logic | Explicitly coded | Learned from data |
| Testing | Deterministic tests | Statistical validation |
| Versioning | Code only | Code + data + model + config |
| Debugging | Stack traces, logs | Data analysis, model inspection |
| Failure Modes | Crashes, errors | Silent degradation, drift |
| Dependencies | Libraries, services | + Training data, feature pipelines |
| Deployment | Code deploy | Model deploy + data pipeline deploy |
| Monitoring | Uptime, latency, errors | + Data drift, model performance, fairness |
MLOps Maturity Levels¶
| Level | Description | Characteristics |
|---|---|---|
| Level 0 | Manual process | Jupyter notebooks, manual deployment, no monitoring |
| Level 1 | ML pipeline automation | Automated training, basic CI/CD, simple monitoring |
| Level 2 | CI/CD for ML | Automated testing, model validation, A/B testing, full monitoring |
| Level 3 | Full MLOps | Automated retraining, feature stores, model governance, self-healing |
The ML Lifecycle¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Data Mgmt │───>│ Training │───>│ Evaluation │
│ │ │ │ │ │
│ - Collection │ │ - Feature │ │ - Metrics │
│ - Cleaning │ │ engineering│ │ - Validation │
│ - Versioning │ │ - Training │ │ - Comparison │
│ - Labeling │ │ - HPO tuning │ │ - Approval │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
┌──────────────┐ ┌──────────────┐ ┌──────▼───────┐
│ Monitoring │<───│ Serving │<───│ Deployment │
│ │ │ │ │ │
│ - Data drift │ │ - API/Batch │ │ - Packaging │
│ - Model perf │ │ - Scaling │ │ - Staging │
│ - Cost │ │ - Caching │ │ - Rollout │
│ - Alerts │ │ - A/B test │ │ - Rollback │
└──────┬───────┘ └──────────────┘ └──────────────┘
│
└── Triggers retraining when drift detected
2. Data Management¶
Data is the foundation of ML systems. "Garbage in, garbage out" applies more strongly to ML than to any other software paradigm.
Data Versioning¶
Track datasets alongside code to ensure reproducibility:
| Tool | Approach | Key Features |
|---|---|---|
| DVC (Data Version Control) | Git-like for data | Works with Git, supports remote storage (S3, GCS) |
| LakeFS | Git-like branching for data lakes | Branch, merge, rollback for data |
| Delta Lake | ACID transactions for data lakes | Time travel, schema enforcement |
| Pachyderm | Data-driven pipelines | Automatic versioning, lineage tracking |
Data Quality¶
Automated data quality checks should run as part of the ML pipeline:
class DataQualityChecker:
function validate(dataset: DataFrame) -> QualityReport
checks = []
// Schema validation
checks.append(self.check_schema(dataset))
// Completeness (missing values)
checks.append(self.check_completeness(dataset, max_null_pct=0.05))
// Distribution checks (detect drift from reference)
checks.append(self.check_distributions(dataset, reference_stats))
// Range validation
checks.append(self.check_ranges(dataset, expected_ranges))
// Uniqueness (check for duplicates)
checks.append(self.check_uniqueness(dataset, key_columns))
// Freshness (data isn't stale)
checks.append(self.check_freshness(dataset, max_age_hours=24))
return QualityReport(checks=checks, passed=all(c.passed for c in checks))
Tools: Great Expectations, Pandera, Deequ, Soda.
Feature Stores¶
A centralized repository for storing, managing, and serving ML features:
// Feature Store concept
class FeatureStore:
// Define features
function register_feature(
name: str,
description: str,
entity: str,
value_type: Type,
computation: Function,
freshness: Duration
)
// Get features for training (batch)
function get_training_features(
entity_ids: list[str],
feature_names: list[str],
timestamp: DateTime // Point-in-time correct!
) -> DataFrame
// Get features for inference (real-time)
function get_online_features(
entity_id: str,
feature_names: list[str]
) -> dict
Why feature stores matter: - Consistency: Same feature definition for training and inference (prevents training-serving skew). - Reusability: Features computed once, reused across models. - Point-in-time correctness: Prevent data leakage during training by fetching features as they existed at each training example's timestamp. - Real-time serving: Pre-computed features available with low latency for inference.
| Feature Store | Type | Key Features |
|---|---|---|
| Feast | Open-source | Lightweight, works with existing infra |
| Tecton | Managed | Real-time features, streaming support |
| Databricks Feature Store | Managed | Integrated with Databricks/Spark |
| SageMaker Feature Store | Managed | AWS native, online + offline |
| Hopsworks | Open-source + managed | Python-centric, great docs |
Data Labeling¶
For supervised learning, labeled data is essential:
| Tool | Features | Best For |
|---|---|---|
| Label Studio | Open-source, multi-modal | General labeling, self-hosted |
| Labelbox | Managed, collaborative | Enterprise, team labeling |
| Scale AI | Managed + workforce | Large-scale, high-quality labels |
| Prodigy | Active learning, efficient | NLP tasks, small teams |
| Argilla | Open-source, LLM-focused | LLM evaluation, RLHF data |
3. Experiment Tracking¶
Experiment tracking records every training run's parameters, metrics, artifacts, and environment to ensure reproducibility and enable comparison.
What to Track¶
| Category | Examples |
|---|---|
| Parameters | Learning rate, batch size, epochs, model architecture |
| Metrics | Loss, accuracy, F1, BLEU, latency, throughput |
| Artifacts | Model weights, plots, predictions, confusion matrices |
| Environment | Python version, library versions, GPU type, OS |
| Data | Dataset version, preprocessing steps, train/val/test splits |
| Code | Git commit hash, diff, branch |
Experiment Tracking Tools¶
| Tool | Type | Key Features |
|---|---|---|
| MLflow | Open-source | Model registry, tracking, deployment, widely adopted |
| Weights & Biases (W&B) | Managed | Beautiful UI, hyperparameter sweeps, artifact tracking |
| Neptune | Managed | Flexible metadata, comparison tools |
| CometML | Managed | Experiment comparison, model production monitoring |
| TensorBoard | Open-source | Training visualization, integrated with TensorFlow/PyTorch |
| Aim | Open-source | Fast, local-first, beautiful visualizations |
Pseudocode (Experiment Tracking)¶
class ExperimentTracker:
function start_run(name: str, params: dict) -> Run
run = Run(
id=generate_id(),
name=name,
params=params,
git_commit=get_git_commit(),
environment=capture_environment(),
start_time=now()
)
return run
function log_metric(run: Run, name: str, value: float, step: int = None)
run.metrics.append(Metric(name=name, value=value, step=step, timestamp=now()))
function log_artifact(run: Run, path: str, artifact_type: str)
run.artifacts.append(Artifact(path=path, type=artifact_type, hash=file_hash(path)))
function end_run(run: Run, status: str = "completed")
run.end_time = now()
run.status = status
run.duration = run.end_time - run.start_time
self.store.save(run)
4. Model Training Infrastructure¶
Hyperparameter Optimization (HPO)¶
Finding optimal hyperparameters is critical for model performance:
| Method | Approach | Efficiency | When to Use |
|---|---|---|---|
| Grid Search | Try all combinations | O(k^n) - exhaustive | Few hyperparameters, small search space |
| Random Search | Random sampling | Better than grid for high-dim | Moderate search spaces |
| Bayesian Optimization | Model the objective function | Very efficient | Expensive training runs |
| Hyperband / ASHA | Early stopping of bad runs | Very efficient | Large search spaces |
| Population-Based Training | Evolutionary approach | Parallel, adaptive | Distributed training |
Tools: Optuna, Ray Tune, W&B Sweeps, SigOpt.
Distributed Training¶
For models too large for a single GPU:
| Strategy | Description | Communication | Use Case |
|---|---|---|---|
| Data Parallel (DP) | Same model on each GPU, different data | All-reduce gradients | Models that fit in 1 GPU |
| Distributed Data Parallel (DDP) | DP with better multi-node support | NCCL all-reduce | Standard distributed training |
| Fully Sharded DP (FSDP) | Shard params + gradients + optimizer states | All-gather when needed | Large models (10B+) |
| Tensor Parallel (TP) | Split layers across GPUs | Point-to-point | Very large layers |
| Pipeline Parallel (PP) | Split model layers across GPUs | Forward/backward between stages | Very deep models |
| 3D Parallelism | DP + TP + PP combined | All of the above | Frontier models (100B+) |
Training Frameworks¶
| Framework | Key Features | Best For |
|---|---|---|
| PyTorch | Dynamic graphs, research-friendly | Most common, flexible |
| PyTorch Lightning | Structured PyTorch, less boilerplate | Production training |
| DeepSpeed (Microsoft) | ZeRO optimizer, mixed precision | Large model training |
| Megatron-LM (NVIDIA) | Tensor/pipeline parallelism | LLM pre-training |
| JAX | Functional, XLA compilation | TPU training, research |
| HuggingFace Transformers | Pre-trained models, Trainer API | Fine-tuning, transfer learning |
| Axolotl | Fine-tuning framework | LLM fine-tuning (LoRA, QLoRA) |
| Unsloth | Optimized fine-tuning | Fast LoRA/QLoRA fine-tuning |
5. Model Registry¶
A model registry is a centralized store for model versions, metadata, and lifecycle states.
Model Registry Operations¶
class ModelRegistry:
function register_model(
name: str,
version: str,
artifact_path: str,
metrics: dict,
parameters: dict,
training_run_id: str,
tags: dict = None
) -> ModelVersion
// Store model artifact
artifact_id = self.artifact_store.upload(artifact_path)
// Create version entry
model_version = ModelVersion(
name=name,
version=version,
artifact_id=artifact_id,
metrics=metrics,
parameters=parameters,
training_run_id=training_run_id,
tags=tags,
stage="staging", // Start in staging
created_at=now()
)
self.store.save(model_version)
return model_version
function promote_model(name: str, version: str, target_stage: str)
// Transition: staging -> production (with validation)
model = self.store.get(name, version)
if target_stage == "production":
// Run validation checks
validation = self.validate_for_production(model)
if not validation.passed:
raise ValidationError(validation.failures)
// Archive current production model
current_prod = self.get_production_model(name)
if current_prod:
current_prod.stage = "archived"
self.store.save(current_prod)
model.stage = target_stage
self.store.save(model)
function get_production_model(name: str) -> ModelVersion
return self.store.query(name=name, stage="production")
Model Lifecycle Stages¶
Development → Staging → Production → Archived
│ │ │
└── Failed └── Failed └── Rolled back
6. Model Serving¶
Model serving is the infrastructure for making predictions available to applications.
Serving Patterns¶
Batch Inference¶
Process large datasets offline on a schedule:
class BatchInferenceJob:
model: Model
input_source: DataSource // S3, database, etc.
output_sink: DataSink
function run(job_config: dict)
// Load data
data = self.input_source.read(job_config["input_path"])
// Preprocess
features = self.preprocess(data)
// Batch predict
predictions = []
for batch in features.batches(size=1000):
preds = self.model.predict(batch)
predictions.extend(preds)
// Write results
self.output_sink.write(predictions, job_config["output_path"])
// Log metrics
log_metrics({
"total_predictions": len(predictions),
"latency_p99": compute_p99(latencies),
"error_rate": errors / len(data)
})
Use cases: Recommendation systems, risk scoring, report generation, bulk email classification.
Real-Time Inference (Online Serving)¶
Low-latency API endpoints for synchronous predictions:
class ModelServer:
model: Model
preprocessor: Preprocessor
postprocessor: Postprocessor
cache: Cache
function predict(request: PredictionRequest) -> PredictionResponse
start_time = now()
try:
// Check cache
cache_key = hash(request)
cached = self.cache.get(cache_key)
if cached:
return cached
// Preprocess
features = self.preprocessor.transform(request.data)
// Validate features
if not self.validate_features(features):
return PredictionResponse(error="Invalid input features")
// Predict
raw_prediction = self.model.predict(features)
// Postprocess
response = self.postprocessor.transform(raw_prediction)
// Cache result
self.cache.set(cache_key, response, ttl=300)
// Log for monitoring
latency = now() - start_time
self.log_prediction(request, response, latency)
return response
except Exception as e:
self.log_error(e, request)
return PredictionResponse(error="Prediction failed", fallback=self.get_fallback())
Key metrics: Latency (p50, p95, p99), throughput (requests/second), error rate, GPU utilization.
Streaming Inference¶
Process continuous data streams (e.g., Kafka, Kinesis):
- Real-time fraud detection on transaction streams.
- Continuous anomaly detection on sensor data.
- Live content moderation on social media posts.
Edge Inference¶
Run models on edge devices (mobile, IoT, embedded):
| Framework | Platforms | Model Formats | Key Features |
|---|---|---|---|
| TensorFlow Lite | Android, iOS, embedded | .tflite | Quantization, delegates |
| ONNX Runtime | Cross-platform | .onnx | Universal format, optimized |
| Core ML | Apple ecosystem | .mlmodel | Hardware-accelerated on Apple devices |
| llama.cpp | Desktop, mobile | GGUF | LLM inference on CPU |
| MediaPipe | Mobile, web | Various | Google's ML pipeline framework |
LLM Serving Infrastructure¶
LLM serving has unique requirements compared to traditional ML:
| System | Key Innovation | Best For |
|---|---|---|
| vLLM | PagedAttention, continuous batching | High-throughput production serving |
| TGI (HuggingFace) | Tensor parallelism, Flash Attention | HuggingFace model ecosystem |
| TensorRT-LLM (NVIDIA) | FP8, in-flight batching | Maximum GPU performance |
| Ollama | Simple local serving | Development, testing |
| llama.cpp | CPU/GPU mixed inference, GGUF | Edge, desktop, cost-sensitive |
| SGLang | RadixAttention, constrained decoding | Structured output generation |
| Ray Serve | Distributed serving, model composition | Multi-model pipelines |
7. Monitoring and Observability¶
ML monitoring goes beyond traditional software monitoring — you must also monitor data quality, model performance, and business impact.
What to Monitor¶
Infrastructure Metrics¶
| Metric | Description | Alert Threshold |
|---|---|---|
| Latency (p50, p95, p99) | Response time distribution | p99 > SLA |
| Throughput | Requests per second | Below expected load |
| Error Rate | Failed predictions | > 1% |
| GPU Utilization | GPU compute usage | < 30% (waste) or > 95% (overloaded) |
| Memory Usage | RAM and GPU memory | > 90% |
| Queue Depth | Pending requests | Growing unboundedly |
Data Drift¶
Data drift occurs when the distribution of input data changes from what the model was trained on:
class DriftDetector:
reference_stats: DataStatistics // From training data
function detect_drift(current_data: DataFrame) -> DriftReport
drift_scores = {}
for feature in current_data.columns:
// Statistical tests
if feature.is_numerical:
// Kolmogorov-Smirnov test
ks_stat, p_value = ks_test(
self.reference_stats[feature],
current_data[feature]
)
drift_scores[feature] = {
"test": "KS",
"statistic": ks_stat,
"p_value": p_value,
"drifted": p_value < 0.05
}
else:
// Chi-squared test for categorical
chi2, p_value = chi2_test(
self.reference_stats[feature],
current_data[feature]
)
drift_scores[feature] = {
"test": "chi2",
"statistic": chi2,
"p_value": p_value,
"drifted": p_value < 0.05
}
// Population Stability Index (PSI)
psi = compute_psi(
self.reference_stats[feature],
current_data[feature]
)
drift_scores[feature]["psi"] = psi
// PSI > 0.2 indicates significant drift
return DriftReport(
features=drift_scores,
overall_drift=any(d["drifted"] for d in drift_scores.values()),
recommendation=self.get_recommendation(drift_scores)
)
Drift Types: - Data Drift (Covariate Shift): Input distribution changes (e.g., new user demographics). - Concept Drift: The relationship between input and output changes (e.g., user preferences shift). - Label Drift: The distribution of target labels changes (e.g., fraud rate increases).
Model Performance¶
Monitor prediction quality over time:
class ModelPerformanceMonitor:
function monitor(
predictions: list[Prediction],
ground_truth: list[Label] = None // May be delayed
) -> PerformanceReport
metrics = {}
// Prediction distribution
metrics["prediction_distribution"] = compute_distribution(predictions)
metrics["prediction_entropy"] = compute_entropy(predictions)
// If ground truth available (possibly delayed)
if ground_truth:
metrics["accuracy"] = compute_accuracy(predictions, ground_truth)
metrics["f1"] = compute_f1(predictions, ground_truth)
metrics["calibration"] = compute_calibration(predictions, ground_truth)
// Detect anomalies
metrics["anomaly_score"] = self.detect_performance_anomaly(metrics)
// Compare to baseline
metrics["degradation"] = self.compare_to_baseline(metrics)
return PerformanceReport(metrics=metrics)
Monitoring Tools¶
| Tool | Type | Key Features |
|---|---|---|
| Evidently AI | Open-source | Data drift, model performance reports |
| Arize Phoenix | Open-source | LLM traces, embeddings analysis |
| WhyLabs | Managed | Data profiling, drift detection |
| Fiddler | Managed | Explainability, fairness monitoring |
| NannyML | Open-source | Performance estimation without labels |
| Prometheus + Grafana | Open-source | Infrastructure metrics, custom dashboards |
| LangSmith | Managed | LLM-specific tracing and evaluation |
| Langfuse | Open-source | LLM observability, prompt management |
Alerting Strategy¶
// Tiered alerting based on severity
alerts = {
"critical": {
"conditions": [
"error_rate > 5%",
"latency_p99 > 10s",
"service_down"
],
"action": "page_on_call",
"response_time": "5 minutes"
},
"warning": {
"conditions": [
"data_drift_detected",
"model_accuracy < baseline - 5%",
"gpu_utilization < 20%"
],
"action": "notify_team_channel",
"response_time": "1 hour"
},
"info": {
"conditions": [
"new_model_version_deployed",
"retraining_triggered",
"cost_threshold_approaching"
],
"action": "log_and_dashboard",
"response_time": "next_business_day"
}
}
8. CI/CD for Machine Learning¶
ML Pipeline Stages¶
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Code │──>│ Data │──>│ Train │──>│ Validate│──>│ Deploy │
│ Tests │ │ Tests │ │ Model │ │ Model │ │ Model │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Lint, Schema, Training, Performance Canary,
Unit tests, Quality, Experiment comparison, Blue-green,
Type checks Distribution tracking Fairness, Shadow
checks Safety tests deploy
Model Validation Gates¶
Before promoting a model to production, validate:
class ModelValidationGate:
function validate(candidate: ModelVersion, current_prod: ModelVersion) -> ValidationResult
checks = []
// 1. Performance comparison
checks.append(self.check_performance(
candidate, current_prod,
min_improvement=0.01 // Must be at least 1% better
))
// 2. Latency check
checks.append(self.check_latency(
candidate,
max_p99_ms=100 // Must serve in < 100ms
))
// 3. Fairness check
checks.append(self.check_fairness(
candidate,
protected_attributes=["gender", "race", "age"],
max_disparity=0.1 // Max 10% performance gap across groups
))
// 4. Safety check (for LLMs)
checks.append(self.check_safety(
candidate,
test_suite="safety_benchmark",
max_unsafe_rate=0.001 // < 0.1% unsafe outputs
))
// 5. Resource usage
checks.append(self.check_resources(
candidate,
max_memory_gb=16,
max_gpu_memory_gb=24
))
return ValidationResult(
passed=all(c.passed for c in checks),
checks=checks
)
Deployment Strategies for Models¶
| Strategy | Risk | Complexity | When to Use |
|---|---|---|---|
| Direct Replacement | High | Low | Non-critical models, dev/staging |
| Blue-Green | Medium | Medium | Quick rollback needed |
| Canary | Low | Medium | Gradual rollout, risk mitigation |
| Shadow | None | High | High-stakes models, new architectures |
| A/B Testing | Low | High | Comparing model variants |
| Multi-Armed Bandit | Low | High | Continuous optimization |
Shadow Deployment: Run the new model alongside production, compare outputs, but only serve the old model's predictions. Validate the new model with real traffic before switching.
Pipeline Orchestration Tools¶
| Tool | Type | Key Features |
|---|---|---|
| Kubeflow Pipelines | Open-source | Kubernetes-native, ML-specific |
| Apache Airflow | Open-source | General workflow, widely adopted |
| Prefect | Open-source + managed | Modern Python-native workflows |
| Dagster | Open-source | Data-aware orchestration |
| ZenML | Open-source | ML-specific, stack-agnostic |
| Metaflow (Netflix) | Open-source | Data science workflows, AWS integration |
9. LLMOps¶
LLMOps extends traditional MLOps for the unique challenges of operating LLM-based systems.
LLMOps vs. Traditional MLOps¶
| Aspect | Traditional MLOps | LLMOps |
|---|---|---|
| Model Training | Train custom models | Fine-tune or use API (pre-trained) |
| Primary Tuning | Hyperparameters, features | Prompts, context, fine-tuning |
| Evaluation | Metrics (accuracy, F1) | Metrics + human evaluation + LLM-as-judge |
| Versioning | Model weights + data | Prompt templates + model version + RAG index |
| Cost Drivers | Training compute | Inference tokens (input + output) |
| Failure Modes | Wrong predictions | Hallucination, prompt injection, safety |
| Update Cycle | Retrain periodically | Update prompts, RAG index, or model version |
LLMOps Components¶
┌──────────────────────────────────────────────────────────┐
│ LLMOps Stack │
├──────────────────────────────────────────────────────────┤
│ Prompt Management │ Model Gateway │ Evaluation │
│ - Version control │ - Routing │ - Auto eval │
│ - A/B testing │ - Load balancing │ - Human eval│
│ - Template engine │ - Failover │ - Benchmarks│
├──────────────────────────────────────────────────────────┤
│ RAG Pipeline │ Caching │ Cost Mgmt │
│ - Index management │ - Semantic cache │ - Token │
│ - Embedding updates │ - Prompt cache │ tracking │
│ - Quality monitoring │ - KV cache │ - Budgets │
├──────────────────────────────────────────────────────────┤
│ Observability │ Safety │ Fine-tuning │
│ - Traces │ - Content filter │ - Data prep │
│ - Logs │ - PII detection │ - Training │
│ - Metrics │ - Injection guard │ - Evaluation│
└──────────────────────────────────────────────────────────┘
Prompt Management¶
class PromptManager:
store: PromptStore // Database of prompt versions
function create_version(
prompt_name: str,
template: str,
model: str,
parameters: dict,
description: str
) -> PromptVersion
version = PromptVersion(
name=prompt_name,
version=self.get_next_version(prompt_name),
template=template,
model=model,
parameters=parameters,
description=description,
created_at=now()
)
self.store.save(version)
return version
function get_active_prompt(prompt_name: str, environment: str) -> PromptVersion
// Get the currently active version for this environment
return self.store.get_active(prompt_name, environment)
function ab_test(
prompt_name: str,
version_a: str,
version_b: str,
traffic_split: float = 0.5
)
// Route traffic between two prompt versions
self.store.set_ab_test(prompt_name, version_a, version_b, traffic_split)
Model Gateway¶
A model gateway abstracts the LLM provider, enabling provider switching, fallback, and load balancing:
class ModelGateway:
providers: dict[str, LLMProvider] // openai, anthropic, local, etc.
routing_config: RoutingConfig
function call(
messages: list[dict],
model: str = None,
**kwargs
) -> Response
// Determine provider and model
provider_name, model_name = self.route(model, messages)
provider = self.providers[provider_name]
try:
response = provider.call(model_name, messages, **kwargs)
self.log_success(provider_name, model_name, response)
return response
except (RateLimitError, ServiceUnavailable):
// Fallback to alternative provider
fallback = self.routing_config.get_fallback(provider_name)
response = self.providers[fallback].call(model_name, messages, **kwargs)
self.log_fallback(provider_name, fallback, response)
return response
function route(model: str, messages: list) -> tuple[str, str]
// Simple routing: match model to provider
// Advanced: cost-based, latency-based, or complexity-based routing
return self.routing_config.resolve(model)
Cost Optimization¶
LLM costs can grow quickly. Key optimization strategies:
| Strategy | Savings | Complexity | Description |
|---|---|---|---|
| Prompt optimization | 20-50% | Low | Reduce prompt length, remove redundancy |
| Caching | 30-80% | Medium | Cache identical and semantically similar queries |
| Model routing | 40-70% | Medium | Use cheaper models for simple queries |
| Batching | 10-30% | Low | Batch multiple requests |
| Quantized models | 50-75% | Medium | Use quantized models for appropriate tasks |
| Context pruning | 20-40% | Medium | Only include necessary context |
| Self-hosted models | 50-90% (at scale) | High | Run open-source models on own infrastructure |
10. GPU Infrastructure¶
GPU Types for AI Workloads¶
| GPU | Memory | FP16 TFLOPS | Use Case | Approximate Cost |
|---|---|---|---|---|
| NVIDIA A100 | 40/80 GB HBM2e | 312 | Training + inference | ~$2/hr (cloud) |
| NVIDIA H100 | 80 GB HBM3 | 989 | Frontier training | ~$4/hr (cloud) |
| NVIDIA H200 | 141 GB HBM3e | 989 | Large model inference | ~$5/hr (cloud) |
| NVIDIA L4 | 24 GB GDDR6 | 121 | Cost-effective inference | ~$0.5/hr (cloud) |
| NVIDIA T4 | 16 GB GDDR6 | 65 | Budget inference | ~$0.3/hr (cloud) |
| AMD MI300X | 192 GB HBM3 | 1307 | Training + inference | ~$3/hr (cloud) |
| Google TPU v5e | 16 GB HBM | N/A | JAX/TF training | ~$1.2/hr (cloud) |
Cloud AI Platforms¶
| Platform | Key Services | Strengths |
|---|---|---|
| AWS | SageMaker, Bedrock, EC2 (P5, Inf2) | Broadest ecosystem, Inferentia chips |
| GCP | Vertex AI, TPUs, GKE | TPU access, Gemini integration |
| Azure | Azure ML, OpenAI Service | OpenAI partnership, enterprise focus |
| Lambda Labs | GPU cloud | Simple, GPU-focused, competitive pricing |
| Together AI | Inference API + fine-tuning | Open-source model hosting |
| Replicate | Model hosting API | Simple deployment, pay-per-prediction |
| Modal | Serverless GPU | Serverless functions with GPU access |
Infrastructure Sizing¶
// Estimating GPU requirements for LLM inference
function estimate_gpu_requirements(
model_params_billions: float,
precision: str = "fp16", // fp32, fp16, int8, int4
max_batch_size: int = 32,
max_sequence_length: int = 4096
) -> dict
// Model weight memory
bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
weight_memory_gb = model_params_billions * bytes_per_param[precision]
// KV cache memory (approximate)
// KV cache per token ≈ 2 * num_layers * hidden_size * 2 bytes (fp16)
// Rough estimate: ~1MB per token for 7B model
kv_cache_per_token_mb = model_params_billions * 0.15 // Rough scaling
kv_cache_gb = (kv_cache_per_token_mb * max_sequence_length * max_batch_size) / 1024
// Total memory needed
total_memory_gb = weight_memory_gb + kv_cache_gb + 2 // +2 GB overhead
// Determine GPU configuration
if total_memory_gb <= 24:
return {"gpus": "1x L4/T4 (24 GB)", "memory_gb": total_memory_gb}
elif total_memory_gb <= 80:
return {"gpus": "1x A100/H100 (80 GB)", "memory_gb": total_memory_gb}
else:
num_gpus = ceil(total_memory_gb / 80)
return {"gpus": f"{num_gpus}x A100/H100", "memory_gb": total_memory_gb}
11. ML System Architecture Patterns¶
Online Prediction Service¶
Client → API Gateway → Load Balancer → Model Server → Model
│
Feature Store (online)
│
Monitoring / Logging
Offline Batch Pipeline¶
Scheduler → Data Pipeline → Feature Pipeline → Training Pipeline
│
Model Registry
│
Validation Gate
│
Deployment Pipeline
LLM Application Architecture¶
User → Application → Prompt Manager → Model Gateway → LLM API
│ │
RAG Pipeline Response Handler
│ │
Vector DB Output Validation
│ │
Embedding Model Monitoring / Tracing
12. Best Practices Summary¶
MLOps Best Practices¶
- Automate everything: Manual steps are error-prone and don't scale. Automate data pipelines, training, validation, and deployment.
- Version all artifacts: Code, data, models, configs, and prompts should all be versioned and reproducible.
- Monitor beyond uptime: Data drift, model performance, and business metrics are as important as infrastructure health.
- Test models like software: Unit tests for data transformations, integration tests for pipelines, performance tests for models.
- Plan for failure: Models degrade silently. Have automated alerts, fallbacks, and rollback procedures.
- Start simple, iterate: Begin with a simple pipeline and add complexity (feature stores, advanced monitoring) as needed.
LLMOps Best Practices¶
- Treat prompts as code: Version control, review, test, and deploy prompts through a CI/CD pipeline.
- Implement model gateways: Abstract the LLM provider to enable switching, fallback, and A/B testing.
- Monitor token costs: Track costs per user, per feature, and per model. Set budgets and alerts.
- Cache aggressively: Semantic caching for similar queries, exact caching for identical requests, prefix caching for shared prompt templates.
- Evaluate continuously: Run automated evaluation (LLM-as-judge) on a sample of production traffic regularly.
- Defense in depth: Implement input validation, output filtering, rate limiting, and content safety at every layer.
- Use the right model for the task: Not every query needs GPT-4. Route simple queries to cheaper, faster models.