MLOps & AI Infrastructure¶

MLOps (Machine Learning Operations) is the discipline of deploying, monitoring, and maintaining ML models in production. It extends DevOps and SRE principles to machine learning systems, addressing unique challenges like model versioning, data drift, reproducibility, and the need for continuous retraining. As AI systems move from research to production, MLOps becomes the bridge between data science experimentation and reliable, scalable AI services.

With the rise of LLMs, a new sub-discipline—LLMOps—has emerged, focusing on the specific operational challenges of serving, monitoring, and optimizing large language model applications. This chapter covers the complete MLOps lifecycle, from experiment tracking to production monitoring, infrastructure, and cost optimization.

1. The MLOps Lifecycle¶

Traditional Software vs. ML Systems¶

ML systems are fundamentally different from traditional software:

Aspect	Traditional Software	ML Systems
Logic	Explicitly coded	Learned from data
Testing	Deterministic tests	Statistical validation
Versioning	Code only	Code + data + model + config
Debugging	Stack traces, logs	Data analysis, model inspection
Failure Modes	Crashes, errors	Silent degradation, drift
Dependencies	Libraries, services	+ Training data, feature pipelines
Deployment	Code deploy	Model deploy + data pipeline deploy
Monitoring	Uptime, latency, errors	+ Data drift, model performance, fairness

MLOps Maturity Levels¶

Level	Description	Characteristics
Level 0	Manual process	Jupyter notebooks, manual deployment, no monitoring
Level 1	ML pipeline automation	Automated training, basic CI/CD, simple monitoring
Level 2	CI/CD for ML	Automated testing, model validation, A/B testing, full monitoring
Level 3	Full MLOps	Automated retraining, feature stores, model governance, self-healing

The ML Lifecycle¶

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Data Mgmt   │───>│   Training   │───>│  Evaluation  │
│              │    │              │    │              │
│ - Collection │    │ - Feature    │    │ - Metrics    │
│ - Cleaning   │    │   engineering│    │ - Validation │
│ - Versioning │    │ - Training   │    │ - Comparison │
│ - Labeling   │    │ - HPO tuning │    │ - Approval   │
└──────────────┘    └──────────────┘    └──────┬───────┘
                                               │
┌──────────────┐    ┌──────────────┐    ┌──────▼───────┐
│  Monitoring  │<───│   Serving    │<───│  Deployment  │
│              │    │              │    │              │
│ - Data drift │    │ - API/Batch  │    │ - Packaging  │
│ - Model perf │    │ - Scaling    │    │ - Staging    │
│ - Cost       │    │ - Caching    │    │ - Rollout    │
│ - Alerts     │    │ - A/B test   │    │ - Rollback   │
└──────┬───────┘    └──────────────┘    └──────────────┘
       │
       └── Triggers retraining when drift detected

2. Data Management¶

Data is the foundation of ML systems. "Garbage in, garbage out" applies more strongly to ML than to any other software paradigm.

Data Versioning¶

Track datasets alongside code to ensure reproducibility:

Tool	Approach	Key Features
DVC (Data Version Control)	Git-like for data	Works with Git, supports remote storage (S3, GCS)
LakeFS	Git-like branching for data lakes	Branch, merge, rollback for data
Delta Lake	ACID transactions for data lakes	Time travel, schema enforcement
Pachyderm	Data-driven pipelines	Automatic versioning, lineage tracking

Data Quality¶

Automated data quality checks should run as part of the ML pipeline:

class DataQualityChecker:
    function validate(dataset: DataFrame) -> QualityReport
        checks = []

        // Schema validation
        checks.append(self.check_schema(dataset))

        // Completeness (missing values)
        checks.append(self.check_completeness(dataset, max_null_pct=0.05))

        // Distribution checks (detect drift from reference)
        checks.append(self.check_distributions(dataset, reference_stats))

        // Range validation
        checks.append(self.check_ranges(dataset, expected_ranges))

        // Uniqueness (check for duplicates)
        checks.append(self.check_uniqueness(dataset, key_columns))

        // Freshness (data isn't stale)
        checks.append(self.check_freshness(dataset, max_age_hours=24))

        return QualityReport(checks=checks, passed=all(c.passed for c in checks))

Tools: Great Expectations, Pandera, Deequ, Soda.

Feature Stores¶

A centralized repository for storing, managing, and serving ML features:

// Feature Store concept
class FeatureStore:
    // Define features
    function register_feature(
        name: str,
        description: str,
        entity: str,
        value_type: Type,
        computation: Function,
        freshness: Duration
    )

    // Get features for training (batch)
    function get_training_features(
        entity_ids: list[str],
        feature_names: list[str],
        timestamp: DateTime  // Point-in-time correct!
    ) -> DataFrame

    // Get features for inference (real-time)
    function get_online_features(
        entity_id: str,
        feature_names: list[str]
    ) -> dict

Why feature stores matter: - Consistency: Same feature definition for training and inference (prevents training-serving skew). - Reusability: Features computed once, reused across models. - Point-in-time correctness: Prevent data leakage during training by fetching features as they existed at each training example's timestamp. - Real-time serving: Pre-computed features available with low latency for inference.

Feature Store	Type	Key Features
Feast	Open-source	Lightweight, works with existing infra
Tecton	Managed	Real-time features, streaming support
Databricks Feature Store	Managed	Integrated with Databricks/Spark
SageMaker Feature Store	Managed	AWS native, online + offline
Hopsworks	Open-source + managed	Python-centric, great docs

Data Labeling¶

For supervised learning, labeled data is essential:

Tool	Features	Best For
Label Studio	Open-source, multi-modal	General labeling, self-hosted
Labelbox	Managed, collaborative	Enterprise, team labeling
Scale AI	Managed + workforce	Large-scale, high-quality labels
Prodigy	Active learning, efficient	NLP tasks, small teams
Argilla	Open-source, LLM-focused	LLM evaluation, RLHF data

3. Experiment Tracking¶

Experiment tracking records every training run's parameters, metrics, artifacts, and environment to ensure reproducibility and enable comparison.

What to Track¶

Category	Examples
Parameters	Learning rate, batch size, epochs, model architecture
Metrics	Loss, accuracy, F1, BLEU, latency, throughput
Artifacts	Model weights, plots, predictions, confusion matrices
Environment	Python version, library versions, GPU type, OS
Data	Dataset version, preprocessing steps, train/val/test splits
Code	Git commit hash, diff, branch

Experiment Tracking Tools¶

Tool	Type	Key Features
MLflow	Open-source	Model registry, tracking, deployment, widely adopted
Weights & Biases (W&B)	Managed	Beautiful UI, hyperparameter sweeps, artifact tracking
Neptune	Managed	Flexible metadata, comparison tools
CometML	Managed	Experiment comparison, model production monitoring
TensorBoard	Open-source	Training visualization, integrated with TensorFlow/PyTorch
Aim	Open-source	Fast, local-first, beautiful visualizations

Pseudocode (Experiment Tracking)¶

class ExperimentTracker:
    function start_run(name: str, params: dict) -> Run
        run = Run(
            id=generate_id(),
            name=name,
            params=params,
            git_commit=get_git_commit(),
            environment=capture_environment(),
            start_time=now()
        )
        return run

    function log_metric(run: Run, name: str, value: float, step: int = None)
        run.metrics.append(Metric(name=name, value=value, step=step, timestamp=now()))

    function log_artifact(run: Run, path: str, artifact_type: str)
        run.artifacts.append(Artifact(path=path, type=artifact_type, hash=file_hash(path)))

    function end_run(run: Run, status: str = "completed")
        run.end_time = now()
        run.status = status
        run.duration = run.end_time - run.start_time
        self.store.save(run)

4. Model Training Infrastructure¶

Hyperparameter Optimization (HPO)¶

Finding optimal hyperparameters is critical for model performance:

Method	Approach	Efficiency	When to Use
Grid Search	Try all combinations	O(k^n) - exhaustive	Few hyperparameters, small search space
Random Search	Random sampling	Better than grid for high-dim	Moderate search spaces
Bayesian Optimization	Model the objective function	Very efficient	Expensive training runs
Hyperband / ASHA	Early stopping of bad runs	Very efficient	Large search spaces
Population-Based Training	Evolutionary approach	Parallel, adaptive	Distributed training

Tools: Optuna, Ray Tune, W&B Sweeps, SigOpt.

Distributed Training¶

For models too large for a single GPU:

Strategy	Description	Communication	Use Case
Data Parallel (DP)	Same model on each GPU, different data	All-reduce gradients	Models that fit in 1 GPU
Distributed Data Parallel (DDP)	DP with better multi-node support	NCCL all-reduce	Standard distributed training
Fully Sharded DP (FSDP)	Shard params + gradients + optimizer states	All-gather when needed	Large models (10B+)
Tensor Parallel (TP)	Split layers across GPUs	Point-to-point	Very large layers
Pipeline Parallel (PP)	Split model layers across GPUs	Forward/backward between stages	Very deep models
3D Parallelism	DP + TP + PP combined	All of the above	Frontier models (100B+)

Training Frameworks¶

Framework	Key Features	Best For
PyTorch	Dynamic graphs, research-friendly	Most common, flexible
PyTorch Lightning	Structured PyTorch, less boilerplate	Production training
DeepSpeed (Microsoft)	ZeRO optimizer, mixed precision	Large model training
Megatron-LM (NVIDIA)	Tensor/pipeline parallelism	LLM pre-training
JAX	Functional, XLA compilation	TPU training, research
HuggingFace Transformers	Pre-trained models, Trainer API	Fine-tuning, transfer learning
Axolotl	Fine-tuning framework	LLM fine-tuning (LoRA, QLoRA)
Unsloth	Optimized fine-tuning	Fast LoRA/QLoRA fine-tuning

5. Model Registry¶

A model registry is a centralized store for model versions, metadata, and lifecycle states.

Model Registry Operations¶

class ModelRegistry:
    function register_model(
        name: str,
        version: str,
        artifact_path: str,
        metrics: dict,
        parameters: dict,
        training_run_id: str,
        tags: dict = None
    ) -> ModelVersion
        // Store model artifact
        artifact_id = self.artifact_store.upload(artifact_path)

        // Create version entry
        model_version = ModelVersion(
            name=name,
            version=version,
            artifact_id=artifact_id,
            metrics=metrics,
            parameters=parameters,
            training_run_id=training_run_id,
            tags=tags,
            stage="staging",  // Start in staging
            created_at=now()
        )

        self.store.save(model_version)
        return model_version

    function promote_model(name: str, version: str, target_stage: str)
        // Transition: staging -> production (with validation)
        model = self.store.get(name, version)

        if target_stage == "production":
            // Run validation checks
            validation = self.validate_for_production(model)
            if not validation.passed:
                raise ValidationError(validation.failures)

            // Archive current production model
            current_prod = self.get_production_model(name)
            if current_prod:
                current_prod.stage = "archived"
                self.store.save(current_prod)

        model.stage = target_stage
        self.store.save(model)

    function get_production_model(name: str) -> ModelVersion
        return self.store.query(name=name, stage="production")

Model Lifecycle Stages¶

Development → Staging → Production → Archived
     │           │          │
     └── Failed  └── Failed └── Rolled back

6. Model Serving¶

Model serving is the infrastructure for making predictions available to applications.

Serving Patterns¶

Batch Inference¶

Process large datasets offline on a schedule:

class BatchInferenceJob:
    model: Model
    input_source: DataSource   // S3, database, etc.
    output_sink: DataSink

    function run(job_config: dict)
        // Load data
        data = self.input_source.read(job_config["input_path"])

        // Preprocess
        features = self.preprocess(data)

        // Batch predict
        predictions = []
        for batch in features.batches(size=1000):
            preds = self.model.predict(batch)
            predictions.extend(preds)

        // Write results
        self.output_sink.write(predictions, job_config["output_path"])

        // Log metrics
        log_metrics({
            "total_predictions": len(predictions),
            "latency_p99": compute_p99(latencies),
            "error_rate": errors / len(data)
        })

Use cases: Recommendation systems, risk scoring, report generation, bulk email classification.

Real-Time Inference (Online Serving)¶

Low-latency API endpoints for synchronous predictions:

class ModelServer:
    model: Model
    preprocessor: Preprocessor
    postprocessor: Postprocessor
    cache: Cache

    function predict(request: PredictionRequest) -> PredictionResponse
        start_time = now()

        try:
            // Check cache
            cache_key = hash(request)
            cached = self.cache.get(cache_key)
            if cached:
                return cached

            // Preprocess
            features = self.preprocessor.transform(request.data)

            // Validate features
            if not self.validate_features(features):
                return PredictionResponse(error="Invalid input features")

            // Predict
            raw_prediction = self.model.predict(features)

            // Postprocess
            response = self.postprocessor.transform(raw_prediction)

            // Cache result
            self.cache.set(cache_key, response, ttl=300)

            // Log for monitoring
            latency = now() - start_time
            self.log_prediction(request, response, latency)

            return response

        except Exception as e:
            self.log_error(e, request)
            return PredictionResponse(error="Prediction failed", fallback=self.get_fallback())

Key metrics: Latency (p50, p95, p99), throughput (requests/second), error rate, GPU utilization.

Streaming Inference¶

Process continuous data streams (e.g., Kafka, Kinesis):

Real-time fraud detection on transaction streams.
Continuous anomaly detection on sensor data.
Live content moderation on social media posts.

Edge Inference¶

Run models on edge devices (mobile, IoT, embedded):

Framework	Platforms	Model Formats	Key Features
TensorFlow Lite	Android, iOS, embedded	.tflite	Quantization, delegates
ONNX Runtime	Cross-platform	.onnx	Universal format, optimized
Core ML	Apple ecosystem	.mlmodel	Hardware-accelerated on Apple devices
llama.cpp	Desktop, mobile	GGUF	LLM inference on CPU
MediaPipe	Mobile, web	Various	Google's ML pipeline framework

LLM Serving Infrastructure¶

LLM serving has unique requirements compared to traditional ML:

System	Key Innovation	Best For
vLLM	PagedAttention, continuous batching	High-throughput production serving
TGI (HuggingFace)	Tensor parallelism, Flash Attention	HuggingFace model ecosystem
TensorRT-LLM (NVIDIA)	FP8, in-flight batching	Maximum GPU performance
Ollama	Simple local serving	Development, testing
llama.cpp	CPU/GPU mixed inference, GGUF	Edge, desktop, cost-sensitive
SGLang	RadixAttention, constrained decoding	Structured output generation
Ray Serve	Distributed serving, model composition	Multi-model pipelines

7. Monitoring and Observability¶

ML monitoring goes beyond traditional software monitoring — you must also monitor data quality, model performance, and business impact.

What to Monitor¶

Infrastructure Metrics¶

Metric	Description	Alert Threshold
Latency (p50, p95, p99)	Response time distribution	p99 > SLA
Throughput	Requests per second	Below expected load
Error Rate	Failed predictions	> 1%
GPU Utilization	GPU compute usage	< 30% (waste) or > 95% (overloaded)
Memory Usage	RAM and GPU memory	> 90%
Queue Depth	Pending requests	Growing unboundedly

Data Drift¶

Data drift occurs when the distribution of input data changes from what the model was trained on:

class DriftDetector:
    reference_stats: DataStatistics  // From training data

    function detect_drift(current_data: DataFrame) -> DriftReport
        drift_scores = {}

        for feature in current_data.columns:
            // Statistical tests
            if feature.is_numerical:
                // Kolmogorov-Smirnov test
                ks_stat, p_value = ks_test(
                    self.reference_stats[feature],
                    current_data[feature]
                )
                drift_scores[feature] = {
                    "test": "KS",
                    "statistic": ks_stat,
                    "p_value": p_value,
                    "drifted": p_value < 0.05
                }
            else:
                // Chi-squared test for categorical
                chi2, p_value = chi2_test(
                    self.reference_stats[feature],
                    current_data[feature]
                )
                drift_scores[feature] = {
                    "test": "chi2",
                    "statistic": chi2,
                    "p_value": p_value,
                    "drifted": p_value < 0.05
                }

            // Population Stability Index (PSI)
            psi = compute_psi(
                self.reference_stats[feature],
                current_data[feature]
            )
            drift_scores[feature]["psi"] = psi
            // PSI > 0.2 indicates significant drift

        return DriftReport(
            features=drift_scores,
            overall_drift=any(d["drifted"] for d in drift_scores.values()),
            recommendation=self.get_recommendation(drift_scores)
        )

Drift Types: - Data Drift (Covariate Shift): Input distribution changes (e.g., new user demographics). - Concept Drift: The relationship between input and output changes (e.g., user preferences shift). - Label Drift: The distribution of target labels changes (e.g., fraud rate increases).

Model Performance¶

Monitor prediction quality over time:

class ModelPerformanceMonitor:
    function monitor(
        predictions: list[Prediction],
        ground_truth: list[Label] = None  // May be delayed
    ) -> PerformanceReport

        metrics = {}

        // Prediction distribution
        metrics["prediction_distribution"] = compute_distribution(predictions)
        metrics["prediction_entropy"] = compute_entropy(predictions)

        // If ground truth available (possibly delayed)
        if ground_truth:
            metrics["accuracy"] = compute_accuracy(predictions, ground_truth)
            metrics["f1"] = compute_f1(predictions, ground_truth)
            metrics["calibration"] = compute_calibration(predictions, ground_truth)

        // Detect anomalies
        metrics["anomaly_score"] = self.detect_performance_anomaly(metrics)

        // Compare to baseline
        metrics["degradation"] = self.compare_to_baseline(metrics)

        return PerformanceReport(metrics=metrics)

Monitoring Tools¶

Tool	Type	Key Features
Evidently AI	Open-source	Data drift, model performance reports
Arize Phoenix	Open-source	LLM traces, embeddings analysis
WhyLabs	Managed	Data profiling, drift detection
Fiddler	Managed	Explainability, fairness monitoring
NannyML	Open-source	Performance estimation without labels
Prometheus + Grafana	Open-source	Infrastructure metrics, custom dashboards
LangSmith	Managed	LLM-specific tracing and evaluation
Langfuse	Open-source	LLM observability, prompt management

Alerting Strategy¶

// Tiered alerting based on severity
alerts = {
    "critical": {
        "conditions": [
            "error_rate > 5%",
            "latency_p99 > 10s",
            "service_down"
        ],
        "action": "page_on_call",
        "response_time": "5 minutes"
    },
    "warning": {
        "conditions": [
            "data_drift_detected",
            "model_accuracy < baseline - 5%",
            "gpu_utilization < 20%"
        ],
        "action": "notify_team_channel",
        "response_time": "1 hour"
    },
    "info": {
        "conditions": [
            "new_model_version_deployed",
            "retraining_triggered",
            "cost_threshold_approaching"
        ],
        "action": "log_and_dashboard",
        "response_time": "next_business_day"
    }
}

8. CI/CD for Machine Learning¶

ML Pipeline Stages¶

┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│  Code   │──>│  Data   │──>│  Train  │──>│ Validate│──>│ Deploy  │
│  Tests  │   │  Tests  │   │  Model  │   │  Model  │   │  Model  │
└─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
    │              │              │              │              │
    ▼              ▼              ▼              ▼              ▼
  Lint,         Schema,       Training,     Performance    Canary,
  Unit tests,   Quality,      Experiment    comparison,    Blue-green,
  Type checks   Distribution  tracking      Fairness,      Shadow
                checks                      Safety tests    deploy

Model Validation Gates¶

Before promoting a model to production, validate:

class ModelValidationGate:
    function validate(candidate: ModelVersion, current_prod: ModelVersion) -> ValidationResult
        checks = []

        // 1. Performance comparison
        checks.append(self.check_performance(
            candidate, current_prod,
            min_improvement=0.01  // Must be at least 1% better
        ))

        // 2. Latency check
        checks.append(self.check_latency(
            candidate,
            max_p99_ms=100  // Must serve in < 100ms
        ))

        // 3. Fairness check
        checks.append(self.check_fairness(
            candidate,
            protected_attributes=["gender", "race", "age"],
            max_disparity=0.1  // Max 10% performance gap across groups
        ))

        // 4. Safety check (for LLMs)
        checks.append(self.check_safety(
            candidate,
            test_suite="safety_benchmark",
            max_unsafe_rate=0.001  // < 0.1% unsafe outputs
        ))

        // 5. Resource usage
        checks.append(self.check_resources(
            candidate,
            max_memory_gb=16,
            max_gpu_memory_gb=24
        ))

        return ValidationResult(
            passed=all(c.passed for c in checks),
            checks=checks
        )

Deployment Strategies for Models¶

Strategy	Risk	Complexity	When to Use
Direct Replacement	High	Low	Non-critical models, dev/staging
Blue-Green	Medium	Medium	Quick rollback needed
Canary	Low	Medium	Gradual rollout, risk mitigation
Shadow	None	High	High-stakes models, new architectures
A/B Testing	Low	High	Comparing model variants
Multi-Armed Bandit	Low	High	Continuous optimization

Shadow Deployment: Run the new model alongside production, compare outputs, but only serve the old model's predictions. Validate the new model with real traffic before switching.

Pipeline Orchestration Tools¶

Tool	Type	Key Features
Kubeflow Pipelines	Open-source	Kubernetes-native, ML-specific
Apache Airflow	Open-source	General workflow, widely adopted
Prefect	Open-source + managed	Modern Python-native workflows
Dagster	Open-source	Data-aware orchestration
ZenML	Open-source	ML-specific, stack-agnostic
Metaflow (Netflix)	Open-source	Data science workflows, AWS integration

9. LLMOps¶

LLMOps extends traditional MLOps for the unique challenges of operating LLM-based systems.

LLMOps vs. Traditional MLOps¶

Aspect	Traditional MLOps	LLMOps
Model Training	Train custom models	Fine-tune or use API (pre-trained)
Primary Tuning	Hyperparameters, features	Prompts, context, fine-tuning
Evaluation	Metrics (accuracy, F1)	Metrics + human evaluation + LLM-as-judge
Versioning	Model weights + data	Prompt templates + model version + RAG index
Cost Drivers	Training compute	Inference tokens (input + output)
Failure Modes	Wrong predictions	Hallucination, prompt injection, safety
Update Cycle	Retrain periodically	Update prompts, RAG index, or model version

LLMOps Components¶

┌──────────────────────────────────────────────────────────┐
│                    LLMOps Stack                           │
├──────────────────────────────────────────────────────────┤
│  Prompt Management    │  Model Gateway     │  Evaluation  │
│  - Version control    │  - Routing         │  - Auto eval │
│  - A/B testing        │  - Load balancing  │  - Human eval│
│  - Template engine    │  - Failover        │  - Benchmarks│
├──────────────────────────────────────────────────────────┤
│  RAG Pipeline         │  Caching           │  Cost Mgmt   │
│  - Index management   │  - Semantic cache  │  - Token     │
│  - Embedding updates  │  - Prompt cache    │    tracking  │
│  - Quality monitoring │  - KV cache        │  - Budgets   │
├──────────────────────────────────────────────────────────┤
│  Observability        │  Safety            │  Fine-tuning │
│  - Traces             │  - Content filter  │  - Data prep │
│  - Logs               │  - PII detection   │  - Training  │
│  - Metrics            │  - Injection guard │  - Evaluation│
└──────────────────────────────────────────────────────────┘

Prompt Management¶

class PromptManager:
    store: PromptStore  // Database of prompt versions

    function create_version(
        prompt_name: str,
        template: str,
        model: str,
        parameters: dict,
        description: str
    ) -> PromptVersion
        version = PromptVersion(
            name=prompt_name,
            version=self.get_next_version(prompt_name),
            template=template,
            model=model,
            parameters=parameters,
            description=description,
            created_at=now()
        )
        self.store.save(version)
        return version

    function get_active_prompt(prompt_name: str, environment: str) -> PromptVersion
        // Get the currently active version for this environment
        return self.store.get_active(prompt_name, environment)

    function ab_test(
        prompt_name: str,
        version_a: str,
        version_b: str,
        traffic_split: float = 0.5
    )
        // Route traffic between two prompt versions
        self.store.set_ab_test(prompt_name, version_a, version_b, traffic_split)

Model Gateway¶

A model gateway abstracts the LLM provider, enabling provider switching, fallback, and load balancing:

class ModelGateway:
    providers: dict[str, LLMProvider]  // openai, anthropic, local, etc.
    routing_config: RoutingConfig

    function call(
        messages: list[dict],
        model: str = None,
        **kwargs
    ) -> Response
        // Determine provider and model
        provider_name, model_name = self.route(model, messages)
        provider = self.providers[provider_name]

        try:
            response = provider.call(model_name, messages, **kwargs)
            self.log_success(provider_name, model_name, response)
            return response

        except (RateLimitError, ServiceUnavailable):
            // Fallback to alternative provider
            fallback = self.routing_config.get_fallback(provider_name)
            response = self.providers[fallback].call(model_name, messages, **kwargs)
            self.log_fallback(provider_name, fallback, response)
            return response

    function route(model: str, messages: list) -> tuple[str, str]
        // Simple routing: match model to provider
        // Advanced: cost-based, latency-based, or complexity-based routing
        return self.routing_config.resolve(model)

Cost Optimization¶

LLM costs can grow quickly. Key optimization strategies:

Strategy	Savings	Complexity	Description
Prompt optimization	20-50%	Low	Reduce prompt length, remove redundancy
Caching	30-80%	Medium	Cache identical and semantically similar queries
Model routing	40-70%	Medium	Use cheaper models for simple queries
Batching	10-30%	Low	Batch multiple requests
Quantized models	50-75%	Medium	Use quantized models for appropriate tasks
Context pruning	20-40%	Medium	Only include necessary context
Self-hosted models	50-90% (at scale)	High	Run open-source models on own infrastructure

10. GPU Infrastructure¶

GPU Types for AI Workloads¶

GPU	Memory	FP16 TFLOPS	Use Case	Approximate Cost
NVIDIA A100	40/80 GB HBM2e	312	Training + inference	~$2/hr (cloud)
NVIDIA H100	80 GB HBM3	989	Frontier training	~$4/hr (cloud)
NVIDIA H200	141 GB HBM3e	989	Large model inference	~$5/hr (cloud)
NVIDIA L4	24 GB GDDR6	121	Cost-effective inference	~$0.5/hr (cloud)
NVIDIA T4	16 GB GDDR6	65	Budget inference	~$0.3/hr (cloud)
AMD MI300X	192 GB HBM3	1307	Training + inference	~$3/hr (cloud)
Google TPU v5e	16 GB HBM	N/A	JAX/TF training	~$1.2/hr (cloud)

Cloud AI Platforms¶

Platform	Key Services	Strengths
AWS	SageMaker, Bedrock, EC2 (P5, Inf2)	Broadest ecosystem, Inferentia chips
GCP	Vertex AI, TPUs, GKE	TPU access, Gemini integration
Azure	Azure ML, OpenAI Service	OpenAI partnership, enterprise focus
Lambda Labs	GPU cloud	Simple, GPU-focused, competitive pricing
Together AI	Inference API + fine-tuning	Open-source model hosting
Replicate	Model hosting API	Simple deployment, pay-per-prediction
Modal	Serverless GPU	Serverless functions with GPU access

Infrastructure Sizing¶

// Estimating GPU requirements for LLM inference

function estimate_gpu_requirements(
    model_params_billions: float,
    precision: str = "fp16",  // fp32, fp16, int8, int4
    max_batch_size: int = 32,
    max_sequence_length: int = 4096
) -> dict

    // Model weight memory
    bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
    weight_memory_gb = model_params_billions * bytes_per_param[precision]

    // KV cache memory (approximate)
    // KV cache per token ≈ 2 * num_layers * hidden_size * 2 bytes (fp16)
    // Rough estimate: ~1MB per token for 7B model
    kv_cache_per_token_mb = model_params_billions * 0.15  // Rough scaling
    kv_cache_gb = (kv_cache_per_token_mb * max_sequence_length * max_batch_size) / 1024

    // Total memory needed
    total_memory_gb = weight_memory_gb + kv_cache_gb + 2  // +2 GB overhead

    // Determine GPU configuration
    if total_memory_gb <= 24:
        return {"gpus": "1x L4/T4 (24 GB)", "memory_gb": total_memory_gb}
    elif total_memory_gb <= 80:
        return {"gpus": "1x A100/H100 (80 GB)", "memory_gb": total_memory_gb}
    else:
        num_gpus = ceil(total_memory_gb / 80)
        return {"gpus": f"{num_gpus}x A100/H100", "memory_gb": total_memory_gb}

11. ML System Architecture Patterns¶

Online Prediction Service¶

Client → API Gateway → Load Balancer → Model Server → Model
                                           │
                                    Feature Store (online)
                                           │
                                    Monitoring / Logging

Offline Batch Pipeline¶

Scheduler → Data Pipeline → Feature Pipeline → Training Pipeline
                                                      │
                                               Model Registry
                                                      │
                                               Validation Gate
                                                      │
                                               Deployment Pipeline

LLM Application Architecture¶

User → Application → Prompt Manager → Model Gateway → LLM API
                         │                                │
                    RAG Pipeline                    Response Handler
                         │                                │
                    Vector DB                    Output Validation
                         │                                │
                    Embedding Model               Monitoring / Tracing

12. Best Practices Summary¶

MLOps Best Practices¶

Automate everything: Manual steps are error-prone and don't scale. Automate data pipelines, training, validation, and deployment.
Version all artifacts: Code, data, models, configs, and prompts should all be versioned and reproducible.
Monitor beyond uptime: Data drift, model performance, and business metrics are as important as infrastructure health.
Test models like software: Unit tests for data transformations, integration tests for pipelines, performance tests for models.
Plan for failure: Models degrade silently. Have automated alerts, fallbacks, and rollback procedures.
Start simple, iterate: Begin with a simple pipeline and add complexity (feature stores, advanced monitoring) as needed.

LLMOps Best Practices¶

Treat prompts as code: Version control, review, test, and deploy prompts through a CI/CD pipeline.
Implement model gateways: Abstract the LLM provider to enable switching, fallback, and A/B testing.
Monitor token costs: Track costs per user, per feature, and per model. Set budgets and alerts.
Cache aggressively: Semantic caching for similar queries, exact caching for identical requests, prefix caching for shared prompt templates.
Evaluate continuously: Run automated evaluation (LLM-as-judge) on a sample of production traffic regularly.
Defense in depth: Implement input validation, output filtering, rate limiting, and content safety at every layer.
Use the right model for the task: Not every query needs GPT-4. Route simple queries to cheaper, faster models.