Skip to content

MLOps & AI Infrastructure

MLOps (Machine Learning Operations) is the discipline of deploying, monitoring, and maintaining ML models in production. It extends DevOps and SRE principles to machine learning systems, addressing unique challenges like model versioning, data drift, reproducibility, and the need for continuous retraining. As AI systems move from research to production, MLOps becomes the bridge between data science experimentation and reliable, scalable AI services.

With the rise of LLMs, a new sub-discipline—LLMOps—has emerged, focusing on the specific operational challenges of serving, monitoring, and optimizing large language model applications. This chapter covers the complete MLOps lifecycle, from experiment tracking to production monitoring, infrastructure, and cost optimization.


1. The MLOps Lifecycle

Traditional Software vs. ML Systems

ML systems are fundamentally different from traditional software:

Aspect Traditional Software ML Systems
Logic Explicitly coded Learned from data
Testing Deterministic tests Statistical validation
Versioning Code only Code + data + model + config
Debugging Stack traces, logs Data analysis, model inspection
Failure Modes Crashes, errors Silent degradation, drift
Dependencies Libraries, services + Training data, feature pipelines
Deployment Code deploy Model deploy + data pipeline deploy
Monitoring Uptime, latency, errors + Data drift, model performance, fairness

MLOps Maturity Levels

Level Description Characteristics
Level 0 Manual process Jupyter notebooks, manual deployment, no monitoring
Level 1 ML pipeline automation Automated training, basic CI/CD, simple monitoring
Level 2 CI/CD for ML Automated testing, model validation, A/B testing, full monitoring
Level 3 Full MLOps Automated retraining, feature stores, model governance, self-healing

The ML Lifecycle

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Data Mgmt   │───>│   Training   │───>│  Evaluation  │
│              │    │              │    │              │
│ - Collection │    │ - Feature    │    │ - Metrics    │
│ - Cleaning   │    │   engineering│    │ - Validation │
│ - Versioning │    │ - Training   │    │ - Comparison │
│ - Labeling   │    │ - HPO tuning │    │ - Approval   │
└──────────────┘    └──────────────┘    └──────┬───────┘
                                               │
┌──────────────┐    ┌──────────────┐    ┌──────▼───────┐
│  Monitoring  │<───│   Serving    │<───│  Deployment  │
│              │    │              │    │              │
│ - Data drift │    │ - API/Batch  │    │ - Packaging  │
│ - Model perf │    │ - Scaling    │    │ - Staging    │
│ - Cost       │    │ - Caching    │    │ - Rollout    │
│ - Alerts     │    │ - A/B test   │    │ - Rollback   │
└──────┬───────┘    └──────────────┘    └──────────────┘
       │
       └── Triggers retraining when drift detected

2. Data Management

Data is the foundation of ML systems. "Garbage in, garbage out" applies more strongly to ML than to any other software paradigm.

Data Versioning

Track datasets alongside code to ensure reproducibility:

Tool Approach Key Features
DVC (Data Version Control) Git-like for data Works with Git, supports remote storage (S3, GCS)
LakeFS Git-like branching for data lakes Branch, merge, rollback for data
Delta Lake ACID transactions for data lakes Time travel, schema enforcement
Pachyderm Data-driven pipelines Automatic versioning, lineage tracking

Data Quality

Automated data quality checks should run as part of the ML pipeline:

class DataQualityChecker:
    function validate(dataset: DataFrame) -> QualityReport
        checks = []

        // Schema validation
        checks.append(self.check_schema(dataset))

        // Completeness (missing values)
        checks.append(self.check_completeness(dataset, max_null_pct=0.05))

        // Distribution checks (detect drift from reference)
        checks.append(self.check_distributions(dataset, reference_stats))

        // Range validation
        checks.append(self.check_ranges(dataset, expected_ranges))

        // Uniqueness (check for duplicates)
        checks.append(self.check_uniqueness(dataset, key_columns))

        // Freshness (data isn't stale)
        checks.append(self.check_freshness(dataset, max_age_hours=24))

        return QualityReport(checks=checks, passed=all(c.passed for c in checks))

Tools: Great Expectations, Pandera, Deequ, Soda.

Feature Stores

A centralized repository for storing, managing, and serving ML features:

// Feature Store concept
class FeatureStore:
    // Define features
    function register_feature(
        name: str,
        description: str,
        entity: str,
        value_type: Type,
        computation: Function,
        freshness: Duration
    )

    // Get features for training (batch)
    function get_training_features(
        entity_ids: list[str],
        feature_names: list[str],
        timestamp: DateTime  // Point-in-time correct!
    ) -> DataFrame

    // Get features for inference (real-time)
    function get_online_features(
        entity_id: str,
        feature_names: list[str]
    ) -> dict

Why feature stores matter: - Consistency: Same feature definition for training and inference (prevents training-serving skew). - Reusability: Features computed once, reused across models. - Point-in-time correctness: Prevent data leakage during training by fetching features as they existed at each training example's timestamp. - Real-time serving: Pre-computed features available with low latency for inference.

Feature Store Type Key Features
Feast Open-source Lightweight, works with existing infra
Tecton Managed Real-time features, streaming support
Databricks Feature Store Managed Integrated with Databricks/Spark
SageMaker Feature Store Managed AWS native, online + offline
Hopsworks Open-source + managed Python-centric, great docs

Data Labeling

For supervised learning, labeled data is essential:

Tool Features Best For
Label Studio Open-source, multi-modal General labeling, self-hosted
Labelbox Managed, collaborative Enterprise, team labeling
Scale AI Managed + workforce Large-scale, high-quality labels
Prodigy Active learning, efficient NLP tasks, small teams
Argilla Open-source, LLM-focused LLM evaluation, RLHF data

3. Experiment Tracking

Experiment tracking records every training run's parameters, metrics, artifacts, and environment to ensure reproducibility and enable comparison.

What to Track

Category Examples
Parameters Learning rate, batch size, epochs, model architecture
Metrics Loss, accuracy, F1, BLEU, latency, throughput
Artifacts Model weights, plots, predictions, confusion matrices
Environment Python version, library versions, GPU type, OS
Data Dataset version, preprocessing steps, train/val/test splits
Code Git commit hash, diff, branch

Experiment Tracking Tools

Tool Type Key Features
MLflow Open-source Model registry, tracking, deployment, widely adopted
Weights & Biases (W&B) Managed Beautiful UI, hyperparameter sweeps, artifact tracking
Neptune Managed Flexible metadata, comparison tools
CometML Managed Experiment comparison, model production monitoring
TensorBoard Open-source Training visualization, integrated with TensorFlow/PyTorch
Aim Open-source Fast, local-first, beautiful visualizations

Pseudocode (Experiment Tracking)

class ExperimentTracker:
    function start_run(name: str, params: dict) -> Run
        run = Run(
            id=generate_id(),
            name=name,
            params=params,
            git_commit=get_git_commit(),
            environment=capture_environment(),
            start_time=now()
        )
        return run

    function log_metric(run: Run, name: str, value: float, step: int = None)
        run.metrics.append(Metric(name=name, value=value, step=step, timestamp=now()))

    function log_artifact(run: Run, path: str, artifact_type: str)
        run.artifacts.append(Artifact(path=path, type=artifact_type, hash=file_hash(path)))

    function end_run(run: Run, status: str = "completed")
        run.end_time = now()
        run.status = status
        run.duration = run.end_time - run.start_time
        self.store.save(run)

4. Model Training Infrastructure

Hyperparameter Optimization (HPO)

Finding optimal hyperparameters is critical for model performance:

Method Approach Efficiency When to Use
Grid Search Try all combinations O(k^n) - exhaustive Few hyperparameters, small search space
Random Search Random sampling Better than grid for high-dim Moderate search spaces
Bayesian Optimization Model the objective function Very efficient Expensive training runs
Hyperband / ASHA Early stopping of bad runs Very efficient Large search spaces
Population-Based Training Evolutionary approach Parallel, adaptive Distributed training

Tools: Optuna, Ray Tune, W&B Sweeps, SigOpt.

Distributed Training

For models too large for a single GPU:

Strategy Description Communication Use Case
Data Parallel (DP) Same model on each GPU, different data All-reduce gradients Models that fit in 1 GPU
Distributed Data Parallel (DDP) DP with better multi-node support NCCL all-reduce Standard distributed training
Fully Sharded DP (FSDP) Shard params + gradients + optimizer states All-gather when needed Large models (10B+)
Tensor Parallel (TP) Split layers across GPUs Point-to-point Very large layers
Pipeline Parallel (PP) Split model layers across GPUs Forward/backward between stages Very deep models
3D Parallelism DP + TP + PP combined All of the above Frontier models (100B+)

Training Frameworks

Framework Key Features Best For
PyTorch Dynamic graphs, research-friendly Most common, flexible
PyTorch Lightning Structured PyTorch, less boilerplate Production training
DeepSpeed (Microsoft) ZeRO optimizer, mixed precision Large model training
Megatron-LM (NVIDIA) Tensor/pipeline parallelism LLM pre-training
JAX Functional, XLA compilation TPU training, research
HuggingFace Transformers Pre-trained models, Trainer API Fine-tuning, transfer learning
Axolotl Fine-tuning framework LLM fine-tuning (LoRA, QLoRA)
Unsloth Optimized fine-tuning Fast LoRA/QLoRA fine-tuning

5. Model Registry

A model registry is a centralized store for model versions, metadata, and lifecycle states.

Model Registry Operations

class ModelRegistry:
    function register_model(
        name: str,
        version: str,
        artifact_path: str,
        metrics: dict,
        parameters: dict,
        training_run_id: str,
        tags: dict = None
    ) -> ModelVersion
        // Store model artifact
        artifact_id = self.artifact_store.upload(artifact_path)

        // Create version entry
        model_version = ModelVersion(
            name=name,
            version=version,
            artifact_id=artifact_id,
            metrics=metrics,
            parameters=parameters,
            training_run_id=training_run_id,
            tags=tags,
            stage="staging",  // Start in staging
            created_at=now()
        )

        self.store.save(model_version)
        return model_version

    function promote_model(name: str, version: str, target_stage: str)
        // Transition: staging -> production (with validation)
        model = self.store.get(name, version)

        if target_stage == "production":
            // Run validation checks
            validation = self.validate_for_production(model)
            if not validation.passed:
                raise ValidationError(validation.failures)

            // Archive current production model
            current_prod = self.get_production_model(name)
            if current_prod:
                current_prod.stage = "archived"
                self.store.save(current_prod)

        model.stage = target_stage
        self.store.save(model)

    function get_production_model(name: str) -> ModelVersion
        return self.store.query(name=name, stage="production")

Model Lifecycle Stages

Development → Staging → Production → Archived
     │           │          │
     └── Failed  └── Failed └── Rolled back

6. Model Serving

Model serving is the infrastructure for making predictions available to applications.

Serving Patterns

Batch Inference

Process large datasets offline on a schedule:

class BatchInferenceJob:
    model: Model
    input_source: DataSource   // S3, database, etc.
    output_sink: DataSink

    function run(job_config: dict)
        // Load data
        data = self.input_source.read(job_config["input_path"])

        // Preprocess
        features = self.preprocess(data)

        // Batch predict
        predictions = []
        for batch in features.batches(size=1000):
            preds = self.model.predict(batch)
            predictions.extend(preds)

        // Write results
        self.output_sink.write(predictions, job_config["output_path"])

        // Log metrics
        log_metrics({
            "total_predictions": len(predictions),
            "latency_p99": compute_p99(latencies),
            "error_rate": errors / len(data)
        })

Use cases: Recommendation systems, risk scoring, report generation, bulk email classification.

Real-Time Inference (Online Serving)

Low-latency API endpoints for synchronous predictions:

class ModelServer:
    model: Model
    preprocessor: Preprocessor
    postprocessor: Postprocessor
    cache: Cache

    function predict(request: PredictionRequest) -> PredictionResponse
        start_time = now()

        try:
            // Check cache
            cache_key = hash(request)
            cached = self.cache.get(cache_key)
            if cached:
                return cached

            // Preprocess
            features = self.preprocessor.transform(request.data)

            // Validate features
            if not self.validate_features(features):
                return PredictionResponse(error="Invalid input features")

            // Predict
            raw_prediction = self.model.predict(features)

            // Postprocess
            response = self.postprocessor.transform(raw_prediction)

            // Cache result
            self.cache.set(cache_key, response, ttl=300)

            // Log for monitoring
            latency = now() - start_time
            self.log_prediction(request, response, latency)

            return response

        except Exception as e:
            self.log_error(e, request)
            return PredictionResponse(error="Prediction failed", fallback=self.get_fallback())

Key metrics: Latency (p50, p95, p99), throughput (requests/second), error rate, GPU utilization.

Streaming Inference

Process continuous data streams (e.g., Kafka, Kinesis):

  • Real-time fraud detection on transaction streams.
  • Continuous anomaly detection on sensor data.
  • Live content moderation on social media posts.

Edge Inference

Run models on edge devices (mobile, IoT, embedded):

Framework Platforms Model Formats Key Features
TensorFlow Lite Android, iOS, embedded .tflite Quantization, delegates
ONNX Runtime Cross-platform .onnx Universal format, optimized
Core ML Apple ecosystem .mlmodel Hardware-accelerated on Apple devices
llama.cpp Desktop, mobile GGUF LLM inference on CPU
MediaPipe Mobile, web Various Google's ML pipeline framework

LLM Serving Infrastructure

LLM serving has unique requirements compared to traditional ML:

System Key Innovation Best For
vLLM PagedAttention, continuous batching High-throughput production serving
TGI (HuggingFace) Tensor parallelism, Flash Attention HuggingFace model ecosystem
TensorRT-LLM (NVIDIA) FP8, in-flight batching Maximum GPU performance
Ollama Simple local serving Development, testing
llama.cpp CPU/GPU mixed inference, GGUF Edge, desktop, cost-sensitive
SGLang RadixAttention, constrained decoding Structured output generation
Ray Serve Distributed serving, model composition Multi-model pipelines

7. Monitoring and Observability

ML monitoring goes beyond traditional software monitoring — you must also monitor data quality, model performance, and business impact.

What to Monitor

Infrastructure Metrics

Metric Description Alert Threshold
Latency (p50, p95, p99) Response time distribution p99 > SLA
Throughput Requests per second Below expected load
Error Rate Failed predictions > 1%
GPU Utilization GPU compute usage < 30% (waste) or > 95% (overloaded)
Memory Usage RAM and GPU memory > 90%
Queue Depth Pending requests Growing unboundedly

Data Drift

Data drift occurs when the distribution of input data changes from what the model was trained on:

class DriftDetector:
    reference_stats: DataStatistics  // From training data

    function detect_drift(current_data: DataFrame) -> DriftReport
        drift_scores = {}

        for feature in current_data.columns:
            // Statistical tests
            if feature.is_numerical:
                // Kolmogorov-Smirnov test
                ks_stat, p_value = ks_test(
                    self.reference_stats[feature],
                    current_data[feature]
                )
                drift_scores[feature] = {
                    "test": "KS",
                    "statistic": ks_stat,
                    "p_value": p_value,
                    "drifted": p_value < 0.05
                }
            else:
                // Chi-squared test for categorical
                chi2, p_value = chi2_test(
                    self.reference_stats[feature],
                    current_data[feature]
                )
                drift_scores[feature] = {
                    "test": "chi2",
                    "statistic": chi2,
                    "p_value": p_value,
                    "drifted": p_value < 0.05
                }

            // Population Stability Index (PSI)
            psi = compute_psi(
                self.reference_stats[feature],
                current_data[feature]
            )
            drift_scores[feature]["psi"] = psi
            // PSI > 0.2 indicates significant drift

        return DriftReport(
            features=drift_scores,
            overall_drift=any(d["drifted"] for d in drift_scores.values()),
            recommendation=self.get_recommendation(drift_scores)
        )

Drift Types: - Data Drift (Covariate Shift): Input distribution changes (e.g., new user demographics). - Concept Drift: The relationship between input and output changes (e.g., user preferences shift). - Label Drift: The distribution of target labels changes (e.g., fraud rate increases).

Model Performance

Monitor prediction quality over time:

class ModelPerformanceMonitor:
    function monitor(
        predictions: list[Prediction],
        ground_truth: list[Label] = None  // May be delayed
    ) -> PerformanceReport

        metrics = {}

        // Prediction distribution
        metrics["prediction_distribution"] = compute_distribution(predictions)
        metrics["prediction_entropy"] = compute_entropy(predictions)

        // If ground truth available (possibly delayed)
        if ground_truth:
            metrics["accuracy"] = compute_accuracy(predictions, ground_truth)
            metrics["f1"] = compute_f1(predictions, ground_truth)
            metrics["calibration"] = compute_calibration(predictions, ground_truth)

        // Detect anomalies
        metrics["anomaly_score"] = self.detect_performance_anomaly(metrics)

        // Compare to baseline
        metrics["degradation"] = self.compare_to_baseline(metrics)

        return PerformanceReport(metrics=metrics)

Monitoring Tools

Tool Type Key Features
Evidently AI Open-source Data drift, model performance reports
Arize Phoenix Open-source LLM traces, embeddings analysis
WhyLabs Managed Data profiling, drift detection
Fiddler Managed Explainability, fairness monitoring
NannyML Open-source Performance estimation without labels
Prometheus + Grafana Open-source Infrastructure metrics, custom dashboards
LangSmith Managed LLM-specific tracing and evaluation
Langfuse Open-source LLM observability, prompt management

Alerting Strategy

// Tiered alerting based on severity
alerts = {
    "critical": {
        "conditions": [
            "error_rate > 5%",
            "latency_p99 > 10s",
            "service_down"
        ],
        "action": "page_on_call",
        "response_time": "5 minutes"
    },
    "warning": {
        "conditions": [
            "data_drift_detected",
            "model_accuracy < baseline - 5%",
            "gpu_utilization < 20%"
        ],
        "action": "notify_team_channel",
        "response_time": "1 hour"
    },
    "info": {
        "conditions": [
            "new_model_version_deployed",
            "retraining_triggered",
            "cost_threshold_approaching"
        ],
        "action": "log_and_dashboard",
        "response_time": "next_business_day"
    }
}

8. CI/CD for Machine Learning

ML Pipeline Stages

┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│  Code   │──>│  Data   │──>│  Train  │──>│ Validate│──>│ Deploy  │
│  Tests  │   │  Tests  │   │  Model  │   │  Model  │   │  Model  │
└─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
    │              │              │              │              │
    ▼              ▼              ▼              ▼              ▼
  Lint,         Schema,       Training,     Performance    Canary,
  Unit tests,   Quality,      Experiment    comparison,    Blue-green,
  Type checks   Distribution  tracking      Fairness,      Shadow
                checks                      Safety tests    deploy

Model Validation Gates

Before promoting a model to production, validate:

class ModelValidationGate:
    function validate(candidate: ModelVersion, current_prod: ModelVersion) -> ValidationResult
        checks = []

        // 1. Performance comparison
        checks.append(self.check_performance(
            candidate, current_prod,
            min_improvement=0.01  // Must be at least 1% better
        ))

        // 2. Latency check
        checks.append(self.check_latency(
            candidate,
            max_p99_ms=100  // Must serve in < 100ms
        ))

        // 3. Fairness check
        checks.append(self.check_fairness(
            candidate,
            protected_attributes=["gender", "race", "age"],
            max_disparity=0.1  // Max 10% performance gap across groups
        ))

        // 4. Safety check (for LLMs)
        checks.append(self.check_safety(
            candidate,
            test_suite="safety_benchmark",
            max_unsafe_rate=0.001  // < 0.1% unsafe outputs
        ))

        // 5. Resource usage
        checks.append(self.check_resources(
            candidate,
            max_memory_gb=16,
            max_gpu_memory_gb=24
        ))

        return ValidationResult(
            passed=all(c.passed for c in checks),
            checks=checks
        )

Deployment Strategies for Models

Strategy Risk Complexity When to Use
Direct Replacement High Low Non-critical models, dev/staging
Blue-Green Medium Medium Quick rollback needed
Canary Low Medium Gradual rollout, risk mitigation
Shadow None High High-stakes models, new architectures
A/B Testing Low High Comparing model variants
Multi-Armed Bandit Low High Continuous optimization

Shadow Deployment: Run the new model alongside production, compare outputs, but only serve the old model's predictions. Validate the new model with real traffic before switching.

Pipeline Orchestration Tools

Tool Type Key Features
Kubeflow Pipelines Open-source Kubernetes-native, ML-specific
Apache Airflow Open-source General workflow, widely adopted
Prefect Open-source + managed Modern Python-native workflows
Dagster Open-source Data-aware orchestration
ZenML Open-source ML-specific, stack-agnostic
Metaflow (Netflix) Open-source Data science workflows, AWS integration

9. LLMOps

LLMOps extends traditional MLOps for the unique challenges of operating LLM-based systems.

LLMOps vs. Traditional MLOps

Aspect Traditional MLOps LLMOps
Model Training Train custom models Fine-tune or use API (pre-trained)
Primary Tuning Hyperparameters, features Prompts, context, fine-tuning
Evaluation Metrics (accuracy, F1) Metrics + human evaluation + LLM-as-judge
Versioning Model weights + data Prompt templates + model version + RAG index
Cost Drivers Training compute Inference tokens (input + output)
Failure Modes Wrong predictions Hallucination, prompt injection, safety
Update Cycle Retrain periodically Update prompts, RAG index, or model version

LLMOps Components

┌──────────────────────────────────────────────────────────┐
│                    LLMOps Stack                           │
├──────────────────────────────────────────────────────────┤
│  Prompt Management    │  Model Gateway     │  Evaluation  │
│  - Version control    │  - Routing         │  - Auto eval │
│  - A/B testing        │  - Load balancing  │  - Human eval│
│  - Template engine    │  - Failover        │  - Benchmarks│
├──────────────────────────────────────────────────────────┤
│  RAG Pipeline         │  Caching           │  Cost Mgmt   │
│  - Index management   │  - Semantic cache  │  - Token     │
│  - Embedding updates  │  - Prompt cache    │    tracking  │
│  - Quality monitoring │  - KV cache        │  - Budgets   │
├──────────────────────────────────────────────────────────┤
│  Observability        │  Safety            │  Fine-tuning │
│  - Traces             │  - Content filter  │  - Data prep │
│  - Logs               │  - PII detection   │  - Training  │
│  - Metrics            │  - Injection guard │  - Evaluation│
└──────────────────────────────────────────────────────────┘

Prompt Management

class PromptManager:
    store: PromptStore  // Database of prompt versions

    function create_version(
        prompt_name: str,
        template: str,
        model: str,
        parameters: dict,
        description: str
    ) -> PromptVersion
        version = PromptVersion(
            name=prompt_name,
            version=self.get_next_version(prompt_name),
            template=template,
            model=model,
            parameters=parameters,
            description=description,
            created_at=now()
        )
        self.store.save(version)
        return version

    function get_active_prompt(prompt_name: str, environment: str) -> PromptVersion
        // Get the currently active version for this environment
        return self.store.get_active(prompt_name, environment)

    function ab_test(
        prompt_name: str,
        version_a: str,
        version_b: str,
        traffic_split: float = 0.5
    )
        // Route traffic between two prompt versions
        self.store.set_ab_test(prompt_name, version_a, version_b, traffic_split)

Model Gateway

A model gateway abstracts the LLM provider, enabling provider switching, fallback, and load balancing:

class ModelGateway:
    providers: dict[str, LLMProvider]  // openai, anthropic, local, etc.
    routing_config: RoutingConfig

    function call(
        messages: list[dict],
        model: str = None,
        **kwargs
    ) -> Response
        // Determine provider and model
        provider_name, model_name = self.route(model, messages)
        provider = self.providers[provider_name]

        try:
            response = provider.call(model_name, messages, **kwargs)
            self.log_success(provider_name, model_name, response)
            return response

        except (RateLimitError, ServiceUnavailable):
            // Fallback to alternative provider
            fallback = self.routing_config.get_fallback(provider_name)
            response = self.providers[fallback].call(model_name, messages, **kwargs)
            self.log_fallback(provider_name, fallback, response)
            return response

    function route(model: str, messages: list) -> tuple[str, str]
        // Simple routing: match model to provider
        // Advanced: cost-based, latency-based, or complexity-based routing
        return self.routing_config.resolve(model)

Cost Optimization

LLM costs can grow quickly. Key optimization strategies:

Strategy Savings Complexity Description
Prompt optimization 20-50% Low Reduce prompt length, remove redundancy
Caching 30-80% Medium Cache identical and semantically similar queries
Model routing 40-70% Medium Use cheaper models for simple queries
Batching 10-30% Low Batch multiple requests
Quantized models 50-75% Medium Use quantized models for appropriate tasks
Context pruning 20-40% Medium Only include necessary context
Self-hosted models 50-90% (at scale) High Run open-source models on own infrastructure

10. GPU Infrastructure

GPU Types for AI Workloads

GPU Memory FP16 TFLOPS Use Case Approximate Cost
NVIDIA A100 40/80 GB HBM2e 312 Training + inference ~$2/hr (cloud)
NVIDIA H100 80 GB HBM3 989 Frontier training ~$4/hr (cloud)
NVIDIA H200 141 GB HBM3e 989 Large model inference ~$5/hr (cloud)
NVIDIA L4 24 GB GDDR6 121 Cost-effective inference ~$0.5/hr (cloud)
NVIDIA T4 16 GB GDDR6 65 Budget inference ~$0.3/hr (cloud)
AMD MI300X 192 GB HBM3 1307 Training + inference ~$3/hr (cloud)
Google TPU v5e 16 GB HBM N/A JAX/TF training ~$1.2/hr (cloud)

Cloud AI Platforms

Platform Key Services Strengths
AWS SageMaker, Bedrock, EC2 (P5, Inf2) Broadest ecosystem, Inferentia chips
GCP Vertex AI, TPUs, GKE TPU access, Gemini integration
Azure Azure ML, OpenAI Service OpenAI partnership, enterprise focus
Lambda Labs GPU cloud Simple, GPU-focused, competitive pricing
Together AI Inference API + fine-tuning Open-source model hosting
Replicate Model hosting API Simple deployment, pay-per-prediction
Modal Serverless GPU Serverless functions with GPU access

Infrastructure Sizing

// Estimating GPU requirements for LLM inference

function estimate_gpu_requirements(
    model_params_billions: float,
    precision: str = "fp16",  // fp32, fp16, int8, int4
    max_batch_size: int = 32,
    max_sequence_length: int = 4096
) -> dict

    // Model weight memory
    bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
    weight_memory_gb = model_params_billions * bytes_per_param[precision]

    // KV cache memory (approximate)
    // KV cache per token ≈ 2 * num_layers * hidden_size * 2 bytes (fp16)
    // Rough estimate: ~1MB per token for 7B model
    kv_cache_per_token_mb = model_params_billions * 0.15  // Rough scaling
    kv_cache_gb = (kv_cache_per_token_mb * max_sequence_length * max_batch_size) / 1024

    // Total memory needed
    total_memory_gb = weight_memory_gb + kv_cache_gb + 2  // +2 GB overhead

    // Determine GPU configuration
    if total_memory_gb <= 24:
        return {"gpus": "1x L4/T4 (24 GB)", "memory_gb": total_memory_gb}
    elif total_memory_gb <= 80:
        return {"gpus": "1x A100/H100 (80 GB)", "memory_gb": total_memory_gb}
    else:
        num_gpus = ceil(total_memory_gb / 80)
        return {"gpus": f"{num_gpus}x A100/H100", "memory_gb": total_memory_gb}

11. ML System Architecture Patterns

Online Prediction Service

Client → API Gateway → Load Balancer → Model Server → Model
                                           │
                                    Feature Store (online)
                                           │
                                    Monitoring / Logging

Offline Batch Pipeline

Scheduler → Data Pipeline → Feature Pipeline → Training Pipeline
                                                      │
                                               Model Registry
                                                      │
                                               Validation Gate
                                                      │
                                               Deployment Pipeline

LLM Application Architecture

User → Application → Prompt Manager → Model Gateway → LLM API
                         │                                │
                    RAG Pipeline                    Response Handler
                         │                                │
                    Vector DB                    Output Validation
                         │                                │
                    Embedding Model               Monitoring / Tracing

12. Best Practices Summary

MLOps Best Practices

  1. Automate everything: Manual steps are error-prone and don't scale. Automate data pipelines, training, validation, and deployment.
  2. Version all artifacts: Code, data, models, configs, and prompts should all be versioned and reproducible.
  3. Monitor beyond uptime: Data drift, model performance, and business metrics are as important as infrastructure health.
  4. Test models like software: Unit tests for data transformations, integration tests for pipelines, performance tests for models.
  5. Plan for failure: Models degrade silently. Have automated alerts, fallbacks, and rollback procedures.
  6. Start simple, iterate: Begin with a simple pipeline and add complexity (feature stores, advanced monitoring) as needed.

LLMOps Best Practices

  1. Treat prompts as code: Version control, review, test, and deploy prompts through a CI/CD pipeline.
  2. Implement model gateways: Abstract the LLM provider to enable switching, fallback, and A/B testing.
  3. Monitor token costs: Track costs per user, per feature, and per model. Set budgets and alerts.
  4. Cache aggressively: Semantic caching for similar queries, exact caching for identical requests, prefix caching for shared prompt templates.
  5. Evaluate continuously: Run automated evaluation (LLM-as-judge) on a sample of production traffic regularly.
  6. Defense in depth: Implement input validation, output filtering, rate limiting, and content safety at every layer.
  7. Use the right model for the task: Not every query needs GPT-4. Route simple queries to cheaper, faster models.