Skip to content
All case studies

ML Systems + Inference Optimization

High-Performance Inference and MLOps for Real-Time AI

Built GPU-accelerated inference and deployment pipelines that improved throughput, latency, reproducibility, and validation.

2x

throughput improvement

60%

latency reduction

50%

IPC latency reduction

30%

faster deployments

The problem

Production AI workloads need predictable latency, efficient memory usage, reproducible deployment, and continuous validation.

What I built

Engineered real-time and batch inference pipelines using ONNX, TensorRT, CUDA, GStreamer, MLflow, Docker, AWS, CI/CD, and edge-device runners.

Technical approach

  • TensorRT and ONNX runtime optimization
  • CUDA-aware scheduling
  • GStreamer pipelines with GPU-accelerated decode/encode
  • MLflow deployment tracking
  • Dockerized CI/CD workflows
  • Hardware-in-the-loop validation

Visuals

Latency before/after chart
Deployment pipeline diagram
Edge-to-cloud architecture