ML Systems + Inference Optimization
High-Performance Inference and MLOps for Real-Time AI
Built GPU-accelerated inference and deployment pipelines that improved throughput, latency, reproducibility, and validation.
2x
throughput improvement
60%
latency reduction
50%
IPC latency reduction
30%
faster deployments
The problem
Production AI workloads need predictable latency, efficient memory usage, reproducible deployment, and continuous validation.
What I built
Engineered real-time and batch inference pipelines using ONNX, TensorRT, CUDA, GStreamer, MLflow, Docker, AWS, CI/CD, and edge-device runners.
Technical approach
- TensorRT and ONNX runtime optimization
- CUDA-aware scheduling
- GStreamer pipelines with GPU-accelerated decode/encode
- MLflow deployment tracking
- Dockerized CI/CD workflows
- Hardware-in-the-loop validation
Visuals
Latency before/after chart
Deployment pipeline diagram
Edge-to-cloud architecture