Skip to content
All case studies

Multimodal RAG + LLM Systems

Video-RAG for Continuous Video Search

Architected retrieval-oriented video understanding pipelines for natural-language search over live and long-horizon video streams.

Sub-second search latency over multi-day indexed video

Contextual querying and decision support over live video streams

Improved explainability through retrieval-backed reasoning

The problem

Long video streams are difficult to search, reason over, and summarize using traditional indexing approaches.

What I built

Built a Video-RAG pipeline using compressed spatio-temporal embeddings, CLIP/VLM-derived representations, semantic text embeddings, multimodal indexing, and LLM reasoning.

Technical approach

  • Compressed spatio-temporal embeddings
  • Multimodal segment construction
  • Bi-encoder candidate generation
  • Cross-encoder refinement
  • Confidence-weighted multimodal fusion
  • LLM-driven contextual reasoning
  • TensorRT / FP16 optimization on Jetson AGX Orin

Visuals

Retrieval architecture
Search demo screenshots
Ranking/fusion diagram