Multimodal RAG + LLM Systems
Video-RAG for Continuous Video Search
Architected retrieval-oriented video understanding pipelines for natural-language search over live and long-horizon video streams.
Sub-second search latency over multi-day indexed video
Contextual querying and decision support over live video streams
Improved explainability through retrieval-backed reasoning
The problem
Long video streams are difficult to search, reason over, and summarize using traditional indexing approaches.
What I built
Built a Video-RAG pipeline using compressed spatio-temporal embeddings, CLIP/VLM-derived representations, semantic text embeddings, multimodal indexing, and LLM reasoning.
Technical approach
- Compressed spatio-temporal embeddings
- Multimodal segment construction
- Bi-encoder candidate generation
- Cross-encoder refinement
- Confidence-weighted multimodal fusion
- LLM-driven contextual reasoning
- TensorRT / FP16 optimization on Jetson AGX Orin
Visuals
Retrieval architecture
Search demo screenshots
Ranking/fusion diagram