Posts | Steve Bottos

Some RAG Systems and Their Problems

Text RAG is manageable. Document RAG means bolting on OCR and layout models that work well but add real overhead. Visual RAG skips the extraction step entirely, but trades it for a cost and precision problem of its own.

read more →

vision text RAG engineering

July 11, 2026

Teaching a CNN to play Lunar Landar with Visual Features Only and Reinforcement Learning

A chronicle of a reinforcement learning project, from initial struggles with exploding losses to a successful pixel-based agent for LunarLander.

read more →

reinforcement learning proximal policy optimization computer vision deep learning tutorial

December 04, 2025

Fast, Simple, Fun - Video Understanding with <40M Parameters

A very approachable jumping off point for video captioning. If you're GPU-poor (<24GB vram) this is for you.

read more →

video nlp multi-modal foundational models tutorial

November 16, 2024

Something-Something-V2-Paraphrased

A small contribution to the community. Adds caption-like variety samples to SSV2 dataset.

read more →

video captioning novel dataset LLMs VLMs

November 10, 2024

V-JEPA 2 Isn't Getting the Hype It Deserves

Some thoughts about the potential power of Meta's V-Jepa 2.

read more →

video multi-modal foundational models

November 09, 2024