Profile Picture

Steve Bottos

Lead Machine Learning Engineer | Victoria, BC

Something-Something-V2, Paraphrased

Video captioning datasets are usually frustrating to work with. Many rely on YouTube links that frequently go dead, and some struggle to capture both spatial AND temporal nuances. This makes robust, readily-available datasets incredibly valuable for research and development.

Lately I’ve been turning to the Something-Something-V2 (SSv2) dataset. While it’s primarily designed with action recognition in mind, its strong emphasis on both spatial and temporal details makes it an excellent resource for video captioning tasks.

To enhance its utility for this purpose, I’ve put together a complete set of synthetic paraphrased captions for the SSv2 training set. This effort introduces significant linguistic diversity, which is key for training more robust and generalized captioning models.

A few key stats on the paraphrased dataset:

This enrichment helps models learn a wider range of descriptive language for the same visual concepts. I’m finding this approach helpful for iterating on video captioning tasks that require a nuanced understanding of actions in space and time.

You can find the full dataset and details here: https://github.com/stevebottos/somethingsomethingv2-paraphrase