Google AI researchers today said they used 2,000 “mannequin challenge” YouTube videos as a training data set to create an AI model capable of depth prediction from videos in motion. Applications of such an understanding could help developers craft augmented reality experiences in scenes shot with hand-held cameras and 3D video.
The mannequin challenge asked groups of people to basically act like time stood still while one person shoots video. In a paper titled “Learning the Depths of Moving People by Watching Frozen People,” researchers said this provides a dataset that helps detect depth of field in videos where the camera and people in the video are moving.
“While there is a recent surge in using machine learning for depth prediction, this work is the first to tailor a learning-based approach to the case of simultaneous camera and human motion,” research scientist Tali Dekel and engineer Forrester Cole said in a blog post today.
The approach outperforms state-of-the-art tools for making depth maps, Google researchers said.
“To the extent that people succeed in staying still during the videos, we can assume the scenes are static and obtain accurate camera poses and depth information by processing them with structure-from-motion (SfM) and multi-view stereo (MVS) algorithms,” the paper reads. “Because the entire scene, including the people, is stationary, we estimate camera poses and depth using SfM and MVS, and use this derived 3D data as supervision for training.”
To make the model, researchers trained a neural network capable of input from RGB images, a mask for human regions, and initial depth of non-human environments in video in order to produce a depth map and make human shape and pose predictions.
YouTube videos were also used as a dataset by University of California, Berkeley AI researchers last year to train AI models to dance the Gangnam style and perform acrobatic human feats like backflips.