Abstract:
Within months of birth, children develop meaningful expectations about the
world around them. How much of this early knowledge can be explained through
generic learning mechanisms applied to sensory data, and how much of it
requires more substantive innate inductive biases? Addressing this fundamental
question in its full generality is currently infeasible, but we can hope to
make real progress in more narrowly defined domains, such as the development of
high-level visual categories, thanks to improvements in data collecting
technology and recent progress in deep learning. In this paper, our goal is
precisely to achieve such progress by utilizing modern self-supervised deep
learning methods and a recent longitudinal, egocentric video dataset recorded
from the perspective of three young children (Sullivan et al., 2020). Our
results demonstrate the emergence of powerful, high-level visual
representations from developmentally realistic natural videos using generic
self-supervised learning objectives.