What AI taught us about children’s language learning abilities!
Researchers figured out how to construct a machine learning model that learns to recognize words using footage from a toddler’s head camera. The research, published this week in Science, has the potential to teach us new insights into how children learn language, and might even help researchers make machine learning models that learn more closely to humans.
Around 6 to 9 months of age is when children start to acquire their first words. By the time they hit age two, the average child already knows around 300 words. However, the exact process through which children connect meaning to words is still not well understood and is a subject of scientific debate. Researchers from New York University’s Center for Data Science embarked on exploring this lexical enigma by building an AI model that tries to learn in the same way a child does.
The model was trained using over 60 hours of a toddler’s perspective captured by a head camera, and over 37,500 transcribed utterances. Their day-to-day experiences—playing, eating, and soaking in the world around them—are all seen through this camera, offering a view into a child’s development.
With access to this unique look at the world, the researchers created a neural network model using input from the camera and transcribed speech directed at the child. The self-supervised model learned similarly to a child, associating words with objects and visuals that happened to co-occur at the same time.

Testing procedure in models and children. Credit: Wai Keen Vong
“By using AI models to study the real language-learning problem faced by children, we can address classic debates about what ingredients children need to learn words—whether they need language-specific biases, innate knowledge, or just associative learning to get going,” paper co-author and NYU Center for Data Science Professor Brenden Lake said in a statement. “It seems we can get more with just learning than commonly thought.”
In testing, the model was shown four images from the training set and asked to match them to specific words. The model was successful 61.6% of the time. Impressively, the baby cam-trained model exhibited similar accuracy levels to a pair of separate AI models that were trained with many more language inputs.

