Can AI learn language like we do?
Thank you to the baby Sam from Australia for being part of this story. 👶🏻
Image: A baby wearing the helmet-mounted camera researchers used to collect a baby's-eye view dataset.
Hey Everyone,
This week there was this fascinating story about AI learning language like we do from young.
In one of the most viral stories I’ve seen about AI learning language like humans do I’ve seen in a while. A group of scientists at New York University used 61 hours of footage from a camera attached to an Australian baby named Sam to gather experiences from the infant’s perspective.
Video from New York University
In order to check if AI could learn words and ideas that a child encounters every day, researchers from New York University decided that the best way forward would be to record what a toddler says and does throughout the day. And make an AI model to learn from it.
The new study, published in the journal Science on February 1st, 2024, demonstrated that an AI system could be developed to learn a substantial number of words and concepts using the limited experiences of a child.
Researchers from New York University showed that the video captured during only one percent of the child’s waking hours is sufficient for genuine language learning.
“We show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts,” says Wai Keen Vong, a research scientist at NYU’s Center for Data Science and the paper’s first author. “Our results demonstrate how recent algorithmic advances paired with one child’s naturalistic experience has the potential to reshape our understanding of early language and concept acquisition.”
The idea of training AIs on how babies learn is of course not new. Still how a neural network that taught itself to recognize objects using the filmed experiences of a single infant could offer new insights into how humans learn. LLMs are going to get a lot better soon learning from video.
How do young children learn to associate new words with specific objects or visually represented concepts? This hotly debated question in early language acquisition has been traditionally examined in laboratories, limiting generalizability to real-world settings. Vong et al. investigated the question in an unprecedented, longitudinal manner using head-mounted video recordings from a single child’s first-person experiences in naturalistic settings.
If this study was repeated at scale it could harness some interesting findings. In the study, Sam, a baby boy living near Adelaide in Australia, wore a head-mounted camera for around one hour twice each week from the age of six months to around two years, gathering experiences from the infant’s perspective.
Abstract
Starting around 6 to 9 months of age, children begin acquiring their first words, linking spoken words to their visual counterparts. How much of this knowledge is learnable from sensory input with relatively generic learning mechanisms, and how much requires stronger inductive biases? Using longitudinal head-mounted camera recordings from one child aged 6 to 25 months, we trained a relatively generic neural network on 61 hours of correlated visual-linguistic data streams, learning feature-based representations and cross-modal associations. Our model acquires many word-referent mappings present in the child’s everyday experience, enables zero-shot generalization to new visual referents, and aligns its visual and linguistic conceptual systems. These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child’s input.
The training material included frames from the video recordings of the baby’s environment and transcribed words spoken to Sam. In total, the model was exposed to 250,000 words and corresponding images depicting various activities like playing, reading, and eating.
It also included a range of different activities, including mealtimes, reading books, and the child playing.
For education and language acquisition and early childhood development this is also really valuable. What if it could be replicated for all kinds of other ways we learn?
AI Might Also help us Understand Child Development Better
“Our results demonstrate how recent algorithmic advances paired with one child’s naturalistic experience has the potential to reshape our understanding of early language and concept acquisition,” Dr Vong said.
CVCL
The system, dubbed the Child's View for Contrastive Learning (CVCL) model, was fed 37,500 utterances captured from 61 hours of first-person video recorded by a baby named Sam from ages 6 months to 2 years old.
Researchers from New York University’s Center for Data Science and Department of Psychology fed this dataset into a multimodal AI system—one that could ingest both text and images.
They found that the AI model could identify many different objects, both in tests using data from the head-mounted camera and in tests using a dataset of idealized images of various objects, although its accuracy was somewhat limited.
The AI system did learn from the video in a surprisingly effective way. The AI system was better at naming objects it had seen more frequently, including apples (which are ubiquitous in children’s books) and cribs. It also was better able to pick out objects that weren’t obscured in the head camera images. It was particularly poor at recognizing knives, says Wai Keen Vong, one of the paper’s authors. Curiously the work was supported by the U.S. Department of Defense's Defense Advanced Research Projects Agency (N6600119C4030) (that’s DARPA) and the National Science Foundation (1922658).
This is also about child development science and psychology here. Some psychologists and linguists believe that children would not be able to form associations between words and objects without having some innate language ability. But the fact that the AI model, which is relatively simple, could even begin to learn word associations on such a small dataset challenges this view, says Vong.
Since writing up their results, the NYU researchers have transcribed four times more data from the head camera footage, which they intend to feed into their model. They hope to examine how much more the AI model learns when it's given more data, Vong says. They also hope to test whether the model can start to learn more challenging words and linguistic behaviors that tend to develop later in life.
❄️ Special Offer to Go Paid 🌨️🧊☕️⛄️
This special 2-year Anniversary offer expires after February, 2024. It is only available for yearly plans. This amounts to $53.60 a year, or $4.16 a month, or $0.26 an article.
Ask your manager and get this covered for multiple team members by your learning and development fund.
Keep reading with a 7-day free trial
Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.