
Certificate of Completion
THIS ACKNOWLEDGES THAT
HAS COMPLETED THE FALL 2025 DATA SCIENCE BOOT CAMP
Sam Schiavone
Roman Holowinsky, PhD
NOVEMBER 13, 2025
DIRECTOR
DATE

TEAM
LingPredict Project: Do Developmental Norms Predict L2 Difficulty? Modeling Duolingo Learners with WordBank Features
Sara Sanchez-Alonso, Benard Haugen, Vikram Jambulapati, Manjeet Kaur, Sam Schiavone

Field: Language Learning
Research Question:
Do words acquired later in first-language development (L1) show higher error rates in second language (L2) practice on Duolingo?
Duolingo SLAM Dataset:
• Large corpus of data from over 6,000 Duolingo users, collected during their first 30 days of learning a language
• Released in 2018 as part of the Duolingo Shared Task on Second Language Acquisition Modeling (SLAM).
• Publicly available via Harvard Dataverse and linked from Duolingo Research.
• Freely usable for research/educational purposes (requires agreeing to terms of use).
WordBank Dataset:
• Open database of children’s vocabulary growth.
• Publicly available at wordbank.stanford.edu.
• Aggregates data from the MacArthur–Bates Communicative Development Inventories (CDIs).
• Open access under a permissive license (for research/educational use).
