top of page
Data Science Boot Camp

Spring 2026

Jan 26, 2026

-

May 1, 2026

I'm a paragraph. Click here to add your own text and edit me. It's easy.

erdosOspin.gif

Checking your registration status...

To access the program content, you must first create an account and member profile and be logged in.

You are registered for this program.

Registration Deadlines

Jan 21, 2026

-

All Erdős Spring 2026 Career Launch Cohort or Alumni Club members who are not participating in another Launch bootcamp

-

-

Category

Launch, Core Program, Boot Camp, Projects, Certificates

Overview

In this bootcamp, we will develop the skills needed to complete a data science project from start to finish. This includes defining a problem in quantitative terms, identifying key performance indicators (KPIs), acquiring and cleaning data, exploring patterns and trends, and transforming raw data into meaningful variables. We will then build models for prediction and inference, focusing primarily on supervised learning methods for regression and classification.

Slack

Click here to be invited to the slack organization: The Erdős Institute

Click here to access the slack cohort channel: #slack-cohort-channel

Click here to access the slack program channel: #slack-program-channel

calendar-icon.png

Click here to download the Events & Deadlines .ics calendar file

Organizers, Instructors, and Advisors

matt_osborne.png

Steven Gubkin, PhD

Lead Instructor

Office Hours:

By appt. only

Email:

Preferred Contact:

Slack

Please feel free to message me on Slack with any questions!

matt_osborne.png

Alec Clott, PhD

Head of Data Science Projects

Office Hours:

By appt. only

Email:

Preferred Contact:

Slack

Participants are welcome to reach out to me via slack or email. I normally work standard EST hours (9am-5pm), but can always find time to meet folks via Zoom too after work. Let me know how I can help!

Objectives

The goal of our Data Science Boot Camp is to provide you with the skills and mentorship necessary to produce a portfolio worthy data science/machine learning project while also providing you with valuable career development support and connecting you with potential employers.

Project Examples

TEAM 33

Tuning Up Music Highway

James O'Quinn, Yang Mo, john hurtado cadavid, Ruixuan Ding, Chilambwe Natasha Wapamenshi

clear.png
Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

Known as the most dangerous highway in Tennessee, Music Highway, the stretch of Interstate 40 between Memphis and Nashville, could use a serious tuning up. This project investigates the effectiveness and cost-efficiency of potential physical safety interventions along its Madison and Henderson County segments, with the goal of reducing crash severity. We used a data-driven geospatial modeling approach to assess whether adding specific safety features to targeted segments predicts statistically significant changes in crash injury outcomes.

TEAM 29

Who Regulates the Regulators?

Jared Able, Joshua Jackson, Zachary Brennan, Alexandria Wheeler, Nicholas Geiser

clear.png
Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

With recent major cuts to governmental regulation agencies in the US, we investigate whether those cuts are justified. In particular, we analyze the efficacy of RGGI, a state-level cap-and-trade program designed to regulate CO2 emissions in power plants. By using synthetic controls, we answer the counterfactual question: "how would CO2 emissions look in a world where RGGI was never enacted?".

First Steps/Prerequisites

Course Orientation / Computer Setup Day
Our first meeting is Thursday, January 29th, from 1:30 PM - 3:00 PM ET. I will give a brief orientation to the course. The remainder of the time will be spent on the following very simple goal: to clone the repo, install the conda environment, and use that conda environment to run a Jupyter notebook. It is impossible to participate in the course without these abilities, so it is important to attend this session. If you can do these things, please show up to help the other participants!
 
Detailed instructions (created by teaching assistant Ness Mayker Chen) can be found at this link.
 
We will test your ability to do these things by having you submit a "secret code". You will obtain this code by successfully running the notebook
 
lectures/00_orientation/computer_setup_day/find_secret_code.ipynb
 
When you have obtained the code put it in the textbox at https://www.erdosinstitute.org/ds-boot-camp-prep
 
If you can do these things independently please show up to help your colleagues!
If you cannot do these things independently please show up to get help from your colleagues!
 
Prerequisites
 
In addition to these computer setup steps there are also some content prerequisites:
  1. Base level familiarity with Python
  2. Differential calculus. Ideally you also know some multivariate differential calculus and linear algebra.
  3. Basic statistics and probability

Program Content

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Course materials are available on github through the following link:

25231-github-cat-in-a-circle-icon-vector-icon-vector-eps.png
Request Access to GitHub

github message for user

Program Content

Textbook/Notes

Note: our video player does not support playback speed options. You can find a third party browser extension which will allow you to modify video playback speed. For example, this one works for Chrome: video-speed-controller. If you would prefer to avoid a browser extension you can manually modify the playback speed in the javascript console as well: Speed up any HTML5 video player!

(Prerecorded) L3E2: Regularization

Lecture 03: Complexity Control (prerecorded)

Regularization adds a term to our loss function which penalizes large parameters. This can help prevent overfitting.

Slides
Transcript
Code

(Prerecorded) L4E1: Simple Linear Regression

Lecture 04: Regression I (prerecorded)

We introduce the simple linear regression model.

Slides
Code

(Prerecorded) L4E1: Categorical Variables and Interactions

Lecture 04: Regression I (prerecorded)

Adjusting our regression set up to accommodate categorical variables.

Slides
Code

(Prerecorded) L4E1: Basic Pipelines

Lecture 04: Regression I (prerecorded)

Pipelines are a nice way to put all modeling steps into one neat package. Here we introduce the most basic pipeline creation method.

Slides
Code

(Prerecorded) L8E3: Time and Dates in Python

Lecture 08: Time Series I Control (prerecorded)

A brief aside on how to handle time and dates in python.

Slides
Code

(Prerecorded) L8E6: Exponential Smoothing

Lecture 08: Time Series I Control (prerecorded)

Exponential Smoothing forecast methods are rolling averages with weights which decreases exponentially backwards through time. We introduce four such methods.

Slides
Code

(Prerecorded) L9E1: Moving Average (MA(q)) Models

Lecture 09: Time Series II Control (prerecorded)

A Moving Average model expresses the value of a time series as a linear combination of lagged independent normally distributed errors.

Slides
Transcript
Code

Data Source Websites

Data Collection (prerecorded)

We cover a plethora of data source websites you can use.

Slides
Transcript
Code

Data in Databases

Data Collection (prerecorded)

Your data is stuck in a database, can you get it out? Learn how in this video.

Slides
Transcript
Code

What is Clustering?

Bonus content (prerecorded)

We take a moment to define clustering problems.

Slides
Code

Imputation

Bonus content (prerecorded)

When you are missing data, try imputing!

Slides
Code

Gradient Descent

Bonus content (prerecorded)

Taking advantage of the gradient to minimize cost functions.

Slides
Code

(Prerecorded) L2E5: Classification Plots

Lecture 02: Model Evaluation (Prerecorded)

Plots related to threshold tuning, namely PR-curve and ROC-curve. The area under the ROC curve (AUC-ROC) measures the probability that class 1s are ranked as more probable than class 0s.

Slides
Transcript
Code

(Prerecorded) L2E2: Regression Metrics

Lecture 02: Model Evaluation (Prerecorded)

We discuss different loss functions (MSE, MAE, Huber) and evaluation metrics (MAPE, R^2, etc) for regression problems.

Slides
Transcript
Code

(Prerecorded) L1E5: Scikit-learn Supervised API

Lecture 01: Supervised Learning (Prerecorded)

Scikit-learn models follow a simple model.fit(X,y), model.predict(X) API. This allows you to use many different models as black boxes so you can start using them right away.

Slides
Transcript
Code

(Prerecorded) L1E2: Data Collection

Lecture 01: Supervised Learning (Prerecorded)

A very short video which gives you pointers on where to find additional content on data collection. We do not cover this content in depth in the synchronous lectures.

Slides
Transcript
Code

(Prerecorded) L3E3: PCA

Lecture 03: Complexity Control (prerecorded)

We introduce the theory behind principal components analysis and demonstrate how it is implemented in python.

Slides
Transcript
Code

(Prerecorded) L4E1: A First Predictive Modeling Project

Lecture 04: Regression I (prerecorded)

We review a typical workflow for a predictive modeling project with some baseball data.

Slides
Code

(Prerecorded) L4E1: Polynomial Regression and Nonlinear Transformations

Lecture 04: Regression I (prerecorded)

We can make polynomial and nonlinear transformations of our features to make additional regression model types.

Slides
Code

(Prerecorded) L8E1: What are Time Series and Forecasting

Lecture 08: Time Series I Control (prerecorded)

We introduce the notions of time series data and forecasting.

Slides
Code

(Prerecorded) L8E4: Baseline Forecasts

Lecture 08: Time Series I Control (prerecorded)

Some good baseline models for various flavors of time series data.

Slides
Code

(Prerecorded) L9E1: Stationarity and Autocorrelation

Lecture 09: Time Series II Control (prerecorded)

We learn what a stationary time series is and see how we can assess whether a time series is clearly non-stationary.

Slides
Code

(Prerecorded) L9E1: Next Steps

Lecture 09: Time Series II Control (prerecorded)

Where to go in order to keep learning about time series and forecasting.

Slides
Code

Web Scraping with BeautifulSoup

Data Collection (prerecorded)

We give a brief introduction into web scraping with BeautifulSoup

Slides
Transcript
Code

PCA and Basketball

Bonus content (prerecorded)

A cool PCA example where the components have an interesting interpretation.

Slides
Code

k Means Clustering

Bonus content (prerecorded)

Our first clustering algorithm.

Slides
Code

More Advanced Pipelines

Bonus content (prerecorded)

We demonstrate some more advanced pipeline techniques in sklearn.

Slides
Code

GridSearchCV

Bonus content (prerecorded)

We introduce a sklearn object that makes hyper parameter tuning a tad easier.

Slides
Code

(Prerecorded) L2E4: Classification Metrics

Lecture 02: Model Evaluation (Prerecorded)

Derivation of cross-entropy loss via negative log likelihood. Confusion matrix based classification metrics (accuracy, precision, recall, etc)

Slides
Transcript
Code

(Prerecorded) L2E1: Data Splits

Lecture 02: Model Evaluation (Prerecorded)

Train/test splits, cross-validation, and the frequent need for custom splits to give an "honest assessment" of model performance. Always think about what you want your model to generalize to!

Slides
Transcript
Code

(Prerecorded) L1E4: Supervised Learning Framework

Lecture 01: Supervised Learning (Prerecorded)

Supervised learning models are trained on a dataset that includes both inputs (features) and the correct outputs (labels or target values). We discuss what they are and how they are trained.

Slides
Transcript
Code

(Prerecorded) L1E1: Workflow Overview

Lecture 01: Supervised Learning (Prerecorded)

Overview of the Data Science workflow (asking the right question, collecting/cleaning/exploring data, modeling, and deployment/reporting). A minimal example using concrete compressive strength.

Slides
Transcript
Code

(Prerecorded) L3E4: Hyperparameter Tuning

Lecture 03: Complexity Control (prerecorded)

Hyperparameter tuning and nested cross-validation.

Slides
Transcript
Code

(Prerecorded) L4E1: Multiple Linear Regression

Lecture 04: Regression I (prerecorded)

We introduce multiple linear regression.

Slides
Code

(Prerecorded) L4E1: Scaling Data

Lecture 04: Regression I (prerecorded)

Having features with very different scales can impact the performance of some models. In this video we show how to use sklearn's StandardScaler object to standardize our features.

Slides
Code

(Prerecorded) L8E2: Adjustments for Time Series Data

Lecture 08: Time Series I Control (prerecorded)

Since time series are ordered, we need a slight adjustment to our cross-validation strategy.

Slides
Code

(Prerecorded) L8E5:Rolling Averages

Lecture 08: Time Series I Control (prerecorded)

Our first time series forecast works by taking local averages of previous observations.

Slides
Code

(Prerecorded) L9E1: Autoregressive (AR(p)) Models

Lecture 09: Time Series II Control (prerecorded)

An autoregressive model expresses the value of a time series as a linear combination of lags plus an error term.

Slides
Transcript
Code

(Prerecorded) L9E1: SARIMA

Lecture 09: Time Series II Control (prerecorded)

SARIMA stands for Seasonal Autoregressive Integrated Moving Average. We will learn how these models are defined and see an example of how to use such a model in practice.

Slides
Code

Python and APIs

Data Collection (prerecorded)

How can we use python to collect data from APIs?

Slides
Transcript
Code

tSNE

Bonus content (prerecorded)

Another dimension reduction used primarily for data visualization.

Slides
Code

Hierarchical Clustering

Bonus content (prerecorded)

Our second clustering algorithm.

Slides
Code

Regression Version of Classification Algorithms

Bonus content (prerecorded)

All of your favorite classification algorithms back for regression purposes.

Slides
Code

(Prerecorded) L3E1 Bias-Variance Trade-Off

Lecture 03: Complexity Control (prerecorded)

The expected generalization error of a learning algorithm can be decomposed into two terms: the bias and the variance.

Slides
Transcript
Code

(Prerecorded) L2E3: Regression Plots

Lecture 02: Model Evaluation (Prerecorded)

Diagnostic plots for regression analysis, including residual vs. feature plots, residual vs. predicted value plots, and QQ-plots of residuals.

Slides
Transcript
Code

(Prerecorded) L1E6: Pipelines and Transformations

Lecture 01: Supervised Learning (Prerecorded)

We often want to ingest raw data and apply a sequence of transformations (imputation, scaling, and feature engineering) to that data before modeling. Pipelines let us do that in a clean way.

Slides
Transcript
Code

(Prerecorded) L1E3: Data Cleaning and EDA

Lecture 01: Supervised Learning (Prerecorded)

Overview of the basics of data cleaning and exploratory data analysis. Showcases some fundamental tools (pandas, matplotlib, plotly) as well as some more specialized tools (panderas, missingno).

Slides
Transcript
Code

Project/Homework Instructions

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Project/Team Formation
Project Submission
Projects README

Schedule

Click on any date for more details

Phase 1 - Instruction and Project Completion: Feb 02 - Mar 20, 2026
Project Review & Judging: Mar 23 - Mar 26, 2026
Phase 2 - Intense Interview Prep & Career Connections: Mar 27 - May 1, 2026

Lecture 00: Orientation / Computer Setup Day

Jan 29, 2026 at 06:30 PM UTC

EVENT

Lecture 01: Supervised Learning

Feb 3, 2026 at 06:30 PM UTC

EVENT

Lecture 02: Model Evaluation

Feb 5, 2026 at 06:30 PM UTC

EVENT

Problem Session 02

Feb 9, 2026 at 06:30 PM UTC

EVENT

Problem Session 03

Feb 11, 2026 at 06:30 PM UTC

EVENT

Problem Session 04

Feb 16, 2026 at 06:30 PM UTC

EVENT

Problem Session 05

Feb 18, 2026 at 06:30 PM UTC

EVENT

Problem Session 06

Feb 23, 2026 at 06:30 PM UTC

EVENT

Problem Session 07

Feb 25, 2026 at 06:30 PM UTC

EVENT

Problem Session 08

Mar 2, 2026 at 06:30 PM UTC

EVENT

Problem Session 09

Mar 4, 2026 at 06:30 PM UTC

EVENT

Problem Session 10

Mar 9, 2026 at 05:30 PM UTC

EVENT

Problem Session 11

Mar 11, 2026 at 05:30 PM UTC

EVENT

Math Hour 00

Feb 2, 2026 at 03:00 PM UTC

EVENT

Math Hour 01

Feb 4, 2026 at 03:00 PM UTC

EVENT

Project Pitch Hour

Feb 6, 2026 at 09:00 PM UTC

EVENT

Lecture 03: Complexity Control

Feb 10, 2026 at 06:30 PM UTC

EVENT

Lecture 04: Linear Regression

Feb 12, 2026 at 06:30 PM UTC

EVENT

Lecture 05: Generalized Linear Models and Generalized Additive Models

Feb 17, 2026 at 06:30 PM UTC

EVENT

Lecture 06: Inference I

Feb 19, 2026 at 06:30 PM UTC

EVENT

Lecture 07: Inference II

Feb 24, 2026 at 06:30 PM UTC

EVENT

Lecture 08: Time Series I

Feb 26, 2026 at 06:30 PM UTC

EVENT

Lecture 09: Time Series II

Mar 3, 2026 at 06:30 PM UTC

EVENT

Lecture 10: Ensemble Learning I

Mar 5, 2026 at 06:30 PM UTC

EVENT

Lecture 11: Ensemble Learning II

Mar 10, 2026 at 05:30 PM UTC

EVENT

Lecture 12: Introduction to Neural Networks

Mar 12, 2026 at 05:30 PM UTC

EVENT

Problem Session 00

Feb 2, 2026 at 06:30 PM UTC

EVENT

Problem Session 01

Feb 4, 2026 at 06:30 PM UTC

EVENT

Math Hour 02

Feb 9, 2026 at 03:00 PM UTC

EVENT

Math Hour 03

Feb 11, 2026 at 03:00 PM UTC

EVENT

Math Hour 04

Feb 16, 2026 at 03:00 PM UTC

EVENT

Math Hour 05

Feb 18, 2026 at 03:00 PM UTC

EVENT

Math Hour 06

Feb 23, 2026 at 03:00 PM UTC

EVENT

Math Hour 07

Feb 25, 2026 at 03:00 PM UTC

EVENT

Math Hour 08

Mar 2, 2026 at 03:00 PM UTC

EVENT

Math Hour 09

Mar 4, 2026 at 03:00 PM UTC

EVENT

Math Hour 10

Mar 9, 2026 at 02:00 PM UTC

EVENT

Math Hour 11

Mar 11, 2026 at 02:00 PM UTC

EVENT

Project/Homework Deadlines

Jan 31, 2026

04:59 AM UTC

Last chance to switch bootcamps

Email Amalya Lehmann at amalya@erdosinstitute.org if you would like to switch to a different bootcamp.

Feb 5, 2026

04:59 AM UTC

Watch video about Project Formation

This should help answer any Q's you may have going into project formation

Feb 5, 2026

04:59 AM UTC

Watch 3 Previous Top Projects

Consult the project database, and watch at least 3 previous top projects from Erdos Alumni.

Feb 6, 2026

09:00 PM UTC

Project Pitch Hour

Opportunity to meet with other Erdős Fellows and form teams and propose topics.

Feb 12, 2026

04:59 AM UTC

Last day to defer enrollment to a future cohort

Contact Amalya Lehmann (amalya@erdosinstitute.org) if you would like to unenroll from this cohort and defer to a future cohort.

Feb 12, 2026

04:59 AM UTC

Finalized Teams with Preliminary Project Ideas

Teams need to be finalized by this point. If you proposed or created a project, you must have others in your group. If you did not propose or create a project, you must join an open group.

Feb 14, 2026

04:59 AM UTC

Data gathering and defining stakeholders + KPIs

Find the dataset you will be working with. Describe the dataset and the problem you are looking to solve (1 page max). List the stakeholders of the project and company key performance indicators (KPIs) (bullet points).

Feb 21, 2026

04:59 AM UTC

Data cleaning + preprocessing + EDA

Look for missing values and duplicates. Basic data manipulation & preliminary feature engineering. Exploratory data analysis.

Feb 28, 2026

04:59 AM UTC

Written proposal of modeling approach [Checkpoint]

Describe your planned modeling approach, based on the exploratory data analysis from the last two weeks (< 1 page, bullet points).

Mar 7, 2026

04:59 AM UTC

Modeling and Preliminary Results

Results with visualizations and/or metrics. List of successes and pitfalls.

Mar 14, 2026

03:59 AM UTC

Clean your repository and start working on final presentation

Clean up your repository so that an outsider can easily follow your work. Convert notebooks into scripts where possible. Confirm that the whole pipeline from data ingestion all the way to prediction or inference works without fuss.

Mar 21, 2026

03:59 AM UTC

Final Project Deadline

Submit your final project by this time.

©2017-2025 by The Erdős Institute.

bottom of page