top of page
Data Science Boot Camp

May-Summer 2024

May 6, 2024

-

Jun 5, 2024

I'm a paragraph. Click here to add your own text and edit me. It's easy.

erdosOspin.gif

Checking your registration status...

To access the program content, you must first create an account and member profile and be logged in.

You are registered for this program.

Problem Solving Session 1

Next Event

NEXT EVENT

Registration Deadlines

May 7, 2024

-

All interested participants

-

-

Category

Launch, Core Program, Boot Camp, Projects, Certificates

Overview

The Erdős Institute's signature Data Science Boot Camp has been running since May 2018 thanks to the generous support of our sponsors, members, and partners. Due to its popularity, we now offer our boot camp online three times per year in two different formats: a 1-month long intensive boot camp each May and a semester long version each Spring & Fall.

Slack

Click here to be invited to the slack organization: The Erdős Institute

Click here to access the slack cohort channel: #slack-cohort-channel

Click here to access the slack program channel: #slack-program-channel

calendar-icon.png

Click here to download the Events & Deadlines .ics calendar file

Organizers, Instructors, and Advisors

matt_osborne.png

Steven Gubkin, PhD

Lead Instructor

Office Hours:

MTWRF 12pm - 1pm ET, and by appt.

Email:

Preferred Contact:

Slack

Please feel free to message me on Slack with any questions!

matt_osborne.png

Alec Clott, PhD

Head of Data Science Projects

Office Hours:

By appt. only

Email:

Preferred Contact:

Slack

Participants are welcome to reach out to me via slack or email. I normally work standard EST hours (9am-5pm), but can always find time to meet folks via Zoom too after work. Let me know how I can help!

Objectives

The goal of our Data Science Boot Camp is to provide you with the skills and mentorship necessary to produce a portfolio worthy data science/machine learning project while also providing you with valuable career development support and connecting you with potential employers.

Project Examples

TEAM 13

Hitmakers vs. One-Hit Wonders: Predicting Sustained Success in the Music Industry

James McNally,Yundi Kong,Guillermo Sanmarco,Vishal Gupta

clear.png
Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

Question:
What early signals predict sustained success in the music industry?

Objective:
Many musicians produce hit songs, but not all are able to do so more than once. This project builds a machine learning classifier to distinguish hitmakers (artists with multiple top 20 Billboard Hot 100 hits) from one-hit wonders, using only information available at the moment of a musician’s first top 20 hit song.

Conclusions:
Our model reveals that prior charting experience, collaboration network position, chart longevity, genre breadth, and dominant genre affiliations are the strongest predictors of sustained success.

Data sources:
- MusicBrainz (artist metadata, genre tags, collaboration graph)
- Billboard Hot 100 & 200 chart data
- Spotify (artist and song metadata)
- Google Trends (relative search volume at time of first hit song)

TEAM 16

Predicting Lead Contamination in NY School Drinking Water

Ranadeep Roy,Cami Goray,Hana Lang

clear.png
Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

Lead is a toxic metal, and in children especially, lead exposure can have severe health consequences -- even small amounts of lead have the potential to affect memory, behavior, and learning ability. Despite this, numerous schools across New York State have at least one drinking water outlet with lead levels testing for above 5 ppb. In this project, we aim to predict the presence of lead contamination in school drinking water, and better understand the role of demographic, socioeconomic, infrastructural, and geographic features in elevated lead levels.

First Steps/Prerequisites

Participants should have a base-level familiarity with Python. Participants should also be familiar with some basic math concepts. Finally, you will also need to have your laptop or desktop computer set up for the course. If you are new to Python, need a quick math refresher, or if you need help setting up your computer, then please follow the link below.

Program Content

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Course materials are available on github through the following link:

25231-github-cat-in-a-circle-icon-vector-icon-vector-eps.png
Request Access to GitHub

github message for user

Program Content

Textbook/Notes

Note: our video player does not support playback speed options. You can find a third party browser extension which will allow you to modify video playback speed. For example, this one works for Chrome: video-speed-controller. If you would prefer to avoid a browser extension you can manually modify the playback speed in the javascript console as well: Speed up any HTML5 video player!

Lecture 11: Ensemble II

Live Lectures

Voter Models, AdaBoost, Gradient Boosting, XGBoost

Slides
Transcript
Code

Math Hour 3 (from Spring 2024)

Math Hour (Supplemental Content)

We give a geometric account of the Singular Value Decomposition of a matrix. The SVD of the centered design matrix gives Principle Component Analysis.

Transcript
Code

Math Hour 9 (from Spring 2024)

Math Hour (Supplemental Content)

We discuss LDA and QDA.

Slides
Transcript
Code

Web Scraping with BeautifulSoup

02 Week: Data Collection (prerecorded)

We give a brief introduction into web scraping with BeautifulSoup

Slides
Code

Live Lecture 12: Neural Networks

Live Lectures

Feed Forward Neural Networks, Convolutional NNs, Recurrent NNs

Slides
Transcript
Code

Math Hour 4 (from Spring 2024)

Math Hour (Supplemental Content)

We give several perspectives on regularization techniques including:
1. Ridge and Lasso as MAP estimators.
2. Ridge as OLS with "pseudo-observations"
3. Ridge as a "smooth" version of PCA regression.

Slides
Transcript
Code

A Broad Overview

02 Week: Data Collection (prerecorded)

In this video we give an eagle's eye view of what we will cover in our data science content.

Slides
Code

Python and APIs

02 Week: Data Collection (prerecorded)

How can we use python to collect data from APIs?

Slides

Math Hour 2 (from Spring 2024)

Math Hour (Supplemental Content)

1. We give a geometric interpretation of Bessel's correction
2. We derive MLE estimates for simple linear regression.
3. We interpret multiple linear regression as orthogonal projection.

Slides
Transcript
Code

Math Hour 7 (from Spring 2024)

Math Hour (Supplemental Content)

We discuss how to fit logistic regression models.

Slides
Transcript
Code

Data Source Websites

02 Week: Data Collection (prerecorded)

We cover a plethora of data source websites you can use.

Slides
Code

Data in Databases

02 Week: Data Collection (prerecorded)

Your data is stuck in a database, can you get it out? Learn how in this video.

Slides
Code

Project/Homework Instructions

I'm a paragraph. Click here to add your own text and edit me. It's easy.

Project/Team Formation
Project Submission
Projects README

How To Form Projects

Presentation Tips and Tricks (prerecorded)

This video should show you how to navigate the team formation process on the Erdos website.

Slides
Transcript

Project Pitch Hour

Participant project pitches (live)

This is a recording of the May 19th Project Pitch Hour

Slides
Transcript
Code

Schedule

Click on any date for more details

Orientation & Setup

Phase 1: Instruction and Project Completion

Project Review & Judging

Phase 2: Intense Interview Prep & Career Connections

Problem Solving Session 1

May 6, 2024 at 03:00 PM UTC

EVENT

Extra Help with Setting Up

May 6, 2024 at 08:30 PM UTC

EVENT

Lecture 2: Data Collection

May 7, 2024 at 07:00 PM UTC

EVENT

Lecture 3: Regression I

May 8, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 9, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 10, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 13, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 6

May 14, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 6

May 14, 2024 at 11:00 PM UTC

EVENT

Lecture 7: Time Series II

May 15, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 16, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 17, 2024 at 04:00 PM UTC

EVENT

Lecture 9: Classification II

May 20, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 21, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 11

May 22, 2024 at 03:00 PM UTC

EVENT

Problem Solving Session 12

May 23, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 11

May 23, 2024 at 11:00 PM UTC

EVENT

Office Hours

May 28, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 31, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 6, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 2

May 7, 2024 at 03:00 PM UTC

EVENT

Problem Solving Session 3

May 8, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 3

May 8, 2024 at 11:00 PM UTC

EVENT

Lecture 4: Regression II

May 9, 2024 at 07:00 PM UTC

EVENT

Project Pitch Hour

May 10, 2024 at 08:00 PM UTC

EVENT

Lecture 5: Regression III

May 13, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 14, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 7

May 15, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 7

May 15, 2024 at 11:00 PM UTC

EVENT

Lecture 8: Classification I

May 16, 2024 at 07:00 PM UTC

EVENT

Problem Solving Session 9

May 20, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 9

May 20, 2024 at 11:00 PM UTC

EVENT

Lecture 10: Ensemble Learning I

May 21, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 22, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 23, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 24, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 29, 2024 at 04:00 PM UTC

EVENT

Erdős May 2024 Final Project Showcase

Jun 5, 2024 at 04:00 PM UTC

EVENT

Lecture 1: Introduction

May 6, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 7, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 8, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 4

May 9, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 4

May 9, 2024 at 11:00 PM UTC

EVENT

Problem Solving Session 5

May 13, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 5

May 13, 2024 at 11:00 PM UTC

EVENT

Lecture 6: Time Series I

May 14, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 15, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 8

May 16, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 8

May 16, 2024 at 11:00 PM UTC

EVENT

Office Hours

May 20, 2024 at 04:00 PM UTC

EVENT

Problem Solving Session 10

May 21, 2024 at 03:00 PM UTC

EVENT

Alternate Problem Session 10

May 21, 2024 at 11:00 PM UTC

EVENT

Lecture 11: Ensemble Learning II

May 22, 2024 at 07:00 PM UTC

EVENT

Lecture 12: Neural Networks

May 23, 2024 at 07:00 PM UTC

EVENT

Office Hours

May 27, 2024 at 04:00 PM UTC

EVENT

Office Hours

May 30, 2024 at 04:00 PM UTC

EVENT

Project/Homework Deadlines

May 9, 2024

03:59 AM UTC

Watch 5 Previous Distinguished Projects

Click the "only show projects with distinction or higher" check box, watch five previous projects and explore their githubs.

May 10, 2024

08:00 PM UTC

Project Pitch Hour

Opportunity to meet with other Erdos Fellows and form teams and propose topics.

May 14, 2024

03:59 AM UTC

Submit Team Proposal to Project Formation Page

If you want to propose a project, or have an idea for a project, submit it by this date.

May 15, 2024

03:59 AM UTC

Finalized Teams with Preliminary Project Idea

Teams need to be finalized by this point. If you proposed or created a project, you must have others in your group. If you did not propose or create a project, you must join an open group.

May 17, 2024

02:06 PM UTC

Data gathering and defining stakeholders + KPIs

Find the dataset you will be working with. Describe the dataset and the problem you are looking to solve (1 page max). List the stakeholders of the project and company key performance indicators (KPIs) (bullet points).

May 18, 2024

03:59 AM UTC

Data cleaning + preprocessing

Look for missing values and duplicates. Basic data manipulation & preliminary feature engineering.

May 25, 2024

03:59 AM UTC

Written proposal of modeling approach [Checkpoint]

Test linearity assumptions. Dimensionality reductions (if necessary). Describe your planned modeling approach, based on the exploratory data analysis from the last two weeks (< 1 page, bullet points).

May 25, 2024

03:59 AM UTC

Exploratory data analysis + visualizations [Checkpoint]

Distributions of variables, looking for outliers, etc. Descriptive statistics.

Jun 1, 2024

03:59 AM UTC

Machine learning models or equivalent [Checkpoint]

Results with visualizations and/or metrics. List of successes and pitfalls.

Jun 2, 2024

03:59 AM UTC

Final project due

Please read the submission instructions on the link below.

©2017-2026 by The Erdős Institute.

bottom of page