Photo taken by Stephen Takacs during the May 2019 Cőde boot camp.

ONLINE: FALL 2021 & MAY 2022

We are piloting a semester long version of our signature Data Science Boot Camp this Fall


The Erdős Institute Data Science Boot Camps have been running for 4 years now thanks to the generous support of our academic and corporate sponsors and members. This year, we had a record number of participants complete the boot camp:

240 Participants on Day 1 (May 3, 2021)
190 Participants Completed Projects

Here are some statistics on those 190

Educational and Professional Aim

The goal of our Data Science Boot Camp is to provide you with the skills and mentorship necessary to produce a portfolio worthy data science/machine learning project while also providing you with valuable career development support and connecting you with potential employers.


Those who successfully complete our program will also receive our certificate of completion. Participants also have the option of signing up for our professional development services which include: career coaching, mock interviews, resume and LinkedIn profile review, corporate connections, and more!  


In fitting with our educational aim, this course will walk you through the steps involved in a typical data science/machine learning project. While covering a variety of important techniques and algorithms we will illuminate themes and motivations that can be adapted to many data science settings.

Below we outline some of the specific topics we will be covering in this boot camp.

  • Data Gathering Techniques

    • Searching common online sources for data

    • Basic web scraping with BeautifulSoup

  • Data Cleaning

    • Data Types

    • Basic data exploration with pandas, and numpy

    • Basic plotting with matplotlib

    • Handling Missing Data

    • Common Data Transformations

  • Supervised Learning

    • Regression

      • Simple Linear Regression

      • Multiple Linear Regression

      • Polynomial Regression

      • Ridge Regression

      • LASSO

      • Kernel Regression (if time permits)

      • Local Regression (if time permits)

    • Classification

      • Nearest Neighbor Methods

      • Naive Bayes

      • Logistic Regression

      • Decision Trees

      • Random Forests

      • Support Vector Machines

    • Ensemble Learning​

  • Forecasting for Time Series Data

    • Handling and cleaning time series data

    • Simple forecasting methods

    • Time series regression models

    • Smoothing

    • Exponential Smoothing

  • Unsupervised Learning

    • Dimensionality Reduction

      • Principal Components Analysis

    •  Clustering

  • Neural Networks

    • Perceptrons

    • Shallow Networks

  • Presenting Results

    • Pandas for presentation

    • Advanced matplotlib

    • Plotting in seaborn

  • Machine Learning Concepts

    • Training Test Split

    • Loss Functions

    • Gradient Descent

    • Model Validation

    • Bias Variance Trade-Off

    • Cross Validation

  • And MORE!​

Corporate Members and Sponsors

Get in touch with us today to learn how you and your organization can join as members!


What people are saying

The Erdős Institute's signature Data Science Boot Camp has grown rapidly each year since our May 2018 pilot:

In May 2020, our graduating boot camp class gave us a Net Promoter Score (NPS) of


Comments from our May 2020 Exit Survey:

The boot camp was excellent! As a PhD student in pure mathematics, I appreciate that the boot camp gave me an idea of what people are actually doing when they work in data science. Also, I got the introductory knowledge and technical tools needed to start learning about data science more deeply and trying out basic data science projects on my own.

The Erdos Institute's boot camp was a phenomenal experience not only was the information provided in a digestible format that simultaneously empowered me to explore further topics. The best part of the course was interacting with graduate students interested in data science and developing a project aimed at generating novel insights around a specific issue. I couldn't be more grateful for the opportunity to grow as a data scientist and researcher.

It was an amazing boot camp. I think the online environments helped a lot.

The boot-camp was my first foray into machine-learning, and I am delighted with what I have learned from the camp. It was a great introduction to the basics of ML, and by the end of the camp, I gained a lot of experience in the field. The collaborative project also helped hone several interpersonal skills, alongside the experience of building a final product for deployment.

This program was a great opportunity for me to get hands-on experience on Python programming as well as Machine Learning. I personally enjoyed the thoughtful design of the Boot Camp which made it conducive to learning and the informative and well-prepared lectures that we had throughout the program.

An incredible program that catalyzed my machine learning and data analysis skills

I really loved the way in which the bootcamp was organized despite having to go online this year. The well thought-out notebooks will remain a valuable resource in my arsenal as I continue my data science journey. The final project was my first collaborative coding experience and taught me many valuable skills such as the effective use of version control, pull requests etc as well as presenting the results of the project to a diverse audience in a concise yet clear manner.

Very well organized and structured course.

I really enjoyed this camp. During this COVID situation, it was a great opportunity for me to acquire an industrial exposure in Data science field. Besides learning different machine learning algorithms, I learn how to apply those algorithms in order to obtain better results for any given requirement. Thank you, Erdős.

Awesome course, I definitely recommend if you want a jump-start in machine learning.

Loved the program. Thank you guys for the great work.

I learned a lot; the bootcamp has given me a really good basic primer and the motivation to explore data science topics on my own in the future.

The Data Science Boot Camp rapidly expanded what I could do. It was helpful even as an absolute beginner in data science. I'm grateful for this opportunity.


Some winning projects from previous years

bookend: an authorship attribution classifier

Kyle Dettman, Elaad Applebaum, Diptanil Roy, Nikhil Tilak

Screen Shot 2020-11-19 at 9.52.07 AM.png

Goal: Given some small snippet from a book, can we devise a classifier that can predict the author with some accuracy?

Data Source:

  • E-books from open source Project Gutenburg by authors such as Jane Austen, Mark Twain, Jack London

  • A purpose-built class in Python to handle the reading in, cleaning, and general analysis of these e-books.


  • A series of Natural Language Processing techniques to build features: n-grams, syntactic tagging, bag-of-words, and lexical features.

  • The predictions from each of these models was fed to a soft- voting classifier to obtain the results.

Packages: Scikit-learn, Nltk, Ngram-graphs

Screen Shot 2020-11-19 at 12.45.55
Screen Shot 2020-11-19 at 12.49.49


  • Classifier achieves >90% accuracy on texts it had never seen before.

  • Algorithm correctly classifies books written by J.K. Rowling under her pseudonym Robert Galbraith, demonstrating cross-genre success in identifying author-level characteristics.

  • Classifier correctly attributes the disputed Federalist papers to Madison.

Birds in Random Forests

Dananjaya Liyanage, Caleb Dilsavor, Hiran Wijesinghe

Screen Shot 2020-02-04 at 8.11.19 PM.png
STakacs_2019.Hackathon_749_1920.web (1).
Screen Shot 2020-02-04 at 8.21.05 PM.png
Screen Shot 2020-02-04 at 8.21.12 PM.png
Screen Shot 2020-02-04 at 8.19.24 PM.png


The following is a brief summary of classifiers generated along with their accuracy. Cross validation was performed using random 1:5 testing to training splits of the data set.


Extracted ~23k feature vectors from ~8k recordings and tested various classifiers using labels provided in metadata.

  • Scraping and preprocessing data from using R

  • Fourier spectral analysis using TuneR (R library)

  • WarbleR (R library) for feature extraction

  • Python with Jupyter Notebook

  • Matplotlib (python library) for visualization

  • Scikit-Learn (python library) for machine learning

Two Tweet Too Furious

Matthew Osborne, Austin Antoniou, Dan McGregor, Luke Andrejek

Brief Description

In a set of tweets containing #Charlottesville from the week of the Charlottesville Unite the Right rally, can we gain any insight on who is actually tweeting these tweets?


Thanks to Center for the Study of Networks and Society for the Excellent Data!


Data and Analysis

We had tweets from over 1000 unique Twitter users to analyze. Each user in the data set tweeted #Charlottesville at least once in the week following Unite the Right rally. Other than that we knew nothing about the accounts.


We arranged these accounts into what we called a mutual retweet network (see Figure 1). Each node was one of the unique accounts from the data set and two accounts were connected with an edge if they retweeted the same accounts.


Our thought process was that people that are similar will retweet the same content.


Analysis was done using the Python packages: pandas, numpy, matplotlib, and networkx.

Retweet Graph

Figure 1


Our approach resulted in clear communities, that describe the accounts quite well. We’ve highlighted those communities along with pictures of the top accounts that the encircled users retweeted (see Figure 2). A quick breakdown of the communities is provided below.


We expected to get two communities corresponding to republicans and democrats, however, we were surprised to see two outlier communities that are seemingly unrelated to the politics of the Charlottesville incident.


Perhaps this technique of Twitter data analysis could be a fruitful way to identify who is talking about or leveraging an event/issue.

Community Breakdown

Green – Possibly followers of a religious group from India

Yellow – Media accounts with a similar political skew

Blue – Democrat leaning accounts

Red – Republican leaning accounts

Retweet Graph Labelled

Figure 2

Identifying Leaves with Python

Alex Beckwith, Jason Bello, Sheng Guo, Kiwon Lee, Francisco Martínez Figueroa


Photos courtesy of Stephen Takacs Photography.


Goal: Classify species of leaves from a database of leaf images using techniques from image processing and computer vision.

Fourier Leaves
Principal Component Analysis

Successive approximations of leaf outlines using elliptic Fourier analysis.

Principal component analysis to identify leaf species.


Some eigen-“leaves”

Primary packages used:

  • cv2 (for computer vision)

  • sklearn (for statistical analysis)

Examples of techniques used:

  • Principal component analysis

  • Elliptic Fourier analysis

  • ~FAST corner detection algorithm

  • Scale-invariant feature transform (SIFT)

  • Template matching via Hausdorff distance

  • Canny edge detection

  • Misc. statistics on leaves (aspect ratio, solidity, isoperimetric factor)

  • Distance-to-mean comparison

Leaf Analysis

~FAST corner detection

Template matching via Hausdorff distance