©2018 by The Erdős Institute.

CŐDE

Train our PhDs with us through our Data Science Boot Camps

Designed to train PhDs with the skill sets necessary for successful employment in your organization.

MAY 2020 DATA SCIENCE BOOT CAMPS

PRE-REGISTRATION NOW OPEN

Corporate Sponsorship and Participation

The Erdős Institute is now accepting new corporate sponsors and participants for its May 2020 Data Science Boot Camps. Over 90 PhD students and postdocs from our member universities have already pre-registered.

Educational Aim

The goal of our Python Boot Camp is to provide you with the skills needed to produce a portfolio worthy data science/machine learning project.

Topics to be Covered
In fitting with our educational aim, this course will walk you through the steps involved in a typical data science/machine learning project. While covering a variety of important techniques and algorithms we will illuminate themes and motivations that can be adapted to many data science settings.


Below we outline the specific topics we will be covering in this boot camp.

 

  • Data Gathering Techniques

    • Searching common online sources for data

    • Basic web scraping with BeautifulSoup

    • Interacting with APIs
       

  • Data Cleaning

    • Data Types

    • Basic data exploration with pandas, and numpy

    • Basic plotting with matplotlib

    • Handling Missing Data

    • Common Data Transformations
       

  • Supervised Learning

    • Regression

      • Simple Linear Regression

      • Multiple Linear Regression

      • Polynomial Regression

      • Ridge Regression

      • LASSO

      • Kernel Regression (if time permits)

      • Local Regression (if time permits)

    • Classification

      • Nearest Neighbor Methods

      • Naive Bayes

      • Logistic Regression

      • Decision Trees

      • Random Forests

      • Support Vector Machines
         

  • Unsupervised Learning

    • Dimensionality Reduction

      • Principal Components Analysis

      • t-SNE

    •  Clustering

      • k-Means

      • Hierarchical

      • DBScan
         

  • Forecasting for Time Series Data

    • Handling and cleaning time series data

    • Simple forecasting methods

    • Time series regression models

    • Smoothing

    • Exponential Smoothing
       

  • Neural Networks

    • Perceptrons

    • Shallow Networks
       

  • Presenting Results

    • Pandas for presentation

    • Advanced matplotlib

    • Plotting in seaborn

    • Introduction to Interactive Plotting With Python
       

  • Machine Learning Concepts

    • Training Test Split

    • Loss Functions

    • Gradient Descent

    • Model Validation

    • Bias Variance Trade-Off

    • Cross Validation

SAMPLE PROJECTS

Some winning projects from previous years

PHOTO GALLERY

Photos courtesy of Stephen Takacs Photography.

Two Tweet Too Furious

Matthew Osborne, Austin Antoniou, Dan McGregor, Luke Andrejek

Brief Description

In a set of tweets containing #Charlottesville from the week of the Charlottesville Unite the Right rally, can we gain any insight on who is actually tweeting these tweets?

 

Thanks to Center for the Study of Networks and Society for the Excellent Data!

 

Data and Analysis

We had tweets from over 1000 unique Twitter users to analyze. Each user in the data set tweeted #Charlottesville at least once in the week following Unite the Right rally. Other than that we knew nothing about the accounts.

 

We arranged these accounts into what we called a mutual retweet network (see Figure 1). Each node was one of the unique accounts from the data set and two accounts were connected with an edge if they retweeted the same accounts.

 

Our thought process was that people that are similar will retweet the same content.

 

Analysis was done using the Python packages: pandas, numpy, matplotlib, and networkx.

Figure 1

Outcome

Our approach resulted in clear communities, that describe the accounts quite well. We’ve highlighted those communities along with pictures of the top accounts that the encircled users retweeted (see Figure 2). A quick breakdown of the communities is provided below.

 

We expected to get two communities corresponding to republicans and democrats, however, we were surprised to see two outlier communities that are seemingly unrelated to the politics of the Charlottesville incident.

 

Perhaps this technique of Twitter data analysis could be a fruitful way to identify who is talking about or leveraging an event/issue.

Community Breakdown

Green – Possibly followers of a religious group from India

Yellow – Media accounts with a similar political skew

Blue – Democrat leaning accounts

Red – Republican leaning accounts

Figure 2

Identifying Leaves with Python

Alex Beckwith, Jason Bello, Sheng Guo, Kiwon Lee, Francisco Martínez Figueroa

Goal: Classify species of leaves from a database of leaf images using techniques from image processing and computer vision.

Successive approximations of leaf outlines using elliptic Fourier analysis.

Principal component analysis to identify leaf species.

Some eigen-“leaves”

Primary packages used:

  • cv2 (for computer vision)

  • sklearn (for statistical analysis)

Examples of techniques used:

  • Principal component analysis

  • Elliptic Fourier analysis

  • ~FAST corner detection algorithm

  • Scale-invariant feature transform (SIFT)

  • Template matching via Hausdorff distance

  • Canny edge detection

  • Misc. statistics on leaves (aspect ratio, solidity, isoperimetric factor)

  • Distance-to-mean comparison

~FAST corner detection

Template matching via Hausdorff distance