CŐDE (ONLINE)
Train our PhDs with us through our Data Science Boot Camps
Designed to train PhDs with the skill sets necessary for successful employment in your organization.
MAY 2020 DATA SCIENCE BOOT CAMPS
ONLINE: May 431, 2020
Corporate Sponsorship and Participation
The Erdős Institute is now accepting new corporate sponsors and data sets/challenges for its May 2020 Data Science Boot Camps. Over 140 PhD students and postdocs from 6 universities have registered. Join these worldclass organizations in supporting the Erdős Institute's Data Science Boot Camp:
Educational Aim
The goal of our Python Boot Camp is to provide you with the skills needed to produce a portfolio worthy data science/machine learning project.
Topics to be Covered
In fitting with our educational aim, this course will walk you through the steps involved in a typical data science/machine learning project. While covering a variety of important techniques and algorithms we will illuminate themes and motivations that can be adapted to many data science settings.
Below we outline the specific topics we will be covering in this boot camp.

Data Gathering Techniques

Searching common online sources for data

Basic web scraping with BeautifulSoup

Interacting with APIs


Data Cleaning

Data Types

Basic data exploration with pandas, and numpy

Basic plotting with matplotlib

Handling Missing Data

Common Data Transformations


Supervised Learning

Regression

Simple Linear Regression

Multiple Linear Regression

Polynomial Regression

Ridge Regression

LASSO

Kernel Regression (if time permits)

Local Regression (if time permits)


Classification

Nearest Neighbor Methods

Naive Bayes

Logistic Regression

Decision Trees

Random Forests

Support Vector Machines



Unsupervised Learning

Dimensionality Reduction

Principal Components Analysis

tSNE


Clustering

kMeans

Hierarchical

DBScan



Forecasting for Time Series Data

Handling and cleaning time series data

Simple forecasting methods

Time series regression models

Smoothing

Exponential Smoothing


Neural Networks

Perceptrons

Shallow Networks


Presenting Results

Pandas for presentation

Advanced matplotlib

Plotting in seaborn

Introduction to Interactive Plotting With Python


Machine Learning Concepts

Training Test Split

Loss Functions

Gradient Descent

Model Validation

Bias Variance TradeOff

Cross Validation

SAMPLE PROJECTS
Some winning projects from previous years
Birds in Random Forests
Dananjaya Liyanage, Caleb Dilsavor, Hiran Wijesinghe
Results:
The following is a brief summary of classifiers generated along with their accuracy. Cross validation was performed using random 1:5 testing to training splits of the data set.
Summary:
Extracted ~23k feature vectors from ~8k recordings and tested various classifiers using labels provided in metadata.

Scraping and preprocessing data from xenocanto.org using R

Fourier spectral analysis using TuneR (R library)

WarbleR (R library) for feature extraction

Python with Jupyter Notebook

Matplotlib (python library) for visualization

ScikitLearn (python library) for machine learning
Two Tweet Too Furious
Matthew Osborne, Austin Antoniou, Dan McGregor, Luke Andrejek
Brief Description
In a set of tweets containing #Charlottesville from the week of the Charlottesville Unite the Right rally, can we gain any insight on who is actually tweeting these tweets?
Thanks to Center for the Study of Networks and Society for the Excellent Data!
Data and Analysis
We had tweets from over 1000 unique Twitter users to analyze. Each user in the data set tweeted #Charlottesville at least once in the week following Unite the Right rally. Other than that we knew nothing about the accounts.
We arranged these accounts into what we called a mutual retweet network (see Figure 1). Each node was one of the unique accounts from the data set and two accounts were connected with an edge if they retweeted the same accounts.
Our thought process was that people that are similar will retweet the same content.
Analysis was done using the Python packages: pandas, numpy, matplotlib, and networkx.
Figure 1
Outcome
Our approach resulted in clear communities, that describe the accounts quite well. We’ve highlighted those communities along with pictures of the top accounts that the encircled users retweeted (see Figure 2). A quick breakdown of the communities is provided below.
We expected to get two communities corresponding to republicans and democrats, however, we were surprised to see two outlier communities that are seemingly unrelated to the politics of the Charlottesville incident.
Perhaps this technique of Twitter data analysis could be a fruitful way to identify who is talking about or leveraging an event/issue.
Community Breakdown
Green – Possibly followers of a religious group from India
Yellow – Media accounts with a similar political skew
Blue – Democrat leaning accounts
Red – Republican leaning accounts
Figure 2
Identifying Leaves with Python
Alex Beckwith, Jason Bello, Sheng Guo, Kiwon Lee, Francisco Martínez Figueroa
PHOTO GALLERY
Photos courtesy of Stephen Takacs Photography.
Goal: Classify species of leaves from a database of leaf images using techniques from image processing and computer vision.
Successive approximations of leaf outlines using elliptic Fourier analysis.
Principal component analysis to identify leaf species.
Some eigen“leaves”
Primary packages used:

cv2 (for computer vision)

sklearn (for statistical analysis)
Examples of techniques used:

Principal component analysis

Elliptic Fourier analysis

~FAST corner detection algorithm

Scaleinvariant feature transform (SIFT)

Template matching via Hausdorff distance

Canny edge detection

Misc. statistics on leaves (aspect ratio, solidity, isoperimetric factor)

Distancetomean comparison
~FAST corner detection
Template matching via Hausdorff distance