©2018 by The Erdős Institute.

CŐDE

Project-based coding boot camps for industry and entrepreneurship

MAY 2019 DATA SCIENCE BOOT CAMPS

Sections:

The aim of the Python boot camp is to give you the background, experience, and tools that you need to analyze data and to build and use models and simulations. In the first week you will learn the basics of the Python language, applying what you know to interesting problems, e.g., analysis of a financial model or of a data set (historical temperature data, stock prices, etc.). The second week will be devoted exclusively to data analysis, and in the the third week you will learn about models and simulations with some examples drawn from https://www.inferentialthinking.com/

There will be regular homework assignments and a final project. The latter is intended to be part of your portfolio: something you can show to prospective employers.

Sections:

Over the course of this three-week boot camp, participants with little to no exposure to computer science or analytics will be introduced, through practical, hands-on exercises, to the various ways of thinking and solving problems that define the field of data science, including:

 

  • The basics of computer science in R, a modern statistical programming language that is widely used in academia and business

  • The principles of tidy data, or, ensuring that when data are generated or imported, they are organized in a manner conducive to further analysis

  • Using problem solving, graphics, and exploratory data analysis to glean high-level insights from novel datasets

  • A practical primer on probability and statistics, with a focus on hypothesis testing for different types of data

  • Machine learning as a predictive analogue to the more traditional retrospective statistical analyses

  • Collaboration and teamwork, in the pursuit of designing and implementing a group project as the course's capstone

Instructors

Jim Fowler, PhD

Jim Fowler is a mathematician at The Ohio State University. His research interests broadly include geometry and topology, and more specifically focus on the topology of high-dimensional manifolds and geometric group theory. He's fond of using computational techniques to attack problems in pure mathematics.

Prior to working at The Ohio State University, he received an undergraduate degree from Harvard University and received a Ph.D. from the University of Chicago.

Jim Carlson, PhD

Jim received his PhD in mathematics from Princeton University in 1971. Since that time he has served as a postdoc at Stanford University and a professor of mathematics at both Brandeis University and the University of Utah. Jim served as president of the prestigious Clay Mathematics Institute, managing the process that led to the award of the first $1,000,000 Millennium Prizes in Mathematics. He is currently a Visiting Professor of Mathematics at the Ohio State University.

Jim has years of experience working with a variety of programming languages including both Python and Elm. He is quite active in the development of these open source languages.

Tom Needham, PhD

Tom is a Ross Assistant Professor at The Ohio State University. He graduated in Spring 2016 with a Phd in Mathematics from the University of Georgia under the direction of Professor Jason Cantarella. He currently works with Facundo Mémoli and the Topology, Geometry, and Data Analysis Research Group at Ohio State.

 

Tom's research is in applications of geometry and topology to applied problems in shape analysis and signal processing. His work uses tools from Riemannian, symplectic and metric geometry, topological data analysis and optimal transport.

Tom will be joining Florida State University in the Fall of 2019 as a tenure track Assistant Professor.

Collin McCabe, PhD

Collin received his PhD in human evolutionary biology from Harvard University in 2017. He was a National Science Foundation Graduate Research Fellow, and returned to Columbus in 2018 following a postdoctoral position at Duke University. 

Collin has extensive experience with the programming language R. He has developed a number of R packages and is proficient in various R data science methods.

Matt Osborne

Matthew Osborne is a PhD candidate in Mathematics at Ohio State with a graduate minor in data analysis. He works with his advisor, Joseph Tien, at the intersection of applied math and public health on modeling the dynamics of contagious processes including behavior and disease. Outside of his research he enjoys learning about machine learning, data science, and sports analytics.

 

He is expected to graduate by summer 2020.

Supporters

May 2018 Pilot Boot Camp

The May 2019 Cőde program is made possible in part due to the generous support from the following:

COLLEGE OF ARTS AND SCIENCES

DEPARTMENT OF MATHEMATICS

DEPARTMENT OF PHYSICS

DEPARTMENT OF ASTRONOMY

CENTER FOR COSMOLOGY AND ASTROPARTICLE PHYSICS (CCAPP)

The Erdős Institute piloted its first Cőde program in May 2018 at The Ohio State University in collaboration with CoverMyMeds. Boot camp participants included:

  • 20 math graduate students,

  • 10 math undergraduate students, and

  • 10 CoverMyMeds employees.

The first two weeks were dedicated to intense instruction in Python and R. The third week allowed participants to develop and demonstrate their skills through individual projects. Those projects were then presented at the CoverMyMeds headquarters and teams were formed.  The program culminated in a week-long team coding sprint and hackathon designed to tackle real-world challenges.

The May 2018 Cőde program was made possible in part due to the generous support from the following:

Example Projects

The May 2018 Boot Camps culminated in individual and team projects.

Here are some examples of what the participants created.

Two Tweet Too Furious

Matthew Osborne, Austin Antoniou, Dan McGregor, Luke Andrejek

Brief Description

In a set of tweets containing #Charlottesville from the week of the Charlottesville Unite the Right rally, can we gain any insight on who is actually tweeting these tweets?

 

Thanks to Center for the Study of Networks and Society for the Excellent Data!

 

Data and Analysis

We had tweets from over 1000 unique Twitter users to analyze. Each user in the data set tweeted #Charlottesville at least once in the week following Unite the Right rally. Other than that we knew nothing about the accounts.

 

We arranged these accounts into what we called a mutual retweet network (see Figure 1). Each node was one of the unique accounts from the data set and two accounts were connected with an edge if they retweeted the same accounts.

 

Our thought process was that people that are similar will retweet the same content.

 

Analysis was done using the Python packages: pandas, numpy, matplotlib, and networkx.

Figure 1

Outcome

Our approach resulted in clear communities, that describe the accounts quite well. We’ve highlighted those communities along with pictures of the top accounts that the encircled users retweeted (see Figure 2). A quick breakdown of the communities is provided below.

 

We expected to get two communities corresponding to republicans and democrats, however, we were surprised to see two outlier communities that are seemingly unrelated to the politics of the Charlottesville incident.

 

Perhaps this technique of Twitter data analysis could be a fruitful way to identify who is talking about or leveraging an event/issue.

Community Breakdown

Green – Possibly followers of a religious group from India

Yellow – Media accounts with a similar political skew

Blue – Democrat leaning accounts

Red – Republican leaning accounts

Figure 2

Identifying Leaves with Python

Alex Beckwith, Jason Bello, Sheng Guo, Kiwon Lee, Francisco Martínez Figueroa

Goal: Classify species of leaves from a database of leaf images using techniques from image processing and computer vision.

Successive approximations of leaf outlines using elliptic Fourier analysis.

Principal component analysis to identify leaf species.

Some eigen-“leaves”

Primary packages used:

  • cv2 (for computer vision)

  • sklearn (for statistical analysis)

Examples of techniques used:

  • Principal component analysis

  • Elliptic Fourier analysis

  • ~FAST corner detection algorithm

  • Scale-invariant feature transform (SIFT)

  • Template matching via Hausdorff distance

  • Canny edge detection

  • Misc. statistics on leaves (aspect ratio, solidity, isoperimetric factor)

  • Distance-to-mean comparison

~FAST corner detection

Template matching via Hausdorff distance