CŐDE
Projectbased coding boot camps for industry and entrepreneurship
MAY 2019 DATA SCIENCE BOOT CAMPS
Sections:
The aim of the Python boot camp is to give you the background, experience, and tools that you need to analyze data and to build and use models and simulations. In the first week you will learn the basics of the Python language, applying what you know to interesting problems, e.g., analysis of a financial model or of a data set (historical temperature data, stock prices, etc.). The second week will be devoted exclusively to data analysis, and in the the third week you will learn about models and simulations with some examples drawn from https://www.inferentialthinking.com/
There will be regular homework assignments and a final project. The latter is intended to be part of your portfolio: something you can show to prospective employers.
Over the course of this threeweek boot camp, participants with little to no exposure to computer science or analytics will be introduced, through practical, handson exercises, to the various ways of thinking and solving problems that define the field of data science, including:

The basics of computer science in R, a modern statistical programming language that is widely used in academia and business

The principles of tidy data, or, ensuring that when data are generated or imported, they are organized in a manner conducive to further analysis

Using problem solving, graphics, and exploratory data analysis to glean highlevel insights from novel datasets

A practical primer on probability and statistics, with a focus on hypothesis testing for different types of data

Machine learning as a predictive analogue to the more traditional retrospective statistical analyses

Collaboration and teamwork, in the pursuit of designing and implementing a group project as the course's capstone
Instructors
Jim Fowler, PhD
Jim Fowler is a mathematician at The Ohio State University. His research interests broadly include geometry and topology, and more specifically focus on the topology of highdimensional manifolds and geometric group theory. He's fond of using computational techniques to attack problems in pure mathematics.
Prior to working at The Ohio State University, he received an undergraduate degree from Harvard University and received a Ph.D. from the University of Chicago.
Jim Carlson, PhD
Jim received his PhD in mathematics from Princeton University in 1971. Since that time he has served as a postdoc at Stanford University and a professor of mathematics at both Brandeis University and the University of Utah. Jim served as president of the prestigious Clay Mathematics Institute, managing the process that led to the award of the first $1,000,000 Millennium Prizes in Mathematics. He is currently a Visiting Professor of Mathematics at the Ohio State University.
Jim has years of experience working with a variety of programming languages including both Python and Elm. He is quite active in the development of these open source languages.
Tom Needham, PhD
Tom is a Ross Assistant Professor at The Ohio State University. He graduated in Spring 2016 with a Phd in Mathematics from the University of Georgia under the direction of Professor Jason Cantarella. He currently works with Facundo Mémoli and the Topology, Geometry, and Data Analysis Research Group at Ohio State.
Tom's research is in applications of geometry and topology to applied problems in shape analysis and signal processing. His work uses tools from Riemannian, symplectic and metric geometry, topological data analysis and optimal transport.
Tom will be joining Florida State University in the Fall of 2019 as a tenure track Assistant Professor.
Collin McCabe, PhD
Collin received his PhD in human evolutionary biology from Harvard University in 2017. He was a National Science Foundation Graduate Research Fellow, and returned to Columbus in 2018 following a postdoctoral position at Duke University.
Collin has extensive experience with the programming language R. He has developed a number of R packages and is proficient in various R data science methods.
Matt Osborne
Matthew Osborne is a PhD candidate in Mathematics at Ohio State with a graduate minor in data analysis. He works with his advisor, Joseph Tien, at the intersection of applied math and public health on modeling the dynamics of contagious processes including behavior and disease. Outside of his research he enjoys learning about machine learning, data science, and sports analytics.
He is expected to graduate by summer 2020.
Supporters
May 2018 Pilot Boot Camp
The May 2019 Cőde program is made possible in part due to the generous support from the following:
COLLEGE OF ARTS AND SCIENCES
DEPARTMENT OF MATHEMATICS
DEPARTMENT OF PHYSICS
DEPARTMENT OF ASTRONOMY
CENTER FOR COSMOLOGY AND ASTROPARTICLE PHYSICS (CCAPP)
The Erdős Institute piloted its first Cőde program in May 2018 at The Ohio State University in collaboration with CoverMyMeds. Boot camp participants included:

20 math graduate students,

10 math undergraduate students, and

10 CoverMyMeds employees.
The first two weeks were dedicated to intense instruction in Python and R. The third week allowed participants to develop and demonstrate their skills through individual projects. Those projects were then presented at the CoverMyMeds headquarters and teams were formed. The program culminated in a weeklong team coding sprint and hackathon designed to tackle realworld challenges.
The May 2018 Cőde program was made possible in part due to the generous support from the following:
Example Projects
The May 2018 Boot Camps culminated in individual and team projects.
Here are some examples of what the participants created.
Two Tweet Too Furious
Matthew Osborne, Austin Antoniou, Dan McGregor, Luke Andrejek
Brief Description
In a set of tweets containing #Charlottesville from the week of the Charlottesville Unite the Right rally, can we gain any insight on who is actually tweeting these tweets?
Thanks to Center for the Study of Networks and Society for the Excellent Data!
Data and Analysis
We had tweets from over 1000 unique Twitter users to analyze. Each user in the data set tweeted #Charlottesville at least once in the week following Unite the Right rally. Other than that we knew nothing about the accounts.
We arranged these accounts into what we called a mutual retweet network (see Figure 1). Each node was one of the unique accounts from the data set and two accounts were connected with an edge if they retweeted the same accounts.
Our thought process was that people that are similar will retweet the same content.
Analysis was done using the Python packages: pandas, numpy, matplotlib, and networkx.
Figure 1
Outcome
Our approach resulted in clear communities, that describe the accounts quite well. We’ve highlighted those communities along with pictures of the top accounts that the encircled users retweeted (see Figure 2). A quick breakdown of the communities is provided below.
We expected to get two communities corresponding to republicans and democrats, however, we were surprised to see two outlier communities that are seemingly unrelated to the politics of the Charlottesville incident.
Perhaps this technique of Twitter data analysis could be a fruitful way to identify who is talking about or leveraging an event/issue.
Community Breakdown
Green – Possibly followers of a religious group from India
Yellow – Media accounts with a similar political skew
Blue – Democrat leaning accounts
Red – Republican leaning accounts
Figure 2
Identifying Leaves with Python
Alex Beckwith, Jason Bello, Sheng Guo, Kiwon Lee, Francisco Martínez Figueroa
Goal: Classify species of leaves from a database of leaf images using techniques from image processing and computer vision.
Successive approximations of leaf outlines using elliptic Fourier analysis.
Principal component analysis to identify leaf species.
Some eigen“leaves”
Primary packages used:

cv2 (for computer vision)

sklearn (for statistical analysis)
Examples of techniques used:

Principal component analysis

Elliptic Fourier analysis

~FAST corner detection algorithm

Scaleinvariant feature transform (SIFT)

Template matching via Hausdorff distance

Canny edge detection

Misc. statistics on leaves (aspect ratio, solidity, isoperimetric factor)

Distancetomean comparison
~FAST corner detection
Template matching via Hausdorff distance