top of page
PARTICIPANT DEMOGRAPHICS

2500+ participants with PhDs from 200+ universities

Candidate
Profiles

3042

Seeking Internships

967

Seeking
Part-Time

559

Seeking
Full-Time

1494

Seeking
Senior/Managerial

296

Seeking
DS, ML, AI

1815

Seeking Quant
Research/Finance

1177

Seeking Software Engineering

667

Seeking Quantum Computing

475

Seeking
UX Research

365

Seeking
Prof/Sci Writing

499

Hover over pie charts for details

Geographic Preferences

Map below is of those expressing specific geographic preferences.

An additional 391 are open to work any where in US.

PARTICIPANT PROJECTS

Some projects from prior cohorts

FALL 2023

TEAM

Data Science Boot Camp

AI-generated Image Detection

Amanda Pan, Hasan Saad, Alina Al Beaini, Cemile Kurkoglu

AI-generated images have become increasingly realistic, prompting a variety of malicious uses. We plan to develop a model for detecting AI-generated images, ideally improving upon some of the current difficulties: generalization to different methods of image generation, robustness to image resizing and compression, and interpretability of results.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

FALL 2023

TEAM

Data Science Boot Camp

Groundwater Forecasting

Riti Bahl, Meredith Sargent, Marcos Ortiz, Chelsea Gary, Anireju Dudun

Groundwater is a critical source of water human survival. A significant percentage of both drinking and crop irrigation water is drawn from groundwater sources through wells. In the US, overuse of groundwater could have major implications for the future and forecasting groundwater can be useful in understanding its impact. Building on historical data for four wells, together with surface water and weather data, in Spokane, WA, we construct and evaluate machine learning models that forecast groundwater levels in the area.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

FALL 2023

TEAM

Data Science Boot Camp

Funk

aydin ozbek, Dane Miyata, Kristina Knowles, Mario Gomez, Kashish Mehta

Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.

We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

FALL 2023

TEAM

Data Science Boot Camp

The Silent Emergency - Predicting Preterm Birth

Katherine Grillaert, Divya Joshi, Alexander Sutherland, Kristina Zvolanek, Noah Rahman

Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births. The rates are notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%). Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions. Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

FALL 2023

TEAM

Data Science Boot Camp

DDTs: Dementia Detection Tool

Himanshu Khanchandani, Clark Butler, Cisil Karaguzel, Selman Ipek, Shreya Shukla

Alzheimer’s disease (AD) is one of the most common types of dementia and frequently affects the elderly. Electroencephalography (EEG) is a non-invasive technique to measure the brain activity using external electrodes and may help provide improved diagnosis of AD. In this project we use power spectrum of EEG to build a robust machine learning classifier which predicts whether a patient has Alzheimer's or is healthy. We vastly improve upon existing models in the literature by using modified features compared to the ones used in literature.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

SPRING 2023

TEAM

Data Science Boot Camp

Correcting Racial Bias in Measurement of Blood Oxygen Saturation

Rohan Myers, Saad Khalid, woojeong kim, Brooks Miner, Jaychandran Padayasi

Fingertip pulse oximeters are the current standard for estimating blood oxygen saturation without a blood draw, both at home and in healthcare settings. However, pulse oximeters overestimate oxygen saturation, often resulting in ‘hidden hypoxemia’: a patient has hypoxemia (dangerously low oxygen saturation), but the oximeter returns a healthy oxygen value. Unfortunately, oximeter overestimation of oxygen saturation is exacerbated for patients with darker skin tones due to light-based oximeter technology. This results in Black patients experiencing hidden hypoxemia at twice the rate of white patients. By combining pulse oximeter readings (SpO2) with additional patient data, we develop improved methods for estimating arterial blood oxygen saturation (SaO2) and identifying Hidden Hypoxemia. The predictions of our models are more accurate than pulse-oximeter readings alone, and remove the systematic racial inequity inherent in the current medical practice of using oximeter readings alone.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

SPRING 2021

TEAM

Data Science Boot Camp

MaizeFinder

Tuguldur Sukhbold, Michael Darcy, Pol Arranz-Gibert, AJ Adejare

Our problem is to accurately predict maize field centers in Africa using very low resolution satellite images. The dataset contains many disparate entries including two satellite imagery, one with higher spatial res (Planet) and one with higher temporal resolution and wavelength coverage (Sentinel-2), partial metadata about the crop fields including estimated yield, size, and subjective quality of the measurement. We are employing CNN based image segmentation models to compute displacement vector from the image center.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

SPRING 2021

TEAM

Data Science Boot Camp

NLPs

Frank Hidalgo, Joseph Szabo, Christopher Zhang, Sean Perez, Kun Jin

Acronym/Abbreviation (short form) disambiguation is one of main challenges when using NLP methods to uderstand medical records. While this topic has long been studied, it is still a work in progress. Current strategies often involve having manually curated datasets of abbreviations and train classifiers. The main problem of that approach is that curated datasets are sparse and don't include all the short forms. In Dec 2020, a paper came out where they created a large dataset of short forms as one of their steps in their pipeline to pre-train models. The goal of our project would be to build upon their short form disambiguation piece and create a tool to disambiguate a medical short form using its context. Example of the usage of our tool: original_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, AB 2. She has no history of adverse reaction to anesthesia." AB could stand for "abortion", "ankle-brachial", "blood group in ABO system", "A, B lines in Kerley lines". disambiguated_sentence = "The patient states that she has had dizziness, nausea, some heartburn, and some change in her vision. She is gravida 6, para 4, abortion 2. She has no history of adverse reaction to anesthesia."

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

FALL 2022

TEAM

Data Science Boot Camp

Lime

Yuchen Luo, Ritika Khurana, Aditya Chander, Taylor Mahler

We built a podcast recommendation engine that suggests episodes to a listener based on either a previous episode that they've heard or an episode description that they can input with freeform text entry.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

SPRING 2022

TEAM

Data Science Boot Camp

Erdio

Matthew Frick, Paul Jreidini, Matthew Heffernan

Timely identification of safety-critical events, such as gunshots, is of great importance to public safety stakeholders. However, existing systems only deliver limited value by not classifying additional urban sounds. We perform classification of environmental sounds to detect safety-critical events, in particular gunshots, and provide information on first-response via siren detection. We also engineer general features for off-line classification tasks and demonstrate how this system can provide value to additional stakeholders in the film and television industry.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

SPRING 2022

TEAM

Data Science Boot Camp

SKYLAB

Chenyi Gu, Briana Stanfield, Dylan Bates, Kanishk Jain

The NHL Stanley Cup is the oldest existing trophy to be awarded to a professional sports franchise in North America, and often considered “the hardest trophy to win in professional sport.” Using just regular season data, we want to know, can we predict who is going to win the Stanley Cup?
We collected data from each team, as well as data from every player in over 20,000 games going back to 2005. Using this data, we made an ensemble model using logistic regression, AdaBoost, random forests, and a neural network, which were able to predict playoff data with up to 70% accuracy - above the theoretical threshold reported in the literature of 62%.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL
bottom of page