OUR COMMUNITY
5000+ participants with PhDs from 300+ universities
Candidate
Profiles
6575
Seeking Internships
1377
Seeking
Part-Time
843
Seeking
Full-Time
2249
Seeking
Senior/Managerial
434
Seeking
DS, ML, AI
2581
Seeking Quant
Research/Finance
1719
Seeking Software Engineering
919
Seeking Quantum Computing
660
Seeking
UX Research
466
Seeking
Prof/Sci Writing
667
EDUCATION & REPORTED DEMOGRAPHIC DATA
University
Area of Study
Year of PhD
Reported Ethnicity
Reported Gender
Reported First Gen
PARTICIPANT PROJECTS
Examples of projects from prior cohorts
SPRING 2026
TEAM
Quantum Computing Boot Camp
Ege Aktener - Quantum Computing - Final Project
Ege Aktener
Project 1: An Oracle for Shor's Algortihm (https://github.com/egeaktener/Oracle-for-Shor-s-Algorithm)
Takes two co-prime integers a and N, and outputs a quantum gate for modular multiplication.
Project 2: Quantum PageRank (https://github.com/egeaktener/Quantum-PageRank)
given a directed graph, computes classical and quantum PageRank by simulating classical and quantum random walks.
SPRING 2026
TEAM
Quantum Computing Boot Camp
Gautham (Sid) Meka
Gautham Meka
Quantum Computing Mini Projects
First Project: https://github.com/TripleH372/MultiControlledUGate
Second Project: https://github.com/TripleH372/QuantumFinanceProject
SPRING 2026
TEAM
UX Research Boot Camp
2026 Cohort Team
Tamera Jones, David Stifler, Ibrahim Odugbemi, Dejan Duric
Project 1: Competitive Analysis
We conducted market research on behalf of Kairos, a remote patient monitoring company focused on postpartum health. Through background research and data synthesis, we strategically redesigned the company's marketing position.
Project 2: Finance App Usability
The goal-tracking app, Stack Save, was tested for feature usability with 4 users, focusing on areas for improvement in ease of use and effectiveness. We conducted a usability testing session via a survey and redesigned the interface based on the suggestions.
Project 3: Customer Loyalty
TableGram, a popular food service app, has seen changes in customer loyalty. To understand the key drivers and barriers in customer loyalty, we designed and administered a survey and conducted A/B testing to propose an intervention to keep customers.
SPRING 2026
TEAM
Quant Finance Boot Camp
Gautam Hegde : Kou Jump-Diffusion Model
Gautam Hegde
Kou Jump-Diffusion Model
An extension of the Black-Scholes framework incorporating sudden, large discontinuities in asset prices via a double-exponential jump process. The project consists of two modules.
The first is a pricing and calibration engine for European options, combining the Carr-Madan FFT method with 2D interpolation for efficient price computation, and L-BFGS-B optimization to fit model parameters to market data.
The second is a Monte Carlo simulation analyzing delta-hedging performance under jump risk. By evolving a hedged European call option portfolio under the Kou model, the simulation demonstrates that jumps induce fat tails in the P&L distribution that delta-hedging alone cannot eliminate.
SPRING 2026
TEAM
Quant Finance Boot Camp
The Volatility Complex: Topological Analysis of Sector Stock Volatility via Vietoris-Rips Complexes
Andrew Tawfeek
We study how the volatility of ~50 S&P 500 tech stocks co-moves over time (2007–2024) using tools from topological data analysis. From rolling volatility correlations, we build a distance metric between stocks and construct Vietoris-Rips simplicial complexes — geometric objects that capture not just pairwise relationships but higher-order group behavior. Tracking the complex over time reveals sharp structural shifts during crises (2008, COVID, 2022) when all stocks move as one, and fragmentation into sub-sector clusters during calm periods. Persistent homology provides a multi-scale view of this structure, identifying robust features across different correlation thresholds.
SPRING 2026
TEAM
Deep Learning Boot Camp
Cross-Dataset Generalization of Underwater Instance Segmentation Models
Carsten Sprunger
Underwater instance segmentation models are typically trained and evaluated on a single dataset, leaving cross-dataset generalization unstudied. This project measures the cross-dataset domain gap between TrashCan (7,212 deep-sea ROV images, 22 classes) and SeaClear (8,610 shallow-water images, 40 classes) using Mask R-CNN with a COCO-pretrained ResNet-50 FPN backbone. Models trained on one dataset fail catastrophically on the other despite overlapping object categories. We show this gap is visual, not semantic: silhouette analysis of backbone features reveals strong dataset clustering even at the per-class level. To mitigate the gap, we pool both datasets into common category spaces via a generic coarsening hierarchy and show that a single pooled model recovers or exceeds in-domain performance on both test sets. We also re-split TrashCan using frame-chunking to fix data leakage in the original split. All results are explorable via an interactive Streamlit dashboard.
SPRING 2026
TEAM
UX Research Boot Camp
Team JXY!
Yuxian Lin,Xinyue Wu,Jessie Cordwell
Project 1 – Market Research & Strategy
Conducted market research and competitive analysis to understand the industry landscape, followed by a SWOT analysis to identify strategic strengths and weaknesses, and designed a case study to apply these insights in a real-world context.
Project 2 – User Research & Design
Developed user personas to define target users, conducted usability testing based on personas to evaluate design effectiveness, and created wireframes to translate research findings into early-stage interface concepts.
Project 3 – Quantitative Research
Designed a survey to collect user data, ran an A/B test to compare new product feature, and performed quantitative analysis to draw data-driven conclusions from the results.
SPRING 2026
TEAM
Deep Learning Boot Camp
Spatiotemporal Modeling of Pose Estimation in Wearables
Sero Parel, Kristin Dona, Dayoung Lee, Brian Mullen
This project aims to build a deep learning pipeline for hand pose estimation from surface electromyography (sEMG) signals recorded by a smart wristband equipped with muscle activity sensors. We used the emg2pose dataset, which includes data from 193 users, 370 hours, 16-channel sEMG signals at 2 kHz (Salter, Warren, Schlager, et al. 2024). This publicly available dataset is found in the GitHub repository: https://github.com/facebookresearch/emg2pose.
We focused on the core deployment plan, generalization to new users/poses, sensor placements, and trajectory quality. We established a baseline LTSM model and added small, well-ablated improvement through an spatiotemporal learning approach. This project is packaged as a reproducible PyTorch pipeline that can be run in Google Colab. Additionally, we included deployment by publishing our trained model checkpoints and inference code to Hugging Face.
SPRING 2026
TEAM
Deep Learning Boot Camp
Deep Learning Song Recommender
Nick Geis, Mitch Hamidi-Ismert, Juan Salinas
This project develops a content-based music recommender that predicts song relationships from audio, using listener-generated tags as supervision during training. From 10-second clips, stem separation and mel spectrograms are used to represent each track, and a late-fusion ResNet18 learns embeddings that capture genre, mood, and musical structure. At inference time, the system recommends songs from audio alone through an interactive web app, showing how deep learning can support music discovery without relying solely on user behavior.
SPRING 2026
TEAM
Data Science Boot Camp
A Solar-to-Ground Proxy Model for Ground-Level Electromagnetic (EM) Risk Prediction
George Seelinger
Geomagnetic storms driven by solar activity pose risks to power grids, satellites, and communication systems. In this project, we developed a data-driven solar-to-ground proxy model that predicts near-term geomagnetic activity using solar wind data. The final model is an XGBoost Classifier tuned to maximize correctly predicting when a storm occurs subject to keeping the false positive rate at an acceptable level.
SPRING 2026
TEAM
Deep Learning Boot Camp
EllipticGuard: Graph Deep Learning for Bitcoin Illicit Activity Detection
Ran Li, Shaoyang Zhou, Rafael Miksian Magaldi, Prakash Singh, Tinghao Huang
This project studies illicit Bitcoin transaction detection on the Elliptic dataset under a stable pre-shutdown split (train 1–32, val 33–37, test 38–42). We compared strong tabular baselines, GNNs, graph-aware non-neural models, compressed graph–tree hybrids, directed residual GNNs, and combination models. Generic GNNs improved over weaker graph baselines but remained below the best tabular model. A graph-aware ET stack using directed neighbor-risk aggregates reached 0.905 test PR-AUC, while compressed hybrid models showed that GNN embeddings help more when constrained through low-dimensional bottlenecks, including Matryoshka-style designs, before integration into trees. The best standalone graph models were directed residual GNNs (up to 0.916), and the top result, 0.9187, came from a preserved-head combination model integrating GraphAgg ET with SIGN/stack components. Overall, graph information helps most when integrated with tabular models rather than used through a standalone GNN.
SPRING 2026
TEAM
Deep Learning Boot Camp
Fragmented ID Resolution
Noimot Bakare Ayoub, Dharineesh Somisetty, Arpith Shanbhag, Pedro Fontanarrosa
Scope: Detect duplicate identities across noisy, fragmented datasets (fraud, patient mismatch, citizen records)
Architecture: CNN Embeddings + Siamese Network
Problem: Real-world identity data is messy, small inconsistencies cause one person to appear as multiple records, creating operational risk and inefficiency.
Approach: We learn record similarity using CNN embeddings and a Siamese network. LinkID detects, ranks, and resolves duplicate identities auto-linking high-confidence matches and routing borderline cases for review.
Data: HPI snapshot of North Carolina voter records with labeled duplicate and non duplicate pairs.
Results: Strong performance overall, with ~25-point improvement on hard cases where traditional models struggle.
Conclusion: Learned similarity models significantly outperform traditional approaches in complex identity resolution tasks.
SPRING 2026
TEAM
Data Science Boot Camp
LLM Hallucinations Detector
Helmut Wahanik, Guoqin Liu, Santanil Jana, AJ Vargas, Debanjan Sarkar
In this project, we develop methods for detecting hallucinations in Large Language Models (LLMs) to flag risky outputs prior to expensive downstream validation. We propose two complementary detection strategies evaluated on 2,500 questions across five benchmark datasets using Llama-3.2-3B. The first approach is a white-box method that extracts spectral features from attention-head Laplacians. This method demonstrates that the hallucination signal is low-dimensional and largely linearly separable. The second approach is a black-box method that computes semantic and geometric statistics from a cloud of sampled responses. We find that an ElasticNet logistic model trained on six baseline features achieves an AUROC of approximately 0.91.
Ultimately, we demonstrate that hallucinations leave measurable signatures in both internal transformer activations and the geometry of sampled outputs. Our approach serves as a cost-effective filter for organizations deploying LLMs at scale.
SPRING 2026
TEAM
Data Science Boot Camp
Hitmakers vs. One-Hit Wonders: Predicting Sustained Success in the Music Industry
James McNally,Yundi Kong,Guillermo Sanmarco,Vishal Gupta
Question:
What early signals predict sustained success in the music industry?
Objective:
Many musicians produce hit songs, but not all are able to do so more than once. This project builds a machine learning classifier to distinguish hitmakers (artists with multiple top 20 Billboard Hot 100 hits) from one-hit wonders, using only information available at the moment of a musician’s first top 20 hit song.
Conclusions:
Our model reveals that prior charting experience, collaboration network position, chart longevity, genre breadth, and dominant genre affiliations are the strongest predictors of sustained success.
Data sources:
- MusicBrainz (artist metadata, genre tags, collaboration graph)
- Billboard Hot 100 & 200 chart data
- Spotify (artist and song metadata)
- Google Trends (relative search volume at time of first hit song)
SPRING 2026
TEAM
Data Science Boot Camp
Mapping Radon Risk in Canada
Huiyao Kuang, Manimugdha Saikia, Emmanuel Asante, John Berezney
Radon is a naturally occurring radioactive gas, and long-term exposure to elevated indoor radon is a major public health concern in Canada. Because radon is colorless and odorless, exposure often goes unnoticed, while risk varies substantially across regions due to differences in geology, climate, housing, and socioeconomic context. This project developed an FSA-level radon risk screening framework for Canada by integrating household radon survey data with public geological, climatic, housing, socioeconomic, and uranium-related datasets, and identified the regional factors most strongly associated with elevated radon risk.
SPRING 2026
TEAM
Data Science Boot Camp
Towards Automated Sleep Analysis: Stage Classification and Apnea Prediction
Ye Hong,Vaibhav Thakur,Aiqi Cheng
Accurate sleep monitoring remains a challenge for consumer wearables, and conditions like Obstructive Sleep Apnea are widely underdiagnosed due to the lack of accessible, reliable automated tools. Using the DREAMT dataset — combining overnight PSG signals, Empatica E4 smartwatch data, and subject metadata from 100 participants — we built two models: one to classify sleep stages in real time, and one to predict apnea events 10 seconds in advance.
For sleep stage classification, we benchmarked Logistic Regression, XGBoost, LightGBM, and LSTM; gradient boosting methods performed best, with an XGBoost and LightGBM ensemble further improved by majority-vote smoothing. For apnea prediction, a longer lag window of LightGBM features yielded the strongest results, highlighting the importance of temporal context.
These models enabled real-time stage tracking and proactive apnea alerts with potential for earlier clinical interventions.
FALL 2025
TEAM
Quant Finance Boot Camp
Implied Volatility vs. Realized Volatility for an Africa-Exposure ETF
Chidubem Umeh
As a Nigerian-American, I have a personal interest in understanding African financial markets—particularly those in West Africa, where local economic factors often differ significantly from continental trends.
This project will focus on volatility modeling and forecasting using real market data. The “West Africa Regime” component refers specifically to periods of heightened Nigerian Naira (NGN) volatility, which can be used as a proxy for broader West African macroeconomic uncertainty.
FALL 2025
TEAM
Deep Learning Boot Camp
Deep Learning Models for Colorectal Polyp Detection
Ruibo Zhang, Rebekah Eichberg, Betul Senay Aras, Kevin Specht, Arthur Diep-Nguyen
A polyp is an abnormal tissue growth in the large intestine that is typically benign but can develop into malignant colorectal cancer. Colonoscopy enables endoscopists to identify and assess these polyps for potential removal. However, the accuracy of this procedure depends heavily on the clinician’s expertise, making it prone to human error and variability. Our goal is to build a deep-learning model that detects colorectal polyps in images from colonoscopies to minimize missed lesions and improve patient outcomes.
FALL 2025
TEAM
Data Science Boot Camp
Identifying Early Risk Factors for Students in Online Courses
James McNally,James Caramanico,Arina Favilla,Feng Zhu
Research Question: What early engagement patterns in virtual learning environments predict negative course outcomes?
Context: It is well known that performance on assessments and in-class attendance are predictive of final course results. Yet grades often come too late in a class term for early interventions and attendance is difficult to measure in online learning environments. To address this gap, we developed a model for identifying early risk factors in online courses based on student interaction patterns in a virtual learning environment (VLE).
Data source: Open University Learning Analysis Dataset (OULAD), which includes daily logs of UK student VLE interactions and grades in 7 science and social science online courses occurring in 2013-14.
Goal: Develop a model for identifying early risk factors based on student interaction patterns that predict negative course outcomes (i.e., failure or withdrawal) in a VLE.
FALL 2025
TEAM
Data Science Boot Camp
Personalized Gesture Recognition
Sero Parel, Carrie Clark, Brian Mullen, Philip Nelson, Revati Jadhav
Smart wristbands enable users to control technology through subtle hand gestures by decoding muscle signals. However, each individual's muscle signals are unique, making personalization a critical challenge. Leveraging a publicly available dataset (Kaifosh et al. 2025), our team developed a personalized gesture recognition model using surface electromyography (sEMG) data from 100 participants performing 9 gestures. Addressing inter-user variability through within-user training, we engineered 160 features and selected 37 via random forest ranking and correlation pruning. Logistic regression with L2 regularization achieved strong cross-validation performance (F1 Macro = 0.7164), but holdout testing revealed a generalization gap (F1 Macro = 0.3977). Performance varied widely, confirming heterogeneity in performance across diverse users. Future work could explore adaptive time windows and fine-tuning pre-trained models to enable more robust commercial neuromotor interfaces.
FALL 2025
TEAM
Data Science Boot Camp
Predicting Lead Contamination in NY School Drinking Water
Ranadeep Roy,Cami Goray,Hana Lang
Lead is a toxic metal, and in children especially, lead exposure can have severe health consequences -- even small amounts of lead have the potential to affect memory, behavior, and learning ability. Despite this, numerous schools across New York State have at least one drinking water outlet with lead levels testing for above 5 ppb. In this project, we aim to predict the presence of lead contamination in school drinking water, and better understand the role of demographic, socioeconomic, infrastructural, and geographic features in elevated lead levels.
SUMMER 2025
TEAM
Deep Learning Boot Camp
Going Off-Grid: A Computer Vision Approach for Grid Integration and Reconstruction in Post-War Syria
Al Baraa Abd Aldaim,Suman Bhandari,Nicholas Geiser
Decentralized solar electricity production has become common in Syria due to unreliability and inconsistent delivery from the national electric grid. Estimating output from decentralized solar is vital for grid integration and reconstruction efforts currently underway in Syria. Our goal in this project was to develop a deep learning model capable of detecting panels (bounding box), estimating their area (segmentation), and predicting the bottom corners of the panel assembly (corner prediction).
SUMMER 2025
TEAM
Deep Learning Boot Camp
Fraud Detection with Deep Learning
Jude Pereira, Yang Yang, Adrian Wong, Sara Edelman-Munoz, Mary Reith
Fraud detection is a critical area where deep learning has been effectively applied to identify and prevent unauthorized transactions, money laundering, and other financial crimes. Traditional rule-based systems and statistical models often struggle to detect sophisticated fraud patterns, particularly when dealing with large volumes of data and rapidly evolving fraud techniques. In contrast, deep learning models, such as CNNs, RNNs, and autoencoders, have proven highly effective in analyzing complex, high-dimensional transaction data and detecting subtle, non-linear patterns indicative of fraudulent activity.
In this project, we build a User ID-based fraud detection model using autoencoders, trained on unlabelled real-world credit card transaction data, capable of detecting fraud with a precision of up to 35% and a recall of up to 72%, performing significantly better than traditional ML/statistical baseline models..
SUMMER 2025
TEAM
Deep Learning Boot Camp
FSP Finder
Duncan Clark,Elzbieta Polak,Jared Able,Shuo Yan
NOTE: Available at www.fspfinder.com (HF link is deactivated)
FSP (Foul Speech Pattern) Finder is a useful tool for preparing music files for radio airplay by detecting and automatically censoring explicit content. We use a custom version of OpenAI's automatic speech recognition model Whisper, which we fine tuned on over 100 hours of music vocals, to transcribe uploaded music files (with timestamps for each word). We then search for common explicit words (e.g., curse words, racial slurs, etc.) in the transcript. The vocals stem of the track is separated using demucs, then muted at the identified times, to produce a high quality radio-friendly version of the uploaded track(s).
Our tool comes with an easy to use web interface built in Gradio. The tool can process files one at a time or in batches, and the web interface allows the user to view the full transcript of each track along with the words that will be censored, before downloading the edited files.
SUMMER 2025
TEAM
Data Science Boot Camp
Machine Learning Magnetism
Ahmed Abdelazim, Murod Mirzhalilov, Brandon Abrego, Sayok Chakravarty
Strong electron correlations often lead to emergent magnetic behavior in materials. Predicting such magnetic properties is essential for advancing technologies in spintronics, data storage, and quantum computing. However, traditional methods - whether experimental techniques or density functional theory (DFT) calculations - are often complex, time-consuming, or unreliable in strongly correlated systems. This project aims at building machine learning models to predict the magnetic ordering of inorganic compounds using chemical, structural, electronic, and thermodynamic descriptors. By leveraging existing materials databases (The Materials Project + Bilbao Crystallographic Server MAGNDATA), our goal is to build a ML model that offers a faster, data-driven alternative for accelerating the discovery and design of novel magnetic materials. Our results represent a step forward in tackling the grand challenge of magnetism.
SUMMER 2025
TEAM
Data Science Boot Camp
WikiShield: Guarding against vandalism on Wikipedia
Samarth Chawla, Daniel Milanes Perez, Paul Spears, Zijian Rong, Zihao Fang
Despite being open for editing by anyone, Wikipedia tends to be fairly reliable for a first pass on many topics. Its openness to editors is its greatest strength, but also its greatest vulnerability. Intentionally disruptive and malicious edits can be a nuisance for unsuspecting readers who may come across nonsensical sentences inserted by bad actors. These edits can also pollute downstream platforms (such as search engines) that may rely on Wikipedia to generate short summaries of relevant information.
Reverting these edits often falls to volunteer (human) editors. The aim of our project, "WikiShield," was to produce a machine learning model designed to quickly detect vandalism edits on Wikipedia so that the effort of human volunteer editors is not wasted reverting low-effort attempts at vandalism.
SUMMER 2025
TEAM
Data Science Boot Camp
Predicting Yearly Science Fiction/Fantasy Awards
Zach Raines, Rohan Nair
There are a number of major awards are given to the ‘best’ new Science and Fiction fantasy novels each year, such as the Hugo, Nebula, and World Fantasy awards. Predicting which books might win is made especially difficult by the paucity of publicly available sales and review data as a function of time.
In this project we constructed a predictive model to select award winners from a pool of nominees based on publicly available information about the books and a proxy topicality score describing world state and zeitgeist for the year of publication.
SUMMER 2025
TEAM
Data Science Boot Camp
Safeify: A Quality and Safety Metric
Emelie Arvidsson, Alex Margolis, Rebekah Eichberg, Betul Senay Aras
The Safeify project was motivated by the need to find a better quality and safety metric for online consumers as product ratings are not that reliable due to bots and paid reviewers. Also, there is no known metric or model that flags safety concerns such as recalls and multiple incident reports of products. The goal of the Safeify model is to help consumers by predicting unsafe and poor quality products that could lead to dissatisfaction.
SUMMER 2025
TEAM
Data Science Boot Camp
Tuning Up Music Highway
James O'Quinn, Yang Mo, john hurtado cadavid, Ruixuan Ding, Chilambwe Natasha Wapamenshi
Known as the most dangerous highway in Tennessee, Music Highway, the stretch of Interstate 40 between Memphis and Nashville, could use a serious tuning up. This project investigates the effectiveness and cost-efficiency of potential physical safety interventions along its Madison and Henderson County segments, with the goal of reducing crash severity. We used a data-driven geospatial modeling approach to assess whether adding specific safety features to targeted segments predicts statistically significant changes in crash injury outcomes.
SPRING 2025
TEAM
Deep Learning Boot Camp
Deep Learning - Audio Project (VocalCycleGAN)
Gregory Taylor,Jaspar Wiart,Chutian Ma
In this project, trained a cycleGAN on speech data and singing data to create a voice synthesizer that takes speech and outputs a synthesized voice to play over a given song.
SPRING 2025
TEAM
Data Science Boot Camp
Today's Texas Might be Tomorrow's Ohio: Building a Geographic Climate Change Predictor
David Pochik, Alison Duck, Tawny Sit, Jack Neustadt
From the dawn of industrialization to today, the average global temperature has shifted upward by ~2.7 degrees Fahrenheit (~1.5 degrees Celsius) due to increased greenhouse gas emissions. If emissions are left unchecked and temperatures continue to rise at their current (or projected) rate, then this will lead to drastic shifts in regional climate. For example, today's annual average temperature in Ohio will increase to that of today's annual temperature in Texas in Y years.
This project explores and analyzes geographical climate change data in the contiguous United States from 1950 to the current year. The objective is to predict regional features, e.g., temperature, precipitation, snowfall, etc., for a given year based on historical data, i.e., if I want to live in an area Y years from now that has roughly the same temperature or climate as region X today, where would I go?
SPRING 2025
TEAM
Data Science Boot Camp
Who Regulates the Regulators?
Jared Able, Joshua Jackson, Zachary Brennan, Alexandria Wheeler, Nicholas Geiser
With recent major cuts to governmental regulation agencies in the US, we investigate whether those cuts are justified. In particular, we analyze the efficacy of RGGI, a state-level cap-and-trade program designed to regulate CO2 emissions in power plants. By using synthetic controls, we answer the counterfactual question: "how would CO2 emissions look in a world where RGGI was never enacted?".
SPRING 2025
TEAM
Data Science Boot Camp
Discovering Next-Gen Battery Materials
Dorisa Tabaku, Avinash Karamchandani, Qinying Chen, Sadisha Nanayakkara, FNU Simran
Building the next generation of batteries—efficient, compact, and sustainable—relies on discovering new materials with the right set of properties. Metal-organic frameworks (MOFs), a class of crystalline and porous materials, have emerged as promising candidates for battery electrodes due to their potential for electrical conductivity. One key property that influences a MOF’s conductivity is its band gap. However, state-of-the-art density functional theory (DFT) calculations used to compute band gaps are computationally expensive. In this project, our goal was to develop a machine learning model to predict the band gaps of MOFs, helping to rapidly identify promising candidates for future energy storage technologies such as next-generation batteries.
SPRING 2025
TEAM
Data Science Boot Camp
Predicting Power Outages
Aaron Weinberg, Evan Morris, Anna Zuckerman, Julio Caceres Gonzales
From ThinkOnward (https://thinkonward.com/app/c/challenges/dynamic-rhythms):
"In this challenge, you'll be tasked with developing a model to predict power outages and how they correlate with extreme, rare weather events (e.g. storms). Your goal is to create a reliable system that can accurately predict these outages. You'll have access to a dataset containing historical weather data and relevant power outages. Your task is to design a model that can effectively forecast future weather impacts on power outages. You're free to explore and experiment with various algorithms, techniques, and models to achieve accurate results. To make things more interesting, we've identified two primary datasets: a storm event dataset and a power outage dataset. These dual datasets will require you to develop a robust model that can adapt to different scenarios and provide accurate forecasts."
SPRING 2025
TEAM
Data Science Boot Camp
Predicting Survival Time After Bone Marrow Transplant
Ruibo Zhang, Chi-Hao Wu, Yang Li, Ray Karpman, Elzbieta Polak
A blood and marrow transplant is a procedure that replaces unhealthy blood-forming cells with healthy ones. It typically involves using blood-forming cells donated by someone else instead of one's own blood-forming cells. The goal of this project is to predict transplant survival rates for post Bone Marrow Transplant patients.
We implemented and finetuned four models including Cox Proportional Hazard Model, XGBoost AFT, Survival Random Forest, and CatBoost AFT. To improve model performance, we hybridized each of the four models with an extra logistic or random forest stratification.
Our dataset comes from a Kaggle competition: https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/.
FALL 2024
TEAM
Data Science Boot Camp
Predicting Problematic Internet Use
Daniel Visscher, Emilie Wiesner, Aaron Weinberg
Internet use has been identified by researchers as having the potential to rise to the level of addiction, with associated increased rates of anxiety and depression. Identifying cases of problematic internet usage currently requires evaluation by an expert, however, which is a significant impediment to screening children and adolescents across society. One potential solution is to rely on data that is more easily and uniformly collected: the kind collected by a family physician, a simple survey, or by a smartwatch. The research question this project sets out to answer is: “Can we predict the level of problematic internet usage exhibited by children and adolescents, based on their physical activity and survey responses?”
FALL 2024
TEAM
Data Science Boot Camp
AP Outcomes to university metrics
Shannon McElhenney, Raymond Tana, Shrabana Hazra, Prabhat Devkota, Jung-Tsung Li
This project was designed to investigate the potential relationship between AP exam performance and the presence of nearby universities. It was initially hypothesized that local (especially R1/R2 or public) universities would contribute to better pass rates for AP exams in their vicinities as a result of their various outreach, dual-enrollment, tutoring, and similar programs for high schoolers. We produce a predictive model that uses a few features related to university presence, personal income, and population to predict AP exam performance.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
Wunderpus Octopus (New Atlantis)
Ingrida Semenec, Kshitiz Parihar, Nadir Hajouji, Saswat Mishra, Deniz Olgu Devecioglu
Modeling the relationship between biogeochemical layers and chlorophyll density
The distribution and density of chlorophyll in the ocean are critical indicators of marine primary productivity, which influences the global carbon cycle, marine food webs, and climate regulation. Biogeochemical and physical ocean properties, including nutrient availability, light penetration, water temperature, salinity, and ocean currents influence chlorophyll density. Understanding and accurately modeling these relationships is essential for predicting the impacts of environmental changes on marine ecosystems and for managing oceanic resources effectively. We plan to combine multiple Copernicus Marine Datasets to model the chlorophyll density based on the biochemical and physical properties of the ocean.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
RivusVox Editor
Zachary Bezemek,Francesca Balestrieri
RivusVox Editor: the world's first near-live zero-shot adaptive speech editing system
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
A Vocal-Cue Interpreter for Minimally-Verbal Individuals
Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara
The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.
We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
“Good composers borrow, Great ones steal!”
Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain
Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
Taxi Demand Forecasting
Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez
Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.
MAY-SUMMER 2024
TEAM
Deep Learning Boot Camp
arXiv Chatbot
Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji
arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.
This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.
MAY-SUMMER 2024
TEAM
Data Science Boot Camp
Continuous Glucose Monitoring
Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye
The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel
MAY-SUMMER 2024
TEAM
Data Science Boot Camp
Chirp Checker
Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff
The nocturnal soundscapes of late summer and autumn are replete with the familiar chirps, trills, and buzzes of singing insects. But these cryptic performers often remain anonymous and underappreciated.
The goal of this project was to build machine learning models to identify the presence of insects in sound files and to coarsely categorize the sounds as crickets, katydids, or cicadas.
Both Support Vector Classifiers and Convolutional Neural Networks were able to identify insects songs to the broad categories of cricket, katydid, and cicada with 90% accuracy or higher.
In the future, similar, more sophisticated models could be applied to filtering large volumes of passively recorded audio from ecological studies of insects and could power apps that identify insect songs to the species level.
SPRING 2024
TEAM
Data Science Boot Camp
Aware NLP Project III
Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams
This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.
FALL 2023
TEAM
Data Science Boot Camp
Groundwater Forecasting
Riti Bahl, Meredith Sargent, Marcos Ortiz, Chelsea Gary, Anireju Dudun
Groundwater is a critical source of water human survival. A significant percentage of both drinking and crop irrigation water is drawn from groundwater sources through wells. In the US, overuse of groundwater could have major implications for the future and forecasting groundwater can be useful in understanding its impact. Building on historical data for four wells, together with surface water and weather data, in Spokane, WA, we construct and evaluate machine learning models that forecast groundwater levels in the area.
FALL 2023
TEAM
Data Science Boot Camp
Funk
aydin ozbek, Dane Miyata, Kristina Knowles, Mario Gomez, Kashish Mehta
Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.
We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.
FALL 2023
TEAM
Data Science Boot Camp
The Silent Emergency - Predicting Preterm Birth
Katherine Grillaert, Divya Joshi, Alexander Sutherland, Kristina Zvolanek, Noah Rahman
Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births. The rates are notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%). Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions. Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.
FALL 2023
TEAM
Data Science Boot Camp
DDTs: Dementia Detection Tool
Himanshu Khanchandani, Clark Butler, Cisil Karaguzel, Selman Ipek, Shreya Shukla
Alzheimer’s disease (AD) is one of the most common types of dementia and frequently affects the elderly. Electroencephalography (EEG) is a non-invasive technique to measure the brain activity using external electrodes and may help provide improved diagnosis of AD. In this project we use power spectrum of EEG to build a robust machine learning classifier which predicts whether a patient has Alzheimer's or is healthy. We vastly improve upon existing models in the literature by using modified features compared to the ones used in literature.
