![erdosOspin.gif](https://static.wixstatic.com/media/55f531_a92a2cc0e40a4b7792adc76e6c66c8f0~mv2.gif)
Checking your membership status...
Project Database
View Team Project Submissions for various cohorts and programs below:
77 results were found.
FALL 2024
TEAM 29
Course Assistant Bot
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Reginald Bain
This project is a proof of concept of what we envision as a broad class of AI-based assistants for course development. It focuses on a tool for building effective course syllabi, which serve a variety of key functions in course development. The syllabus outlines key components of the course and serves as a "contract" between the students and the instructional staff. Course policies must be enforced equitably and worded clearly, particularly in cases where litigation could be involved. This project is an example of a simple tools that can help instructors quickly assess how effective syllabi are at answering student questions by actually simulating asking those questions using Large Language Models (LLMs) and so-called Retrieval Augmented Generation (RAG).
FALL 2024
TEAM 2
Health Insights
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Chiara Mattamira, Neal Edgren, Shravan Patankar, Leyda Almodóvar Velázquez
In 2021, 11.6% of Americans were diagnosed with diabetes, and every year over 2 million Americans are diagnosed with the disease. What particularly interested us is that the percentage of people living with diabetes is distributed highly unequally across counties in the US. With this project, we aim to identify key risk predictors for diabetes, understand the role of demographic and socioeconomic factors in diabetes prevalence, and provide insights to help make informed policy decisions at the local level.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 35
ML Based QSAR On TRPM8
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jaehyoun Seiler, Adedolapo Ojoawo, Carmen Al Masri, Jessica Pan
Our goal for this project was to use Quantitative structure-activity relationship (QSAR) to predict inhibitors for transient receptor potential cation channel subfamily M member 8 (TRPM8), an ion channel that mediates both cold and pain. QSAR enables the prediction of the molecule's potency based on its physical, chemical and structural properties. In this project, we developed a QSAR workflow, and then chose TRPM8 as a representative protein to test our process. We chose TRPM8 because there are no approved drugs for it. Creating a model that would allow us to screen thousands of molecules and computationally score their inhibitor ability would be immensely useful, as it cuts down on the costly and lengthy process of screening these molecules in the lab.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 31
Occupancy modeling of birds in the amazon rainforest
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jeremy Borden, Sriram Raghunath, Dawit Mengesha, Yusupujiang Aimaiti, Chelsey Hunts
Our goals for this project were to test different occupancy modeling strategies to explore if and how climate change or forest loss has affected bird populations in the Amazonas region of Brazil over the time period of 2012 – 2021, and subsequently evaluate which models performed the best. We tested this for two species – a generalist species, Black vulture (Coragyps atratus) and a forest specialist, Screaming piha (Lipaugus vociferans). We used three different modeling approaches, two standard machine learning classification models – balanced random forest and binary logistic regression – and one modern occupancy modeling approach using the R package spOccupancy.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 5
Predicting Problematic Internet Use
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Daniel Visscher, Emilie Wiesner, Aaron Weinberg
Internet use has been identified by researchers as having the potential to rise to the level of addiction, with associated increased rates of anxiety and depression. Identifying cases of problematic internet usage currently requires evaluation by an expert, however, which is a significant impediment to screening children and adolescents across society. One potential solution is to rely on data that is more easily and uniformly collected: the kind collected by a family physician, a simple survey, or by a smartwatch. The research question this project sets out to answer is: “Can we predict the level of problematic internet usage exhibited by children and adolescents, based on their physical activity and survey responses?”
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 6
Guess the elo (chess)
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Foivos Chnaras, Dorian Soergel, Lang Song
This project explores the relationship between chess performance metrics and Elo ratings, aiming to determine if game performance can reliably predict player ratings or detect cheating.
Using a dataset of 1 million games from The Week in Chess, we leveraged Stockfish, a state-of-the-art chess engine, to analyze moves and compute metrics like Centipawn Loss and Winning Chance Loss.
Key questions include whether single-game metrics can effectively predict Elo and if aggregated metrics across multiple games serve as stronger predictors.
Our results showed weak correlation for single games but revealed stronger patterns when averaging metrics over multiple games. These findings offer valuable insights into Elo prediction and provide a foundation for developing improved anti-cheating tools and user-friendly applications.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 28
Emergency Medical Services Predictions
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jessica Liu, Darius Alizadeh, Jonathan Miller, Karina Cho
The goal of this project is to produce useful predictions using the database published by the National EMS Information Services organization. Given that EMS is underfunded and understaffed across the country, it may be useful to be able to predict how many calls will be in a given region at a given time. If we can produce predictions about the volume of specific types of calls, even better.
A first step would be to do time series modeling using just the data provided in the public NEMSIS database. A second step would be to apply to include data with more location identifiers (specific location identifiers don't automatically come with the default dataset from NEMSIS for medical privacy reasons). A third step could be to try to include other types of data in the model, such as weather, flu/covid rates, unemployment rates, ect.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 9
Thrive or Survive: Predicting the Health of Trees following Forest Fires in Washington
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Ella Palacios, Henry Cladouhos, Christina Duffield, Allison Cruikshank
The motivation behind this project is to build a model that can accurately predict the survival and health of trees following a forest fire, given previous health data. We picked the state of Washington to begin developing the model due to its diverse boreal & arboreal ecosystems at varying elevations that fall victim to yearly forest fires. Preserving Washington forests is a passion of ours.
Question:
Can we predict tree survival and health following a fire, using data about the tree’s past health and the fire severity?
Stakeholders:
Disaster Mitigation Groups
Commercial Logging
Forestry Researchers
KPI:
Accuracy of tree survival predictions post-fire when compared with actual historic outcomes
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 26
Facial Emotion Recognition
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Rui Shi, Menglei Wang, Jiayi Wang, Yuting Ma
Facial Emotion Recognition Project Outline:
• Objective:
o Build a model to classify facial expressions in images into different emotions (e.g., happy, sad, angry, surprised).
Explore techniques for handling variations in lighting, pose, and image quality.
1. Setup:
o Python environment with TensorFlow/PyTorch, etc.
o Download FER2013 or similar dataset.
2. Data Prep:
o Load and explore data.
o Preprocess: Resize, normalize, augment, split.
3. Model:
o Traditional ML models.
o CNN architecture, etc.
4. Evaluation and Refinement:
o Evaluate on test set, generate confusion matrix.
o Fine-tune hyperparameters, optimize models.
5. Handling Variations:
o Augment for lighting, pose, quality.
o Consider attention mechanisms, robust feature extraction.
6. Conclusion and Future Work:
o Summarize findings, best model, techniques.
FALL 2024
TEAM 10
Predicting the outcome of esports tournaments (Super Smash Bros. Melee)
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jaspar Wiart,Dan Ursu
In this project, we focus on predicting the winners of esports tournaments, specifically for Super Smash Bros. Melee. While this is an older game released in 2001, it still sees a vibrant esports scene, with tournaments regularly offering tens of thousands of dollars in prize money, and viewership sometimes in the hundreds of thousands.
Our goal was the same as any other data scientist analyzing sports data - predict the winner as accurately as possible. In particular, we wanted to develop a machine learning model that could predict the outcome of individual matches between players, and develop a second model that could predict the winner out of the top 8 finalists in a tournament. We placed a heavy emphasis on engineering predictive features beyond standard Elo scores.
Ultimately, our single-match predictor was able to outperform the baseline of "predicting whoever has the highest elo", with a statistically significant improvement in accuracy.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 13
Predicting NBA Player Retention
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Alexander Pandya, Peter Johnson, Andrew Newman, Ryan Moruzzi, Collin Litterell
The NBA is widely considered to be the best basketball league in the world, and has grown over its seven-decade existence into a multibillion-dollar industry. Central to this industry is the problem of roster construction, as team performance depends critically on selecting productive players for all 15 roster spots.
In this project, we aim to perform a novel analysis of NBA statistics, salary, and transaction data in order to determine whether or not a given player will be in the NBA in the next season (i.e., we predict NBA player retention). The resulting model has the potential to aid in the selection of players toward the end of the roster, which has long been one of the most challenging aspects of team construction.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2024
TEAM 15
AP Outcomes to university metrics
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Shannon McElhenney, Raymond Tana, Shrabana Hazra, Prabhat Devkota, Jung-Tsung Li
This project was designed to investigate the potential relationship between AP exam performance and the presence of nearby universities. It was initially hypothesized that local (especially R1/R2 or public) universities would contribute to better pass rates for AP exams in their vicinities as a result of their various outreach, dual-enrollment, tutoring, and similar programs for high schoolers. We produce a predictive model that uses a few features related to university presence, personal income, and population to predict AP exam performance.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 22
Wunderpus Octopus (New Atlantis)
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Ingrida Semenec, Kshitiz Parihar, Nadir Hajouji, Saswat Mishra, Deniz Olgu Devecioglu
Modeling the relationship between biogeochemical layers and chlorophyll density
The distribution and density of chlorophyll in the ocean are critical indicators of marine primary productivity, which influences the global carbon cycle, marine food webs, and climate regulation. Biogeochemical and physical ocean properties, including nutrient availability, light penetration, water temperature, salinity, and ocean currents influence chlorophyll density. Understanding and accurately modeling these relationships is essential for predicting the impacts of environmental changes on marine ecosystems and for managing oceanic resources effectively. We plan to combine multiple Copernicus Marine Datasets to model the chlorophyll density based on the biochemical and physical properties of the ocean.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 5
A Vocal-Cue Interpreter for Minimally-Verbal Individuals
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara
The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.
We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 20
“Good composers borrow, Great ones steal!”
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain
Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.
MAY-SUMMER 2024
TEAM 3
Taxi Demand Forecasting
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez
Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.
MAY-SUMMER 2024
TEAM 14
arXiv Chatbot
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji
arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.
This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.
MAY-SUMMER 2024
TEAM 7
Geo-locator
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Aashraya Jha,Dante Bonolis,Zachary Bezemek,Leonhard Hochfilzer,Francesca Balestrieri
In the popular online game Geoguessr, the player is shown a random image from Google Street View and is tasked with guessing their location on the globe as accurately as possible. In this project, we seek to solve a simplified version of this problem but using a strategy often used by professional Geoguesser players: using man-made features (for example, traffic lights) to accurately guess a city.
We use the publicly available GSV-Cities Dataset, which consists of around 500k street-view images taken in 23 different cities. We then use CNN trained on the images and features extracted from the images to make our mode. The backbone of this CNN is a pre-trained model named MobileNetV2.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 12
Jimmy's and Joes vs X's and O's: Predicting results in college sports analyzing talent accumulation and on-field success
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Reginald Bain, Tung Nguyen, Reid Harris
Recent legislation has changed the landscape of college sports, a multi-billion dollar enterprise with deep roots in American sports culture. With the recent legalization of sports betting in many states and the SCOTUS O’Bannon ruling that allows athletes to be paid through so-called “Name-Image-Likeness (NIL)” deals, evaluating talent and projecting results in college sports is an increasingly interesting problem. By considering both talent accumulation and recent on-field results, our models aim to predict relevant results for sports betting/team construction. In this iteration of the project, our targets are regular season win percentage (using a season level model that we’ll call Model 1) and individual game results (with a game by game model we’ll call Model 2) in the regular season. Our datasets come from a variety of sources including On3, ESPN, 24/7 Sports, The College Football Database, and SportsReference.com.
MAY-SUMMER 2024
TEAM 19
MoonBoard Grade Classification
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Gautam Prakriya,Adrian Batista Planas,Larsen Linov,Prabhjot Singh
A MoonBoard is a standardized rock climbing wall - fixed holds on a wall of fixed dimensions. This climbing wall comes with an app that generates routes/problems for climbers to move up. Rock climbing routes are assigned subjective grades to represent difficulty, but given that the MoonBoard is widely used there is often a consensus around grades assigned to MoonBoard problems making them somewhat objective. The goal of the project will be to build a model that identifies the grade of a given route.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 21
BirdCLEF
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Junichi Koganemaru, Robert Jeffs, Ashwin Tarikere Ashok Kumar Nag, Salil Singh
This project addresses the BirdCLEF 2024 research code competition hosted on Kaggle by the Cornell Lab of Ornithology. Participants are provided with a dataset containing labeled audio clips of bird calls recorded at various locations in the world. The competition seeks reliable machine learning models for automatic detection and classification of bird species from soundscapes recorded in the Western Ghats of India, a Global Biodiversity Hotspot. The broader goal of this endeavor is to leverage Passive Acoustic Monitoring (PAM) and machine learning techniques to enable conservationists to study bird biodiversity at much greater scales than is possible with observer-based surveys.
MAY-SUMMER 2024
TEAM 24
Company Discourse: How are people talking about my company online?
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Hannah Lloyd, Vinicius Ambrosi, Gilyoung Cheong, Dohoon Kim
In the age of digital communication, a wealth of information exists in the discourse surrounding companies and their products on social media platforms and online forums. This project utilizes natural language processing (NLP) and machine learning (ML) techniques to construct predictive models capable of assessing and rating comments provided by consumers. By employing these advanced analytical methods, we aim to enhance the correctness and effectiveness of sentiment analysis in understanding and forecasting consumer behavior. This approach is computationally efficient, while maintaining contextual integrity in the data and leveraging complex analytical techniques to gauge audience sentiment through online discourse.
MAY-SUMMER 2024
TEAM 25
AI-powered solutions for the restaurant industry
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Evaristo Villaseco, Davood Dar
In the US alone, restaurants waste 25bn pounds of food every year before it reaches the consumers plate and independent restaurants are a large driver of this. This is crucial for an industry that operates with very low profit margins of 3% to 6% on average. In this project we have partnered with Burnt (https://burnt.squarespace.com), whose mission is to help restaurants automate their back-of-house operational flow: recipe management, inventory forecasting, analysis and optimization of costs. In this project, we will use time series data from restaurants to forecast menu item sales based on different factors such as day of the week, weather, holidays etc., which will help to optimize ordering decisions for maximum efficiency.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 26
Continuous Glucose Monitoring
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye
The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 28
Chirp Checker
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff
The nocturnal soundscapes of late summer and autumn are replete with the familiar chirps, trills, and buzzes of singing insects. But these cryptic performers often remain anonymous and underappreciated.
The goal of this project was to build machine learning models to identify the presence of insects in sound files and to coarsely categorize the sounds as crickets, katydids, or cicadas.
Both Support Vector Classifiers and Convolutional Neural Networks were able to identify insects songs to the broad categories of cricket, katydid, and cicada with 90% accuracy or higher.
In the future, similar, more sophisticated models could be applied to filtering large volumes of passively recorded audio from ecological studies of insects and could power apps that identify insect songs to the species level.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 30
Headlines and Market Trends: A Sentiment Analysis Approach to Stock Prediction
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jem Guhit, Sarasi Jayasekara, Nawaz Sultani, Timothy Alland, Ogonnaya Romanus, Kenneth Anderson
Financial markets are often affected by sentiment conveyed in news headlines. As major news events can drive significant fluctuations in stock prices, understanding these sentiment trends can provide important insights into market movements. This project aims to answer the question whether the sentiments extracted from financial news headlines can predict stock movements.
We use 5 years worth of data extracted from Yahoo Finance and Stock News API, obtain sentiment scores using FinVader, and use Models: Logistic Regression, Gradient Boosted Trees, XGBoost, and LSTM, to predict whether the next day's stock prices would rise or fall. We use a simulated stock portfolio to evaluate the effectiveness of the models.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
MAY-SUMMER 2024
TEAM 40
Flavor Finder
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Zhihan Li, Xue Xiao, Daniel Colon Amill, Andres Martinez, William Porteous, Michael Shteyn
Flavor Finder is a chat client that generates query-specific menu-item recommendations using Retrieval-Augmented Generation (RAG) to comb thousands of Google reviews. This process results in a natural language dish recommendation which is responsive to a user's unique dietary needs and preferences.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 1
Counting Crossings - Team 2
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jared Able
Roads and bridges are essential to civilian, commercial, and government transport across the world. One facet of roads that is often ill-captured by GPS navigation systems is that of road overpasses, and this can have severe consequences for drivers of large vehicles. We aim to rectify this lack by predicting the presence of road overpasses in a given satellite image. To do so, we apply a convolutional neural network to a dataset of satellite images that we labeled and assembled.
SPRING 2024
TEAM 8
Analysing Road Safety
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Pedro Lemos, Jesse Frohlich, Jacob Van Hook, Assaf Bar-Natan
The purpose of this project is to predict collision rates in New York City using public road data and other geological features, and to identify key road infrastructures that can improve safety as economically as possible.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 17
Activity Detection using Biosignals from Wearable Devices
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Tong Shan, Dushyanth Sirivolu, Fulya Tastan, Philip Barron, Larsen Linov, Ming Li
Our goal is to study the biosignal pattern of everyday activity like walking, running, lifting chairs, etc, and creates machine learning models to recognize human daily activities from biosignals recorded by wearable devices. These signals includes electrocardiography (ECG), electrodermal activity (EDA), and photoplethysmography (PPG), electromyography (EMG), wrist temperature (TEMP) and chest and wrist actigraphy (ACC). The algorithms can be used in detecting user's daily activities and monitoring user's health condition.
SPRING 2024
TEAM 18
D&D Combat Length Predictions
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Alec Traaseth, Deewang Bhamidipati, David Rubinstein, Emre Akaturk, Jeremy Schwend
Dungeons and Dragons (D&D) is one of the most popular tabletop role-playing games. But there are not a lot of tools nor data to answer questions about party optimization, resource management, combat difficulty, etc. Enter: FIREBALL, a dataset of 25,000 unique combat sessions. This dataset logs individual actions and character information from 25,000 unique sessions, giving a lot of information to leverage for answering a variety of questions.
This group will be answering the question "How many rounds should a combat take?" This tool could help dungeon masters who are crafting encounters and are wanting to ensure their big perfect monster survives long enough to be a threat, or to prevent a large combat from becoming a boring slog.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 34
Aware NLP Project III
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Mohammad Nooranidoost, Baian Liu, Craig Franze, Mustafa Anıl Tokmak, Himanshu Raj, Peter Williams
This project involves the investigation and evaluation of different methodologies for retrieval for use in RAG (Retrieval-Augmented Generation) systems. In particular, this project investigates retrieval quality for information downloaded from employee subreddits. We investigated the impacts of using clustering, multi-vector indexing, and multi-querying in advanced retrieval methodologies against baseline naive retrieval.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 38
Recipe Recommender
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Nadir Hajouji, Felix Almendra Hernandez, Nathan Schley, Ali Arslanhan, Katherine Martin
We designed a recipe recommendation engine that suggests recipes based on a user query and a user's review history.
Our modeling focused mainly on trying to predict recipes that a user was likely to review. We tried some intuitive things, and they didn't work as well as we thought they would- but we obtained models that did a surprisingly good job predicting which reviews were left out of the training set using singular value decomposition.
We also created a user interface that allows a user to enter a freeform query, and that returns a list of recipes that not only match the query but also take the user's review history into account. We did this by combining our model (which quantified how well a recipe matches up with the user history) with a pretrained sentence transformer (which quantified how well a recipe matches the query).
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 40
Harmful Brain Activity Classification
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jianing Yang, Souparna Purohit, Kshitiz Parihar, Evgeniya Lagoda
This project aims to use EEG recordings of critically ill patients provided by Harvard Medical School (https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification) to classify seizures and other harmful brain activities.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 41
Nuclear Localization Signal (NLS) Prediction - NLSeer
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Scott Auerbach, Ukamaka Nnyaba, Ming Zhang, Yingyi Guo, Hemaa Selvakumar, Cisil Karaguzel
The purpose of the project is to build a prediction tool that estimates the possibility of having nuclear localization signals inside a protein's sequence based on the significance of each amino acid. Nuclear localization signals (NLS) are segments of a protein sequence that direct it towards the nucleus and have been implicated in human diseases and play an important role in many biological pathways. We employed datasets including whole protein sequences with and without nuclear localization signals and trained both classifiers and neural networks to predict whether or not a protein contained a NLS. Using a random forest classifier, we developed a web app through Flask that can predict whether or not a given protein is likely to have a NLS, and if so, also estimate the likelihood of each amino acid contributing to a NLS.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 1
Pawsitive Retrieval 1
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Marcos Ortiz, Kristina Knowles, Diptanil Roy, Karthik Prabhu Palimar, Sayantan Roy
This project aims to build a model to efficiently identify and rank relevant content from a large dataset of human-generated Reddit posts (5.5 million posts from 34 different subreddits), given an arbitrary user query. The key objectives were to retrieve highly relevant results for queries while keeping retrieval times under 1 second. The long-term application is to use this capability as part of a Retrieval-Augmented Generation (RAG) pipeline for Aware clients.
We focus on systematically varying parameters of our embedding model, as well as applying different filters (before retrieval) and rerankings (after retrieval) that leverage the relationships inherent in the structure of the data.
Using these strategies, we successfully improve the placement of relevant results retrieved according to several modified recommender system metrics. These metrics were implemented using a set of over 1000 human labeled query-result pairs establishing a set of known relevant results for 25 queries.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 6
Medical Image Classification
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Tristan Freiberg, Bailey Forster, Henri Antikainen
Melanoma can affect anyone and early detection is a crucial factor affecting survival rates. Machine learning models could assist trained healthcare professionals in screening for skin cancer. The Human Against Machine 10000 (HAM10000) dataset contains images of 7,470 distinct skin lesions, each belonging to one of seven mutually-exclusive classes of skin lesion, including melanoma, as well as other cancerous types and benign types such as nevi (moles). Our goal was to train a convolutional neural network (CNN) to accurately classify images of skin lesions using the HAM10000 dataset. The project includes a Streamlit app to test users’ classification ability against our fine-tuned models.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
SPRING 2024
TEAM 3
Audiobots: Transformers in Disguise
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Deep Learning Boot Camp
Dylan Bates, Soheil Anbouhi, Aycan Gamache, Johann Thiel, Paul VanKoughnett, Muhammed Cifci
Historically, songs have been categorized into genres not just for commercial purposes but also to enhance the listening experience and foster cultural exchange through music. Our primary goal was to compare the performance of traditional machine learning models with more advanced deep learning models like transformers, thereby evaluating the effectiveness of these newer neural network architectures for music genre classification.
We used two datasets, GTZAN and Free Music Archive, training a variety of models on the smaller dataset, and choosing the best performing models to train on the larger. Although it is easy enough to overfit on the training data, creating a model that generalizes well is difficult, especially in the case of imbalanced data.
We further tested our best models in a practical scenario by addressing a contemporary debate: whether Beyoncé’s new album Cowboy Carter is country.
FALL 2023
TEAM 45
Biomedical Categorization
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
shayne plourde, Gary Hu, Michelle Lobb, Donna Chen
Diabetes is a major issue in the world, impacting 8.5% of adults and killing 1.5 million people in 2019 according to the World Health Organization. Diabetes is a chronic disease that affects how the body regulates blood glucose levels. Over time, having raised blood glucose levels may lead to serious damage to the nerves and blood vessels, leading to further complications.
The goal of this project is to better understand the relationship between lifestyle factors and diabetes and subsequently predict whether an individual has diabetes or not, based on a survey questionnaire.
FALL 2023
TEAM 37
BrewSavvy
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Timothy Alland, Brandon Butler, Phuc Nguyen, Aidan Lorenz
We built a beer recommender app that recommends beers to a user based on a list of beers that the user likes. The underlying model uses matrix factorization trained on a data set of ~1.5 million reviews with ~65,000 different beers and ~33,000 users.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2023
TEAM 18
Will my flight be late?
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Simon Guichandut, Ketan Sand, Tim Hallatt
Flight delays are not only bothersome but also widespread, causing over 200,000 hours of combined delay annually in just 20 of the busiest airports in the United States. This results in a staggering $32.9 billion annual economic loss for the US. The ability to understand the
contributing factors and predict delays is crucial for better preparation and minimizing the impact. To address this issue, we utilized 12 years of data from the Bureau of Transportation Statistics in the US - (https://www.transtats.bts.gov/HomeDrillChart.asp). The dataset was refined to focus on flights between the top 20 busiest airports, operated by the top 8 airline carriers in US. We employed a random forest model for training, predicting both the likelihood of delay and quantifying the delay duration. A user-friendly website (https://willmyflightbelate.streamlit.app/) was developed to enhance the overall experience
FALL 2023
TEAM 33
Mu 'n I: Direction Detection
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Christopher Stith, Katja Vassilev, Benjamin Riley, Lukas Scheiwiller, Chinmaya Kausik
The goal of this project is to determine the direction of incoming neutrinos detected by the IceCube neutrino observatory and posted on Kaggle. The IceCube detector indirectly observes high-energy neutrinos from incoming cosmic radiation. IceCube wants to use data science to estimate the direction to feed into their software which calculates the precise direction. We used several linear regression models, including tensorflow, before training a convolutional and fully connected NN in pytorch. These networks were trained using features provided by IceCube and additional features used in the regression.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2023
TEAM 9
Meow-by-Meow
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Jinjing Yi, Zach Hafen-Saavedra, William Craig, Tantrik Mukerji, Brady Ali Medina
Cat vocalizations (“meows”) are typically directed at humans, rather than other cats. Cat meows therefore present an opportunity for computational audio analysis to improve relationships between cats and their owners. In our analysis we developed an interface for users to upload audio recordings and have them interpreted as “comfortable”, “uncomfortable”, or “hungry”. Our classification leverages machine learning models trained on preprocessed and augmented data from the CatMeows dataset.
FALL 2023
TEAM 27
Somm
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Ngoc Nguyen, veronica miatto, Pavel Kovalev, Palak Arora, Vishal Kumar
We build a wine recommendation engine using wine reviews. Wines are very overwhelming due to the sheer variety and lack of consistent categorization. This tool can help buyers navigate many wine options and help sellers procure wines that fits consumer demand. Existing solutions out there (Vivino, Delectable) are restrictive, as they only allow users to search by wine name, grape type, price. Our wine recommender fills the gap by letting users enter free-form queries to search for wines that fit their tastes.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2023
TEAM 20
AI-generated Image Detection
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Amanda Pan, Hasan Saad, Alina Al Beaini, Cemile Kurkoglu
AI-generated images have become increasingly realistic, prompting a variety of malicious uses. We plan to develop a model for detecting AI-generated images, ideally improving upon some of the current difficulties: generalization to different methods of image generation, robustness to image resizing and compression, and interpretability of results.
FALL 2023
TEAM 8
Climate Risk in Marginalized Communities
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Zoe Kearney, Bailey Forster, Viraj Meruliya, Braeden Reinoso, Reeya Kumbhojkar
The Environmental Protection Agency (EPA) monitors concentrations of air toxins in the US, including particulate matter smaller than 2.5 micrometers (PM2.5). These fine particles are able to enter the lungs and bloodstream, posing a significant health risk. There is also evidence that people of color in the US are at increased risk for adverse health effects due climate change and pollution (see review article: Berberian et al. 2022). Our goal is to create a model that uses 2021 ACS 5-Year Estimates Data Profiles and EPA data to identify tracts likely to be at risk of high PM2.5 levels. This is intended as a screening tool to inform further research on the climate related health risks and pollution sources that are affecting marginalized communities.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2023
TEAM 6
Groundwater Forecasting
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Riti Bahl, Meredith Sargent, Marcos Ortiz, Chelsea Gary, Anireju Dudun
Groundwater is a critical source of water human survival. A significant percentage of both drinking and crop irrigation water is drawn from groundwater sources through wells. In the US, overuse of groundwater could have major implications for the future and forecasting groundwater can be useful in understanding its impact. Building on historical data for four wells, together with surface water and weather data, in Spokane, WA, we construct and evaluate machine learning models that forecast groundwater levels in the area.
FALL 2023
TEAM 7
Funk
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
aydin ozbek, Dane Miyata, Kristina Knowles, Mario Gomez, Kashish Mehta
Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.
We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.
FALL 2023
TEAM 25
The Silent Emergency - Predicting Preterm Birth
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Katherine Grillaert, Divya Joshi, Alexander Sutherland, Kristina Zvolanek, Noah Rahman
Preterm birth is a primary cause of infant mortality and morbidity in the United States, affecting approximately 1 in 10 births. The rates are notably higher among Black women (14.6%), compared to White (9.4%) and Hispanic women (10.1%). Despite its prevalence, predicting preterm birth remains challenging due to its multifaceted etiology rooted in environmental, biological, genetic, and behavioral interactions. Our project harnesses machine learning techniques to predict preterm birth using electronic health records. This data intersects with social determinants of health, reflecting some of the interactions contributing to preterm birth. Recognizing that under-representation in healthcare research perpetuates racial and ethnic health disparities, we take care to use diverse data to ensure equitable model performance across underrepresented populations.
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
![](https://static.wixstatic.com/media/a994932411404ef3bb797ba005125f5d.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/a994932411404ef3bb797ba005125f5d.png)
FALL 2023
TEAM 31
DDTs: Dementia Detection Tool
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Himanshu Khanchandani, Clark Butler, Cisil Karaguzel, Selman Ipek, Shreya Shukla
Alzheimer’s disease (AD) is one of the most common types of dementia and frequently affects the elderly. Electroencephalography (EEG) is a non-invasive technique to measure the brain activity using external electrodes and may help provide improved diagnosis of AD. In this project we use power spectrum of EEG to build a robust machine learning classifier which predicts whether a patient has Alzheimer's or is healthy. We vastly improve upon existing models in the literature by using modified features compared to the ones used in literature.
FALL 2021
TEAM 13
Narwhal
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Alana Huszar, Sanal Shivaprasad, Yueqiao Wu, Yili Zhang
When a prior authorization (PA) form is submitted to insurance, it’s important to know if it will get approved. In this project, we try to use data analysis techniques to build a model that predicts, based on the information contained in the PA forms, whether a PA form will be approved or not.
![github URL](https://static.wixstatic.com/media/55f531_912a737dafa247788d855b76b7b02446~mv2.png/v1/fill/w_45,h_45,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/25231-github-cat-in-a-circle-icon-vector.png)
FALL 2021
TEAM 7
Hedgehog
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)
Data Science Boot Camp
Andrew McMillan, Shashank G. Markande, Jithin Madhusudanan Sreekala, Josh Tawabutr
With the increased popularity of zero-commission investing and trading apps such as Robinhood and rise of retail traders, the influence of social media on financial markets have grown. Twitter was one of the most used social media platforms for this purpose. First, we used a machine learning classifier to predict the popularity of a tweet. Following that, a scenario was created where the stocks of S&P 500 companies that were mentioned in the popular tweets were bought and held over a certain period of time. The returns from each stock purchase were then used to train a machine learning model for obtaining recommendations to buy. Traders could potentially use the stock recommendations from our model to plan out investment strategies and build stock portfolios.
FALL 2021
TEAM 10
Koala
![clear.png](https://static.wixstatic.com/media/55f531_b9f3f13ce3aa4af78af2cc6d3563b81b~mv2.png/v1/fill/w_3,h_3,al_c,lg_1,q_85,enc_avif,quality_auto/clear.png)