Checking your membership status...
Project Database
View Team Project Submissions for various cohorts and programs below:
73 results were found.
MAY-SUMMER 2024
TEAM
AI-powered solutions for the restaurant industry
Deep Learning Boot Camp
Evaristo Villaseco,Davood Dar,Amir Kazemi-Moridani
In the US alone, restaurants waste 25bn pounds of food every year before it reaches the consumers plate and independent restaurants are a large driver of this. In this project we have partnered with Burnt (https://burnt.squarespace.com), whose mission is to help restaurants automate their back-of-house operational flow: recipe management, inventory forecasting, analysis and optimization of costs. Restaurants on average use 5-10 different suppliers with varying ways of describing their supplies. To tackle this, we will employ a large language model (LLM) and fine-tune it to process textual description of the supplies and categorize them into predefined classes. We will train our model on collected and unlabelled real-world data provided by Burnt. This will not only aid in streamlining inventory management and procurement, but will also impact management on food costs, menu item gross product, and budget forecasts.
MAY-SUMMER 2024
TEAM
Forecast Direct Normal Irradiance of Solar Energy using Deep Learning Model
Deep Learning Boot Camp
Md Mehedi Hasan
The goal of this project is to forecast Direct Normal Irradiance using a deep learning model Temporal Fusion Transformer. Direct Normal Irradiance accounts for a large portion of Photovoltaic solar energy. So, it has become essential to forecast DNI accurately from solar power for the effective operation and maintenance of power systems.
MAY-SUMMER 2024
TEAM
Advancing Cardiac Diagnostics: Machine Learning Approaches for ECG-Based Heart Condition Analysis and Reconstruction
Deep Learning Boot Camp
Gbocho Masato Terasaki
The project aims to advance cardiac health diagnostics through two critical tasks: irregular heartbeat classification and activation map reconstruction using ECG signals. Utilizing the ECG Heartbeat Categorization Dataset, the project focuses on developing a robust multiclass classification model to accurately diagnose various irregular heartbeats, facilitating early detection and timely treatment of cardiac conditions. In parallel, the project employs a neural network to transform ECG sequences into activation maps, using simulated intracardiac transmembrane voltage recordings. This reconstruction task is designed to deepen our understanding of the heart’s electrical activity, offering detailed insights that can lead to more precise and effective clinical interventions.
MAY-SUMMER 2024
TEAM
What the Text?
Deep Learning Boot Camp
Robert Jeffs, Salil Singh, Ashwin Tarikere Ashok Kumar Nag, Junichi Koganemaru
With the explosion of LLMs and NLP methods, AI-generated text has become ubiquitous on the internet. This presents several challenges across many contexts, ranging from plagiarism in the academic setting to misinformation on social media and its consequences in electoral politics. With this in mind, we explore a range of classical statistical learning classifiers as well as deep learning based transformers to detect AI-generated text.
MAY-SUMMER 2024
TEAM
Lung Cancer detection with Histopathological Images
Deep Learning Boot Camp
Abuduaini Niyazi,Sujoy Upadhyay,Rexiati Dilimulati,Ronak Desai,Derek Kielty
One of the important applications of deep neural networks in the healthcare industry is, using the help of computer vision, detecting cancerous cells at an early stage. In this project we will train convolutional neural networks with 15,000 histopathological images to distinguish normal lung tissue from cancerous ones.
MAY-SUMMER 2024
TEAM
Trick Taker
Deep Learning Boot Camp
Shin Kim, Sixuan Lou, Juergen Kritschgau, Edward Varvak, Yizhen Zhao
In reinforcement learning problems, the agent learns how to maximize a numerical reward signal through direct interactions with the environment and without relying on a complete model of the environment. In fact, agents using model free methods learn from raw experience and without any inferences about how the environment will behave. An important model free method is the use of the Q-Learning algorithm to approximate the optimal action value function. However, it can be impractical to estimate the optimal action value function for every possible state-action pair. Deep Q-Learning uses a neural network trained with a variant of Q-Learning as a nonlinear function approximator of the optimal action value function. Our objective is to use Deep Q-Learning to train an agent to make legal moves and/or win tricks while playing the card game Spades (without bidding).
MAY-SUMMER 2024
TEAM
ISIC 2024 - Skin Cancer Detection with 3D-TBP
Deep Learning Boot Camp
Madelyn Esther Cruz,Maksim Kosmakov
The ISIC2024 Skin Cancer Detection project aims to develop machine learning algorithms that identify skin cancer from 3D total body photographs and patient metadata. The goal is to support early diagnosis of melanoma, basal cell carcinoma, and squamous cell carcinoma, improving patient outcomes.
Inspired by the ISIC2024 Kaggle competition, we used a dataset of skin lesion images and related clinical information from thousands of patients. We trained deep learning models, including ResNet50 and EfficientNetV2, to predict malignancy, with a focus on high sensitivity. The primary evaluation metric was the Partial Area Under the ROC Curve (pAUC) above an 80% True Positive Rate (TPR).
Our best model achieved a pAUC of 0.140, demonstrating the potential of AI in skin cancer detection. Future work will focus on refining the models, improving image preprocessing, and exploring new ensemble techniques.
MAY-SUMMER 2024
TEAM
A Vocal-Cue Interpreter for Minimally-Verbal Individuals
Deep Learning Boot Camp
Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara
The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.
We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.
MAY-SUMMER 2024
TEAM
“Good composers borrow, Great ones steal!”
Deep Learning Boot Camp
Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain
Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.
MAY-SUMMER 2024
TEAM
Proof Truth
Deep Learning Boot Camp
Jared Able, Hongyi Shen, Dennis Nguyen, Evgeniya Lagoda, Zhihan Li
Can AI learn to do logic? We'll explore this question by teaching an AI to fill in gaps in mathematical proofs. Working with a database of over 40,000 proofs, we apply three different models to various prediction scenarios: a graph isomorphism network to predict a logical justification; an LSTM recurrent neural network to predict a logical statement; and an attention-based model to predict distance between statements within a proof.
MAY-SUMMER 2024
TEAM
Taxi Demand Forecasting
Deep Learning Boot Camp
Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez
Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.
MAY-SUMMER 2024
TEAM
Sarcasm Detection in Memes
Deep Learning Boot Camp
Yiyang Liu,Eunbin Kim
Sarcasm detection has been an interesting and difficult task in NLP. Sarcasm in memes is special in that the it's usually demonstrated by the contrast between the captions and the image. For our project, we use a dataset of 7000 memes labeled as "sarcastic" or "not sarcastic". We took inspirations from OPEN AI's CLIP model and implemented a multi-modal binary classifier for sarcasm detection. Our model achieved a 74% accuracy with a 0.80 AUC-ROC score.
MAY-SUMMER 2024
TEAM
AntiBERTotics
Deep Learning Boot Camp
Scott Auerbach, Craig Corsi, Samuel Ogunfuye, Hatice Mutlu
Given the rise in bacterial pathogens that are resistant to current antibiotics due to misuse, this has the potential to escalate into a health catastrophe. The main idea is to use optimized large language models intended for small-molecule drugs as well as those for parsing DNA and other genetic information to construct a model based on structural correlations that can predict whether or not known pathogens are resistant to a given antibiotic. Genes coding for antimicrobial resistance are not are represented via letters corresponding to nucleotides in the DNA sequence (for example a, c, t, or g), while antibiotics are shown through SMILES (Simplified Molecular Input Line Entry System), or a 2-d representation of the 3-d structure using an essentially extended Latin alphabet.
MAY-SUMMER 2024
TEAM
Detecting Engine Usage in Chess
Deep Learning Boot Camp
Calvin Pozderac, Philip Barron, Philip Barron, Nathaniel Tamminga, Dushyanth Sirivolu
When then world champion Magnus Carlsen lost to Niemann in a surprise upset, Magnus accused his opponent of cheating. Were such accusations well-founded?
The project: train a neural network to distinguish between human chess and engine chess.
For data sets: We use in-person high level tournament play, as it's harder to cheat, for the human moves and an online database of engine evaluations from Lichess
MAY-SUMMER 2024
TEAM
Exploring Causality between News Sentiment and Stock Movement Prediction
Deep Learning Boot Camp
Jem Guhit, Nawaz Sultani, Saeid Hajizadeh, Samson Johnson
Motivation: Financial markets are often affected by sentiment conveyed in news headlines. Understanding the causal relationship between news sentiment and stock price movements can provide deeper insight into market dynamics.
Goal: Investigate the causal effects between news sentiments and stock price movements. This includes predicting stock movement trends based on news sentiment analysis and understanding how stock movement changes based on future news sentiment. This project aims to study these effects to improve stock movement predictions and optimize portfolio performance
Project Proposal Structure:
- Improve on the sentiment analysis tool used by exploring transformer models to get more accurate sentiment scores
- Explore Bi-directional Models and CNN
- Refine Baseline Model (used ARIMA in the last project)
- Refine simulation of trading strategy that is used to calculate average percentage of portfolio growth – did our models make profit?
MAY-SUMMER 2024
TEAM
Music Subgenre Classification
Deep Learning Boot Camp
Anthony Kling,Ramachandra Rahul Taduri,Reid Harris
Music genres are essential for organizing and categorizing music, making it easier for listeners to discover, enjoy, and connect with styles that resonate with them. Genres also carry historical, cultural, and sonic significance. Playlists, which often focus on a single subgenre, have become an increasingly popular way to discover new music.
We address the multi-label classification problem to identify a song's genre(s) using acoustic features extracted from audio files. We train a variety of supervised learning models to determine genre. Rather than focusing on broad genres (e.g., jazz, hip hop, electronic), we concentrate on four subgenres of electronic music: techno, house, trance, and drum and bass. While these subgenres are distinct and well-defined, they can be challenging to differentiate. We train various models, including XGBoost, and neural networks on data obtained from AcousticBrainz.
MAY-SUMMER 2024
TEAM
Deep Learning for Portfolio Optimization
Deep Learning Boot Camp
Arvind Suresh,Li Zhu,Jingheng Wang
This project develops advanced models for stock allocation to maximize returns using both short-term and long-term portfolio optimization strategies. For short-term optimization, we combine sentiment analysis via the BERT language model with the Black-Litterman model for dynamic 10-day portfolio adjustments. For long-term optimization, we utilize a Long Short-Term Memory (LSTM) network to predict stock performance and compare it against the Markowitz Mean-Variance Model and Genetic Algorithm. These approaches aim to create a versatile toolkit for optimizing portfolios under varying market conditions and investment horizons.
MAY-SUMMER 2024
TEAM
arXiv Chatbot
Deep Learning Boot Camp
Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji
arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.
This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.
MAY-SUMMER 2024
TEAM
NLP stock prediction
Data Science Boot Camp
Jingheng Wang, Joseph Schmidt, Aoran Wu, Alborz Ranjbar
We design a bot trading technique based on machine learning on twitter sentiment analysis. We compare sentiment models like Vader, Naive Bayesian, and BERT to see which performs best on tweet sentiment analysis. We then use these tweets, their sentiment, and popularity of the user to assign a modified sentiment score. This modified score is one of several input features for several models that aim to calculate the best action of buy or selling a stock to maximize profit. In the end, we gain 5% advantage compared to base line model using an LSTM model.
MAY-SUMMER 2024
TEAM
Climate-Based Forecasting of Dengue Dynamics
Deep Learning Boot Camp
Abdullah Al Helal, Haridas Kumar Das, Chun-hao Chen, Feng Zhu
Dengue outbreaks have become a global concern, affecting many regions such as the Americas, Africa, the Middle East, Asia, and the Pacific Islands. Over the past two decades, there has been a notable rise in dengue cases worldwide, with significant impacts observed in countries like Brazil and Bangladesh. Moreover, in the United States, local dengue transmission has been reported in a few states, including Florida, Hawaii, Texas, Arizona, and California. Numerous studies have demonstrated the correlation between climate factors—such as temperature and rainfall—and dengue, Zika, chikungunya, and yellow fever transmission. Specifically, elevated temperatures have been linked to an increased dengue infection risk, while extreme rainfall events have been shown to decrease this risk. In this project, we deploy deep learning time series analysis to analyze climate and epidemiological data in order to forecast dengue dynamics.
MAY-SUMMER 2024
TEAM
Stanford Sentiment Treebank with 5 labels (SST-5)
Deep Learning Boot Camp
Gilyoung Cheong, Dohoon Kim, Vinicius Ambrosi
The SST-5, or Stanford Sentiment Treebank with 5 labels, is a dataset utilized for sentiment analysis. It contains 11,855 individual sentences sourced from movie reviews, along with 215,154 unique phrases from parse trees. These phrases are annotated by three human judges and are categorized as negative, somewhat negative, neutral, somewhat positive, or positive. This fine-grained labeling is what gives the dataset its name, SST-5. According to the leader board, the highest accuracy on the test set is 59.8, but more interestingly, the model that obtained 5th rank with accuracy of 55.5 only used BERT Large model with dropouts. The purpose of our project is to see if we can achieve to be in top 5 of the leader board by hyperparameter tuning (on learning rate and hyperparameters of Adam optimizer) and fine-tuning.
MAY-SUMMER 2024
TEAM
Forecast Direct Normal Irradiance of Solar Energy
Data Science Boot Camp
Md Mehedi Hasan, Kamlesh Sarkar
This project aims to forecast a day and a week ahead of Direct Normal Irradiance, which is crucial in solar energy. Nowadays, the adoption of solar energy into the power grid has increased, and Direct Normal Irradiance (DNI) is particularly important in forecasting the performance of concentrating solar power (CSP) systems. Photovoltaic panels track the sun to receive more DNI. DNI accounts for a large portion of PV solar energy. So it has become essential to accurate forecasts of direct normal irradiance from solar power for the effective operation and maintenance of power systems, ensuring their ability.
MAY-SUMMER 2024
TEAM
Studying Data from the Food Environment Atlas - - GROUP 2
Data Science Boot Camp
Cyril Enyi,Mercy Amankwah,Danielle Brager,Nicole Bruce,Monalisa Dutta
Is it possible to utilize the data from the Food Environment Atlas (https://www.ers.usda.gov/data-products/food-environment-atlas/) to examine the determinants of a community's access to affordable and healthy food?
Exploring the connections between specific factors and their impact on typical food habits in communities could yield fascinating insights.
MAY-SUMMER 2024
TEAM
NSPP: News-Based Stock Price Prediction
Data Science Boot Camp
Nasimeh Heydaribeni, Mahdi Soleymani
In this project, we intend to investigate whether or not the news headlines and abstracts are good predictors of stock prices. We intend to use open large language models to extract the useful features of the news to then apply various regression methods on them.
MAY-SUMMER 2024
TEAM
Geo-locator
Data Science Boot Camp
Aashraya Jha,Dante Bonolis,Zachary Bezemek,Leonhard Hochfilzer,Francesca Balestrieri
In the popular online game Geoguessr, the player is shown a random image from Google Street View and is tasked with guessing their location on the globe as accurately as possible. In this project, we seek to solve a simplified version of this problem but using a strategy often used by professional Geoguesser players: using man-made features (for example, traffic lights) to accurately guess a city.
We use the publicly available GSV-Cities Dataset, which consists of around 500k street-view images taken in 23 different cities. We then use CNN trained on the images and features extracted from the images to make our mode. The backbone of this CNN is a pre-trained model named MobileNetV2.
MAY-SUMMER 2024
TEAM
NFL Combine Analysis
Data Science Boot Camp
Dennis Nguyen, Brett Lambert
The National Football League (NFL) is one of the largest professional sports organizations in the United States. Currently, there are 32 NFL teams and each year, each attempting to maximize performance to win the Super Bowl. Because of this, the annual NFL draft is highly anticipated as it allows teams to select players eligible to leave college football in the hopes of adding talented, young individuals on team-friendly (cheap) contracts. However, there is a great deal of uncertainty in predicting professional performance.
We modeled some of the variation in prospect success as well as draft position by using eight statistics from the NFL combine.
We additionally explored the relationship between draft position and player performance and found the relative value of various player positions at different draft positions.
MAY-SUMMER 2024
TEAM
SpotPOP
Data Science Boot Camp
Melika Shahhosseini, Ali Asghari Adib
The main objective of this project is to develop a predictive classification model that can classify the popularity of Spotify tracks based on their audio features. By analyzing a dataset containing various attributes of Spotify tracks, we aim to do extensive data exploration to identify key factors that contribute to a track's popularity and create a reliable predictive system.
MAY-SUMMER 2024
TEAM
Predicting Missed Payments from Credit Card Clients
Data Science Boot Camp
Song Gao, Juergen Kritschgau
Credit card clients miss payments on their credit card debt for a variety of reasons. Being able to predict missed payments would allow banks, credit raters, and debt collectors to forecast their own operations, target interventions or financial products, and accurately appraise the value of credit card debt. In this project, we attempt to use a client’s payment history over a 6 month and demographic information to predict whether the client will miss a credit card payment next month. We used data obtained from a Kaggle competition page to train different classification models, including logistic regressions, Bayesian models, Support Vector Classifiers, K-nearest neighbor models, and decision trees. We used cross-validation and classification accuracy to compare different classification models. Our primary finding is that no model is able to accurately predict whether or not a client will miss a payment.
MAY-SUMMER 2024
TEAM
Stock price modeling and forecasting
Data Science Boot Camp
SIU CHEUNG LAM, Suman Aich, Xiaoyu Wang, Nafis Fuad
We perform stock market analysis using data from multiple stocks (including index funds and companies). Our approach is based on both statistical modeling and LSTM neural networks. For statistical modeling we use autocorrelation plots to examine trends in data and root mean squared error (RMSE) as our key performance indicator. Using the LSTM neural network we design a regression model for forecasting and a classifier to predict whether to buy, hold or sell stocks at any given day. Finally, we explore the LSTM regression model’s ability to generalize to multiple stocks, as well as its usage for multi-day forecasting.
MAY-SUMMER 2024
TEAM
Jimmy's and Joes vs X's and O's: Predicting results in college sports analyzing talent accumulation and on-field success
Data Science Boot Camp
Reginald Bain, Tung Nguyen, Reid Harris
Recent legislation has changed the landscape of college sports, a multi-billion dollar enterprise with deep roots in American sports culture. With the recent legalization of sports betting in many states and the SCOTUS O’Bannon ruling that allows athletes to be paid through so-called “Name-Image-Likeness (NIL)” deals, evaluating talent and projecting results in college sports is an increasingly interesting problem. By considering both talent accumulation and recent on-field results, our models aim to predict relevant results for sports betting/team construction. In this iteration of the project, our targets are regular season win percentage (using a season level model that we’ll call Model 1) and individual game results (with a game by game model we’ll call Model 2) in the regular season. Our datasets come from a variety of sources including On3, ESPN, 24/7 Sports, The College Football Database, and SportsReference.com.
MAY-SUMMER 2024
TEAM
Climate Predictions Using Machine Learning Approaches
Data Science Boot Camp
Maitituerdi Aihemaiti, Abuduaini Niyazi, Rexiati Dilimulati
In contrast to modern climate models, which predict that precipitation will increase as temperatures rise, the Horn of Africa has experienced severe and recurring droughts over the past few decades. The region's agriculture-based economies have suffered greatly as a result of these droughts. Therefore, the quality of long-term weather prediction has become fundamentally important. In this project, we use multiple past climate proxy records to build a machine learning model to determine whether we can predict the future climate of the Horn of Africa.
MAY-SUMMER 2024
TEAM
Applied Neural Networks
Data Science Boot Camp
Deven Gill, Ajay Aryan, Dionel Jaime
Stock Price Prediction:
Objective: To build a neural network model that predicts stock prices based on historical data.
Dataset: Historical stock price data including trading volume, company financials, and macroeconomic indicators for a specific company (AAP was used but any company could be used).
MAY-SUMMER 2024
TEAM
QED
Data Science Boot Camp
Cisil Karaguzel, Ming Zhang, Hatice Mutlu, Adnan Cihan Cakar, Matthew Gelvin
The state-of-the-art language models have achieved human-level performance on many tasks but still face significant challenges in multi-step mathematical reasoning. Recent advancements in large language models (LLMs) have demonstrated exceptional capabilities across diverse tasks, including common-sense reasoning, question answering, and summarization. However, they struggle with tasks requiring quantitative reasoning, such as solving complex mathematical problems. Mathematics serves as a valuable testbed in machine learning for problem-solving abilities, highlighting the need for more robust models capable of multi-step reasoning. The primary goal of this project is to develop a customized LLM that can provide step-by-step solutions to math problems by fine-tuning a base LLM using a large mathematical dataset.
MAY-SUMMER 2024
TEAM
Influential Actors in Communication Networks
Data Science Boot Camp
Adam Perhala, Jungbae An
An influential actor can spread information to others in a communication network, and thus change the attitudes of others by doing so. By identifying influential actors, we can track the flow of information about a policy or product and the resulting attitudinal changes, and utilize this influence to intervene in people's attitudes or to undermine abusive interventions. In this project, we show and test a framework for detecting influential actors in the standing committee hearings in the U.S. House of Representatives–one of the communication networks of policymakers utilizing policy-relevant expertise. The influential actors identified by our framework are consistent with the relevant literature. Our detection framework can be used to optimize decision-making that leverages communication networks such as disinformation and mobilizing attention.
MAY-SUMMER 2024
TEAM
MoonBoard Grade Classification
Data Science Boot Camp
Gautam Prakriya,Adrian Batista Planas,Larsen Linov,Prabhjot Singh
A MoonBoard is a standardized rock climbing wall - fixed holds on a wall of fixed dimensions. This climbing wall comes with an app that generates routes/problems for climbers to move up. Rock climbing routes are assigned subjective grades to represent difficulty, but given that the MoonBoard is widely used there is often a consensus around grades assigned to MoonBoard problems making them somewhat objective. The goal of the project will be to build a model that identifies the grade of a given route.
MAY-SUMMER 2024
TEAM
Topic recognition on NYT articles
Data Science Boot Camp
Ravi Tripathi, Touseef Haider, Ping Wan, Schinella D'Souza, Alessandro Malusà, Craig Franze
The project proposes to study metadata of New York Times article to detect most relevant topics and build a recommendation system based on topic similarity.
We plan to do the following:
1) Apply methods like Latent Dirichlet Allocation (LDA) and Bidirectional Encoder Representations from Transformers (BERT) to identify the most relevant topics from a corpus of about 42,000 article published over the last year
2) Draw insightful visuals to highlight topic and word distribution as well as popular trends
3) Use Neural Networks to assign significant labels to topics
4) Create a recommender system based on topic similarity
MAY-SUMMER 2024
TEAM
BirdCLEF
Data Science Boot Camp
Junichi Koganemaru, Robert Jeffs, Ashwin Tarikere Ashok Kumar Nag, Salil Singh
This project addresses the BirdCLEF 2024 research code competition hosted on Kaggle by the Cornell Lab of Ornithology. Participants are provided with a dataset containing labeled audio clips of bird calls recorded at various locations in the world. The competition seeks reliable machine learning models for automatic detection and classification of bird species from soundscapes recorded in the Western Ghats of India, a Global Biodiversity Hotspot. The broader goal of this endeavor is to leverage Passive Acoustic Monitoring (PAM) and machine learning techniques to enable conservationists to study bird biodiversity at much greater scales than is possible with observer-based surveys.
MAY-SUMMER 2024
TEAM
Temporal Graphs for Music recommendation systems
Data Science Boot Camp
Abhinav Chand,Tristan Freiberg,Astrid Olave Herrera
Music streaming companies seek to increase enhance the user experience by offering personalized music recommendations. Moreover, users value personalization as a top feature on a streaming service. Music preferences can be represented as a dynamic graph of users interacting with music genres over time. Our goal is to predict the music preference of a user using classical graph algorithms, statistical inference and Temporal Graph Neural Networks. We will work with the Temporal Graph Benchmark for our study and if possible we will apply our models to other real world networks.
MAY-SUMMER 2024
TEAM
Cancer Survivability
Data Science Boot Camp
Dilruba Sofia, Funmilola Mary Taiwo, Enayon Sunday Taiwo, Samuel Ogunfuye, Karla Paulette Flores Silva, Ray Lee
Unfortunately, each of us has a 1/4 chance of getting cancer. Although with advances in treatment technologies, the survival rate of cancer patients has increased, cancer still kills many people. Breast cancer is the second most diagnosed cancer and most fatal in women. The goal of this project is to develop models that can accurately classify breast cancer patient outcomes as either "alive" or "dead", based on demographic data and clinical data at the time of diagnosis.
Data: The Cancer Genome Atlas Breast Cancer (TCGA-BRCA) project through the National Cancer Institute - GDC Data Portal.
Method: We extract patient clinical information of the patients and engineer the features as necessary. Then we apply a few classification algorithms such as random forest, AdaBoost, SVC, logistic regression, K-nearest neighbor, and MLP while keeping the decision tree algorithm as our base model to predict patients' vital status.
MAY-SUMMER 2024
TEAM
Company Discourse: How are people talking about my company online?
Data Science Boot Camp
Hannah Lloyd, Vinicius Ambrosi, Gilyoung Cheong, Dohoon Kim
In the age of digital communication, a wealth of information exists in the discourse surrounding companies and their products on social media platforms and online forums. This project utilizes natural language processing (NLP) and machine learning (ML) techniques to construct predictive models capable of assessing and rating comments provided by consumers. By employing these advanced analytical methods, we aim to enhance the correctness and effectiveness of sentiment analysis in understanding and forecasting consumer behavior. This approach is computationally efficient, while maintaining contextual integrity in the data and leveraging complex analytical techniques to gauge audience sentiment through online discourse.
MAY-SUMMER 2024
TEAM
AI-powered solutions for the restaurant industry
Data Science Boot Camp
Evaristo Villaseco, Davood Dar
In the US alone, restaurants waste 25bn pounds of food every year before it reaches the consumers plate and independent restaurants are a large driver of this. This is crucial for an industry that operates with very low profit margins of 3% to 6% on average. In this project we have partnered with Burnt (https://burnt.squarespace.com), whose mission is to help restaurants automate their back-of-house operational flow: recipe management, inventory forecasting, analysis and optimization of costs. In this project, we will use time series data from restaurants to forecast menu item sales based on different factors such as day of the week, weather, holidays etc., which will help to optimize ordering decisions for maximum efficiency.
MAY-SUMMER 2024
TEAM
Continuous Glucose Monitoring
Data Science Boot Camp
Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye
The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel
MAY-SUMMER 2024
TEAM
If You’re Single, You’re Probably a Democrat... (and other insights into US demographics and voting inclination)
Data Science Boot Camp
Fernando Liu Lopez,Arvind Suresh
Voting behaviors depend, to a significant degree, on news and events leading up to the election; these are often unpredictable and introduce variance that undermines the accuracy of election forecasts. Yet, it is common knowledge that certain demographic characteristics are strong predictors of voting tendencies (e.g., rural areas tend to vote Republican). In this project, we employ machine-learning methods to measure the predictive power of demographic characteristics (race, gender, education, socio-economic status, marital status) in determining voting outcomes, focusing in particular on the US popular presidential vote.
MAY-SUMMER 2024
TEAM
Chirp Checker
Data Science Boot Camp
Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff
The nocturnal soundscapes of late summer and autumn are replete with the familiar chirps, trills, and buzzes of singing insects. But these cryptic performers often remain anonymous and underappreciated.
The goal of this project was to build machine learning models to identify the presence of insects in sound files and to coarsely categorize the sounds as crickets, katydids, or cicadas.
Both Support Vector Classifiers and Convolutional Neural Networks were able to identify insects songs to the broad categories of cricket, katydid, and cicada with 90% accuracy or higher.
In the future, similar, more sophisticated models could be applied to filtering large volumes of passively recorded audio from ecological studies of insects and could power apps that identify insect songs to the species level.
MAY-SUMMER 2024
TEAM
Educational outcomes for children as a function of healthcare access
Data Science Boot Camp
Nicholas Castillo, Glenn Young, Anthony Kling, Ayomikun Adeniran, Edward Varvak, samara chamoun
According to the CDC, around 5.8% of grade school students missed at least 15 days of school in 2022 due to health-related reasons. Chronic absenteeism results in students missing milestones in reading and math, and consequently falling behind their peers, possibly putting educational success out of reach. The goal of this project is to determine if there is a relationship between ease of access to healthcare in children and educational outcomes.
Using data from the National Survey of Children's Health, we identified 88 relevant features, later refined to 10 key predictors through selection methods. We looked at various models and ultimately chose logistic regression for its interpretability and performance.
MAY-SUMMER 2024
TEAM
Headlines and Market Trends: A Sentiment Analysis Approach to Stock Prediction
Data Science Boot Camp
Jem Guhit, Sarasi Jayasekara, Nawaz Sultani, Timothy Alland, Ogonnaya Romanus, Kenneth Anderson
Financial markets are often affected by sentiment conveyed in news headlines. As major news events can drive significant fluctuations in stock prices, understanding these sentiment trends can provide important insights into market movements. This project aims to answer the question whether the sentiments extracted from financial news headlines can predict stock movements.
We use 5 years worth of data extracted from Yahoo Finance and Stock News API, obtain sentiment scores using FinVader, and use Models: Logistic Regression, Gradient Boosted Trees, XGBoost, and LSTM, to predict whether the next day's stock prices would rise or fall. We use a simulated stock portfolio to evaluate the effectiveness of the models.
MAY-SUMMER 2024
TEAM
Climate-Based Forecasting of Dengue Epidemic Months: A Case Study of Bangladesh
Data Science Boot Camp
Haridas Kumar Das, Abdullah Al Helal
Dengue outbreaks have become a global concern, affecting many regions such as the Americas, Africa, the Middle East, Asia, and the Pacific Islands. Over the past two decades, there has been a notable rise in dengue cases worldwide, with significant impacts observed in countries like Brazil and Bangladesh. Moreover, in the United States, local dengue transmission has been reported in a few states, including Florida, Hawaii, Texas, Arizona, and California. Numerous studies have demonstrated the correlation between climate factors—such as temperature and rainfall—and dengue, Zika, chikungunya, and yellow fever transmission. Specifically, elevated temperatures have been linked to an increased dengue infection risk, while extreme rainfall events have been shown to decrease this risk. In this project, we develop machine learning algorithms to analyze climate and epidemiological data in order to forecast dengue epidemic months, focusing on the analysis of Bangladesh.
MAY-SUMMER 2024
TEAM
Imputing missing data from stock time series
Data Science Boot Camp
Khanh Nguyen, Yizhen Zhao, Evgeniya Lagoda, Himanshu Raj, Carlos Owusu-Ansah, Sergei Neznanov
Missing data is a typical problem in science research. For example, in clinical trials, wearable sensors might lose signal due to battery. Errors in measuring instruments often leading to a gap in time series. Naively dropping missing data can remove important information. In this project, we investigate imputation of missing financial times series data in particular stock time series. We analyze a toy problem where we delete a few data points by hand and attempt to impute it through various methods. The goal is to see which methods and what market indicators work best for such a dataset. The completeness of stock data allows us to test how well a model predicts missing data. Analyzing imputation for such time series could therefore yield insight on correlations in international market and the relevant models and market predictors to use for the more practical problem of making forecast in price movements.
MAY-SUMMER 2024
TEAM
Doggy Doggy What Now?: Using Machine Learning to Predict Animal Shelter Intakes and Outcomes
Data Science Boot Camp
John Harden, Claire Merriman, Angela Kubena, Jun Lau, Robert Young
The Humane Society states that over 3 million dogs enter animal shelters around the United States each year, and around 2 million dogs are adopted each year. Shelters are understandably busy, noisy, and fast-moving places where many challenges present themselves. The ability to correctly anticipate how the coming days, weeks, and months will go would allow shelter managers to allocate resources more effectively. Our group sought to leverage machine learning tools and 100,000s of observations over the last decade to predict animal shelter intakes, outcomes, and adoptions. We developed time series models which include macro-level features and can predict the number of intakes and outcomes per day, week, and month with over 90% accuracy. Additionally, we achieved over 70% accuracy exploring how random forest can be used to get a paw up on predicting adoption rates with shelter-level features.
MAY-SUMMER 2024
TEAM
Predicting mental health treatment decisions from social media
Data Science Boot Camp
Eunbin Kim, Alejandra Dashe, Emelie Curl, Mitch Hamidi, Gabriel Khan
Using classification and Natural Language Processing (NLP) to analyze web-scraped data from Reddit, can we (1) identify who is undergoing or interested in mental health treatment, and (2) predict preference for treatment?
Data Source: A Kaggle dataset of scraped Reddit posts and a team-created dataset of scraped comments from eight BPD-relevant subreddits using 89 keywords of interest.
Method: 1st within the BPD community, can we classify treatment relevant content based on the text data.
2nd can we identify predictors that of BPD individuals' preference for specific treatment plans/outcomes (e.g., demographic information, comorbidity). Classification, predictive modeling, NLP,