top of page
erdosOspin.gif

Checking your membership status...

Project Database

View Team Project Submissions for various cohorts and programs below:

73 results were found.

MAY-SUMMER 2024

TEAM

AI-powered solutions for the restaurant industry

clear.png

Deep Learning Boot Camp

Evaristo Villaseco,Davood Dar,Amir Kazemi-Moridani

In the US alone, restaurants waste 25bn pounds of food every year before it reaches the consumers plate and independent restaurants are a large driver of this. In this project we have partnered with Burnt (https://burnt.squarespace.com), whose mission is to help restaurants automate their back-of-house operational flow: recipe management, inventory forecasting, analysis and optimization of costs. Restaurants on average use 5-10 different suppliers with varying ways of describing their supplies. To tackle this, we will employ a large language model (LLM) and fine-tune it to process textual description of the supplies and categorize them into predefined classes. We will train our model on collected and unlabelled real-world data provided by Burnt. This will not only aid in streamlining inventory management and procurement, but will also impact management on food costs, menu item gross product, and budget forecasts.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Forecast Direct Normal Irradiance of Solar Energy using Deep Learning Model

clear.png

Deep Learning Boot Camp

Md Mehedi Hasan

The goal of this project is to forecast Direct Normal Irradiance using a deep learning model Temporal Fusion Transformer. Direct Normal Irradiance accounts for a large portion of Photovoltaic solar energy. So, it has become essential to forecast DNI accurately from solar power for the effective operation and maintenance of power systems.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Advancing Cardiac Diagnostics: Machine Learning Approaches for ECG-Based Heart Condition Analysis and Reconstruction

clear.png

Deep Learning Boot Camp

Gbocho Masato Terasaki

The project aims to advance cardiac health diagnostics through two critical tasks: irregular heartbeat classification and activation map reconstruction using ECG signals. Utilizing the ECG Heartbeat Categorization Dataset, the project focuses on developing a robust multiclass classification model to accurately diagnose various irregular heartbeats, facilitating early detection and timely treatment of cardiac conditions. In parallel, the project employs a neural network to transform ECG sequences into activation maps, using simulated intracardiac transmembrane voltage recordings. This reconstruction task is designed to deepen our understanding of the heart’s electrical activity, offering detailed insights that can lead to more precise and effective clinical interventions.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

RivusVox Editor

clear.png

Deep Learning Boot Camp

Zachary Bezemek,Francesca Balestrieri

RivusVox Editor: the world's first near-live zero-shot adaptive speech editing system

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

What the Text?

clear.png

Deep Learning Boot Camp

Robert Jeffs, Salil Singh, Ashwin Tarikere Ashok Kumar Nag, Junichi Koganemaru

With the explosion of LLMs and NLP methods, AI-generated text has become ubiquitous on the internet. This presents several challenges across many contexts, ranging from plagiarism in the academic setting to misinformation on social media and its consequences in electoral politics. With this in mind, we explore a range of classical statistical learning classifiers as well as deep learning based transformers to detect AI-generated text.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Lung Cancer detection with Histopathological Images

clear.png

Deep Learning Boot Camp

Abuduaini Niyazi,Sujoy Upadhyay,Rexiati Dilimulati,Ronak Desai,Derek Kielty

One of the important applications of deep neural networks in the healthcare industry is, using the help of computer vision, detecting cancerous cells at an early stage. In this project we will train convolutional neural networks with 15,000 histopathological images to distinguish normal lung tissue from cancerous ones.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Trick Taker

clear.png

Deep Learning Boot Camp

Shin Kim, Sixuan Lou, Juergen Kritschgau, Edward Varvak, Yizhen Zhao

In reinforcement learning problems, the agent learns how to maximize a numerical reward signal through direct interactions with the environment and without relying on a complete model of the environment. In fact, agents using model free methods learn from raw experience and without any inferences about how the environment will behave. An important model free method is the use of the Q-Learning algorithm to approximate the optimal action value function. However, it can be impractical to estimate the optimal action value function for every possible state-action pair. Deep Q-Learning uses a neural network trained with a variant of Q-Learning as a nonlinear function approximator of the optimal action value function. Our objective is to use Deep Q-Learning to train an agent to make legal moves and/or win tricks while playing the card game Spades (without bidding).

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

ISIC 2024 - Skin Cancer Detection with 3D-TBP

clear.png

Deep Learning Boot Camp

Madelyn Esther Cruz,Maksim Kosmakov

The ISIC2024 Skin Cancer Detection project aims to develop machine learning algorithms that identify skin cancer from 3D total body photographs and patient metadata. The goal is to support early diagnosis of melanoma, basal cell carcinoma, and squamous cell carcinoma, improving patient outcomes.

Inspired by the ISIC2024 Kaggle competition, we used a dataset of skin lesion images and related clinical information from thousands of patients. We trained deep learning models, including ResNet50 and EfficientNetV2, to predict malignancy, with a focus on high sensitivity. The primary evaluation metric was the Partial Area Under the ROC Curve (pAUC) above an 80% True Positive Rate (TPR).

Our best model achieved a pAUC of 0.140, demonstrating the potential of AI in skin cancer detection. Future work will focus on refining the models, improving image preprocessing, and exploring new ensemble techniques.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

A Vocal-Cue Interpreter for Minimally-Verbal Individuals

clear.png

Deep Learning Boot Camp

Julian Rosen, Alessandro Malusà, Rahul Krishna, Atharva Patil, Monalisa Dutta, Sarasi Jayasekara

The ReCANVo dataset consists of ~7k audio recordings of vocalizations from 8 minimally-verbal individuals (mostly people with developmental disabilities). The recordings were made in a real-world setting, and were categorized on the spot by the speaker's caregiver based on context, non-verbal cues, and familiarity with the speaker. There are several pre-defined categories such as selftalk, frustrated, delighted, request, etc., and caregivers could also specify custom categories. Our goal was to train a model, per individual, that accurately predicts labels and improves upon previous work.

We train several different combinations of models of the form “Feature Extractor + Classifier”. For extracting features from audio data, we use two deep models (HuBERT and AST) each with pre-trained weights, as well as mel spectrograms. As classifiers, we use a 4 layer CNN-based neural network (for mel spectrograms), NNs with fully-connected layers (for features coming from deep models), and more.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

“Good composers borrow, Great ones steal!”

clear.png

Deep Learning Boot Camp

Emelie Curl, Tong Shan, Glenn Young, Larsen Linov, Reginald Bain

Throughout history, composers and musicians have borrowed musical elements like chord progressions, rhythms, lyrics, and melodies from each other. Our motivation for this project is born of a fascination with this phenomenon, which of course extends to less legal examples like unconsciously or intentionally copying the work of another. Even famed and highly regarded composers like Bach, Vivaldi, Mozart, and Haydn are not innocent of borrowing from their contemporaries or even recycling their own works. Similarly, in 2015, in a high-profile court case, defendants and artists Robin Thicke and Pharrell Williams were ordered to pay millions of dollars in damages for copyright infringement to Marvin Gaye's estate, considering they borrowed from Gaye’s "Got to Give it Up" when writing their hit "Blurred Lines." Our project aimed to use deep learning to assess the similarity between musical clips to potentially establish a more robust and empirical way to detect music plagiarism.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Proof Truth

clear.png

Deep Learning Boot Camp

Jared Able, Hongyi Shen, Dennis Nguyen, Evgeniya Lagoda, Zhihan Li

Can AI learn to do logic? We'll explore this question by teaching an AI to fill in gaps in mathematical proofs. Working with a database of over 40,000 proofs, we apply three different models to various prediction scenarios: a graph isomorphism network to predict a logical justification; an LSTM recurrent neural network to predict a logical statement; and an attention-based model to predict distance between statements within a proof.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Taxi Demand Forecasting

clear.png

Deep Learning Boot Camp

Ngoc Nguyen, Li Meng, Sriram Raghunath, Nazanin Komeilizadeh, Noah Gillespie, Edward Ramirez

Knowing where to go to find customers is the most important question for taxi drivers and ride hailing networks. If demand for taxis can be reliably predicted in real-time, taxi companies can dispatch drivers in a timely manner and drivers can optimize their route decision to maximize their earnings in a given day. Consequently, customers will likely receive more reliable service with shorter wait time. This project aims to use rich trip-level data from the NYC Taxi and Limousine Commission to construct time-series taxi rides data for 63 taxi zones in Manhattan and forecast demand for rides. We will explore deep learning models for time series, including Multilayer Perceptrons, LSTM, Temporal Graph-based Neural Networks, and compare them with a baseline statistical model ARIMAX.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Sarcasm Detection in Memes

clear.png

Deep Learning Boot Camp

Yiyang Liu,Eunbin Kim

Sarcasm detection has been an interesting and difficult task in NLP. Sarcasm in memes is special in that the it's usually demonstrated by the contrast between the captions and the image. For our project, we use a dataset of 7000 memes labeled as "sarcastic" or "not sarcastic". We took inspirations from OPEN AI's CLIP model and implemented a multi-modal binary classifier for sarcasm detection. Our model achieved a 74% accuracy with a 0.80 AUC-ROC score.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

AntiBERTotics

clear.png

Deep Learning Boot Camp

Scott Auerbach, Craig Corsi, Samuel Ogunfuye, Hatice Mutlu

Given the rise in bacterial pathogens that are resistant to current antibiotics due to misuse, this has the potential to escalate into a health catastrophe. The main idea is to use optimized large language models intended for small-molecule drugs as well as those for parsing DNA and other genetic information to construct a model based on structural correlations that can predict whether or not known pathogens are resistant to a given antibiotic. Genes coding for antimicrobial resistance are not are represented via letters corresponding to nucleotides in the DNA sequence (for example a, c, t, or g), while antibiotics are shown through SMILES (Simplified Molecular Input Line Entry System), or a 2-d representation of the 3-d structure using an essentially extended Latin alphabet.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Detecting Engine Usage in Chess

clear.png

Deep Learning Boot Camp

Calvin Pozderac, Philip Barron, Philip Barron, Nathaniel Tamminga, Dushyanth Sirivolu

When then world champion Magnus Carlsen lost to Niemann in a surprise upset, Magnus accused his opponent of cheating. Were such accusations well-founded?
The project: train a neural network to distinguish between human chess and engine chess.
For data sets: We use in-person high level tournament play, as it's harder to cheat, for the human moves and an online database of engine evaluations from Lichess

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Exploring Causality between News Sentiment and Stock Movement Prediction

clear.png

Deep Learning Boot Camp

Jem Guhit, Nawaz Sultani, Saeid Hajizadeh, Samson Johnson

Motivation: Financial markets are often affected by sentiment conveyed in news headlines. Understanding the causal relationship between news sentiment and stock price movements can provide deeper insight into market dynamics.
Goal: Investigate the causal effects between news sentiments and stock price movements. This includes predicting stock movement trends based on news sentiment analysis and understanding how stock movement changes based on future news sentiment. This project aims to study these effects to improve stock movement predictions and optimize portfolio performance

Project Proposal Structure:
- Improve on the sentiment analysis tool used by exploring transformer models to get more accurate sentiment scores
- Explore Bi-directional Models and CNN
- Refine Baseline Model (used ARIMA in the last project)
- Refine simulation of trading strategy that is used to calculate average percentage of portfolio growth – did our models make profit?


Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Music Subgenre Classification

clear.png

Deep Learning Boot Camp

Anthony Kling,Ramachandra Rahul Taduri,Reid Harris

Music genres are essential for organizing and categorizing music, making it easier for listeners to discover, enjoy, and connect with styles that resonate with them. Genres also carry historical, cultural, and sonic significance. Playlists, which often focus on a single subgenre, have become an increasingly popular way to discover new music.

We address the multi-label classification problem to identify a song's genre(s) using acoustic features extracted from audio files. We train a variety of supervised learning models to determine genre. Rather than focusing on broad genres (e.g., jazz, hip hop, electronic), we concentrate on four subgenres of electronic music: techno, house, trance, and drum and bass. While these subgenres are distinct and well-defined, they can be challenging to differentiate. We train various models, including XGBoost, and neural networks on data obtained from AcousticBrainz.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Deep Learning for Portfolio Optimization

clear.png

Deep Learning Boot Camp

Arvind Suresh,Li Zhu,Jingheng Wang

This project develops advanced models for stock allocation to maximize returns using both short-term and long-term portfolio optimization strategies. For short-term optimization, we combine sentiment analysis via the BERT language model with the Black-Litterman model for dynamic 10-day portfolio adjustments. For long-term optimization, we utilize a Long Short-Term Memory (LSTM) network to predict stock performance and compare it against the Markowitz Mean-Variance Model and Genetic Algorithm. These approaches aim to create a versatile toolkit for optimizing portfolios under varying market conditions and investment horizons.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

arXiv Chatbot

clear.png

Deep Learning Boot Camp

Xiaoyu Wang,Ketan Sand,Guoqing Zhang,Tajudeen Mamadou Yacoubou,Tantrik Mukerji

arXiv is the largest open database available containing nearly 2.4 million research papers, spanning 8 major domains covering everything there is to understand from the tiniest of atoms to the entire cosmos. A large language model (LLM) having access to such a dataset will make it unprecedented in generating updated, relevant, and, more importantly, precise information with citable sources.

This is exactly what we have done in this project. We have refined the capabilities of Google’s Gemini 1.5 pro LLM by building a customized Retrieval-Augmented Generation (RAG) pipeline that has access to the entire arXiv database. We then deployed the entire package into an app that mimics a chatbot to make the experience user-friendly.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

NLP stock prediction

clear.png

Data Science Boot Camp

Jingheng Wang, Joseph Schmidt, Aoran Wu, Alborz Ranjbar

We design a bot trading technique based on machine learning on twitter sentiment analysis. We compare sentiment models like Vader, Naive Bayesian, and BERT to see which performs best on tweet sentiment analysis. We then use these tweets, their sentiment, and popularity of the user to assign a modified sentiment score. This modified score is one of several input features for several models that aim to calculate the best action of buy or selling a stock to maximize profit. In the end, we gain 5% advantage compared to base line model using an LSTM model.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Climate-Based Forecasting of Dengue Dynamics

clear.png

Deep Learning Boot Camp

Abdullah Al Helal, Haridas Kumar Das, Chun-hao Chen, Feng Zhu

Dengue outbreaks have become a global concern, affecting many regions such as the Americas, Africa, the Middle East, Asia, and the Pacific Islands. Over the past two decades, there has been a notable rise in dengue cases worldwide, with significant impacts observed in countries like Brazil and Bangladesh. Moreover, in the United States, local dengue transmission has been reported in a few states, including Florida, Hawaii, Texas, Arizona, and California. Numerous studies have demonstrated the correlation between climate factors—such as temperature and rainfall—and dengue, Zika, chikungunya, and yellow fever transmission. Specifically, elevated temperatures have been linked to an increased dengue infection risk, while extreme rainfall events have been shown to decrease this risk. In this project, we deploy deep learning time series analysis to analyze climate and epidemiological data in order to forecast dengue dynamics.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Stanford Sentiment Treebank with 5 labels (SST-5)

clear.png

Deep Learning Boot Camp

Gilyoung Cheong, Dohoon Kim, Vinicius Ambrosi

The SST-5, or Stanford Sentiment Treebank with 5 labels, is a dataset utilized for sentiment analysis. It contains 11,855 individual sentences sourced from movie reviews, along with 215,154 unique phrases from parse trees. These phrases are annotated by three human judges and are categorized as negative, somewhat negative, neutral, somewhat positive, or positive. This fine-grained labeling is what gives the dataset its name, SST-5. According to the leader board, the highest accuracy on the test set is 59.8, but more interestingly, the model that obtained 5th rank with accuracy of 55.5 only used BERT Large model with dropouts. The purpose of our project is to see if we can achieve to be in top 5 of the leader board by hyperparameter tuning (on learning rate and hyperparameters of Adam optimizer) and fine-tuning.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

A. K. W. Warren

clear.png

Data Science Boot Camp

Ashley Wheeler

The 118th Congress has been described as a "do-nothing" congress, but they've managed to pass a few laws! When a bill is introduced, can we predict whether it will become a law?

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Forecast Direct Normal Irradiance of Solar Energy

clear.png

Data Science Boot Camp

Md Mehedi Hasan, Kamlesh Sarkar

This project aims to forecast a day and a week ahead of Direct Normal Irradiance, which is crucial in solar energy. Nowadays, the adoption of solar energy into the power grid has increased, and Direct Normal Irradiance (DNI) is particularly important in forecasting the performance of concentrating solar power (CSP) systems. Photovoltaic panels track the sun to receive more DNI. DNI accounts for a large portion of PV solar energy. So it has become essential to accurate forecasts of direct normal irradiance from solar power for the effective operation and maintenance of power systems, ensuring their ability.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Studying Data from the Food Environment Atlas - - GROUP 2

clear.png

Data Science Boot Camp

Cyril Enyi,Mercy Amankwah,Danielle Brager,Nicole Bruce,Monalisa Dutta

Is it possible to utilize the data from the Food Environment Atlas (https://www.ers.usda.gov/data-products/food-environment-atlas/) to examine the determinants of a community's access to affordable and healthy food?
Exploring the connections between specific factors and their impact on typical food habits in communities could yield fascinating insights.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

NSPP: News-Based Stock Price Prediction

clear.png

Data Science Boot Camp

Nasimeh Heydaribeni, Mahdi Soleymani

In this project, we intend to investigate whether or not the news headlines and abstracts are good predictors of stock prices. We intend to use open large language models to extract the useful features of the news to then apply various regression methods on them.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Geo-locator

clear.png

Data Science Boot Camp

Aashraya Jha,Dante Bonolis,Zachary Bezemek,Leonhard Hochfilzer,Francesca Balestrieri

In the popular online game Geoguessr, the player is shown a random image from Google Street View and is tasked with guessing their location on the globe as accurately as possible. In this project, we seek to solve a simplified version of this problem but using a strategy often used by professional Geoguesser players: using man-made features (for example, traffic lights) to accurately guess a city.

We use the publicly available GSV-Cities Dataset, which consists of around 500k street-view images taken in 23 different cities. We then use CNN trained on the images and features extracted from the images to make our mode. The backbone of this CNN is a pre-trained model named MobileNetV2.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

NFL Combine Analysis

clear.png

Data Science Boot Camp

Dennis Nguyen, Brett Lambert

The National Football League (NFL) is one of the largest professional sports organizations in the United States. Currently, there are 32 NFL teams and each year, each attempting to maximize performance to win the Super Bowl. Because of this, the annual NFL draft is highly anticipated as it allows teams to select players eligible to leave college football in the hopes of adding talented, young individuals on team-friendly (cheap) contracts. However, there is a great deal of uncertainty in predicting professional performance.​

We modeled some of the variation in prospect success as well as draft position by using eight statistics from the NFL combine.​
We additionally explored the relationship between draft position and player performance and found the relative value of various player positions at different draft positions.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

SpotPOP

clear.png

Data Science Boot Camp

Melika Shahhosseini, Ali Asghari Adib

The main objective of this project is to develop a predictive classification model that can classify the popularity of Spotify tracks based on their audio features. By analyzing a dataset containing various attributes of Spotify tracks, we aim to do extensive data exploration to identify key factors that contribute to a track's popularity and create a reliable predictive system.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Predicting Missed Payments from Credit Card Clients

clear.png

Data Science Boot Camp

Song Gao, Juergen Kritschgau

Credit card clients miss payments on their credit card debt for a variety of reasons. Being able to predict missed payments would allow banks, credit raters, and debt collectors to forecast their own operations, target interventions or financial products, and accurately appraise the value of credit card debt. In this project, we attempt to use a client’s payment history over a 6 month and demographic information to predict whether the client will miss a credit card payment next month. We used data obtained from a Kaggle competition page to train different classification models, including logistic regressions, Bayesian models, Support Vector Classifiers, K-nearest neighbor models, and decision trees. We used cross-validation and classification accuracy to compare different classification models. Our primary finding is that no model is able to accurately predict whether or not a client will miss a payment.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Stock price modeling and forecasting

clear.png

Data Science Boot Camp

SIU CHEUNG LAM, Suman Aich, Xiaoyu Wang, Nafis Fuad

We perform stock market analysis using data from multiple stocks (including index funds and companies). Our approach is based on both statistical modeling and LSTM neural networks. For statistical modeling we use autocorrelation plots to examine trends in data and root mean squared error (RMSE) as our key performance indicator. Using the LSTM neural network we design a regression model for forecasting and a classifier to predict whether to buy, hold or sell stocks at any given day. Finally, we explore the LSTM regression model’s ability to generalize to multiple stocks, as well as its usage for multi-day forecasting.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Jimmy's and Joes vs X's and O's: Predicting results in college sports analyzing talent accumulation and on-field success

clear.png

Data Science Boot Camp

Reginald Bain, Tung Nguyen, Reid Harris

Recent legislation has changed the landscape of college sports, a multi-billion dollar enterprise with deep roots in American sports culture. With the recent legalization of sports betting in many states and the SCOTUS O’Bannon ruling that allows athletes to be paid through so-called “Name-Image-Likeness (NIL)” deals, evaluating talent and projecting results in college sports is an increasingly interesting problem. By considering both talent accumulation and recent on-field results, our models aim to predict relevant results for sports betting/team construction. In this iteration of the project, our targets are regular season win percentage (using a season level model that we’ll call Model 1) and individual game results (with a game by game model we’ll call Model 2) in the regular season. Our datasets come from a variety of sources including On3, ESPN, 24/7 Sports, The College Football Database, and SportsReference.com.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Short-Term Volatility Prediction for stocks

clear.png

Data Science Boot Camp

Li Zhu

This project aims to build an efficient model to predict short-term volatility for hundreds of stocks across different sectors. It is based on the Kaggle competition - Optiver Realized Volatility Prediction.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Climate Predictions Using Machine Learning Approaches

clear.png

Data Science Boot Camp

Maitituerdi Aihemaiti, Abuduaini Niyazi, Rexiati Dilimulati

In contrast to modern climate models, which predict that precipitation will increase as temperatures rise, the Horn of Africa has experienced severe and recurring droughts over the past few decades. The region's agriculture-based economies have suffered greatly as a result of these droughts. Therefore, the quality of long-term weather prediction has become fundamentally important. In this project, we use multiple past climate proxy records to build a machine learning model to determine whether we can predict the future climate of the Horn of Africa.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Applied Neural Networks

clear.png

Data Science Boot Camp

Deven Gill, Ajay Aryan, Dionel Jaime

Stock Price Prediction:
Objective: To build a neural network model that predicts stock prices based on historical data.
Dataset: Historical stock price data including trading volume, company financials, and macroeconomic indicators for a specific company (AAP was used but any company could be used).

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

QED

clear.png

Data Science Boot Camp

Cisil Karaguzel, Ming Zhang, Hatice Mutlu, Adnan Cihan Cakar, Matthew Gelvin

The state-of-the-art language models have achieved human-level performance on many tasks but still face significant challenges in multi-step mathematical reasoning. Recent advancements in large language models (LLMs) have demonstrated exceptional capabilities across diverse tasks, including common-sense reasoning, question answering, and summarization. However, they struggle with tasks requiring quantitative reasoning, such as solving complex mathematical problems. Mathematics serves as a valuable testbed in machine learning for problem-solving abilities, highlighting the need for more robust models capable of multi-step reasoning. The primary goal of this project is to develop a customized LLM that can provide step-by-step solutions to math problems by fine-tuning a base LLM using a large mathematical dataset.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Influential Actors in Communication Networks

clear.png

Data Science Boot Camp

Adam Perhala, Jungbae An

An influential actor can spread information to others in a communication network, and thus change the attitudes of others by doing so. By identifying influential actors, we can track the flow of information about a policy or product and the resulting attitudinal changes, and utilize this influence to intervene in people's attitudes or to undermine abusive interventions. In this project, we show and test a framework for detecting influential actors in the standing committee hearings in the U.S. House of Representatives–one of the communication networks of policymakers utilizing policy-relevant expertise. The influential actors identified by our framework are consistent with the relevant literature. Our detection framework can be used to optimize decision-making that leverages communication networks such as disinformation and mobilizing attention.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

MoonBoard Grade Classification

clear.png

Data Science Boot Camp

Gautam Prakriya,Adrian Batista Planas,Larsen Linov,Prabhjot Singh

A MoonBoard is a standardized rock climbing wall - fixed holds on a wall of fixed dimensions. This climbing wall comes with an app that generates routes/problems for climbers to move up. Rock climbing routes are assigned subjective grades to represent difficulty, but given that the MoonBoard is widely used there is often a consensus around grades assigned to MoonBoard problems making them somewhat objective. The goal of the project will be to build a model that identifies the grade of a given route.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Topic recognition on NYT articles

clear.png

Data Science Boot Camp

Ravi Tripathi, Touseef Haider, Ping Wan, Schinella D'Souza, Alessandro Malusà, Craig Franze

The project proposes to study metadata of New York Times article to detect most relevant topics and build a recommendation system based on topic similarity.

We plan to do the following:
1) Apply methods like Latent Dirichlet Allocation (LDA) and Bidirectional Encoder Representations from Transformers (BERT) to identify the most relevant topics from a corpus of about 42,000 article published over the last year
2) Draw insightful visuals to highlight topic and word distribution as well as popular trends
3) Use Neural Networks to assign significant labels to topics
4) Create a recommender system based on topic similarity

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

BirdCLEF

clear.png

Data Science Boot Camp

Junichi Koganemaru, Robert Jeffs, Ashwin Tarikere Ashok Kumar Nag, Salil Singh

This project addresses the BirdCLEF 2024 research code competition hosted on Kaggle by the Cornell Lab of Ornithology. Participants are provided with a dataset containing labeled audio clips of bird calls recorded at various locations in the world. The competition seeks reliable machine learning models for automatic detection and classification of bird species from soundscapes recorded in the Western Ghats of India, a Global Biodiversity Hotspot. The broader goal of this endeavor is to leverage Passive Acoustic Monitoring (PAM) and machine learning techniques to enable conservationists to study bird biodiversity at much greater scales than is possible with observer-based surveys.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Temporal Graphs for Music recommendation systems

clear.png

Data Science Boot Camp

Abhinav Chand,Tristan Freiberg,Astrid Olave Herrera

Music streaming companies seek to increase enhance the user experience by offering personalized music recommendations. Moreover, users value personalization as a top feature on a streaming service. Music preferences can be represented as a dynamic graph of users interacting with music genres over time. Our goal is to predict the music preference of a user using classical graph algorithms, statistical inference and Temporal Graph Neural Networks. We will work with the Temporal Graph Benchmark for our study and if possible we will apply our models to other real world networks.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Cancer Survivability

clear.png

Data Science Boot Camp

Dilruba Sofia, Funmilola Mary Taiwo, Enayon Sunday Taiwo, Samuel Ogunfuye, Karla Paulette Flores Silva, Ray Lee

Unfortunately, each of us has a 1/4 chance of getting cancer. Although with advances in treatment technologies, the survival rate of cancer patients has increased, cancer still kills many people. Breast cancer is the second most diagnosed cancer and most fatal in women. The goal of this project is to develop models that can accurately classify breast cancer patient outcomes as either "alive" or "dead", based on demographic data and clinical data at the time of diagnosis.
Data: The Cancer Genome Atlas Breast Cancer (TCGA-BRCA) project through the National Cancer Institute - GDC Data Portal.
Method: We extract patient clinical information of the patients and engineer the features as necessary. Then we apply a few classification algorithms such as random forest, AdaBoost, SVC, logistic regression, K-nearest neighbor, and MLP while keeping the decision tree algorithm as our base model to predict patients' vital status.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Company Discourse: How are people talking about my company online?

clear.png

Data Science Boot Camp

Hannah Lloyd, Vinicius Ambrosi, Gilyoung Cheong, Dohoon Kim

In the age of digital communication, a wealth of information exists in the discourse surrounding companies and their products on social media platforms and online forums. This project utilizes natural language processing (NLP) and machine learning (ML) techniques to construct predictive models capable of assessing and rating comments provided by consumers. By employing these advanced analytical methods, we aim to enhance the correctness and effectiveness of sentiment analysis in understanding and forecasting consumer behavior. This approach is computationally efficient, while maintaining contextual integrity in the data and leveraging complex analytical techniques to gauge audience sentiment through online discourse.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

AI-powered solutions for the restaurant industry

clear.png

Data Science Boot Camp

Evaristo Villaseco, Davood Dar

In the US alone, restaurants waste 25bn pounds of food every year before it reaches the consumers plate and independent restaurants are a large driver of this. This is crucial for an industry that operates with very low profit margins of 3% to 6% on average. In this project we have partnered with Burnt (https://burnt.squarespace.com), whose mission is to help restaurants automate their back-of-house operational flow: recipe management, inventory forecasting, analysis and optimization of costs. In this project, we will use time series data from restaurants to forecast menu item sales based on different factors such as day of the week, weather, holidays etc., which will help to optimize ordering decisions for maximum efficiency.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Continuous Glucose Monitoring

clear.png

Data Science Boot Camp

Daniel Visscher,Margaret Swerdloff,Noah Gillespie,S. C. Park,oladimeji olaluwoye

The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

If You’re Single, You’re Probably a Democrat... (and other insights into US demographics and voting inclination)

clear.png

Data Science Boot Camp

Fernando Liu Lopez,Arvind Suresh

Voting behaviors depend, to a significant degree, on news and events leading up to the election; these are often unpredictable and introduce variance that undermines the accuracy of election forecasts. Yet, it is common knowledge that certain demographic characteristics are strong predictors of voting tendencies (e.g., rural areas tend to vote Republican). In this project, we employ machine-learning methods to measure the predictive power of demographic characteristics (race, gender, education, socio-economic status, marital status) in determining voting outcomes, focusing in particular on the US popular presidential vote.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Chirp Checker

clear.png

Data Science Boot Camp

Andrew Merwin, Caleb Fong, B Mede, Yang Yang, Robert Cass, Calvin Yost-Wolff

The nocturnal soundscapes of late summer and autumn are replete with the familiar chirps, trills, and buzzes of singing insects. But these cryptic performers often remain anonymous and underappreciated.

The goal of this project was to build machine learning models to identify the presence of insects in sound files and to coarsely categorize the sounds as crickets, katydids, or cicadas.

Both Support Vector Classifiers and Convolutional Neural Networks were able to identify insects songs to the broad categories of cricket, katydid, and cicada with 90% accuracy or higher.

In the future, similar, more sophisticated models could be applied to filtering large volumes of passively recorded audio from ecological studies of insects and could power apps that identify insect songs to the species level.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Educational outcomes for children as a function of healthcare access

clear.png

Data Science Boot Camp

Nicholas Castillo, Glenn Young, Anthony Kling, Ayomikun Adeniran, Edward Varvak, samara chamoun

According to the CDC, around 5.8% of grade school students missed at least 15 days of school in 2022 due to health-related reasons. Chronic absenteeism results in students missing milestones in reading and math, and consequently falling behind their peers, possibly putting educational success out of reach. The goal of this project is to determine if there is a relationship between ease of access to healthcare in children and educational outcomes.

Using data from the National Survey of Children's Health, we identified 88 relevant features, later refined to 10 key predictors through selection methods. We looked at various models and ultimately chose logistic regression for its interpretability and performance.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Headlines and Market Trends: A Sentiment Analysis Approach to Stock Prediction

clear.png

Data Science Boot Camp

Jem Guhit, Sarasi Jayasekara, Nawaz Sultani, Timothy Alland, Ogonnaya Romanus, Kenneth Anderson

Financial markets are often affected by sentiment conveyed in news headlines. As major news events can drive significant fluctuations in stock prices, understanding these sentiment trends can provide important insights into market movements. This project aims to answer the question whether the sentiments extracted from financial news headlines can predict stock movements.

We use 5 years worth of data extracted from Yahoo Finance and Stock News API, obtain sentiment scores using FinVader, and use Models: Logistic Regression, Gradient Boosted Trees, XGBoost, and LSTM, to predict whether the next day's stock prices would rise or fall. We use a simulated stock portfolio to evaluate the effectiveness of the models.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Climate-Based Forecasting of Dengue Epidemic Months: A Case Study of Bangladesh

clear.png

Data Science Boot Camp

Haridas Kumar Das, Abdullah Al Helal

Dengue outbreaks have become a global concern, affecting many regions such as the Americas, Africa, the Middle East, Asia, and the Pacific Islands. Over the past two decades, there has been a notable rise in dengue cases worldwide, with significant impacts observed in countries like Brazil and Bangladesh. Moreover, in the United States, local dengue transmission has been reported in a few states, including Florida, Hawaii, Texas, Arizona, and California. Numerous studies have demonstrated the correlation between climate factors—such as temperature and rainfall—and dengue, Zika, chikungunya, and yellow fever transmission. Specifically, elevated temperatures have been linked to an increased dengue infection risk, while extreme rainfall events have been shown to decrease this risk. In this project, we develop machine learning algorithms to analyze climate and epidemiological data in order to forecast dengue epidemic months, focusing on the analysis of Bangladesh.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Imputing missing data from stock time series

clear.png

Data Science Boot Camp

Khanh Nguyen, Yizhen Zhao, Evgeniya Lagoda, Himanshu Raj, Carlos Owusu-Ansah, Sergei Neznanov

Missing data is a typical problem in science research. For example, in clinical trials, wearable sensors might lose signal due to battery. Errors in measuring instruments often leading to a gap in time series. Naively dropping missing data can remove important information. In this project, we investigate imputation of missing financial times series data in particular stock time series. We analyze a toy problem where we delete a few data points by hand and attempt to impute it through various methods. The goal is to see which methods and what market indicators work best for such a dataset. The completeness of stock data allows us to test how well a model predicts missing data. Analyzing imputation for such time series could therefore yield insight on correlations in international market and the relevant models and market predictors to use for the more practical problem of making forecast in price movements.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Doggy Doggy What Now?: Using Machine Learning to Predict Animal Shelter Intakes and Outcomes

clear.png

Data Science Boot Camp

John Harden, Claire Merriman, Angela Kubena, Jun Lau, Robert Young

The Humane Society states that over 3 million dogs enter animal shelters around the United States each year, and around 2 million dogs are adopted each year. Shelters are understandably busy, noisy, and fast-moving places where many challenges present themselves. The ability to correctly anticipate how the coming days, weeks, and months will go would allow shelter managers to allocate resources more effectively. Our group sought to leverage machine learning tools and 100,000s of observations over the last decade to predict animal shelter intakes, outcomes, and adoptions. We developed time series models which include macro-level features and can predict the number of intakes and outcomes per day, week, and month with over 90% accuracy. Additionally, we achieved over 70% accuracy exploring how random forest can be used to get a paw up on predicting adoption rates with shelter-level features.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL

MAY-SUMMER 2024

TEAM

Predicting mental health treatment decisions from social media

clear.png

Data Science Boot Camp

Eunbin Kim, Alejandra Dashe, Emelie Curl, Mitch Hamidi, Gabriel Khan

Using classification and Natural Language Processing (NLP) to analyze web-scraped data from Reddit, can we (1) identify who is undergoing or interested in mental health treatment, and (2) predict preference for treatment?
Data Source: A Kaggle dataset of scraped Reddit posts and a team-created dataset of scraped comments from eight BPD-relevant subreddits using 89 keywords of interest.
Method: 1st within the BPD community, can we classify treatment relevant content based on the text data.
2nd can we identify predictors that of BPD individuals' preference for specific treatment plans/outcomes (e.g., demographic information, comorbidity). Classification, predictive modeling, NLP,

Screen Shot 2022-06-03 at 11.31.35 AM.png