View all Team Project Submissions for the May 2022 Data Science Boot Camp below:
Jack Carlisle, Mohammed Karaki, Cristian Rodriguez
We use tools in NLP, specifically sentiment analysis, to understand the 2019 Australian election. We train five models that predict sentiment on a Twitter dataset unrelated to the election. Using the highest quality model we label Australian Election tweets as positive or negative. Using our prediction of sentiment we succeed in predicting the political affiliation of a Twitter user based on how their sentiment shifts over time as the election results become available.
In this project, we attempt to rank the cells in jupyter notebooks submitted to the Kaggle site, as part of the Google AI4Code Challenge. The problem is first transformed into one of classification. Then, we made use of natural language processing methods as well as code parsing tricks to design a machine learning model for the problem. As the computer understands our code better, they will be better at helping us be a more productive programmer
Deepti Jain, Amartya Singh, Jay Vora
We have a classifier that identifies the alphabet of the American Sign Language. Computer vision has made huge strides in the past few years, yet we don't have a reliable real-time sign language translator as of yet. In this project, we try to identify the alphabet of the American Sign Language using still images.
We also have an accompanying web app hosted on the Google Cloud Platform which can take in inputs from a user's webcam and give live predictions for the sign the user is showing.
Aidan Zabalo, Gleb Zhelezov, Martin Molina, Jaychandran Padayasi
The United States is just a few months away from the crucial 2022 mid-term election. Campaign strategists know the news cycle and media consumption strongly influence election results and are interested in learning:
- What are the important issues in the media around election time?
- When is a good time to start playing up these issues?
- Which news outlets could give maximum exposure to your ideas?
Knowing the answers to these questions would help shape public opinion, election results, and the country's direction.
To answer these questions, we found the dominant political stories of 2018, quantified when they first began to gain prominence and measured which publications are most amenable to spending a disproportionate amount of time covering a single topic.
Jason Lee, Jimin Kim, Arpan Pal, Zhi Jiang, Yumeng Li
Cuisines vary significantly across the countries. One of the key characteristics that distinguish different types of cuisines is the ingredients used in each dish. The goal of the project is to identify cuisine origin using only the ingredients. Accurately associating ingredients to cuisines will be a valuable tool to better understand the characteristics of each type of cuisine as well as creating an algorithm that can recommend recipes using different combinations of raw ingredients.
Cindy Zhang, Moyi Tian, Songhao Zhu, Yulin Guo
Chocolate popularity over the world never decreases. More and more fancy and novel chocolates are coming to the market. We want to find out what factors affect the rating of certain chocolate and provide chocolate manufacturers with this information to help them determine what type of chocolate to produce in the future; we also want to predict chocolate ratings using these features and help consumers to pick their favorite chocolates.
The Hopf Bundle
Halley Fritze, Jay Hathaway, Max Vargas
Our problem is to accurately deduplicate commercial location data for points of interest. In other words, given two location observations, we must determine whether they describe the same point of interest. Our data contains many features, such as name and address, which required vectorization and text processing to train our models. Moreover, the data had errors and many missing values. This makes valuable the practices of data cleaning and smart implementation of text features. We performed a broad analysis, training many different models to see which would give us the largest accuracy for identifying matching observations with the same point of interest. Further directions for model improvement are also mentioned.
Adam Kawash, Moeka Ono, Soumen Deb, Allison Londerée
The DaVinci Team of the Erdős Institute has utilized advances in computer vision technology with the goal to train a machine learning model to classify species of birds. We then applied this model in a prototype app ChickID. In doing so our project addresses two primary goals:
1) Generate an algorithm that could take images of birds to identify the species.
2) Ensure our model could function even using amateur-level images with a high degree of accuracy, to ensure accessibility of identification.
Our product can be applied for both private and public settings to allow for fast and accurate identification.
Rose Weisshaar, Elif Poyraz, Mario Gomez Flores, Mohammad Nooranidoost, Tajudeen Mamadou
Our project used machine learning to classify speech data by emotion. We use the CREMA-D dataset, which contains sentences spoken in different emotional tones by actors, and we explore how well we can classify emotion using features we extracted from the audio data. One of our classifiers was 5% more accurate than human raters of the same audios. The results of our project can help autistic children with emotional processing deficits perceive different emotional categories.
Xiang Ren, Xiaoran Hao, Jun Li, Zahra Adahman
COVID-19 adverse health outcomes such as mortality rate are associated with multiple Demographic, Environmental and Socioeconomic (DES) factors. Precise estimation of association patterns can help improve the understanding of social and environmental justice issues related to the impact of the pandemic. In this project, we extracted a subset of the COVID-19 socioexposomic data and developed Interpretable Machine Learning (IML) methods to identify important nonlinear health effects and interactions at local (municipality) scale across New Jersey. Our results show that IML can be an effective supplement to traditional statistical and geospatial models for uncovering underlying complex patterns even for small sample sets.
Cassidy Madison, Ethan Semrad, Ching-Lung Hsu
Coffee is consumed daily by 30-40% of the world's population, and produced in over 70 countries worldwide. Though coffee drinkers have their own individual preferences, we wanted to see if we could find relationships between how a coffee rates in taste tests and its features, such as country or region of origin, roast, or type of preparation method.
Amin Idelhaj, Hannah Alpert, Soumya Sankar
In this project, we study prediction methods for credit default risk based on past credit history, and minimal demographic data. We analyze the following data set: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. We compare the area ratios of several models and make recommendations for models and model parameters that have the best predicting power.
Asia Wyatt, Kriti Sehgal, Aziz Burak Guelen
The clinical trial process is a multi-year, often decade-long process that involves rigorous research and testing from the petri dish to double-blind human tests. The overall cost to bring a single intervention to market can cost in the range of hundreds of millions of dollars to well over a billion dollars. Once a drug reaches the clinical trial stage, there are many factors that are taken into consideration for how to structure said trial–the age and number of participants, levels of masking, the location for the trial, etc.
While the goal is always to reach completion, there are a myriad of reasons why a clinical trial may be terminated or suspended. These can range from adverse side effects and poor safety to funding setbacks and lack of efficacy. The goal of our project is to determine which of these factors when structuring a clinical trial are the most important for cancer intervention trial completion and to predict whether or not a test set of trials will be completed.
Karan Srivastava, Benjamin Sheller, Anya Michaelsen, Alejandra Castillo
Our project seeks to aid our food industry stakeholders, including online recipe repositories and platforms seeking to produce food or restaurant recommendations, in understanding the relationship between the ingredients of a recipe and the type of cuisine to which it corresponds, as well as the relationships between different cuisines. To study this, our team used both classification and clustering models. Our best classification model used logistic regression to achieve a predictive accuracy of 77.9%, with linear SVC in close second, with a predictive accuracy of 76.8%. The logistic model also gives insight about top ingredients for each cuisine. Clustering algorithms, k-means and hierarchical, showed natural groupings of cuisines, as well as insights as to which cuisines were most similar within those groups.
Hakan Doga, Siying Li, Erika Ordog, Akarsh Mohan Konaje
Brain tumor diagnosis requires manual examination of MRI images by a radiologist. This process can be error-prone due to the complexities in the structure of a brain and time-consuming in developing regions with limited access to medical experts. In this project, we applied the concept of transfer learning to build a convolutional neural networks algorithm based on the mobileNetV2 model. This model can accurately and efficiently classify common brain tumor MRI images. Furthermore, it is portable to mobile devices which could help healthcare providers to automate the diagnostic process and patients to receive faster treatment.
Reza Averly, Adnan Mahmood, Nikhil Ajgaonkar, Aniket Joshi, Lisa Berger
We analyze written transcription of presidential and vice presidential debates from 1992 to 2020. Our aim is to determine parlance specific to U.S. Democratic and Republican parties. Using a machine learning classification model, we use this determination to classify words and phrases according to party affinity as per the predicted probability. Given an out-of-sample set of words and phrases, our algorithm is capable of classifying whether the words or phrases are favored by the Democratic or the Republican Party. Our algorithm shows an accuracy of 70% and above for all the election terms since 1992. To take care of the imbalance in the data set, we optimize our algorithm to derive custom thresholds by maximizing the F1 score. This algorithm can be used by politicians for constructing and promoting campaign platforms as well as by independent lobbyists targeting proposals to either party.
Shreeya Behera, Karthik Prabhu, Adam Broussard
To respond to the rapid increase in demand for radiologists in the United States, we perform an exploratory analysis using machine-learning algorithms to identify the presence of pneumonia in chest x-rays. We train a k-Nearest Neighbor (KNN), Convolutional Neural Network (CNN), and RESNET-152-based transfer learning model the the goal of improving healthcare KPI’s such as the average treatment charge and patient wait time while reducing mistakes in treatment by accurately differentiating nominal chest x-rays from those that exhibit pneumonia. We identify the F1 Score as the best metric for this use case, as it balances the recall (the fraction of pneumonia cases we correctly identify) against precision (the fraction of cases identified as pneumonia that are pneumonia). We find that our CNN model performs best, with a validation set F1 Score of 0.964 and a test set score of 0.966. This model has potential for generalized high-impact applications in biomedical imaging and diagnosis.
Gautam Prakriya, Moran Shechnick, Michael Awuah, Praveen Shahani
Broadway, are the theatrical performances presented in the 41 professional theaters, each with 500 or more seats, located in the Theatre District and the Lincoln Centre along Broadway, in Midtown Manhattan, New York City. Broadway is one of the highest commercial sites for live theatre in the English-speaking world. According to The Broadway League, for the 2018–2019 season (which ended May 26, 2019) total attendance was 14,768,254 and Broadway shows had US$1,829,312,140 in grosses, with attendance up 9.5%, grosses up 10.3%, and playing weeks up 9.3%. Most Broadway shows are musicals.
Our project predicts how the Broadway industry can maximize profit on ticket sales and grosses.
Irati Hurtado, Konstantinos Karatapanis, Sammy Sbiti
In this project we develop a model that can ascribe prespecified annotations on students essays. For this categorical classification problem
we used the pretrained word embedding neural network (the longformer) and on top of it we trained, via supervised learning, a two layer
neural network model. Our neural network was trained by optimizing the cross categorical entropy, a very popular choice for these kind of tasks.
Next to smoothen out our prediction output we properly weighted the probability distributions of our tokens to optimize the predictions across the essay on a sentence level. The model has an app version so users can interact and annotate their own essays.
Danny Wan, Nicole Basinski, Jason Xing, Nydia Chang
Gathering data about video game players' decision-making process is difficult. Available data is currently limited to inferences made after players have inputted their actions into a controller. Using the Atari-HEAD dataset, our team used MLP to create a model that predicted player actions using eye-tracking data and video game frames.
Yaming Cao, Jingzhen Hu, Qingzhong Liang, Arafatur Rahman, A K M Rokonuzzaman Sonet
The motivation of our projects is to build a model to predict the future return/trend of a basket of stocks so that financial companies can use it to optimize the portfolios (e.g. balance between the return and risk) and even develop the trading strategies. We used the opening price data between 06/02/2012 and 06/02/2022 of the five stocks (AAPL,TSLA, AMD, SBUX, FB) from Yahoo to tune the LSTM model parameters and test on the most recent data. The five stocks we picked are from different sectors, which allow us to train a general model the fits a large variety of stocks. Testing data consists of the last 90 days while the price of the previous dates form the training set. The resulting plots shows that our
predictions provide the correct trend with a lagging time in general. For relatively new stocks like TSLA, the volatility(beta) is relatively high. For stocks with relatively high (higher than the mean of the basket) historical beta, the prediction of the future 14 days (time-step)
Stefanie Wang, Hongyi Chen, Tyler Ellis, Sahinde Dogruer
Our objective in this project was to predict rental prices in Indianapolis, IN to help prospective tenants determine if a rental listing is fairly priced based on its features and to help landlords set fair prices for their units. We scraped data from www.apartments.com, cleaned the data, and looked at four models: CAT boost, gradient boost, linear regression, and random forest regression. All models shared the key features of the apartment: number of bedrooms, bathrooms, square footage, and neighborhood. CAT boost was the most effective in terms of lowest MSE and can predict if a rental listing is fair within 10% accuracy.
Berna Basar, Ness Mayker, Mario Mendez
Machine learning algorithms have been employed to produce predictive models that would habilitate the proactive diagnosis of heart disease, such as the Heroku app. We use the Heroku app model as our baseline logistic regression model. Our baseline model has a true positive value of only 12%. With the objective of developing a better performing model, we developed two new models: a challenger logistic regression model and a random forest model.
Adriana Morales Miranda, Devashi Gulati, Meghan Peltier
The purpose of our project was to analyze clinical trial data and determine the most important features contributing to clinical trial phase failure. After modeling the data from NIH U.S National Library of Medicine’s website - https://clinicaltrials.gov/, we concluded that the enrollment fraction i.e. the actual enrollment divided by the desired enrollment is the major predictor for successful completion of clinical trials. Our logistic regression model, if updated and implemented, would save 24% of the money spent on clinical trials.