Description
The Erdős Institute is a multi-university collaboration focused on helping PhDs get and create jobs they love.
Erdős Connect is our one day conference bringing together startups, investors, partners, and PhD researchers through rapid-fire demos, pitches, and project presentations.
This event serves as a connective launchpad for innovative ventures and leaders, aiming to harness world-class data science and subject matter expertise to find solutions to common problems. As a participant, you're entering an environment geared towards catalyzing the spin-out of startups, facilitating the acquisition of project teams, and driving innovation and growth in the data science ecosystem.
To learn more about our community of over 3000 PhDs, please visit: https://www.erdosinstitute.org/dashboard
Schedule (all times ET)
3:15 PM
Investors
Jill Raderstorf (introduction)
3:45 PM
Startups
Jim Schwoebel (introduction)
Project Abstracts
The idea of the project is to predict high glucose spikes from continuous glucose data, smartwatch data, food logs, and glycemic index. The dataset consists of the following:
1) Tri-axial accelerometer data (movement in subject)
2) Blood volume pulse
3) Intestinal glucose concentration
4) Electrodermal activity
5) Heart rate
6) IBI (interbeat interval)
7) Skin temperature
8) Food log
Data is public in: https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/#files-panel
Our team’s goal was to design a recommendation engine for recipes. We achieved this by combining an SVD model, trained on data from food.com, with a sentence transformer. This allowed us to generate recommendations based on a combination of a user query and a user’s history.
Today, we will explain how to interpret certain geometric features of the SVD model underlying our recommendation engine. We can use these to generate recommendations that not only take into account a user’s query and preference history, but can be tailored to their mood at the time they are making the query.
We use computer vision techniques to identify bird calls from audio for the BirdCLEF 2024 Kaggle competition, hosted by the Cornell Lab of Ornithology. The competition seeks reliable machine learning models for automatic detection and classification of bird species from soundscapes recorded in the Western Ghats of India, a Global Biodiversity Hotspot. The broader goal of this endeavor is to leverage Passive Acoustic Monitoring (PAM) and machine learning techniques to enable conservationists to study bird biodiversity at much greater scales than is possible with observer-based surveys.
In this project, we employ Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models to forecast groundwater levels for four wells in Spokane, Washington. Using historical data on well levels, weather, and surface water, we build models that predict future groundwater conditions. We found that across all wells, our models outperform the baseline. More specifically, LSTM performs the best, achieving RMSE values between 0.34 and 2.3 feet. We also developed a Streamlit web app for users to explore and visualize these predictions and ultimately hope the results can inform water management policies
In this project, our objective was to make improvements at the retrieval stage of a retrieval augmented generation (RAG) pipeline, for use with a large language model (LLM). This model is used to provide specific and relevant information to leadership teams interested in operational intelligence information regarding employee experiences. Given a corpus of 5.5M+ documents, we achieved sub-second retrieval of results, using text embeddings (HuggingFace, gte-base) and vector database indexing methods (LanceDB). By engineering metadata, and applying filtering and re-ranking techniques before and after retrieval, we significantly improved the placement of relevant results across several metrics (Mean Reciprocal Rank, Extended Mean Reciprocal Rank, and Normalized Discounted Cumulative Gain).
A trove of videos of cats “conversing” with humans is available at r/catswhoyell and across the internet. However, despite the sheer volume of cat meows an owner experiences, getting useful information from those meows is much harder. Yes, we can probably guess from a cat's meow when a cat wants food and when a cat is in distress. But what if we could do more? What if we could use cat meows to determine if a cat is sick? Meow x Meow explores this possibility. We built both traditional ML and deep learning models to predict the context of a cat meow, using augmented data from the CatMeows dataset. Our models enable us to do a meow-by-meow classification of “uncomfortable”, “hungry”, and “comfortable” with 90% validation accuracy. In the future, cat owners can upload their recordings to our online UI, and discover if their cat is hurt or just reminding us to feed. the. cat. now.
Financial markets are often affected by sentiment conveyed in news headlines. As major news events can drive significant fluctuations in stock prices, understanding these sentiment trends can provide important insights into market movements. This project aims to answer the question whether the sentiments extracted from financial news headlines can predict stock movements.
We use 5 years worth of data extracted from Yahoo Finance and Stock News API, obtain sentiment scores using FinVader, and use Models: Logistic Regression, Gradient Boosted Trees, XGBoost, and LSTM, to predict whether the next day's stock prices would rise or fall. We use a simulated stock portfolio to evaluate the effectiveness of the models.
Most existing music recommendation systems rely on listeners to provide seed tracks, and then utilize a variety of different approaches to recommend additional tracks in either a playlist-like listening session or as sequential track recommendations based on user feedback.
We built a playlist recommendation engine that takes a different approach, allowing listeners to generate a novel playlist based on a semantic string, such as the title of desired playlist, specific mood (happy, relaxed), atmosphere (tropical vibe), or function (party music, focus). Using a publicly available dataset of existing playlists (https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), we combine a semantic similarity vector model with a matrix factorization model to allow users to quickly and easily generate playlists to fit any occasion.