TEAM

Pawsitive Retrieval 1

Marcos Ortiz, Kristina Knowles, Diptanil Roy, Karthik Prabhu Palimar, Sayantan Roy

This project aims to build a model to efficiently identify and rank relevant content from a large dataset of human-generated Reddit posts (5.5 million posts from 34 different subreddits), given an arbitrary user query. The key objectives were to retrieve highly relevant results for queries while keeping retrieval times under 1 second. The long-term application is to use this capability as part of a Retrieval-Augmented Generation (RAG) pipeline for Aware clients.

We focus on systematically varying parameters of our embedding model, as well as applying different filters (before retrieval) and rerankings (after retrieval) that leverage the relationships inherent in the structure of the data.

Using these strategies, we successfully improve the placement of relevant results retrieved according to several modified recommender system metrics. These metrics were implemented using a set of over 1000 human labeled query-result pairs establishing a set of known relevant results for 25 queries.

THE ERDŐS INSTITUTE

Helping PhDs get and create jobs they love at every stage of their career.

TEAM

Pawsitive Retrieval 1

Marcos Ortiz, Kristina Knowles, Diptanil Roy, Karthik Prabhu Palimar, Sayantan Roy