CertificateBackground.png

Certificate of Completion

ErdosHorizontal.png

THIS ACKNOWLEDGES THAT

HAS COMPLETED THE SPRING 2022 DATA SCIENCE BOOT CAMP

Jay Hathaway

clear.png

Roman Holowinsky, PhD

JUNE 08, 2022

DIRECTOR

DATE

TEAM

The Hopf Bundle

Halley Fritze, Jay Hathaway, Max Vargas

clear.png

Our problem is to accurately deduplicate commercial location data for points of interest. In other words, given two location observations, we must determine whether they describe the same point of interest. Our data contains many features, such as name and address, which required vectorization and text processing to train our models. Moreover, the data had errors and many missing values. This makes valuable the practices of data cleaning and smart implementation of text features. We performed a broad analysis, training many different models to see which would give us the largest accuracy for identifying matching observations with the same point of interest. Further directions for model improvement are also mentioned.

Screen Shot 2022-06-03 at 11.31.35 AM.png
github URL