Certificate of Completion
THIS ACKNOWLEDGES THAT
HAS COMPLETED THE SPRING 2022 DATA SCIENCE BOOT CAMP
Roman Holowinsky, PhD
JUNE 08, 2022
The Hopf Bundle
Halley Fritze, Jay Hathaway, Max Vargas
Our problem is to accurately deduplicate commercial location data for points of interest. In other words, given two location observations, we must determine whether they describe the same point of interest. Our data contains many features, such as name and address, which required vectorization and text processing to train our models. Moreover, the data had errors and many missing values. This makes valuable the practices of data cleaning and smart implementation of text features. We performed a broad analysis, training many different models to see which would give us the largest accuracy for identifying matching observations with the same point of interest. Further directions for model improvement are also mentioned.