The Hopf Bundle

Halley Fritze, Jay Hathaway, Max Vargas


Our problem is to accurately deduplicate commercial location data for points of interest. In other words, given two location observations, we must determine whether they describe the same point of interest. Our data contains many features, such as name and address, which required vectorization and text processing to train our models. Moreover, the data had errors and many missing values. This makes valuable the practices of data cleaning and smart implementation of text features. We performed a broad analysis, training many different models to see which would give us the largest accuracy for identifying matching observations with the same point of interest. Further directions for model improvement are also mentioned.

