Duplicate listings have proven to be a pretty big challenge for Yelp, the company revealed in a post on its engineering blog. It’s not an issue Yelp takes lightly, and it’s putting new measures in place to help deal with it as time goes on.
So far, under its current system, Yelp has been able to merge over 500,000 duplicate listings. The post provides some insight into how they’ve reached that number.
“We constantly receive new business information from a variety of sources including external partners, business owners, and Yelp users,” writes software engineer Tobi Owoputi. “It isn’t always easy to tie different updates from different sources to the same business listing, so we sometimes mistakenly generate duplicates. Duplicates are especially bad when both listings have user-generated content as they lead to user confusion over which page is the “right” one to add a review or check-in to.”
“The problem of detecting and merging duplicates isn’t trivial,” he continues. “Merging two businesses involves moving and destroying information from multiple tables which is difficult for us to undo without significant manual effort. A pair of businesses can have slightly different names, categories, and addresses while still being duplicates, so trying to be safe by only merging exact matches isn’t good enough. On the other hand, using simple text similarity measures generates a lot of false positives by misclassifying cases like: two businesses that are part of the same chain and are located close to one another; one business that is a sub-business of another (e.g. Monterey Bay Aquarium and Jellies Experience at the Monterey Bay Aquarium); a professional and the practice that they work at (e.g. Coldwell Banker and Rose Parmelee – Coldwell Banker).”
You can dive into the process here, but Yelp plans to add language and geographical area-specific features, focus on high-impact duplicates (based on number of search result impressions), and extract its named entity and discriminative word classifiers into libraries for use in other projects.
Yelp hopes to be able to merge all high confidence duplicate business listings and minimize the amount of human intervention with improvements to its classifier.
Image via Yelp