Google was granted a patent today for something called “Feature selection for large scale models”. It sounds pretty vague. The patent’s abstract says:
Disclosed are a method and system for receiving a plurality of potential features to be added to a model having existing features. For each of the potential features, an approximate model is learned by holding values of the existing features in the model constant. The approximate model includes the model having existing features and at least the potential feature. A performance metric is computed for evaluating performance of the approximate model. The performance metric is used to rank the potential feature based on a predetermined criterion.
That doesn’t exactly jump at you as an obvious patent on the Panda update. However, Bill Slawski at SEO By The Sea, who spends a fair amount of time analyzing Google patents sees a connection, and wonders if this is indeed the Panda patent.
“I have been keeping a careful eye out for a patent that would describe the process behind Google’s Panda updates, and based upon the nature of those updates, my expectation was that I might not necessarily recognize it once I came across it,” writes Slawski. “I didn’t expect it to provide details upon specific features that might be seen as positive or negative when it comes to determining the quality of web pages. I didn’t expect it to provide hints about what a webmaster might do if he or she was impacted by it.”
“I did expect that a patent about the Panda update would involve very large data sets, that it would include a machine learning approach that might determine positive features from known websites considered to be high quality, and that it could expand upon the features being used during the process of classifying a large set of pages,” he adds. “The process described in this patent does seem to fit those expectations.”
Indeed. In the background description part of the patent, it says:
In recent years, machine-learning approaches for data analysis have been widely explored for recognizing patterns which, in turn, allow extraction of significant information contained within large datasets. Learning algorithms include models that may be trained to generalize using data with known outcomes. Trained learning machine algorithms may then be applied to predict the outcome in cases of unknown outcome, i.e., to classify the data according to learned patterns.
Could this be machines learning to asses page quality based on what it has already deemed to be quality? Slawski refers back to a famous Wired interview with Google’s Matt Cutts and Amit Singhal from last year. That was the interview where the update was actually revealed to be named “Panda,” after one of Google’s engineers. In that interview, Cutts talked about how Google came up with a classifier to look at sites like the IRS, Wikipedia or the New York Times on one side, and low-quality sites on the other, with there being “mathematical reasons” you “can really see”.
As Slawski notes, this new patent illustrates a way to examine features on a seed set of known pages, and compare them with features on other pages, to determine a classification for those pages.
He’s pretty clear in that he’s not certain that this is indeed Google’s Panda patent, but it’s interesting nonetheless, and could still provide clues to Google’s background processes.
While Google’s Panda update is still something webmasters much contend with, it’s the Penguin update, which has has hogged the spotlight lately. This past weekend, Google pushed out its first data refresh for Penguin, and at least one site has shown that a full recovery is possible.
Slawski has recently pointed to other Google patents which might be directly related to Penguin as well.
Google filed for the “Feature Selection For Large Scale Models” patent on October 31, 2008. You can read the full filing here.