Google’s Matt Cutts has put out a new Webmaster Help video. This one is particularly interesting and nearly 8 minutes long – much longer than the norm. It goes fairly in depth about how Google crawls content and attempts to rank it based on relevancy. PageRank, you’ll find is still the key ingredient.
He starts off by talking about how far Google has come in terms of crawling. When Cutts started at Google, they were only crawling every three or four months.
“We basically take page rank as the primary determinant,” says Cutts. “And the more page rank you have– that is, the more people who link to you and the more reputable those people are– the more likely it is we’re going to discover your page relatively early in the crawl. In fact, you could imagine crawling in strict page rank order, and you’d get the CNNs of the world and The New York Times of the world and really very high page rank sites. And if you think about how things used to be, we used to crawl for 30 days. So we’d crawl for several weeks. And then we would index for about a week. And then we would push that data out. And that would take about a week.”
He continues on with the history lesson, talking about the Google Dance, Update Fritz and things, and eventually gets to the present.
“So at this point, we can get very, very fresh,” he says. “Any time we see updates, we can usually find them very quickly. And in the old days, you would have not just a main or a base index, but you could have what were called supplemental results, or the supplemental index. And that was something that we wouldn’t crawl and refresh quite as often. But it was a lot more documents. And so you could almost imagine having really fresh content, a layer of our main index, and then more documents that are not refreshed quite as often, but there’s a lot more of them.”
Google continues to emphasize freshness, as we’ve seen in the company’s monthly lists of algorithm changes the last several months.
“What you do then is you pass things around,” Cutts continues. “And you basically say, OK, I have crawled a large fraction of the web. And within that web you have, for example, one document. And indexing is basically taking things in word order. Well, let’s just work through an example. Suppose you say Katy Perry. In a document, Katy Perry appears right next to each other. But what you want in an index is which documents does the word Katy appear in, and which documents does the word Perry appear in? So you might say Katy appears in documents 1, and 2, and 89, and 555, and 789. And Perry might appear in documents number 2, and 8, and 73, and 555, and 1,000. And so the whole process of doing the index is reversing, so that instead of having the documents in word order, you have the words, and they have it in document order. So it’s, OK, these are all the documents that a word appears in.”
“Now when someone comes to Google and they type in Katy Perry, you want to say, OK, what documents might match Katy Perry?” he continues. “Well, document one has Katy, but it doesn’t have Perry. So it’s out. Document number two has both Katy and Perry, so that’s a possibility. Document eight has Perry but not Katy. 89 and 73 are out because they don’t have the right combination of words. 555 has both Katy and Perry. And then these two are also out. And so when someone comes to Google and they type in Chicken Little, Britney Spears, Matt Cutts, Katy Perry, whatever it is, we find the documents that we believe have those words, either on the page or maybe in back links, in anchor text pointing to that document.”
“Once you’ve done what’s called document selection, you try to figure out, how should you rank those?” he explains. “And that’s really tricky.We use page rank as well as over 200 other factors in our rankings to try to say, OK, maybe this document is really authoritative. It has a lot of reputation because it has a lot of page rank. But it only has the word Perry once. And it just happens to have the word Katy somewhere else on the page. Whereas here is a document that has the word Katy and Perry right next to each other, so there’s proximity. And it’s got a lot of reputation. It’s got a lot of links pointing to it.”
He doesn’t really talk about Search Plus Your World, which is clearly influencing how users see content a great deal these days. And while he does talk about freshness he doesn’t really talk about how that seems to drive rankings either. Freshness is great, as far as Google’s ability to quickly crawl, but sometimes, it feels like how fresh something is, is getting a little too much weight in Google. Sometimes the more relevant content is older, and I’ve seen plenty of SERPs that lean towards freshness, making it particularly hard to find specific things I’m looking for. What do you think?
“You want to find reputable documents that are also about what the user typed in,” continues Cutts in the video. “And that’s kind of the secret sauce, trying to figure out a way to combine those 200 different ranking signals in order to find the most relevant document. So at any given time, hundreds of millions of times a day, someone comes to Google. We try to find the closest data center to them.”
“They type in something like Katy Perry,” he says . “We send that query out to hundreds of different machines all at once, which look through their little tiny fraction of the web that we’ve indexed. And we find, OK, these are the documents that we think best match. All those machines return their matches. And we say, OK, what’s the creme de la creme? What’s the needle in the haystack? What’s the best page that matches this query across our entire index? And then we take that page and we try to show it with a useful snippet. So you show the key words in the context of the document. And you get it all back in under half a second.”
As Cutts notes in the intro to the video, he could talk for hours about all of this stuff. I’m sure you didn’t expect him to reveal Google’s 200 signals in the video, but it does provide an some interesting commentary from the inside on how Google is approaching ranking, even if it omits these signals as a whole.
Google, as Cutts also recently explained, runs 20,000 search experiments a year.