Pandu Nayak testified at the U.S. vs. Google antitrust trial back in October. All I remembered seeing at the time was what felt like a PR puff piece published by the New York Times.
Then AJ Kohn published What Pandu Nayak taught me about SEO on Nov. 16 – which contained a link to a PDF of Nayak’s testimony. This is a fascinating read for SEOs.
Read on for my summary of what Nayak revealed about how Google Search and ranking works – including indexing, retrieval, algorithms, ranking systems, clicks, human raters and much more – plus some additional context from other antitrust trial exhibits that hadn’t been released when I published 7 must-see Google Search ranking documents in antitrust trial exhibits.
Some parts may not be new to you, and this isn’t the full picture of Google Search – much has been redacted during the trial, so we are likely missing some context and other key details. However, what is here is worth digging into.
Google crawls the web and makes a copy of it. This is called an index.
Think of an index you might find at the end of a book. Traditional information retrieval systems (search engines) work similarly when they look up web documents.
But the web is ever-changing. Size isn’t everything, Nayak explained, and there’s a lot of duplication on the web. Google’s goal is to create a “comprehensive index.”
In 2020, the index was “maybe” about 400 billion documents, Nayak said. (We learned that there was a period of time when that number came down, though exactly when was unclear.)
You can keep the size of the index the same if you decrease the amount of junk in it,” Nayak said. “Removing stuff that is not good information” is one way to “improve the quality of the index.”
Nayak also explained the role of the index in information retrieval:
We know Google uses the index to retrieve pages matching the query. The problem? Millions of documents could “match” many queries.
This is why Google uses “hundreds of algorithms and machine learning models, none of which are wholly reliant on any singular, large model,” according to a blog post Nayak wrote in 2021.
These algorithms and machine learning models essentially “cull” the index to the most relevant documents, Nayak explained.
Google’s A guide to Google Search ranking systems contains many ranking systems you’re probably well familiar with by now (e.g., BERT, helpful content system, PageRank, reviews system).
But Nayak (and other antitrust trial exhibits) have revealed new, previously unknown systems, for us to dig deeper into.
Many years ago, Google used to say it used more than 200 signals to rank pages. That number ballooned briefly to 10,000 ranking factors in 2010 (Google’s Matt Cutts explained at one point that many of Google’s 200+ signals had more than 50 variations within a single factor) – a stat most people have forgotten.
Well, now the number of Google signals is down to “maybe over a hundred,” according to Nayak’s testimony.
What is “perhaps” the most important signal (which matches what Google’s Gary Illyes said at Pubcon this year) for retrieving documents is the document itself, Nayak said.
The key signals, according to Nayak, are:
Here’s the full quote from the trial:
Google uses core algorithms to reduce the number of matches for a query down to “several hundred” documents. Those core algorithms give the documents initial rankings or scores.
Each page that matches a query gets a score. Google then sorts the scores, which are used in part for Google to present to the user.
Web results are scored using an IR score (IR stands for information retrieval).
Navboost “is one of the important signals” that Google has, Nayak said. This “core system” is focused on web results and is one you won’t find on Google’s guide to ranking systems. It is also referred to as a memorization system.
The Navboost system is trained on user data. It memorizes all the clicks on queries from the past 13 months. (Before 2017, Navboost memorized historical user clicks on queries for 18 months.)
The system dates back at least to 2005, if not earlier, Nayak said. Navboost has been updated over the years – it is not the same as it was when it was first introduced.
Trying not to minimize the importance of Navboost, Nayak also made it clear that Navboost is just one signal Google uses. Nayak was asked whether Navboost is “the only core algorithm that Google uses to retrieve results,” and he said “no, absolutely not.”
Navboost helps reduce documents to a smaller set for Google’s machine learning systems – but it can’t help with ranking for any “documents that don’t have clicks.”
Navboost slices
Navboost can “slice locale information” (i.e., the origin location of a query) and the data information that it has in it by locale.
When discussing “the first culling” of “local documents” and the importance of retrieving businesses that are close to a searcher’s particular location (e.g., Rochester, N.Y.), Google is presenting them to the user “so they can interact with it and create Navboost and so forth.”
So this means Navboost is a ranking signal that can only exist after users have clicked on it.
Navboost can also create different datasets (slices) for mobile vs. desktop searches. For each query, Google tracks what kind of device it is made on. Location matters whether the search is conducted via desktop or mobile – and Google has a specific Navboost for mobile.
Glue
What is Glue?
“Glue is just another name for Navboost that includes all of the other features on the page,” according to Nayak, confirming that Glue does everything else on the Google SERP that’s not web results.
Glue was also explained in a different exhibit (Prof. Douglas Oard Presentation, Nov. 15, 2023):
Also, as of 2016, Glue was important to Whole-Page Ranking at Google:
We also learned about something called Instant Glue, described in 2021 as a “realtime pipeline aggregating the same fractions of user-interaction signals as Glue, but only from the last 24 hours of logs, with a latency of ~10 minutes.”
Navboost and Glue are two signals that help Google find and rank what ultimately appears on the SERP.
Google “started using deep learning in 2015,” according to Nayak (the year RankBrain launched).
Once Google has a smaller set of documents, then the deep learning can be used to adjust document scores.
Some deep learning systems are also involved in the retrieval process (e.g., RankEmbed). Most of the retrieval process happens under the core system.
Will Google Search ever trust its deep learning systems entirely for ranking? Nayak said no:
Nayak discussed three main deep learning models Google uses in ranking, as well as how MUM is used.
RankBrain:
DeepRank:
DeepRank needs both language understanding and world knowledge to rank documents, Nayak confirmed. (“The understanding language leads to ranking. So DeepRank does ranking also.”) However, he indicated DeepRank is a bit of a “black box”:
What exactly is world knowledge and where does DeepRank get it? Nayak explained:
RankEmbed BERT:
MUM:
MUM is another expensive Google model so it doesn’t run for every query at “run time,” Nayak explained:
QBST (Query Based Salient Terms) and term weighting are two other “ranking components” Nayak was not asked about. But these appeared in two slides of the Oard exhibit linked earlier.
These two ranking integrations are trained on rating data. QBST, like Navboost, was referred to as a memorization system (meaning it most likely uses query and click data). Beyond their existence, we learned little about how they work.
The term “memorization systems” is also mentioned in an Eric Lehman email. It may just be another term for Google’s deep learning systems:
Search features are all the other elements that appear on the SERP that are not the web results. These results also “get a score.” It was unclear from the testimony if it’s an IR Score or a different kind of score.
We learned a little about Google’s Tangram system, which used to be called Tetris.
The Tangram system adds search features that aren’t retrieved through the web, based on other inputs and signals, Nayak said. Glue is one of those signals.
You can see a high-level overview of how Freshness in Tetris worked in 2018, in a slide from the Oard trial exhibit:
The IS Score is Google’s primary top-level metric of Search quality. That score is computed from search quality rater rankings. It is “an approximation of user utility.”
IS is always a human metric. The score comes from 16,000 human testers around the world.
“…One thing that Google might do is look at queries for inspiration on what it might need to improve on. … So we create samples of queries that – on which we evaluate how well we are doing overall using the IS metric, and we look at – often we look at queries that have low IS to try and understand what is going on, what are we missing here…So that’s a way of figuring out how we can improve our algorithms.”
Nayak provided some context to give you a sense of what a point of IS is:
“Wikipedia is a really important source on the web, lots of great information. People like it a lot. If we took Wikipedia out of our index, completely out of our index, then that would lead to an IS loss of roughly about a half point. … A half point is a pretty significant difference if it represents the whole Wikipedia wealth of information there…”
Sometimes, IS-scored documents are used to train the different models in the Google search stack. As noted in the Ranking section, IS rater data helps train multiple deep learning systems Google uses.
While specific users may not be satisfied with IS improvement, “[Across the corpus of Google users] it appears that IS is well correlated with helpfulness to users at large,” Nayak said.
Google can use human raters to “rapidly” experiment with any ranking change, Nayak said in his testimony.
Nayak also provided some more insights into how raters assign scores to query sets:
Another interesting discovery: Google decided to do all rater experiments with mobile, according to this slide:
Problems with raters
Human raters are asked to “put themselves in the shoes of the typical user that might be there.” Raters are supposed to represent what a general user is looking for. But “every user clearly comes with an intent, which you can only hope to guess,” Nayak said.
Documents from 2018 and 2021 highlight a few issues with human raters:
A slide from a presentation (Unified Click Prediction) indicates that one million IS ratings are “more than sufficient to superbly tune curves via RankLab and human judgment” but give “only a low-resolution picture of how people interact with search results.”
A slide from 2016 revealed that Google Search Quality uses four other main metrics to capture user intent, in addition to IS:
On Live Experiments:
On Freshness:
“One important aspect of freshness is ensuring that our ranking signals reflect the current state of the world.” (2021)
All of these metrics are used for signal development, launches and tracking.
So, if IS only provides a “low-resolution picture of how people interact with search results,” what provides a clearer picture?
Clicks.
No, not individual clicks. We’re talking about trillions of examples of clicks, according to the Unified Click Prediction presentation.
As the slide indicates:
“~100,000,000,000 clicks
provide a vastly clearer picture of how people interact with search results.
A behavior pattern apparent in just a few IS ratings may be reflected in hundreds of thousands of clicks, allowing us to learn second and third order effects.”
Google illustrates an example with a slide:
Google seems to equate using clicks with memorizing rather than understanding the material. Like how you can read a whole bunch of articles about SEO but not really understand how to do SEO. Or how reading a medical book doesn’t make you a doctor.
Let’s dig deeper into what the Unified Click Prediction presentation has to say about clicks in ranking:
Google’s goal is to figure out what users will click on. But, as this slide shows, clicks are a proxy objective:
The next three slides dive into click prediction, all titled “Life Inside the Red Triangle.” Here’s what Google’s slides tell us:
Were your click predictions better or worse than the baseline?
Whenever Google talks about collecting user data for X number of months, that’s all “the queries and the clicks that occurred over that period of time,” from all users, Nayak said.
If Google were launching just a U.S. model, it would train its model on a subset of U.S. users, for example, Nayak said. But for a global model, it will look at the queries and clicks of all users.
Not every click in Google’s collection of session logs has the same value. Also, fresher user, click and query data is not better in all cases.
Previously, Nayak said there is a point of diminishing returns:
“…And so there is this trade-off in terms of amount of data that you use, the diminishing returns of the data, and the cost of processing the data. And so usually there’s a sweet spot along the way where the value has started diminishing, the costs have gone up, and that’s where you would stop.”
No, the Priors algorithm is not an algorithm update, like a helpful content, spam or core update. In these two slides, Google highlighted its take on “the choice problem.”
“The idea is the score the doors based on how many people took it.
In other words, you rank the choices based on how popular it is.
This is simple, yet very powerful. It is one of the strongest signals for much of Google’s search and ads ranking! If we know nothing about the user, this is probably the best thing we can do.”
Google explains its personalized “twist” – looking at who went through each door and what actions describe them – in the next slide:
“We bring two twists to the traditional heuristic.
Instead of attempting to describe – through a noisy process – what each door is about, we describe it based on the people who took it.
We can do this at Google, because at our scale, even the most obscure choice would have been exercised by thousands of people.
When a new user walks in, we measure their similarity to the people behind each door.
This brings us to the second twist, which is that while describing a user, we don’t not [sic] use demographics or other stereotypical attributes.
We simply use a user’s past actions to describe them and match users based on their behavioral similarity.”
One final tidbit comes from Hal Varian email that was released as a trial exhibit.
The Google we know today is a result of a combination of countless algorithm tweaks, millions of experiments and invaluable learnings from end-user data. Or, as Varian wrote:
“One of the topics that comes up constantly is the ‘data network effect’ which argues that
High quality => more users => more analysis => high quality
Though this is more or less right, 1) it applies to every business, 2) the ‘more analysis’ should really be ‘more and better analysis”.
Much of Google’s improvement over the years has been due to thousands of people … identifying tweaks that have added up to Google as it is today.
This is a little too sophisticated for journalists and regulators to recognize. They believe that if we just handed Bing a billion long-tail queries, they would magically become a lot better.”
Google’s John Mueller says the Search team is “explicitly evaluating” how to reward sites that…
Google revealed details of two new crawlers that are optimized for scraping image and video…
Here is a recap of what happened in the search forums today, through the eyes…
YouTube unveiled four new content and ad offerings at its 13th annual Brandcast at David…
What Is Direct Traffic in Google Analytics? Direct traffic in Google Analytics 4 (GA4) refers to…
Google looks like it will discontinue the direct ordering option with the Order with Google…
This website uses cookies.
Leave a Comment