User-sensitive PageRank and Prabhakar Raghavan
Google made Prabhakar Raghavan, inventor of user-sensitive PageRank, Head of Google Search.
From Search Engine Land (highlight mine):
Prabhakar Raghavan, who was running Ads and Commerce (since 2018), will replace Ben Gomes as the new head of Search and Assistant. Search encompasses News, Discover, Podcasts and Google Assistant. Raghavan’s got a long history in search, having worked on it at IBM in 1995, followed by a position at Stanford where he taught the first course in its computer science department on search. He also authored a foundational text on the subject.
Google announced a reorganization on Thursday in which execs that lead its Search, Assistant and Ads business segments will report to new Search head Prabhakar Raghavan.
Raghavan will report directly to CEO Sundar Pichai.
The move comes as Google expects a potential hit from antitrust regulators who have targeted its Search, Ads and Android business units. The U.S. Department of Justice and nearly 50 Attorneys are expected to bring a lawsuit in the next few months and have discussed actions of potentially breaking up the company.
There are many, many interesting points to mention about this move.
First, the statement that Search includes New, Discover, Podcasts, and Google Assistant speaks loudly for the coming priorities for SEOs (don’t overvalue assistant).
Second, Raghavan is incredibly accomplished and one of the most important people in information retrieval and Search in the world. The man published books like one of the most fundamental books in information retrieval, “An Introduction to Information Retrieval” (link leads to the full book), in 2009. It’s the type of book with a table of contents that spans 5 pages. He also published tons of papers and registered patents.
One of the most fundamental patents he registered is for an innovation called User-Sensitive PageRank. Yes, we have to be cautious with parents because we don’t know for sure whether Google uses the described technology or not. At the same time, they can expand our understanding of SEO.
After working through the patent for many hours, I think it describes many trends we search in search nowadays. I think it’s crucial.
Because when we look at the patent, we see that it was filed in 2009… and not granted before November 2016.
Towards the end of 2016 is around the time we RankBrain coming out and many updates rolling out, as I described in the Impact Analysis of the May Core Update 2020 (or episode #89).
So, let’s tear this patent apart (figuratively).
The core idea of user-sensitive PageRank is that classic PageRank is flawed. It sounds nice in theory but is not practical.
The patent calls out 4 reasons:
- The Random Surfer Model suggests that all links on a page could be clicked in equal possibilities. But that’s not the case in real life. Some links like “disclaimer” or “terms of service” are rarely clicked.
- Users would start a new session on all pages equally. Some sites have a higher likelihood to be called up for information because they’re trusted more, e.g. Wikipedia.
- The idea of TrustRank was predicated on fighting link spam but not document ranking or monitoring user behavior (more on that in a second).
- PageRank is aggregated on a site level (“PageRank Blocks”), but the size and dynamics of the web demand that PageRank needs to also be measured on a document (page) level.
The latter is especially interesting because that’s the idea of TIPR:
PageRankexists between and within sites. The PR value of a page changes completely when we add backlinks to the internal PR calculation.
Note that incoming and outgoing links can be internal (same site) or external (from other sites). When it comes to outgoing links, it doesn’t matter much whether the link points at a page on the same domain or on another one. With incoming links, however, the difference can be big.
To get more accurate values, we need to merge backlink data with internal PR data. That’s what “TIPR”is about.
We can summarize the problems with classic PageRank as follows: flaws of classic PageRank are uniform link weighting and Uniform Teleportation. Not every link is the same, not every follow-up session has the same probability.
An example of the latter: someone reads a definition on a page and then clicks on a link to Wikipedia on that page to learn more. That would give Wikipedia a higher Authority Value.
The idea of Authority value is to blend user behavior data with classic PageRank models.
Yes, you read that right.
Authority Value is a sum of three components. I tried to visualize the concept below:
Everything starts with a document (“1. Doc”, in the upper left corner).
The first component is measuring the chance of a user staying on this document or clicking on an outgoing link to the first or second subset. Google weighs each outgoing link by how often it’s clicked and how topically relevant it is to the link target.
The second component is the chance of a user starting a new session to go to a second or third or nth subset of pages. Imagine this like searching for something, clicking on a result, and then leaving that page for another one (either by clicking on the back button in the browser or typing in a new URL).
The third component is the authority of the first document. This is close to classic PageRank.
The whole predicament is whether a user would click a link or not and measuring user behavior is key for that. However, that’s only possible on pages with high traffic. The patent specifies a confidence factor that tunes user behavior signals depending on how much traffic the document gets. Those signals could be link clicks, time on page, recency of visits, and tenure.
It even goes as far as stating that some users might have a higher weight than others based on their demographic segment (referred to as “user segment personalized PageRank”). Data could be aggregated based on demographics, profile, characteristics, device, etc.
The other thing that stood out to me was the idea of “Teleportation”, which describes users visiting a website without following a link but in the same thought process. In plain words: if there is no link between page A and B but a user visits page B right after page A and page A and B are typically related to each other, page B would receive Authority Value.
Another part of Authority Value is time. Pages or sites with old links and no recent visits decay over time.
It also mentions that Authority Value could even be a major driver of the web crawler, which is in line with the idea that crawl budget is largely impacted by PageRank.
All Authority Value factors:
- # of outgoing links from a page
- Likelihood of each of those links being clicked based on user behavior
- Aggregation based on page, document, site or host
- Topical relevance between document and (external) linked pages (further specified in the patent on Topical PageRank)
- Whether the outgoing link is a search result or not
- Age of user
- Gender of user
So, what do we make of this?
Again, we need to be a bit careful with patents but I see some practical applications:
- PR sculpting = dead. This is nothing new but I still see internal nofollow on many sites and this patent is another nail in the coffin for me.
- Measure link clicks on high-trafficked pages. I will pay more attention to what links I have on the most trafficked pages of a domain and what links are actually being clicked. Heatmaps or referral data, e.g. from GA could be the most helpful here.
- Get fresh backlinks to your pages. Don’t let a page that has good backlinks dry out.
- Measure direct traffic to non-homepage pages. Direct traffic could come from bookmarks, saved URLs in the browser bar (you know, they show up when you start typing something) or other sources (dark traffic), and be a potential indicator of Teleportation.