Internal Link Optimization with TIPR

Internal link optimization is incomplete without factoring in backlinks. In this article, I introduce a model called TIPR that helps you to optimize the internal link graph of your site.

Dec 04, 2018

Once you understand the concept of power laws, you see them everywhere. A few people own most wealth. A few startup investments yield the highest returns. They’re also known as the Pareto Law or 80/20 rule.

How is that relevant for SEO?

Power laws also exist in SEO. A few short-head keywords have as much search volume as thousands of long-tail keywords. A few pages bring in the most traffic and get the most backlinks. And a few pages are crawled the most often.

That imbalance is something we should seek to optimize, especially when it comes to the internal link graph.

We know PageRank is one of the biggest ranking factors in SEO, but we often forget that it flows between and within websites! The common practice is to optimize internal linking without taking backlinks into account. That, however, can lead to making the wrong decisions.

In this article, I present a model that seeks to fill the gap called “TIPR”.

It’s a follow-up to a presentation I held at Tech SEO Boost in Boston in November 2018.

Slide Deck

Internal link graph optimization is one of the strongest SEO levers

SEO has become so much more complex in the last 5 years. To be successful in complex environments, you need to focus on things that you know to be impactful. Internal linking is one of those needle movers. It’s completely controllable and measurable and yields compounding results over time.

The key to optimizing internal linking effectively lies in understanding how PageRank flows through a site and then skewing it towards pages on which users convert.

Once a site achieves a certain size, say >1,000 pages, internal linking has an even higher impact. Bigger sites tend to have more backlinks and the big number of anchor text from internal links gives Google more anchor points to understand the relevance of a page for a certain topic. In reverse, if your site has only a few pages, say 100-200, the recommended tactics in this article have a limited impact.

The forgotten concept of CheiRank

Like a network of connected sites, pages within a site form a link graph. On that graph, some pages are stronger, and some are weaker, depending on how much PageRank they receive and how many outgoing links they have. The latter is known as CheiRank or inverse PageRank and describes how much a page links out in relation to the whole link graph of a site. (Thanks to JR Oakes for making me aware of a better explanation here)

The CheiRank is an eigenvector with a maximal real eigenvalue of the Google matrix constructed for a directed network with the inverted directions of links. It is similar to the PageRank vector, which ranks the network nodes in average proportionally to a number of incoming links being the maximal eigenvector of the Google matrix G with a given initial direction of links. Due to inversion of link directions the CheiRank ranks the network nodes in average proportionally to a number of outgoing links. Since each node belongs both to CheiRank and PageRank vectors the ranking of information flow on a directed network becomes two-dimensional.(Source: https://en.wikipedia.org/wiki/CheiRank, bolding by author)

If you want to dive deeper into the math, check out this paper. If you want to stay off the math (no judgment), what you need to keep in mind is the higher a page’s CheiRank, the more PageRank it gives away. A high CheiRank often indicates hub pages. That can be good or bad, depending on how much PageRank they receive. Pages can also give away too much PageRank, and then it’s a problem. Generally, you want PageRank (PR) and CheiRank (CR) to be somewhat balanced for an optimal internal link structure.

The 3 challenges of internal link graph optimization

When we add a layer of page types, like homepage, category pages, or product pages, to the graph, we notice that some page types have a higher PageRank on average than others. Often the homepage has the highest PR, though not in every case. On www.atlassian.com, the Jira product landing page has a higher PR than the homepage, for example.

An optimized internal link graph distributes PageRank towards conversion pages. For that to happen, we need to know and tweak the PR and CR of each page. The link graph is in consistant flux because a single link can change the values of each page.

That puts us in front of three challenges:

Internal PageRank is inaccurate without factoring in backlinks
Conversion pages are not always the ones that need the most PageRank
The optimal internal PageRank distribution depends on your business model

Let me explain each of the challenges in detail and how we could solve them.

Internal PageRank is incomplete without Backlinks

The first challenge, inaccurate internal PR values, is what I want to solve with the TIPR model. The currently available tools and approaches to optimize internal PageRank flow are helpful but miss one crucial component: backlinks.

Let’s take a step back.

You probably know graphs like this one (screenshot).

It shows the concept of internal linking. Each page has a certain amount of incoming and outgoing links. We can calculate the PR of each page from such a model by applying the PageRank algorithm. We can also crawl a site with a tool like Screaming Frog, Botify, Audisto, Searchmetrics, and others to calculate the internal PR of each URL.

However, that value is not accurate because the PR (and CR) of a page is decided by its internal and external links.

Again, PageRank exists between and within sites. The PR value of a page changes completely when we add backlinks to the internal PR calculation.

Note that incoming and outgoing links can be internal (same site) or external (from other sites). When it comes to outgoing links, it doesn’t matter much whether the link points to a page on the same domain or to another one. With incoming links, however, the difference can be big.

To get more accurate values, we need to merge backlink data with internal PR data. That’s what “TIPR” is about.

TIPR – True Internal PageRank

TIPR, “True InternalPageRank”, is a concept that helps us to get closer to the real PageRank value of a page. It combines four data streams: PageRank, CheiRank, Backlinks, and log files.

A more accurate internal link graph is the result of combining PR, CR, and backlinks. The log files are there to measure progress over time because Google Bot crawls correlate highly with PageRank. Google has said to use PR as a gauge for crawl budget many times.

A Web crawler may use PageRank as one of a number of importance metrics it uses to determine which URL to visit during a crawl of the web. One of the early working papers[55] that were used in the creation of Google is Efficient crawling through URL ordering,[56] which discusses the use of a number of different importance metrics to determine how deeply, and how much of a site Google will crawl. PageRank is presented as one of a number of these importance metrics, though there are others listed such as the number of inbound and outbound links for a URL, and the distance from the root directory on a site to the URL.
(Source: https://en.wikipedia.org/wiki/PageRank)

However, it’s not the only factor.

It is worth noting that crawl frequency is not just based on PageRank, either. Both Google’s Andrey Lipattsev and Gary Illyes have remarked in separate webinars recently that PageRank is not the sole driver for crawling or ranking, with Lipattsev saying, “This (PageRank) has become just one thing among very many things.” (source: https://searchengineland.com/crawl-budget-url-scheduling-might-impact-rankings-website-migrations-255624)

To build a TIPR model for your site, follow these 5 steps:

Site crawl
Calculate internal PageRank and CheiRank
Pull backlinks per URL
Use the crawl rate per URL to monitor the impact over time
Sort and rank metrics

The basis of TIPR is a crawl that gets you a list of pages, in best case, with an internal PR and CR value. You can either use PaulShapiro’s approach with Screaming Frog or one of the tools I mentioned earlier (Botify, Searchmetrics, ContentKing, Audisto, Ryte etc.) for this step.

Pull backlinks from a tool of your choice. You can export a couple of values here:

All backlinks of a domain
The number of backlinks per URL
A proprietary metric per URL (like Domain Authority from Moz)

The problem with exporting all backlinks of a domain or a proprietary metric is that the data volume for large sites quickly becomes impractical. Large sites simply have too many backlinks to export them all and map them to each URL. Thus, I only recommend this approach for smaller sites. For the big ones, export the DomainPop per URL. I found Domain Popularity, the number of referring domains to a URL, to be sufficient for the model.

Export your server log files and extract the crawl rate per URL from Google. You want to get the weekly or monthly crawl rate per URL. Export this data several times to monitor progress: First, before you made changes as a benchmark value, then right after and then again after 2-4 weeks. The idea is to continuously monitor the crawl rate throughout the changes you make based on the TIPR model. The times are not set in stone and rather drawn from my personal experience.

When we made changes on the mentioned Atlassian site, for example, we noticed an increase in crawl rate across the site (shown) and of the pages that we optimized (not shown) with TIPR.

The last step is sorting and ranking the data you pulled. You should now have a spreadsheet with several columns: URL, PageRank, CheiRank, DomainPop or # of backlinks, and crawl frequency. For each column, you want to give a rank from highest to lowest, meaning the highest PageRank value is ranked 1^st. Then, take the average of each row and sort the list from strong to weak.

When you want to factor several values on different scales together, say Google Bot hits andPageRank, ranking is a helpful tool. Be aware that we could apply a different weight to each factor, and I’m still toying around with the right balance.

Now it’s time to take action on your data.

Optimizing your internal link structure for conversion pages

After going through the previous chapter, you should have a list of URLs ranked from strongest to weakest. So, what do you do with that list?

An optimal link structure strives for a state defined by three factors:

No page hoards PageRank (many incoming links, few outgoing links)
No page gives away too much PageRank (few incoming links, many outgoing links)
Conversion pages are preferred

Conversion pages are pages on which users convert in the form of a sign-up, lead form or purchase.

The first two factors hint at a general problem with internal link graphs: they’re inherently imbalanced. I mentioned this earlier in conjunction with SEO power laws.

Broadly, you should strive for a more equal distribution of PageRank, with a slight preference for conversion pages. I call that the “Robin Hood principle”: you take PageRank from the strong and give it to the poor.

Compare the incoming and outgoing links in the screenshot below. Applying TIPR to an Atlassian site helped us identify that category pages hoard way too much PageRank, while others giveaway too much PageRank, which they don’t have in the first place.

This is what I meant earlier by adding a layer of page types over the internal link graph. Look for URL patterns in your ranked and sorted list to figure out which page types, often identified by subdirectories, are hoarders and which ones overspend.

The optimal PageRank distribution for distributed vs. centralized conversion pages

Depending on the business of a site, it has one of two types of conversion pages:

Centralized: A few product landing pages, e.g. Salesforce or Atlassian
Distributed: Many pages, or instances, as part of the product, e.g. Pinterest or Trello

Of course, almost every site has dedicated sign-up landing pages, but let’s leave those out for a moment because they usually rank for queries like “{brand} signup”.

We often see the centralized approach at SaaS or enterprise companies and the distributed approach at consumer-facing companies like social networks of Ecommerce sites. However, both approaches are not exclusive to those industries.

The centralized structure has a few dedicated conversion pages on which users can sign up for a lead form or product trial. The conversion pages target brand (e.g. “Jira”) and transactional generic queries (e.g. “project management tool”). The latter type of query is highly competitive; hence, those pages need a lot of PR to rank.

The power law of internal linking, meaning a few pages on a domain receiving the most incoming links and, therefore, have the highest PageRank, still has to be optimized for centralized conversion pages. Often it’s not the conversion pages that hold the highest PR.

The distributed structure is followed by sites with lots of pages on which users can convert. Think of Pinterest, on which every board is a potential sign-up page. Or an e-commerce site with hundreds of thousands of products. The optimal approach to internal link optimization is different here: we want a more equal distribution of PR.

It’s important to be aware of this difference when optimizing internal linking to make the right choices.

Optimal internal linking depends on the right data and business model

I mentioned three challenges of internal link optimization earlier:

Internal PageRank is inaccurate without factoring in backlinks
Conversion pages are not always the ones that need the most PageRank
The optimal internal PageRank distribution depends on your business model

The TIPR model, together with the awareness of power laws and different conversion page structures, should help you make the right choices and solve these challenges.

To summarize this article broadly, it doesn’t make a lot of sense to apply a cookie-cutter approach to internal linking. We need to collect the most accurate data and apply it to the business model. Our existing models of PR are based on internal link graph calculations that are incomplete without factoring backlinks in. Once an accurate model is established and changes are made, log files are a great tool to monitor success.

Growth Memo