How to run Bayesian SEO tests5 min well spent

Two weeks ago, when I wrote about Bayesian Thinking in Growth, I committed a cardinal sin: being too conceptual without practical application. See, models help you in your understanding and applying the right approach to the right problem. But when you’re too conceptual, it becomes abstract and not applicable. Let me correct that.

The essential idea behind Bayesian experimentation

To quickly recap, the groundbreaking concept of Bayesian testing, as opposed to Frequentist testing, is to use the information you already have to test more accurately. Imagine you want to score a 3 pointer in basketball with your back facing the rim. The Frequentist thinks, “if I throw the ball at an angle of X, the chance of scoring is Y.” The Bayesian thinks, “based on my previous throws, my angle likely has to be between Y and Z to score.” Frequentists look at experiments in an isolated fashion; Bayesians take context into account.

Bayes’ Theorem is P(B|A) = P(A|B) x P(B) / P(A). You want to find the likelihood of A (ball hits the basket), given B’s likelihood (previous throws that hit the basket). That’s the same as the likelihood of B based on A (the angle that led previous balls to hit the basket), multiplied with the likelihood of A (the ball will hit the basket), in relation to the likelihood of B (previous balls hit the basket).

We humans often do this intuitively: we collect new evidence that helps us iterate toward “the truth.” But we don’t define an exact number; we think in ranges. The same applies to Bayesian testing: it looks for a probability distribution.

Frequentist A/B testing is trying to prove the “null hypothesis,” the current version’s performance (of a title or CTA), wrong. In other words, you try to find a “treatment” (or change) that works better than what you currently have. That also means you look for a single number.

Bayesian A/B testing for SEO

A/B tests in SEO are quasi test because they have a sample size of 1: Google. Without randomization, it’s not a “truly statistical test”. But that’s okay. Instead of randomizing users, we randomize the treatment and control for a single user.

You can run quasi a/b tests with a spreadsheet, Python, R, or 3rd party tools like Clickflow, Metaclickpro, Splitsignal, or SEOtesting. But for Bayesian SEO a/b testing, we need to run a so-called “diff in diff” test (difference in difference) that compares the change between control and treatment over time. I’m not aware of an out-of-the-box tool except for Searchpilot that does that for you, yet (I geeked out about that with Will Critchlow on the Tech Bound podcast).

Let’s run through an example of what that looks like.

First, you need to develop a hypothesis. Let’s say, adding your brand twice to the title leads to an increase in organic traffic (not a real example).

Second, you need a “prior”, a datapoint indicating where to start, that you can derive either from a previous test you ran or a test someone else ran. Let’s take Searchpilot’s “brand-in-title” test, which saw a +15% uplift in organic traffic.

The thinking is, “I know adding the brand to my title has a likelihood of showing a +15% uplift, so adding it twice should result in at least +15%, but probably not 100%.” I made the range up in this case, but in an actual experiment you calculate it.

Third, you select the URL(s) to test on and define a control group. For the diff-in-diff to be valid, you need to pick a control group with a strong traffic correlation with the variant group.

Fourth, you run the test for the calculated number of days (use this or this calculator).

Fifth, revert the treatment to see if it falls back to baseline and validate the hypothesis. If it doesn’t, you have a problem. The treatment wasn’t the cause of the change. This is also the most forgotten step in SEO testing.

Sixth, look at the marginal organic traffic impact of your treatment URL(s) and define the lift (use this or this calculator). That’s where you find out whether the treatment resulted in a traffic uplift and how large that uplift is.

Getting the parameters of a test right is one of the hardest parts of testing
Getting the parameters of a test right is one of the hardest parts of testing

Et voilá! You can keep testing based on new evidence and get closer to the true impact. If, for example, your test shows that the CTR uplift is lower than +15%, you know it’s between the number you found and +15%.

That’s the basic outline of the test. For the technicalities, check out the dive deeper section.

The higher the impact, the faster you see results

The challenge with testing is the sample size. If you test a page with 1,000 weekly visitors and a 1% conversion rate with two variants that get 500 visitors each, you need to run the test for 6 weeks to detect a change with 72% reliability. In other words, you need a lot of traffic and time to test low-impact treatments. That’s why I suggest testing high-impact factors like titles or rich snippets to see quick results and iterate from there.

Dive deeper