A/B testing has been around for so long that it’s hard to remember what digital marketing was like before we could test anything to find out what design, messaging, offers, etc. produce the best results. When you conduct an A/B test, you compare two or more versions of an experience against each other to see which performs best. Typically, one of those experiences acts as a “control” — an unchanged experience that allows you to see how much better or worse your test experience performs compared to making no change at all.

In the world of personalization, your control is often a generic experience (one that remains the same for everyone) and your test experience is a personalized one. For this reason, testing is an essential element of any successful personalization strategy. Even if you can sometimes assume it, you can’t know for sure that a personalized experience performs better than a generic one if you don’t test it.

We often argue that A/B testing alone is an outdated approach. We believe that it should almost always be combined with personalization to help you find the experiences that work best for each of your key segments — or even each individual — rather than the one experience that produces the best results for the average person. But whether you’re combining testing and personalization or just doing testing alone, we have some tips that can help you grow from amateur tester to pro.

Decide what you want to measure in advance

Before you start a test, consider the results you expect to see from the test. Obviously, you hope for positive results, but what metrics specifically do you want to affect? It’s important to have a solid idea of what you’re trying to improve, rather than just test for the sake of testing. After you’ve decided which metrics you’d like to positively affect and, ideally, by how much, consider which other metrics you think could be negatively impacted and by how much you would allow?

For example, let’s assume you’re testing an email capture pop-up message, like this one:a/b testing and personalization

Naturally, you want to increase newsletter sign-ups, maybe by 3%. That’s your main goal for this campaign. But each of your metrics doesn’t exist in isolation. By introducing that pop-up, you may affect other metrics you care about as well. Visitors may leave after signing up for your emails without continuing to shop — which means those visitors who might have made a purchase leave before doing so. Or, they may be annoyed by the pop-up and promptly leave your site as a result.

Gaining more email addresses for your list has long-term value for your business, but how much, if any, of a bounce rate increase is acceptable to you?

It’s worth thinking through the potential negative impacts of any campaign in advance (and getting any necessary approvals when appropriate). That way, if you see any negative effects during the test, you’ll be able to know when to stop the test before too much damage occurs.

Determine your test duration by your business cycle

Most businesses see dramatic shifts in site activity depending on the day. At Evergage, for example, we see the bulk of our traffic during the week — with a drop off on the weekends. With that in mind, it doesn’t make sense for us to run a test from Thursday to Sunday on any given week, because the results wouldn’t control for any changes in behavior from the beginning of the week to the end. It also wouldn’t make sense to run a test for a week and a half (for example, beginning a test on Monday and ending it on Wednesday of the following week) because the results will reflect two Mondays, two Tuesdays and two Wednesdays of data but only one Thursday, one Friday, etc. This will skew the results.

A good rule of thumb is never to run a test for less than a week — to ensure you see each day of the week reflected in the data — and to run tests for multiples of a full week.

However, some businesses may want to run longer tests if they have highly variable monthly activity. For example, if you have a subscription-based business where the subscriptions all renew at the end of the month, you may want to run a test for at least a full month to ensure you control for behavioral patterns unique to your business.

Avoid ending a test too early

It can be tempting to end a test early if it’s not showing you the results you expected (or to declare victory too early), but some campaigns take longer to take effect. In one example we saw with an Evergage client, a retailer offered a promotion for $15 off of a minimum purchase of $100. This $100 purchase was higher than the site’s average order value of $75. So in this case, the retailer was trying to increase basket size.

The retailer found that it took some time from when a person saw the offer to when he or she actually acted on it. The results looked like this:testing and personalization

Looking at users who acted within the first hour of seeing the message, the control outperformed the experiment. In other words, those who didn’t see the message converted at a higher rate than those who saw the message. But over time, conversions increased among those who saw the offer.

We can theorize that the offer took some time to take effect because the minimum order value was much higher than the AOV on the site. Shoppers needed some time to think about the purchase and identify how they would spend more to reach the minimum. In other words, the offer didn’t produce a conversion in the same session, but it affected conversions going forward.

In cases like this one, it’s important to keep the test running until those that see the promotion are able to act on it. A good understanding of how your A/B testing solution works — in conjunction with your understanding of your own site — should help you apply the best judgment in this area.  

Understand what the results are saying

Advanced testers are probably familiar with what the results of an A/B test mean, but the rest of us can struggle to understand what the numbers are really saying. It’s important to know what your results mean so you can make the right decision about how to proceed after the test.

Each A/B testing tool is different, so you’ll want to understand the basics of your specific solution. Most solutions do not predict the lifts you’ll see when you convert all of your traffic to the test experience — those results will ultimately be higher or lower than what you saw during the test. Instead, the results are telling you that you can be X% confident (typically 95%) that the test experience will beat the control experience. Some solutions can predict an interval of the lift you can expect when the test ends, but it’s important to keep in mind that just because you saw a 5% lift in conversion rate during the test, it does not mean that you will experience that same exact lift going forward.

For example, look at the image below. It depicts a test of a redesign vs. a control experience. It shows that, at the current point in the test, the redesign is delivering a lift in revenue per user (RPU) of 32.4% over the control experience at 95% confidence.

a/b testing and personalization

This does not mean that we can be 95% sure that we’ll receive a 32.4% lift in RPU when we end the test and allow 100% of our traffic to see the redesigned experience. It just means that we can be confident that the redesigned experience beats the control experience to a meaningful extent.

Check for novelty effect

We all know that in life, something new can grab our attention simply because it’s new, not because it’s better than the old. And with limited attention spans, that excitement over the “new thing” wears off fast. The same can be true with digital experiences. A new experience can produce impressive results over an old experience just because visitors are unused to seeing it. When reviewing test results, make sure you identify whether any positive results could simply be due to novelty.

One easy way to test this is to segment the results by new vs. returning visitors. If you see that a campaign is producing a strong positive effect with returning visitors and not with new visitors, it’s highly likely that returning visitors are drawn to the novelty of the new experience and the results you’re seeing will not last forever.

Of course, just because a campaign is doing well due to the novelty effect isn’t a reason not to go forward with it. It just means that you should recognize that while you may see an initial bump, it isn’t a long-term effect. At this point, it’s a judgment call you need to make about whether that aligns with your goals. If you’re testing a promotion on a retail site that will soon be replaced with a new promotion, novelty is good. If you’re testing a redesigned site experience that you intend to boost your metrics over the long term, this may not be what you’re looking for.

Use machine-learning algorithms to elevate your testing

There are certainly many occasions to continue using traditional A/B testing in 2019 and beyond (such as when you’re testing new brand colors or a new site design), but in many cases, we can now turn to machine learning instead. For example, if you’re trying to find the one offer or message to display on your homepage (or any other prominent place on your website) to produce the best result for the majority of your site traffic — A/B testing will tell you that. But wouldn’t you rather find the one message that is most likely to appeal to each individual on your site? A/B testing can’t quite do that, but machine-learning algorithms can.

Instead of setting up a test between two or more experiences, letting it run, and then picking the best performing option, an algorithm (such as Evergage Decisions Contextual Bandit) uses predictive machine learning to evaluate the probability of engagement with a promotion, image, offer or experience — and compares it with its potential business value — to ultimately select the optimal content for each person in the moment. Essentially, you give the algorithm several different experiences to display in a specific area of your website, and it selects the ideal one for each individual visitor.

It continuously learns and automatically applies this learning, so there is no need for you to wait until a test is complete to pick a winner. It picks a winner automatically for each person and continues to improve over time. For some FAQs on this type of testing and learning using Contextual Bandit, check out this blog post and accompanying on-demand webinar.

Final Thoughts

Whether you’re testing to find an experience that produces great results for everyone, for specific segments, or for each individual, use these tips to help you produce more successful tests going forward.

Also, note that the content of this blog post comes from a session presented by Cliff Lyon (Evergage’s VP of Engineering) and Meera Murthy (Evergage’s VP of Strategy) in last year’s Personalization Summit. Save the date for this year’s event September 18-20 in Boston!