14th Jun 2021 –
So make sure you’re running an actual A/B test where you’re splitting the traffic between your 2 versions and testing them at the exact same time.
All the users that were previously bucketed to the removed variant will need to be re-allocated to a variant and will be seeing a changed webpage in a short amount of time, which could affect their behavior and subsequent choices.
So there you have it. The 57 common and uncommon A/B testing mistakes that we see and how you can avoid them.
Simply re-run the test, set a high confidence level, and make sure you run them for long enough.
- Mistakes before you start testing,
- Issues that can happen during the test,
- And errors you can make once the test is finished.
Getting a win but not implementing it! They have the data and just do nothing with it. No change, no insight, and no new tests.
This is ok. We keep testing and we keep improving, because even a 1% increase compounds over time. Improve on it and get it to 2% and you’ve now doubled the effectiveness.
Some people just test anything without really thinking it through.
What worked and what didn’t? Why did it happen?
Common A/B Testing Mistakes That Can Be Made Before You Even Run Your Test
#1. Pushing something live before testing!
It doesn’t matter if you’re running an A/B, an A/B/n, or a Multivariate test. You need to allocate equal traffic volume to each version so that you can get an accurate measurement.
Hold off!
Here’s another potential sample pollution issue.
So what can we do?
#2. Not running an actual A/B test
An A/B test works by running a single traffic source to a control page and a variation of that page. The goal is to find if the change you implemented makes the audience convert better and take action.
As testers, we need to be impartial. Sometimes, however, you might have a particular design or idea that you just love and are convinced that it should have won so you keep extending the test out longer and longer to see if it pulls ahead.
Peeking is a term used to describe when a tester has a look at their test to see how it’s performing.
These results are not entirely accurate as many things could have happened during those test windows. You could get a burst of new traffic, run an event, causing the 2 pages to have wildly different audiences and results.
Always be ready to go back into an old campaign and retest. (Another reason why having a testing repository works great.)
#3. Not testing to see if the tool works
It could be that something was broken mid way through testing. It never hurts to check.
Just be aware of the significance of your segment size. You might not have had enough traffic to each segment to trust it fully, but you can always run a mobile-only test (or whichever channel it was) and see how it performs. If in doubt, find the next most important test on your list and start improving there. You may even find it helps conversion on that stuck page anyway, simply by feeding better prospects to it.
If in doubt, find the next most important test on your list and start improving there. You may even find it helps conversion on that stuck page anyway, simply by feeding better prospects to it.
If in doubt, find the next most important test on your list and start improving there. You may even find it helps conversion on that stuck page anyway, simply by feeding better prospects to it.
If in doubt, find the next most important test on your list and start improving there. You may even find it helps conversion on that stuck page anyway, simply by feeding better prospects to it.
This is less about a certain page or test mistake, but more about testing philosophy.
Why?
Apple tested its website and improved on it, but it’s the product iterations and improvements that continue to drive even more lift.
#33. Not stopping a test when you have accurate results
No testing tool is 100% accurate. The best thing you can do when starting out is to run an A/A test to see how precise your tool is.
It won’t always stay like this though. It could be that the test launched, it’s payday and you got a bunch of sales on that day.
A super simple mistake but it happens. You mislabel the tests and then get the wrong results. The variation wins but it’s named the control, and then you never implement the win and stay with the loser!
#34. Being emotionally invested in losing variations
Ideally, we never want to look at our test once running, and we never make a decision on it until it’s finished a full cycle, with the right sample size and it’s hit statistical significance.
Sure, something might break, but that’s the only change we should ever make. We don’t change the design, or copy, or anything.Now it would be tempting to turn off the ‘losing’ variation and redistribute the traffic among the other variations, right? Heck… you might even want to take that extra 25% of the traffic and just send it to the top performer, but don’t do it.
#35. Running tests for too long and tracking drops off
However, you wouldn’t want to be testing lead pages, sales pages, and checkout pages all at once as this can introduce so many different elements into your testing process, requiring massive volumes of traffic and conversions to get any useful insight.
Netflix does this with the thumbnails of all of their shows, testing for different elements that may appeal to different audiences (featuring specific actors famous to that country instead).
#36. Not using a tool that allows you to stop/implement the test!
Sometimes you can’t help it. You’ll have a test run and Google just implements a new core update and messes with your traffic sources mid-campaign *cough*.
So as a rule of thumb,
Again, what works for one is not what always works for another.
Common A/B Testing Mistakes You Can Make After Your Test Is Finished
#37. Giving up after one test!
Sometimes, you might even get more clicks because the layout has changed and they’re exploring the design.
Another rare issue.
#38. Giving up on a good hypothesis before you test all versions of it
Why?
Are you running A/B tests but not sure if they’re working properly?
But you’ll need a hypothesis that is testable, meaning it can be proven or disproven through testing. Testable hypotheses put innovation into motion and promote active experimentation. They could either result in success (in which case your hunch was correct) or failure – showing you were wrong all along. But they will give you insights. It may mean your test needs to be executed better, your data was incorrect/read wrong, or you found something that didn’t work which often gives insight into a new test that might work far better.
#39. Expecting huge wins all the time
If I can see that they are both receiving traffic and getting clicks/conversions, then I walk off and let it do its thing. I make NO decisions until the test has run its course.
Another simple mistake. Either the page URL has been entered incorrectly, or the test is running to a ‘test site’ where you made your changes and not to the live version.
And that’s the key here. Even if you share insights with other departments, you should still test to see how it works.
What if something is broken?
It took CXL 21 iterations to improve their client’s page, but it took them from a 12.1% to 79.3% conversion rate.
#40. Not checking validity after the test
Set them up to be equal from the start. Most tools will allow you to do this.
Run a quick test to see how it works first. You don’t want to push a radical change live without getting some data, or you could lose sales and conversions.
#41. Not reading the results correctly
If you run a test for longer than 4 weeks, there is a chance that you might see users’ cookies drop off. This can cause a lack of tracking of events but they may even return and pollute the sample data again.
- Dive deep into your analytics.
- Look at any qualitative data you have.
(This is such an important user experience factor that Google is currently adjusting their rankings for sites that don’t have flickering or moving elements).
That means you need to run 10 tests to get that winner. It takes effort but it’s always worth it, so don’t stop after one campaign!
#42. Not looking at the results by segment
You might have a goal for a page, but are also running a global campaign with multiple variations showing in different languages and different countries.
We can know from our data that X amount of people didn’t click, but we might not know why.
Think of the iPhone.
The key when doing a single element test is just that though. Keep your test to just ONE element change so that you can see what is making the difference and learn from it. Too many changes and you don’t know what worked.
#43. Not learning from results
Which types of tests yield the best results?
#44. Taking the losers
Do you want to learn the common mistakes when A/B testing so that you don’t lose valuable time on a broken campaign?
Your test should always be tied to Guardrail metrics or some element that directly affects your sales. If it’s more leads then you should know down to the dollar what a lead is worth and the value of raising that conversion rate.
#45. Not taking action on the results
Finally, you have the sample size.
Here’s an example.
#46. Not iterating and improving on wins
Sometimes you can get a lift but there’s more to be had. Like we said earlier, it’s very rare that every win will give you a double-digit lift.
You can use this guide to help you sidestep these issues for all future campaigns.
Complete a test, analyze the result, and either iterate or run a different test. (Ideally, have them queued up and ready to go).
#47. Not sharing winning findings in other areas or departments
A super simple mistake, but have you checked that everything works?
The thing is in a few years you may need to overhaul that entire page again. Environments change, language and terms used, the product can be tweaked.
- Find some winning sales page copy? Preframe it in your adverts that get them to the page!
- Find a style of lead magnet that works great? Test it across the entire site.
#48. Not testing those changes in other departments
If the test is working, let it run and let the data decide what works.
The thing is, it causes your data to become polluted and less accurate. Ideally, you want to use a tool that randomizes which page they see but then always shows them that same version until the test is over.
#49. Too much iteration on a single page
Do the math! Make sure you have enough traffic before running a test – otherwise it’s just wasted time and money. Many tests fail because of insufficient traffic or poor sensitivity (or both).
The best testers also listen to their audience. They find out what they need, what moves them forward, what holds them back, and then use that to formulate new ideas, tests, and written copy.
Some people make the mistake of running a test in sequence. They run their current page for X amount of time, then the new version for X time after that, and then measure the difference.
9 out of 10 tests are usually a failure.
Unless you’re testing for a seasonal event, you never want to run a test campaign during the holidays or any other major event, such as a special sale or world event happening.
Sometimes that new change can be a substantial dip in performance. So give it a quick test first.
#50. Not testing enough!
A failure can simply mean your hypothesis is correct but needs to be executed better.
Some testing programs insist on creating hard-coded tests. i.e a developer and engineer build the campaign from scratch.
Always double-check!
If you’re running your test for a month then you’re probably going to get enough traffic to get accurate results. Too little and the test just won’t be able to give you a confidence level that it will perform as it should.
The fact of the matter is you might only get a huge win 1 out of every 10 or more winning campaigns.
#51. Not documenting tests
Well, then your first thought should be that something is broken.
The key when running your test is to segment the audience after and see if the new visitors are responding as well as the old ones.
#52. Forgetting about false positives and not double-checking huge lift campaigns
It’s may be an overlooked part of QA testing, but campaigns can often run with broken buttons, old links, and more. Check first, then test.
So the test is finished. You ran for long enough, saw results, and got stat sig but can you trust the accuracy of the data?
The thing is, there are probably way more important pages for you to be testing right now.
#53. Not tracking downline results
A 1% lift on a sales page is great, but a 20% lift on the page that gets them there could be far more important. (Especially if that particular page is where you are losing most of your audience.)
Sidenote:
That’s why you never change traffic or turn off variations mid-way through. (And also why you shouldn’t be peeking!)
#54. Fail to account for primacy and novelty effects, which may bias the treatment results
Ideally, when running a test, you want to make sure you’re only testing a single segment of your audience. Usually, it’s new organic visitors, to see how they respond their first time on your site.
You could quite easily run tests on every lead generation page you have, all at the same time.
Sometimes people just run a campaign and see what changes, but you will definitely get more leads/conversions or sales if you have clarity on which specific element you want to see a lift on.
Let’s say you have a platform where your audience can communicate. Perhaps a Facebook page or comments section, but EVERYONE can access it.
Because both sets of your audience are seeing the exact same page, the conversion results should be identical on both sides of the test, right?
So what can you do?
Prioritize impact most of all:
It keeps on running and feeding 50% of your audience to a weaker page and 50% to the winner. Oops!
You can only find that out by segmenting it down into your results. Look at the devices used and the results there. You might find some valuable insights!
Just like we don’t change the pages being tested, we don’t remove any variations or change the traffic distribution mid-test either.
#55. Running consideration period changes
But bear in mind, not every A/B test should be for a radical change like this. 99% of the time we’re just testing a change of a single thing, like
Why?
Even worse again?
It might look ok for you, but it won’t actually load for your audience. Well, in that case, you don’t really want to wait a month to see it’s broken, right? This is why I always check to see if a test is getting results in the control and variation, 24 hours after I set it to run.
This is all worth checking out BEFORE you start running traffic to any campaign.
#56. Not retesting after X time
Let’s say the test is getting clicks and the traffic is distributed, so it *looks* like it’s working, but suddenly you start getting reports that people can’t fill out the sales form. (Or better still, you got an automated alert that a guardrail metric has dropped way below acceptable levels.)
Test and improve the biggest impact, lowest hanging fruit first. That’s what agencies do and it’s why they perform the same number of tests as in-house teams, but with a higher ROI. Agencies get 21% more wins for the same volume of tests!
If it’s broken then fix it and restart.
This can be distracting and cause trust issues, lowering your conversion rate.
#57. Only testing the path and not the product
When you make a new change, it can actually have a novelty effect on your past audience.
In this instance, this page would actually be more profitable to run, assuming the traffic that clicks continues to convert as well… Every failure is a valuable lesson, both in testing and in set up mistakes. The key is to learn from them!
The more you understand your results, the better.
Conclusion
The page you’re running tests on has plateaued and you just can’t seem to get any more lift from it.
In this situation, you might have people seeing one page and others seeing a variation, but all of them are on the same social network. This can actually skew your data as they can affect each other’s choices and interactions with the page. Linkedin has been segmenting its audience when testing new features to prevent network effect issues.