A/B testing

A/B testing (also known as bucket testing, split-run testing or split testing) is a user-experience research method.[1] A/B tests consist of a randomized experiment that usually involves two variants (A and B),[2][3][4] although the concept can be also extended to multiple variants of the same variable. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. A/B testing is employed to compare multiple versions of a single variable, for example by testing a subject's response to variant A against variant B, and to determine which of the variants is more effective.[5]
Multivariate testing or multinomial testing is similar to A/B testing but may test more than two versions at the same time or use more controls. Simple A/B tests are not valid for observational, quasi-experimental or other non-experimental situations—commonplace with survey data, offline data, and other, more complex phenomena.
Definition
"A/B testing" is a shorthand for a simple randomized controlled experiment, in which a number of samples (e.g. A and B) of a single vector-variable are compared.[1] A/B tests are widely considered the simplest form of controlled experiment, especially when they only involve two variants. However, by adding more variants to the test, its complexity grows.[6]
The following example illustrates an A/B test with a single variable:
A company has a customer database of 2,000 people and launches an email campaign with a discount code in order to generate sales through its website. The company creates two versions of the email with different calls to action (the part of the copy that encourages customers to act—in the case of a sales campaign, make a purchase) and identifying promotional codes.
- To 1,000 people, the company sends an email with the call to action stating "Offer ends this Saturday! Use code A1",
- To the remaining 1,000 people, it sends an email with the call to action stating "Offer ends soon! Use code B1".
- All other elements of the emails' copy and layout are identical.
The company then monitors which campaign has the higher success rate by analyzing the use of the promotional codes. The email using the code A1 has a 5% response rate (50 of the 1,000 people emailed used the code to buy a product), and the email using the code B1 has a 3% response rate (30 of the recipients used the code to buy a product). The company therefore determines that in this instance, the first call to action is more effective and will use it in future sales. A more nuanced approach would involve applying statistical testing to determine whether the differences in response rates between A1 and B1 were statistically significant (highly likely that the differences are real, repeatable and the result to random chance).[7]
In the previous example, the purpose of the test is to determine the more effective strategy to encourage customers to make a purchase. If, however, the aim of the test had been to determine which email would generate the higher clickthrough rate (the percentage of people who actually click the link after receiving the email), the results might have been different.
For example, even though more of the customers receiving the code B1 accessed the website, because the call to action did not state the end date of the promotion, many recipients may feel no urgency to make an immediate purchase. Consequently, if the purpose of the test had been simply to determine which email would bring more traffic to the website, the email containing code B1 might well have been more successful. An A/B test should have a defined, measurable outcome, such as sales converted, clickthrough rate or registration rate.[8]
Common test statistics
Two-sample hypothesis tests are appropriate for comparing the two samples in which the samples are divided by the two control cases in the experiment. Z-tests are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation. Student's t-tests are appropriate for comparing means under relaxed conditions when less is assumed. Welch's t-test assumes the least and is therefore the most commonly used two-sample hypothesis test in which the mean of a metric is to be optimized. While the mean of the variable to be optimized is the most common choice of estimator, others are regularly used.
Fisher's exact test can be employed to compare two binomial distributions, such as a click-through rate.
| Assumed distribution | Example case | Standard test | Alternative test | 
|---|---|---|---|
| Gaussian | Average revenue per user | Welch's t-test (Unpaired t-test) | Student's t-test | 
| Binomial | Click-through rate | Fisher's exact test | Barnard's test | 
| Poisson | Transactions per paying user | E-test[9] | C-test | 
| Multinomial | Number of each product purchased | Chi-squared test | G-test | 
| Unknown | Mann–Whitney U test | Gibbs sampling | 
Segmentation and targeting
A/B tests most commonly apply the same variant (e.g., user interface element) with equal probability to all users. However, in some circumstances, responses to variants may be heterogeneous. While a variant A might have a higher response rate overall, variant B may have an even higher response rate within a specific segment of the customer base.[10]
For instance, in the above example, the breakdown of the response rates by gender could have been:
| Gender | Overall | Men | Women | 
|---|---|---|---|
| Total sends | 2,000 | 1,000 | 1,000 | 
| Total responses | 80 | 35 | 45 | 
| Variant A | 50/ 1,000 (5%) | 10/ 500 (2%) | 40/ 500 (8%) | 
| Variant B | 30/ 1,000 (3%) | 25/ 500 (5%) | 5/ 500 (1%) | 
In this case, while variant A attracted a higher response rate overall, variant B actually elicited a higher response rate with men.
As a result, the company might select a segmented strategy as a result of the A/B test, sending variant B to men a variant A to women in the future. In this example, a segmented strategy would yield a 30% increase in expected response rates from to .
If segmented results are expected from the A/B test, the test should be properly designed at the outset to be evenly distributed across key customer attributes, such as gender. The test should contain a representative sample of men vs. women and assign men and women randomly to each “variant” (variant A vs. variant B). Failure to do so could lead to experiment bias and inaccurate conclusions.[11]
This segmentation and targeting approach can be further generalized to include multiple customer attributes rather than a single customer attribute—for example, customers' age and gender—to identify more nuanced patterns that may exist in the test results.
Tradeoffs
Positives
The results of A/B tests are simple to interpret to create a clear picture of real user preferences, as they directly test one option over another. A/B tests can also provide answers to highly specific design questions. One example of this is Google's A/B testing with hyperlink colors. In order to optimize revenue, Google tested dozens of hyperlink hues to determine which colors attract the most clicks.[12]
Negatives
A/B tests are sensitive to variance; they require a large sample size in order to reduce standard error and produce a statistically significant result. In applications in which active users are abundant, such as with popular online social-media platforms, obtaining a large sample size is trivial. In other cases, large sample sizes are obtained by increasing the experiment enrollment period. However, using a technique coined by Microsoft as Controlled Experiment Using Pre-Experiment Data (CUPED), variance from before the experiment start can be taken into account so that fewer samples are required to produce a statistically significant result.[13][14]
Because of its nature as an experiment, running an A/B test introduces the risk of wasted time and resources if the test produces unwanted or unhelpful results.
In December 2018, representatives with experience in large-scale A/B testing from 13 organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber and Stanford University) summarized the top challenges in a paper.[15] The challenges were be grouped into four areas: analysis, engineering and culture, deviations from traditional A/B tests and data quality.
History
It is difficult to definitively establish when A/B testing was first used. The first randomized double-blind trial to assess the effectiveness of a homeopathic drug occurred in 1835.[16] Experimentation with advertising campaigns, which has been compared to modern A/B testing, began in the early 20th century.[17] The advertising pioneer Claude Hopkins used promotional coupons to test the effectiveness of his campaigns. However, this process, which Hopkins described in his 1923 book Scientific Advertising, did not incorporate concepts such as statistical significance and the null hypothesis, which are used in statistical hypothesis testing.[18] Modern statistical methods for assessing the significance of sample data were developed separately in the same period. This work was conducted in 1908 by William Sealy Gosset when he altered the Z-test to create Student's t-test.[19][20]
With the growth of the internet, new ways to sample populations have become available. Google engineers ran their first A/B test in 2000 to determine the optimum number of results to display in its search-engine results.[5] The first test was unsuccessful because of glitches that resulted from slow loading times. Later A/B testing research was more advanced, but the foundation and underlying principles generally remain the same, and in 2011, Google ran more than 7,000 different A/B tests.[5]
In 2012, a Microsoft employee working on the search engine Bing created an experiment to test different methods of displaying advertising headlines. Within hours, the alternative format produced a revenue increase of 12% with no impact on user-experience metrics.[4] Today, major software companies such as Microsoft and Google each conduct over 10,000 A/B tests annually.[4]
A/B testing has been claimed by some to be a change in philosophy and business-strategy in certain niches, although the approach is identical to a between-subjects design, which is commonly used in a variety of research traditions.[21][22][23] A/B testing as a philosophy of web development brings the field into line with a broader movement toward evidence-based practice.
Many companies now use the "designed experiment" approach to making marketing decisions, with the expectation that relevant sample results can improve positive conversion results. It is an increasingly common practice as the tools and expertise grow in this area.[24]
Applications
Online social media
A/B tests have been used by large social-media sites such as LinkedIn, Facebook and Instagram to understand user engagement and satisfaction of online features, such as a new feature or product. A/B tests have also been used to conduct complex experiments on subjects such as network effects when users are offline, how online services affect user actions and how users influence one another.[25]
E-commerce
On an e-commerce website, the purchase funnel is typically a helpful candidate for A/B testing, as even marginal decreases in drop-off rates can represent a significant gain in sales. Significant improvements can be sometimes seen through testing elements such as copy text, layouts, images and colors.[26] In these tests, users only see one of two versions, as the goal is to discover which of the two versions is preferable.[27]
Product pricing
A/B testing can be used to determine the right price for a product, which is one of the most difficult challenges faced when a new product or service is launched. A/B testing (especially valid for digital goods) is an effective mechanism to identify the price point that maximizes the total revenue.
Political A/B testing
A/B tests have also been used by political campaigns. In 2007, Barack Obama's presidential campaign used A/B testing to garner online attraction and understand what voters wanted to see from Obama.[28] For example, Obama's team tested four distinct buttons on their website that led users to register for newsletters. Additionally, the team used six different accompanying images to attract users.[28]
HTTP routing and API feature testing

A/B testing is commonly employed when deploying a newer version of an API.[29] For real-time user experience testing, an HTTP layer 7 reverse proxy is configured in such a way that n% of the HTTP traffic is routed to the newer version of the backend instance, while the remaining 100-n% of HTTP traffic hits the (stable) older version of the backend HTTP application service.[29] This is usually achieved to limit the exposure of customers to a newer backend instance such that, if there is a bug with the newer version, only n% of the total user agents or clients are affected while others are routed to a stable backend, which is a common ingress control mechanism.[29]
See also
- Adaptive control
- Between-group design experiment
- Choice modelling
- Multi-armed bandit
- Multivariate testing
- Randomized controlled trial
- Scientific control
- Stochastic dominance
- Test statistic
- Two-proportion Z-test
References
- ^ a b Young, Scott W. H. (August 2014). "Improving Library User Experience with A/B Testing: Principles and Process". Weave: Journal of Library User Experience. 1 (1). doi:10.3998/weave.12535642.0001.101. hdl:2027/spo.12535642.0001.101.
- ^ Kohavi, Ron; Xu, Ya; Tang, Diane (2000). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. Archived from the original on 22 October 2021. Retrieved 22 October 2021.
- ^ Kohavi, Ron; Longbotham, Roger (2023). "Online Controlled Experiments and A/B Tests". In Phung, Dinh; Webb, Geoff; Sammut, Claude (eds.). Encyclopedia of Machine Learning and Data Science. Springer. pp. 891–892. doi:10.1007/978-1-4899-7502-7_891-2. ISBN 978-1-4899-7502-7. Archived from the original on 21 April 2023. Retrieved 21 April 2023.
- ^ a b c Kohavi, Ron; Thomke, Stefan (September–October 2017). "The Surprising Power of Online Experiments". Harvard Business Review. pp. 74–82. Archived from the original on 14 August 2021. Retrieved 27 January 2020.
- ^ a b c Hanington, Jenna (12 July 2012). "The ABCs of A/B Testing". Pardot. Archived from the original on 24 December 2015. Retrieved 21 February 2016.
- ^ Kohavi, Ron; Longbotham, Roger (2017). "Online Controlled Experiments and A/B Testing". Encyclopedia of Machine Learning and Data Mining. pp. 922–929. doi:10.1007/978-1-4899-7687-1_891. ISBN 978-1-4899-7685-7.
- ^ "The Math Behind A/B Testing". developer.amazon.com. Archived from the original on 21 September 2015. Retrieved 12 April 2015.
- ^ Kohavi, Ron; Longbotham, Roger; Sommerfield, Dan; Henne, Randal M. (February 2009). "Controlled experiments on the web: survey and practical guide". Data Mining and Knowledge Discovery. 18 (1): 140–181. doi:10.1007/s10618-008-0114-1. S2CID 17165746.
- ^ Krishnamoorthy, K.; Thomson, Jessica (2004). "A more powerful test for comparing two Poisson means". Journal of Statistical Planning and Inference. 119: 23–35. doi:10.1016/S0378-3758(02)00408-1. S2CID 26753532.
- ^ "Advanced A/B Testing Tactics That You Should Know | Testing & Usability". Online-behavior.com. Archived from the original on 19 March 2014. Retrieved 18 March 2014.
- ^ "Eight Ways You've Misconfigured Your A/B Test". Dr. Jason Davis. 12 September 2013. Archived from the original on 18 March 2014. Retrieved 18 March 2014.
- ^ Statt, Nick (9 May 2016). "Google is experimenting with turning search results from blue to black". The Verge. Retrieved 25 September 2024.
- ^ Deng, Alex (February 2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining. doi:10.1145/2433396.2433413.
- ^ Sexauer, Craig (18 May 2023). "CUPED Explained". Blog. Archived from the original on 4 September 2024. Retrieved 11 September 2024.
- ^ Gupta, Somit; Kohavi, Ronny; Tang, Diane; Xu, Ya; Andersen, Reid; Bakshy, Eytan; Cardin, Niall; Chandran, Sumitha; Chen, Nanyu; Coey, Dominic; Curtis, Mike; Deng, Alex; Duan, Weitao; Forbes, Peter; Frasca, Brian; Guy, Tommy; Imbens, Guido W.; Saint Jacques, Guillaume; Kantawala, Pranav; Katsev, Ilya; Katzwer, Moshe; Konutgan, Mikael; Kunakova, Elena; Lee, Minyong; Lee, MJ; Liu, Joseph; McQueen, James; Najmi, Amir; Smith, Brent; Trehan, Vivek; Vermeer, Lukas; Walker, Toby; Wong, Jeffrey; Yashkov, Igor (June 2019). "Top Challenges from the first Practical Online Controlled Experiments Summit". SIGKDD Explorations. 21 (1): 20–35. doi:10.1145/3331651.3331655. S2CID 153314606. Archived from the original on 13 October 2021. Retrieved 24 October 2021.
- ^ Stolberg, M (December 2006). "Inventing the randomized double-blind trial: the Nuremberg salt test of 1835". Journal of the Royal Society of Medicine. 99 (12): 642–643. doi:10.1177/014107680609901216. PMC 1676327. PMID 17139070.
- ^ "What is A/B Testing". Convertize. Archived from the original on 17 August 2020. Retrieved 28 January 2020.
- ^ "Claude Hopkins Turned Advertising Into A Science". Investor's Business Daily. 20 December 2018. Archived from the original on 10 August 2021. Retrieved 1 November 2019.
- ^ Pereira, Ron (20 June 2007). "How beer influenced statistics". Blog. Gemba Academy. Archived from the original on 5 January 2015. Retrieved 22 July 2014.
- ^ Box, Joan Fisher (1987). "Guinness, Gosset, Fisher, and Small Samples". Statistical Science. 2 (1): 45–52. doi:10.1214/ss/1177013437.
- ^ Christian, Brian (27 February 2000). "The A/B Test: Inside the Technology That's Changing the Rules of Business". Wired Business. Archived from the original on 17 March 2014. Retrieved 18 March 2014.
- ^ Christian, Brian. "Test Everything: Notes on the A/B Revolution | Wired Enterprise". Wired. Archived from the original on 16 March 2014. Retrieved 18 March 2014.
- ^ Cory Doctorow (26 April 2012). "A/B testing: the secret engine of creation and refinement for the 21st century". Boing Boing. Archived from the original on 9 February 2014. Retrieved 18 March 2014.
- ^ "A/B Testing: The ABCs of Paid Social Media". Anyword. 17 January 2020. Archived from the original on 31 March 2022. Retrieved 8 April 2022.
- ^ Xu, Ya; Chen, Nanyu; Fernandez, Addrian; Sinno, Omar; Bhasin, Anmol (10 August 2015). "From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks". Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 2227–2236. doi:10.1145/2783258.2788602. ISBN 9781450336642. S2CID 15847833.
- ^ "Split Testing Guide for Online Stores". webics.com.au. 27 August 2012. Archived from the original on 3 March 2021. Retrieved 28 August 2012.
- ^ Kaufman, Emilie; Cappé, Olivier; Garivier, Aurélien (2014). "On the Complexity of A/B Testing" (PDF). Proceedings of The 27th Conference on Learning Theory. Vol. 35. pp. 461–481. arXiv:1405.3224. Bibcode:2014arXiv1405.3224K. Archived (PDF) from the original on 7 July 2021. Retrieved 27 February 2020.
- ^ a b Siroker, Dan; Koomen, Pete (7 August 2013). A / B Testing: The Most Powerful Way to Turn Clicks Into Customers. John Wiley & Sons. ISBN 978-1-118-65920-5. Archived from the original on 17 August 2021. Retrieved 15 October 2020.
- ^ a b c Szucs, Sandor (2018). Modern HTTP Routing (PDF). LISA 2018. Usenix.org. Archived (PDF) from the original on 1 September 2021. Retrieved 1 September 2021.



