Experimentation reading list

Resources for learning more about controlled experimentation

Jan 24, 2023


During my time on the Growth team at Noom, I've had to learn a lot about online controlled experimentation. I've put together most of the public resources I've come across into this document.

If you're brand new to online controlled experimentation, I high recommend starting with Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing. Its a detailed guide on the experimentation landscape. It was a must-read for every member of my team, and was known as the "Experimentation Bible".





  • Stanford Seminar: Peeking at A/B Tests - Why It Matters and What to Do About It
    • Notes:
      • Frequentest
        • Popular 100 years ago
          • Data was expensive to gather back then.
        • Fixed sample size testing
      • Peeking problem
        • Continuously monitor results dashboard
        • Adjust test length in real time (Adaptive sample size testing)
        • Peeking can dramatically inflate false positives.
        • Intuition
          • Sample size specifies a point in time, doesn't say anything about what happens between the start of the experiment and when you hit sample size.
          • High chance of significant result on the way to sample size may bias whether we keep the test running or not.
          • If you wait long enough, there is a high chance of an eventually inconclusive result looking "significant" along the way!

Experimentation Platform

  • Experimentation Platform - Microsoft Research

  • Building our Centralized Experimental Platform | Stitch Fix Technology - Multithreaded

    • Notes:
      • They have analytics automated
      • Not a lot of details in this blog post
  • Meet Wasabi, an Open Source A/B Testing Platform

    • Wasabi not under active development
  • Scaling Airbnb's Experimentation Platform

    • Notes:
      • Uses Airflow for meteics
        • ERF runtime from 24 hours to 45 minutes
      • Useful article to learn about how they do data processing
  • It's All A/Bout Testing

    • Notes:
      • Targeting and allocation
        • There is one more topic to address before we dive further into details, and that is how members get allocated to a given test. We support two primary forms of allocation: batch and real-time.
        • Batch allocations give analysts the ultimate flexibility, allowing them to populate tests using custom queries as simple or complex as required. These queries resolve to a fixed and known set of members which are then added to the test. The main cons of this approach are that it lacks the ability to allocate brand new customers and cannot allocate based on real-time user behavior. And while the number of members allocated is known, one cannot guarantee that all allocated members will experience the test (e.g. if we’re testing a new feature on an iPhone, we cannot be certain that each allocated member will access Netflix on an iPhone while the test is active).
        • Real-Time allocations provide analysts with the ability to configure rules which are evaluated as the user interacts with Netflix. Eligible users are allocated to the test in real-time if they meet the criteria specified in the rules and are not currently in a conflicting test. As a result, this approach tackles the weaknesses inherent with the batch approach. The primary downside to real-time allocation, however, is that the calling app incurs additional latencies waiting for allocation results. Fortunately we can often run this call in parallel while the app is waiting on other information. A secondary issue with real-time allocation is that it is difficult to estimate how long it will take for the desired number of members to get allocated to a test, information which analysts need in order to determine how soon they can evaluate the results of a test.
  • Twitter experimentation: technical overview

    • Notes:
      • Uses feature flags
      • Pretty unremarkable
  • Zalando Engineering Blog - Experimentation Platform at Zalando: Part 1 - Evolution

    • Notes:
      • Uses craw/walk/run phrasing a-la Ronnie K
      • Their system allows for staged rollouts
  • Under the Hood of Uber's Experimentation Platform

  • Notes:

    • Good details, screenshots of their XP.

      Broadly, we use four types of statistical methodologies: fixed horizon A/B/N tests (t-test, chi-squared, and rank-sum tests), sequential probability ratio tests (SPRT), causal inference tests (synthetic control and diff-in-diff tests), and continuous A/B/N tests using bandit algorithms (Thompson sampling, upper confidence bounds, and Bayesian optimization with contextual multi-armed-bandit tests, to name a few). We also apply block bootstrap and delta methods to estimate standard errors, as well as regression-based methods to measure bias correction when calculating the probability of type I and type II errors in our statistical analyses.

    • The XP detects major issues during analysis:

      • Sample size imbalance (Sample size ratio in control and treatment group)
      • Flickers (Users that have switched between control and treatment groups)
    • They do data pre-processing:

      • Outlier detection using a clustering algorithm to detect and remove outliers
      • Variance reduction using CUPED Method
      • Remove pre-experiment bias using difference in differences to correct pre-experiment bias between groups.
    • Sequential testing

      This is very similar to the circuit breaker system we were interested in building.

      One use case where a sequential test comes in handy for our team is when identifying outages caused by the experiments running on our platform. We cannot wait until a traditional A/B test collects sufficient sample sizes to determine the cause of an outage; we want to make sure experiments are not introducing key degradations of business metrics as soon as possible, in this case, during the experimentation period. Therefore, we built a monitoring system powered by a sequential testing algorithm to adjust the confidence intervals accordingly without inflating Type-I error.

      Using our XP, we conduct periodic comparisons about these business metrics, such as app crash rates and trip frequency rates, between treatment and control groups for ongoing experiments. Experiments continue if there are no significant degradations, otherwise they will be given an alert or even paused. The workflow for this monitoring system is shown in Figure 6, below:

    • Bandit aka Continuous Experiments

      • Bandit is used at Uber for:
        • Content optimization
        • Spend optimization
        • Hyper-parameter tuning for learning models.
        • Automated Rollouts
  • Building an Intelligent Experimentation Platform with Uber Engineering

    • Notes:

      • Uber uses their XP to do staged rollouts.
      • On metric computation

      The new analysis tool does not pre-compute the data of the metrics, which cut down on our data storage spend and reduced our analysis generation time. Now, when the data is ready for analysis, we use a SQL query file to generate metrics on the fly whenever people make a request on the WebUI. After that, we use Scala as our service engine to compute the probability (p-value) that the treatment group mean is significantly different than the control group mean, determining if the experiment reached the target sample size.

  • In-house experimentation platforms

    • Cool list of various companies experimentation platforms.
  • ExPlat: Automattic's Experimentation Platform