Philosophy · 16 June 2014 · Ian Malpass

A/B testing complexity

We tend to think of A/B testing as a tool for testing the validity of product decisions, or for empirically determining them. Should this button be blue, or red? Does moving more pictures above the fold increase conversion? These are good and useful questions that A/B testing can help answer.

However, A/B tests are typically being carried out within the bounds of a complex system. For web sites, that may include the programming language and its underlying libraries, templating systems, web servers, databases, networks, client devices, web browsers, operating systems, and those lovely agents of chaos: human users. The thing about complex systems is that they give rise to completely unexpected interactions (and, unfortunately, failure modes). The good news is that your A/B test is here to help.

The example that got me thinking about this is fairly simple. At Etsy, we (unsurprisingly) want the mobile web experience to be a smooth as possible. One area of concern is the difficulty people have typing in passwords on (typically virtual) mobile keyboards. Usually as you type, you’re shown the last letter you typed in, which then turns to an asterisk or other such symbol as you move on. Anyone who has tried to sign in on a mobile device has probably experienced the frustration of mis-typing that arises from the process. The aim of hiding the password is to prevent someone from looking over your shoulder at your screen and memorising it, but typically a mobile device (being smaller and held closer to the user) is harder to snoop. Our hypothesis was that members signing in on mobile might prefer to have their password shown in clear text so that they could get it right first time.

The experiment had three groups: the control (hidden as usual), one where the password was shown on the screen in clear text with an option to hide it if the member wished, and a third where it was hidden with the option to reveal it. We could measure the rate of login failure, and also the frequency with which members in the two experimental groups chose to toggle the feature on or off. A solid hypothesis, a fairly classic experimental setup. So far so good.

The results were awful.

Login failures went up dramatically in the group with visible passwords, completely contrary to our expectations. We weren’t sure if the change would make things better, but we were unprepared for it making things worse. Why was our product sense, on the face of it entirely reasonable, so wrong?

The answer, after much head-scratching, was that it wasn’t. Instead, it was a confounding factor from a completely different part of the complex system: clear text inputs on most mobile OSes have autocorrect.

Had we not taken the time to do the experiment, we would have had no way to know that this failure mode was occurring. Chances are it would have made only a slight change in the overall rate of login failures, probably not something we would have noticed in our day-to-day scanning of our dashboards. The problem would probably have persisted until eventually some frustrated member complained about it on the forums, by which time the damage would have been all the greater.

So, when planning your A/B tests, always be aware that they may end up telling you about something quite unexpected, and entirely unrelated to the question you intended to ask.