Ah the one-way mirror. Also, confusingly, called a two-way mirror.
Iconic in traditional usability testing practice and imagery. But do we really need two rooms to conduct valid product testing anymore? Not just as a convenient room for observers, but to actually moderate from ‘behind the glass’, separated from the participant, communicating via mics and speakers?
Was this merely a quaint effort to legitimize usability testing as a ‘scientific’ practice and is now laughably passé? Or are there logical and enduring reasons why moderating from the ‘control room’ is best practice for summative / benchmark types of tests?
Let’s listen in on a prototypical conversation between two UX Testing pros – often, but not always, representing different generations (such as X vs Y/Z) or different program backgrounds (such as Cog Psy vs. Design Thinking)….
Behind the Glass
by: Prakhantree
SCENE: A chance meeting at a Palo Alto coffee shop
… with Kevin! The highly-skilled, well-respected UX veteran
…and Katya! The sharp, young UX researcher
Link in case the above embed doesn’t work: http://www.xtranormal.com/watch/14382433/behind-the-glass
I firmly believe either physical setup is suitable, and will yield completely valid and useful results, assuming the moderator has appropriately established rapport and set expectations to offset the potential negatives of each approach.
So what matters then?
Beyond the most obvious factors of type of test and actual availability of a fancy lab, it comes down mostly to preference. I’ve summarized my learnings below. You’re welcome
Ask yourself these simple questions to gauge if moderating from ‘behind the glass’ could be right for you!
Might I want or need to have discreet communication with other observers, such as a notetaker, client, or product experts during testing?
- to clarify a path or feature
- to acknowledge an issue together
- to ensure an error, comment, etc. is captured (if a notetaker is assisting)
- to troubleshoot a system issue
- to gather and plan for any follow-up questions in real time
Is the lab setup adjustable enough / ergonomic to sit with the participant?
- sitting right next to the user can strain the neck (especially with multiple tests in a day)
- might have to balance your laptop on your lap
- have to ensure participant doesn’t view your note-taking screen
- sitting behind the user might feel creepy
- can be hard to get an unobstructed view of the screen without getting ‘too close’
- hard to change positions mid-test if you’re uncomfortable
- small rooms can become uncomfortably warm with two people and the heat of equipment
Do I like to have more privacy while I test (especially if there are no other observers with you) ?
- to munch that Cliff bar
- scratch your nose
- take off your shoes
- yawn and stretch
- freely face-palm and head-desk
- stress less about bodily noises
Am I sensitive to hygiene variability?
- the user that just ‘came from a run!’
- came from the bar
- came from the burrito bar
- came from the cigar bar
- isn’t quite over that phlegmatic cough (!)
- all of the above at once in a perfect storm
If few or none of the above are super-important to you, or you have work-arounds/solutions for them, then you will probably not perceive any benefit to moderating in a separate control room. Or perhaps you discovered a few benefits that you hadn’t thought about. Either way, happy testing!
This post originally appeared on Oracle’s Usable Apps blog. Mick is a member of the Oracle Usability Advisory Board. Thanks to Oracle for allowing EchoUser to repost it here.
Being part of a user experience design firm, we have the luxury of working with a lot of great people across many great companies. We get to help people solve their problems. At least we used to. The basic design challenge is still the same; however, the goal is not necessarily to solve “problems” anymore; it is, “I want our products to delight and excite!” The question for us as UX professionals is how to design to those goals, and then how to assess them from a usability perspective.
I’m not sure where I first heard “delight and excite” (A book? blog post? Facebook status? Steve Jobs quote?), but now I hear these listed as user experience goals all the time. In particular, somewhat paradoxically, I routinely hear them in enterprise software conversations. And when asking these same enterprise companies what will make the project successful, we very often hear, “Make it like Apple.” In past days, it was, “Make it like Yahoo (or Amazon or Google),” but now Apple is the common benchmark.
Steve Jobs and Apple were not secrets, but with Jobs’ passing and Apple becoming the world’s most valuable company in the last year, the impact of great design and experience is suddenly very widespread. In particular, users’ expectations have gone way up. Being an enterprise company is no shield to the general expectations that users now have, for all products.
The user experience challenge has historically been, to echo the words of Eric Ries (author of Lean Startup), to create a “minimum viable product”: the proverbial “make it good enough.” But, in our profession, the “minimum viable” part of that phrase has oftentimes, unfortunately, referred to the design and user experience. Technology typically dominated the focus of the biggest, most successful companies. Few have had the laser focus of Apple to also create and sell design and user experience alongside great technology.
But now that Apple is the most valuable company in the world, copying their success is a common undertaking. Great design is now a premium offering that everyone wants, from the one-person startup to the largest companies, consumer and enterprise. This emerging business paradigm will have significant impact across the user experience design process and profession. One area that particularly interests me is, how are we going to evaluate these new emerging “delight and excite” experiences, which are further customized to each particular domain?
Traditional usability measures of task completion rate, assists, time, and errors are still extremely useful in many situations; however, they are too blunt to offer much insight into emerging experiences “Satisfaction” is usually assessed in user testing, in roughly equivalent importance to the above objective metrics. Various surveys and scales have provided ways to measure satisfying UX, with whatever questions they include. However, to meet the demands of new business goals and keep users at the center of design and development processes, we have to explore new methods to better capture custom-experience goals and emotion-driven user responses.
We have had success assessing custom experiences, including “delight and excite,” by employing a variety of user testing methods that tend to combine formative and summative techniques (formative being focused more on identifying usability issues and ways to improve design, and summative focused more on metrics). Our most successful tool has been one we’ve been using for a long time, Magnitude Estimation Technique (MET). But it’s not necessarily about MET as a measure, rather how it is created.

For one client, EchoUser did two rounds of testing. Each test was a mix of performing representative tasks and gathering qualitative impressions. Each user participated in an in-person moderated 1-on-1 session for 1 hour, using a testing set-up where they held the phone. The primary goal was to identify usability issues and recommend design improvements.
MET is based on a definition of the desired experience, which users will then use to rate items of interest (usually tasks in a usability test). In other words, a custom experience definition needs to be created. This can then be used to measure satisfaction in accomplishing tasks; “delight and excite”; or anything else from strategic goals, user demands, or elsewhere. For reference, our standard MET definition in usability testing is:
“User experience is your perception of how easy to use, well designed and productive an interface is to complete tasks.”
We’ve helped construct experience definitions for several clients to better match their business goals. One example is a modification of the above that was needed for a company that makes medical-related products:
“User experience is your perception of how easy to use, well-designed, productive and safe an interface is for conducting tasks. ‘Safe’ is how free an environment (including devices, software, facilities, people, etc.) is from danger, risk, and injury.”
Another example is from a company that is pushing hard to incorporate “delight” into their enterprise business line:
“User experience is your perception of a product’s ease of use and learning, satisfaction and delight in design, and ability to accomplish objectives.”
I find the last one particularly compelling in that there is little that identifies the experience as being for a highly technical enterprise application. That definition could easily be applied to any number of consumer products.
We have gone further than the above, including “sexy” and “cool” where decision-makers insisted they were part of the desired experience. We also applied it to completely different experiences where the “interface” was, for example, riding public transit, the “tasks” were train rides, and we followed the participants through the train-riding journey and rated various aspects accordingly:
“A good public transportation experience is a cost-effective way of reliably, conveniently, and safely getting me to my intended destination on time.”
To construct these definitions, we’ve employed both bottom-up and top-down approaches, depending on circumstances. For bottom-up, user inputs help dictate the terms that best fit the desired experience (usually by way of cluster and factor analysis). Top-down depends on strategic, visionary goals expressed by upper management that we then attempt to integrate into product development (e.g., “delight and excite”). We like a combination of both approaches to push the innovation envelope but still be mindful of current user concerns.
Hopefully the idea of crafting your own custom experience, and a way to measure it, can provide you with some ideas how you can adapt your user experience needs to whatever company you are in. Whether product-development or service-oriented, nearly every company is ultimately providing a user experience.
Creating great experiences may have been popularized by Steve Jobs and Apple, but I’ll be honest, it’s a good feeling to be moving from “good enough” to “delight and excite,” despite the challenge that entails. In fact, it’s because of that challenge that we will expand what we do as UX professionals to help deliver and assess those experiences. I’m excited to see how we, Oracle, and the rest of the industry will live up to that challenge.
The EchoUser research team had quite a busy December. Our schedules were filled with recruiting users, drafting test plans, moderating usability sessions, writing reports, and, last but not least, arranging check-in meetings with clients throughout the project cycle.
Clients — regardless of their UX background — would raise questions and concerns about UX methodology in those meetings to make sure that their studies were on the right track and that they would get valuable and defensible data from the projects.
In the two usability projects I am on (both benchmark studies), I came across the following two interesting questions from our clients. Though the two questions seemed to have come from two different angles, they both point to one of the key issues in doing usability studies: how to interpret usability data with a small number of users. I thought I’d share the two client questions and hope to elicit some extended discussions here.
Client Question 1: How many participants is enough for a benchmark usability study? Eight, 10, or 12?
A lot of times, the question actually becomes, “Do we need a single-digit participant number or a double-digit one?” Clients want the usability study results to be defensible both from a statistical and a PR standpoint. When time and resources allow and it’s easy to recruit target participants, the question of “Should we get two more participants for the study?” has an easy solution: Let’s just do two more sessions. However, in a scenario in which qualified participants are very difficult to find or recruit (for instance, the study requires a highly specific user profile) or time and resources are limited, how many participants are needed? Is it worthwhile to spend two more weeks on the study just to make it to a total of 10 participants?
The bigger issue: What is the rationale we should use to validate the number of participants for a usability study?
If we go back to the classic model from Nielsen, five users are enough to uncover 85% of usability issues. That has been the UX industry standard’ for a long time, as Jakob Nielsen and his colleagues were among the first UX professionals to calculate the relationship between the number of UX issues uncovered and the number of participants involved. The mathematical model is derived from their years of experience conducting usability studies. Faulkner challenged Nielsen’s model in 2004 with a paper named “Beyond the five-user assumption: Benefits of increased sample sizes in usability testing.” She carefully designed and conducted a few studies with different sample sizes (5, 10, 20, 30, 40, 50, and 60 participants). What she learned from the follow-up data simulation and analysis is that 10 participants are enough to identify at least 82% of the usability issues, whereas a sample size of 15 can help to identify at least 90% of the issues. I even came across a sample size calculator on Jeff Sauro’s Measuring Usability site. Based on the binomial probability formula, it allows you to calculate, for instance, how many users are needed to discover 80% of the usability issues when all issues’ probability of occurrence is above 30%.
All of the above can be used as reference rationales to validate using a certain number of participants for a study. However, as specifically mentioned in Faulkner’s paper, having a highly representative user sample is crucial in uncovering the priority usability issues. Indeed, beyond all those statistical models, getting the right users is sometimes as important as (if not more important than) getting enough users.
Client Question 2: Are we telling the product team that 80% of our customers will fail to use this functionality because 8 out of 10 users failed in the usability study?
Well, the primary purpose of usability studies is to discover qualitative usability issues with an interface, as opposed to predicting the probability of those issues’ occurrence. However, the task completion rate is one of the key metrics we use to evaluate the usability of different UI features, and it is our responsibility to give clients and the product team a clear idea of how to interpret the completion rate.
The confidence level of the results is, again, closely related to the number of users included in the study. From a statistical standpoint, it’s not difficult to understand that the more users in the study, the more confident we can be in the results. However, with only 10 participants, how confident can we say we are in our results?
John Sorflaten has an interesting article discussing this topic. He put forward the limitation of using task success data to predict customer behavior on a larger scale. He recommended using the Adjusted Wald Interval calculator coded by Jeff Sauro to generate the lower and higher bounds of the task success data.
For instance, if 8 out of 10 participants succeed in a task, how could this data be used to predict 1,000 or 10,000 users’ behavior? By using a confidence level of 95% (if you run the same test 100 times, 95 of the times the results will fall within the acceptable +/- margin), Jeff’s calculator generates a lower bound of 48% success and a higher bound of 96% success based on the 80% task success rate from the usability study and accounting for the small sample size. And the same is true if 8 out of 10 participants fail in a task: The calculator predicts a chance of as few as 48% or as many as 96% of users failing the task when the UI is actually released and on the market.
In that sense, as opposed to using the 80% task success rate to predict broader user behavior, we as usability professionals can show the range between 48% and 96% as a reference range for the product manager or marketing team to make further interpretations or decisions.
Next time, when clients are debating between 8 or 10 participants, or the product manager is asking why the task completion rate does not match large-scale user data, these basic stats will help to answer the questions.
*Moral not Morale, but hey, hope your Morale improved watching our skit. Cheers! Vel
Vel Prakhantree