Voice interaction design - four key lessons from psychology

By 2025, the global smart speaker market is projected to grow to $35 billion. Voice interaction assistants like Siri, Microsoft Cortana and Amazon Alexa have gone from interesting experiments to a key part of billions of people’s daily digital lives.

alexa

By any standards, the growth of voice interaction is spectacular. It took about 30 years for mobile phones to outnumber humans. Alexa and her ilk may get there in less than half that time.

That means good voice interaction design is becoming increasingly important in a huge number of fields. From web design and user experience to app design, search engines and digital marketing, voice interaction is rapidly going from ‘nice to have’ to ‘must have’.

Hey Siri! Tell me about the psychology of voice interaction

For designers and other creatives, this is both a challenge and an opportunity. People speak differently from the way they express themselves in writing, and there are other important factors to take into account as well. That means if you want to design great voice experiences, you need to understand some key psychological principles. Let’s look at four key areas:

Accessibility
Emotion
Unintended bias
Cognitive overload

Accessibility and voice user interface design

Research has shown that when interacting with voice technology, adults instinctively use very specific short, focused commands. It’s much more akin to how you would interact with a pet than with another human being.

Children, on the other hand, interact differently. Firstly, their instructions tend to be less precise as they stumble on words and make mistakes. And even more significantly, young children do not have the ability to change conversational requests into short imperatives.

child-interaction

This was clearly shown when researchers at the University of Washington asked a group of three-to-five-year-olds to give the instruction “say quack like a duck” when they saw a cartoon duck on a screen. The children were told that the duck would quack back at them, but by accident, the system wasn’t working.

What the researchers saw was that the children carried on repeating the same phrase until they were told by an adult that it was broken. They rarely raised their voices or changed the way they formulated the instruction.

This has big implications for the ux design of voice interaction systems which need to be accessible to all. Take home automation systems like Google Home - it’s easy to imagine a scenario where an inability to respond to a small child’s commands might have upsetting consequences.

Smart home speakers that are not accessible to everyone in the home are not smart enough. They need to be designed to respond in a way that builds on parts of the instruction that have been comprehended that prompts young children to complete the thought.

Emotion -- it’s not you, it’s me

When we speak, our meaning is more than the sum of our words. The way we phrase what we say can transform an innocent ‘good morning’ into a sarcastic jibe. It can tell our listeners whether we’re happy, sad or angry. It allows other humans to tell if we mean what we say, or the exact opposite.

These nuances and subtleties are extremely hard for artificial intelligences to pick up. But until they can, your Echo Dots and Apple devices will never be able to do more than follow the literal meaning of your speech.

This inability of AI to understand the emotional ‘metadata’ of human conversation is also a potential threat to the quality of human-to-human interactions. At the moment, you can bark an order at your Alexa app or ask her politely to tell you the weather, and you’ll get an identical response.

As we’ve seen, adults tend to talk voice interfaces with short commands. If we don’t develop voice technology which responds according to the way it is spoken to, we could sow the seeds of long term social problems.

However, if we get used to demanding information, will this seep into our human interactions? Will it undermine the foundations of politeness humanity has built up to ensure civilised social interactions?

Race and gender biases

Another major problem is that voice recognition artificial intelligence isn’t equally receptive to everyone’s voice. AI voice recognition tends to perform worse with female and minority voices

Part of the reason for this is that audio analysis struggles with breathier and higher-pitched voices. However, there’s a clear need to train speech recognition algorithms on a far wider variety of voices. One reason for bias towards white male voices in speech recognition is that speech scientists frequently use TED Talks to train AI models, and 70% of TED speakers are male.

These biases are not just of academic interest. They can have serious consequences in people’s lives. For example, in 2017 an Irish veterinarian failed an Australian immigration spoken English proficiency test.

Despite being a highly-educated native English speaker, she failed to meet the requirements for oral fluency. Clearly, this was a fault with the test rather than the applicant. In practice, the test was measuring applicants’ ability to conform to a certain voice type rather than their spoken English proficiency.

Why isn’t Alexa Alexander?

It’s no coincidence that most voice assistants have women’s voices. The creators of the Amazon Alexa carried out research and found that a woman’s voice is more sympathetic.

This supports broader findings that women’s voices are better received by consumers, and that from an early age we prefer listening to female voices.

However, we need to think carefully about this seemingly innocent design psychology. UNESCO research claims that the default use of female-sounding voice assistants signals to users that women are "obliging, docile and eager-to-please helpers, available at the touch of a button or with a blunt voice command like ‘hey’ or ‘OK’".

Because assistants respond to commands regardless of tone or hostility, there’s a concern that the way we talk to Alexa and Siri could perpetuate gender stereotypes and influence the way we talk to women in our everyday lives.

One interesting solution to this issue is Q - a genderless voice pitched between male and female voices, and showing characteristics of both. It’s a different approach to how to design a voice interface.

Just in visual design, voice assistant design needs to be accessible and relevant to its audience. So as the technology matures, it’s likely that technology companies, online marketers and traditional marketing will all embrace greater voice diversity including regional accents, different age groups and slang.

Cognitive overload?

Many of us use voice technology to multitask, for example – reading out your email while making your breakfast – but does this help us complete more tasks in a shorter amount of time, or do we end up doing both tasks badly?

Cognitive psychology research like Christopher Chabris and David Simons’ work on the invisible gorilla shows that there’s almost always a performance cost from multi-tasking. When we do two tasks at once, we’re likely only shifting the focus of our attention rapidly back and forth between the two tasks, and that comes at a cost to speed and accuracy.

This is a large part of the reason why many magic tricks work. But when tasks need to be performed accurately, technologies designed to facilitate multi-tasking may actually be counterproductive.

Let’s talk

Whether you’re a technologist, a social media marketer, a content marketer or something else entirely, good voice user interaction is going to become increasingly important to us all. If you’d like to start a conversation on its role in digital marketing, let’s talk.