Thinking Out Loud: Understanding Voice UI, and How To Build for It

At work, we talk a lot about ‘voice’; what is it good for? Is it the post-mobile platform? And our clients ask us a lot about ‘voice’, and how to build a branded app. But I’m not sure everyone is talking about the same thing; and I’m just as unsure that anyone knows what makes a really good branded ‘voice’ app. I mean, I’m fairly sure I don’t.

This article is my attempt at defining what we’re talking about when we talk about ‘voice’; and, based on my experience as a user and developer of ‘voice’, trying to nail down some of the opportunities for branded third-party apps.

Defining ‘voice’

I think what we mean when we talk about ‘voice’ is: interaction with a digital assistant (Alexa, Assistant, Cortana, Siri, etc) through an interface that converts voice to text. Sometimes there’s a screen that’s interactive (Assistant on Android), sometimes a screen that’s mostly passive (Echo Show), but most often there’s no screen at all (Home Mini, Echo Spot, etc).

To be even more precise, I suspect that we’re mostly talking about Alexa and Assistant—the two biggest players in the market that are extensible through a development platform. A voice app, in this case, is a web app that manifests itself not through a browser, but through Alexa or Assistant compatible devices: Alexa calls them Skills, and Assistant, Actions.

I also think that the single ‘voice’ label is slightly confusing, because really it’s not one thing; there are probably four components to it:

  1. Input: message transcription, such as composing texts.
  2. Command: when we expect an action to be executed: ‘call my wife’, ‘play Stranger Things’.
  3. Enquiry: when we expect an answer: ‘will I need an umbrella tomorrow?’, ‘how do I start exercising?’
  4. Conversation: a multi-step engagement with a digital assistant application—either free natural language, or guided through prompts.

Brand Opportunities

Each of the components of voice offers some opportunities to businesses, although perhaps least is Input, as this largely takes place at the OS level—in the input boxes of messaging apps, for example. Unless the service offering is for taking notes or lists, there’s probably not much scope here.

Conversation is, I think, the area that most brands think of when they consider a voice app, because it offers the longest engagement (‘dwell time’, in the standard metrics). But it’s unproven that users want to spend more than a handful of seconds at a time talking to a voice-only device; personally, I definitely don’t want to. I think too many branded voice apps are trying to maximise engagement and, as a result, missing out on what makes voice different from, say, the web.

This component also has other drawbacks: the more ‘natural’ and comprehensive the conversation aims to be, the more resource-intensive it becomes to build; and, while responses on voice devices with screens are easy to skim, voice-only responses can be an inefficient delivery method for information.

In my opinion, Enquiry is probably the area that offers most value to customers; as the name implies, digital assistants are there to help people get to something (usually an answer) quickly, with as little friction in the process as possible. There might be great value in, for example, ‘ask Brand X to recommend a thing for me’ and getting back an appropriate answer; although dwell time would be low, satisfaction would be high.

There are, perhaps, two big challenges in this approach. First knowing enough about the customer and their intent to give them a response that satisfies them. Second, competing against what the digital assistant already knows; if a service provides, say, cocktail recipes, it would need to be sufficiently unique to offer more than “OK Google, how do I make a martini?”.

Finally, Command is similarly interesting for brands, but probably only mostly useful if you offer some kind of media, product, or service endpoints: ‘ask Brand X to tell me my points balance’, or ‘tell Brand X to cancel my last order’ would be short but satisfying interactions.


One drawback of brand apps on digital assistants is that you have to include the invocation phrase in your request; ‘play some soothing sounds’ might be memorable, but ‘ask Brand X to play some soothing sounds’ is harder to remember. This is a known and continuing problem in getting people to discover and—especially—make repeated uses of a voice app.

On Alexa and Assistant, each voice app is registered with a handful of explicit invocations (‘talk to Brand X’; ‘ask Brand X how to start running’) so that people can invoke them directly if they know the right phrase; this is usually what brands will promote to drive traffic.

But, knowing that discovery on voice platforms tends to be intent-based, Assistant also offers implicit invocations, such as ‘how to start running’;  so when a user asks for ‘advice on how to start running’, Assistant’s recommendation algorithm could match their intent with a branded app. (If you think this sounds like keyword advertising, I agree; and I expect it to be monetised that way in the future.)

The Right Approach?

My hunch is that the right approach for most brands right now would be a combination of Enquiry and Conversation—heavy on the former, light on the latter. Considering the intents that users might have towards their brand, optimising for discovery around those, and trying to answer them in the best way. Personalised to deliver optimised results, and using conversation to disambiguate or steer back towards the intents, but essentially trying to deliver a response and get out of the way as much as possible.

This would be affected by a lot of variables, and I’d prefer if my recommendation was based on data rather than experience and intuition, so I’ll be looking for opportunities to measure and test my hypotheses.

Also published on Medium.