top of page
Search
Zhengyi Cao

Several Principles in Voice-User Interface Design

“The great myth of our times is that technology is communication.” —Libby Larsen, American composer


Voice user interactions (VUIs) allow the user to interface with a system with speech or voice commands. With more and more voice-controlled devices storming the market, VUIs are much closer to everyone's life. They will be a great part of the user experience in future apps or devices.


However, it's impossible to apply the same guidelines to VUIs as to Graphic User Interfaces. On the one hand, there are usually fewer visual affordances in VUIs and most information can only be conveyed by speech. On the other hand, the user normally associated voice with interpersonal communication rather than with person-technology communication, which makes them frustrated when the computer can't respond in the they expected.


Since unfortunately, today's technology still can't make the computer communicate as human-beings do, it becomes even more important to design the voice user interface so that it contains the right amount of information and handles the user’s expectations elegantly. To do this, I've drawn some principles that can be applied in the design of VUIs from the design guidelines of Alexa and Google and other resources and wrote examples. Before the principles, let's have a general look at the design of VUIs.



The implementations of a Voice User Interface


Like GUIs, the design of VUIs usually concentrates on how to respond to the user's command and how to transmit information to the user. There are two implementations of VUIs, the first one is by building a traditional massive flowchart and a complicated phone tree, and the second understands and constructs meaning based on shared context. In the first situation, which is common in chatbots, the designer need to write down nearly all possible inputs in the scenario and the responses. The result is often frustrating-to-use. The second implementation builds on the technical base of machine-learning and is more natural and enables the design of real conversations. In today's applications, most VUI systems were implemented by the second way and in many situations, the design of VUI can generally equal to the design of conversations.

General Modules and Flow in a VUI System

Shown in the image, a VUI system is much more complicated than a GUI system. The key technologies in a modern VUI system are:


Speech Signal Processing (SSP): Technology includes the acquisition, manipulation, storage, transfer and output of speech signals. It processes the original speech signal in a digital format, which can be easily transmitted, stored, recognized and processed by the computer.


Auto Speech Recognition (ASR): Technology that is able to take an audio input and map that input to a word or a language utterance.


Natural-language Processing (NLP): Technology that extracts the “meaning” of a user’s utterance or typed text. Natural-language Understanding (NLU) is the post-processing of text, after the use of NLP algorithms to discern the meaning of fragmented and run-on sentences to execute an intent from typically voice commands.


Natural-language Generation (NLG): Technology that transforms structured data into natural language.


Text to Speech (TTS): Technology that converts text to audio that is spoken by the system.



The Anatomy of a Voice Command


When designing VUIs, designers need to constantly note the objective of the voice interactions (unless for free-chat situation). A voice command from a user usually contains three key parameters: the intent, utterance and slot.


An intent is the user's broaden objective of the command. The utterance represents how the user phrases their request and one request can be phrased with different utterances. Slots are sometimes optional in command and they are the required or optional variables to carry out tasks. For example, in the command of "Play Joker on Amazon", the intent is to play a movie, the utterance is "Play me" and the slots are "Joker" and "Amazon". By analyzing a command into intent, utterance and slots, the AI understands the command.



The Objectives of VUI Design


Generally, the objective of VUI design should be to provide intuitive, effective and frictionless experience. To be specified, the objectives can be concluded as:

  • To make users cooperate with the AI effectively

  • To make users cooperate with the AI delightfully

The effectiveness of the VUI should take the priority in the consideration of design while keeping the communication natural and delightful.


The Process of VUI Design


Flowchart of the conversation design process-by Google

The left image uses a waterfall process to show the steps in the design of VUIs. In practice, the design should be more "agile" and less "waterfall". The basic process of VUI design can be divided into 4 steps.


First, judge whether your application works well in a VUI and does it help with the user experience. You can check here to take a quick quiz to find the answer.


Second, gathering requirements by the methods of user research. Common methods include User Persona, User Journey Map and Storyboard. Then use templates or create a persona for the application, which represents your brand and mission. A good persona evokes a distinct tone and personality.


At this point, the prototypes of VUI conversations with sample dialogs and dialog flows can be designed. Sample dialog is just like a short story in a certain scenario and the dialog flow is a little like the traditional decision tree and should should contain:

  • Keywords that lead to the interaction

  • Branches that represent where the conversation could lead to


After the prototypes, quick testing and iterations can improve the main dialog flow while the detailed design should cover the long tail of ways a conversation can deviate from the most common paths. Additionally, a good VUI design should also work in a multimodal environment, that is to design matching visual components and adapt your design to other devices.


After the design, UX analysis is also necessary to get feedbacks from the user and prepare further iterations. Some key metric to measure the VUI are:

  • Engagement metrics, such as sessions per user or messages per session

  • Languages used

  • Behavior flows

  • Messages, intents, and utterances



Principles for VUI Design


To design VUIs with a better user experience, I divided the principles into two types that are suitable for different levels, one for simple sentences and another for conversations. In the examples I use, Vivi is the name of the virtual assistant.


Let's start from principles suitable for simple sentences.



For Simple Sentence


Grammar


It's obvious that every sentence in VUIs needs to follow certain grammars. There are three grammar principles that should be noticed:


1. Personal Pronouns

The personal pronouns in a single sentence need keep consistent.

Counterexample: Vivi has found three results and shall I report them?


2. Reply & Execution

The order of the relies from AI and the task execution can influence the tense used in replies. There are four situations:


Reply first and then execute. This situation is not common as the reply can drag the time for execution. The AI replies with a future tense immediately after understanding the command.

Example: Sure, I will wake you up at 7 tomorrow morning.


Reply after executing. Mostly seen in services like searching, where the execution results determine the contents of the reply, and the reply often uses the tense of the intention. But if the reply contains the actions of the bot itself, like 'find', should use a simple past tense like 'found'. The bot should reply after loading the execution.

Example 1:

-The user: Read my messages

-After loading the messages,

-Vivi: You have a text from Anne "What time is dinner tonight?"


Example 2:

-The user: What's next on my calendar?

-After loading the calendar

-Vivi: I found this in your Calendar for today

-Display the item on the screen


Reply while executing. The most common situation. The reply sent immediately after understanding the command. Present participles are often used.

Example: Playing your favorite list


Execute without replying. Only happens when the execution can be felt evidently by the user without notice.

Example: the user says, "Stop the music" and the music is stopped.


3. Punctuation

This principle rules the punctuation marks in the sentence and is more effective with GUIs as the text can be shown on screens.


If it's a question, use a question mark.

Example: Can you say that again?


Use comma and quotation marks if necessary.

Example: Alright, turning the temperature down


Avoid Exclamation points

Avoid exclamation points as they can be perceived as shouting.


Be careful of other punctuation marks.

Example: Here's a Spotify playlist called Intense Studying: Piano Concertos Edition


Cooperative Principle


Previous studies on voice interaction revealed that the interactions by speech are so fundamental in life that we cannot completely disregard our expectations for how speech communication normally takes place. This means some principles that suitable for the interpersonal communication should be applied to VUIs too. Specified, the cooperative principle, which describes how people achieve effective conversational communication in common social situations, can be helpful in the design of VUIs, especially in task-oriented situations. This cooperative principle was phrased by Paul Grice. The maxims in cooperative principle can co-exist in one sentence.



1. Maxim of quality

The replies of AI would better contain information of good quality, in other word, truth.


Do not say what you believe is false.

Counterexample: Sorry, it's too noisy, can you say it again?


Do not say that for which you lack adequate evidence.

Counterexample: Sorry, I don't know where your pencil is, maybe it's on your desk?


2. Maxim of quantity

The replies should offer enough information to the user.


Make your reply as informative as is required (for the current purposes of the exchange). The reply should contain enough information for the cooperation.

Examples:

-Bad: Action failed

-Good: Sorry, no Internet connection

-Better: Sorry, no internet connection, please try it later


Do not make your reply more informative than is required. The reply should be concise and don't offer too much information that might overwhelm the user.

Examples:

-For a non-technical user

-Good: Sorry, no Internet connection

-Bad: Sorry, no Internet connection, the destination IP address is 123.123.123.123 and the packet loss rate is 100% right now


3. Maxim of relation

The reply should be relevant to the objective. Before the technology enables AI change the topic naturally in conversations, only the user should take the initiative.

Counterexample:

-The user: Vivi, raise the temperature to 26

-Vivi: Ok, by the way, 26 is my lucky number


4. Maxim of manner

The contents of the reply should be perspicuous.


Avoid obscurity of expression. Avoid terminologies, jargon, legalese and complex clauses, use simple words and colloquial speeches like contractions and phrases instead. Follow user habits are safe. If the terminology is necessary, explain it at first use.

Example 1:

-Bad: Sorry, I cannot understand

-Good: Sorry, I can't understand

-Better: Pardon?

Example 2:

-Bad: Do you want to enable streamlined identity experience?

-Good: Do you want to sign up as a member?


Avoid ambiguity. Use necessary referents or conditions to avoid ambiguity.

Example:

-The term platform can stand for both the device or the app

-Good: The platform doesn't support this action

-Better: Your phone doesn't support this action


Be brief. There are four basic ways to avoid unnecessary prolixity while maintain required information. However, brief sentences may look indifferent.


Avoid unnecessary explanations.

Counterexample: The password is wrong. I can't open it with a wrong password


Avoid niceties. Niceties make responses feel distant and formal. Ditch them to keep the conversation friendly and informal.

Example:

-Good: Before I can help with that, please accept the terms of service

-Better: Please accept the terms of service before using


Only states one action or status within one reply if possible.

Example:

-Bad: To play in Bluetooth mode needs you to enable Bluetooth on the phone first and then play the music

-Good: To play in Bluetooth mode, please enable Bluetooth on the phone first


Try shorter expression.

Example:

-The user: How many dollars is 3 euros?

-Good: The stormy weather makes it harder to walk safely, be careful

-Better: It's hard to walk in stormy weather, be careful


Be ordered. If there are more than one action or object in the reply, sort them in the order of time, space or importance. Notably, if you want the user to do something, give them a reason first. Because starting with an action makes the user easily forget it.


Example:

-Good: You should get up earlier tomorrow, as you may get in a jam tomorrow

-Better: You may get in a jam tomorrow, so if you don't want to be late, get up earlier.


Politeness Principle



Like normal conversations happen in our life, the expressions and sentences should never annoy the user. This politeness principle was listed by Geoffrey Leech. However, politeness is built on effective cooperation. The usage of politeness maxims should not violent the cooperative principle. Additionally, not all of the maxims are equally important. For instance, "Tact" involves a more powerful constraint on conversational behavior than does "Generosity", and "Approbation" is more important than "Modesty". Note also that speakers may adhere to more than one maxim of politeness at the same time.


1. The tact maxim

The tact maxim states: "Minimize the expression of beliefs which imply cost to other; maximize the expression of beliefs which imply benefit to other." The first part of this maxim can minimize the imposition, and the second part can attend to the hearer's interests, wants, and needs.

Example 1:

-Bad: Sorry, I didn't hear you, please speak to me louder

-Good: Sorry, I didn't hear you, please say that again


Example 2:

-Bad: I can't purchase that because you haven't chosen the size

-Good: Sure, but before that, let's choose the size you like


2. The generosity maxim

The maxim states that "Minimize the expression of beliefs that express or imply benefit to self; maximize the expression of beliefs that express or imply cost to self". The maxim of generosity focuses on the speaker and says that others should be put first instead of the self. In other words, if the task can be completed by the system, don't leave them for users. Also notify the user that the system has done something for them.

Example:

-The user is reading on the bed

-The user: Hey, I want to go sleep

-Better: Sure, I'm turning off the lights and closing the windows for you, good night


3. The approbation maxim

Minimize the expression of beliefs which express dispraise of other; maximize the expression of beliefs which express approval of other. It is preferred to praise others and if this is impossible, to sidestep the issue, to give some sort of minimal response, or to remain silent. The first part of the maxim avoids disagreement; the second part intends to make other people feel good by showing solidarity. However, the degree of approbation should be accorded with the setting of the assistant: you don't expect a cult-style bot with a sharp tongue approves you. This maxim also means, never blame the user, and never blame another party. People don't like individuals who blame others for failure.

Example 1:

-The user: I want the first one

-Vivi: Sure, Good taste

Example 2:

-Bad: I can't play the music because you haven't linked me with your Spotify account

-Good: I can play that right now, because I need to link with you Spotify account, should I do it now?


4. The modesty maxim

The original maxim goes like "Minimize the expression of praise of self; maximize the expression of dispraise of self". The bot should narrate what it has done in an objective tone and moderately apologise for failures or errors. Words like "sorry" can also serve a social or phatic function and is not a full-fledged, heartfelt apology. However, to maximize the expression of dispraise of self too frequently, like to keep saying "sorry", sounds annoying and risks undermining the user’s confidence in the system. Keep moderate modesty and remember, offering solutions is always better than apologizing to show empathy and emotional intelligence.

Example:

-Good: I'm so sorry, I can't do it without network

-Better: Oops, I can't do it without network, I'm checking the Internet connection

5. The agreement maxim

Minimize the expression of disagreement between self and other; maximize the expression of agreement between self and other. However, it doesn't mean that we should totally avoid disagreement. It is simply observed that people are much more direct in expressing agreement, rather than disagreement.

Example:

-The user has added one task for 10 a.m., now he wants to add another one at 10 a.m.

-The user: Vivi, I'll have a meeting with Mr. Wong at 10 a.m.

-Vivi: I see, but you have an appointment with your wife at 10 too, do you want to cancel it?


6. The sympathy maxim

Minimize antipathy between self and other; maximize sympathy between the self and other. It concludes congratulation, commiseration and condolences and other speech acts attending to the hearer's interests, wants, and needs.

Example:

-The user: Vivi, call Doctor Smith, I have a headache now.

-Vivi: Calling Doctor Smith, don't worry


For Conversations



The above are all the principles that can be applied on a simple reply form a virtual assistant. Then here are some principles that should be taken into consideration when designing a conversation.


Status Principle


In GUIs, user can see when they enter a new section or interface. However, they need to be told in VUIs as VUIs don't have enough markers for them to judge where they are, specifically, which functionality they are using and how far the conversation will go. Being confused of the conversation status makes users feel upset and bothered. I concluded four methods aimed to tell the user the status of the conversation.


1. Flags

Use flag words or phrases to mark the progress of the conversation. There are two types of flag can help the user locate their position. Avoid overuse these flag words or the experience can be monotonous and robotic (look at the variation principle).


Timeline. Use words and phrases like "first", "half done" and "last" to show the progress of the conversation.

Example:

-The user: I want to buy an insurance product; can you recommend one?

-Vivi: Sure, but before it, I need to ask you three questions, are you ready?

-The user: Ok

-Vivi: First, do you have any preference for insurance companies?


Acknowledgement. Flag words and phrases, such as "sure", "of course", "sorry" and so on, can remind the user that the bot acknowledges commands. These words can make the conversation more fluid and natural.

Example:

-The user: Dim the living room lights

-Vivi: Sure



2. Confirmations

In VUIs, confirmations are good indicators to help users find their position. They also enable users to correct mistakes immediately and tell them that the bot has not only received it but also understood it. There are three types of confirmation.


Explicit confirmation. Explicit confirmations are used in relatively big commands. They repeat the information gained from the command and ask the user to confirm. The user have s chance to reconsider or undo the command before confirming it.

Example: Do you confirm to order the economy class of CA1111, from Beijing to Shanghai Pudong, taking off at 6:05 a.m., May 1, 2019?


Implicit confirmation. Implicit confirmations contain the information gained from the command in a more implicit way. Implicit confirmations are contextual, and the design should combine the confirmations with the scenarios.

Example:

-The user: Text Julia that I on the way

-Vivi: Sure, I've texted Julia, anything else?


No overt confirmation. This has been mentioned in the situation of "Execute without replying". Use when the input is simple and typically recognized with high confidence, like the case of "yes" or "no", or the result of this command is evident itself, like "play" and "pause". Acknowledgements can also be regarded as non-overt confirmation.

Example:

-A post app: Can you tell me the address you want to send it to?

-The user: Room 303, #3 Green Street

-The app: Ok, and what's the recipient's phone number?


3. Wait indicators

Wait indicators provides a system status for users when something is happening or loading. The missing of wait indicators might let user guess if the system is stuck. VUIs use speeches as wait indicators.

Example: Processing your request, please wait about 1 minute


4. GUIs

Though the article concentrates on VUIs, we shouldn't forget that GUIs are also available for the virtual assistant on many devices like on smartphones. The GUIs can still provide certain information of the system status, like listening, loading and displaying the result.


Guide Principle



It's easy to show users the options they can choose and don't need to tell them to click on the option when designing on GUIs. On the contrary, a voice interface typically has no way of showing the user what options are available, and new users may feel confused about how to interact with this artificial intelligence, which is different from their previous experience in communications. In this case, designers need to provide users with some guides on the functionality, commands and other actions they should do.


1. Guides of functionality.

The voice interface should tell the user what they can do. This kind of guide sometimes consists of the function and a guide of interaction to call this function.

Example 1:

-A weather app: You can ask me for today's weather, by saying "what's the weather today"

Example 2:

-A study app: I can look up words, translate sentences, search for answers and contact your teacher, what can I help with you?


2. Guides of commands

When users are giving a command, they probably don't give the complete command and some necessary slots are missed. Guides of commands can help the user to complete the command to carry out the task correctly.


Parameters. By asking users necessary information, the bot can fill the slots. Options are good to show what the user can choose among. If the options are too universal or too many, options can be omitted.

Example 1:

-The user: Send a message to Mary that I will drop by at 10 tomorrow morning

-Vivi: Sure, but you have two Mary in your recents, Mary Collins and Mary Smith, which Mary?

-The user: Mary Collins

Example 2:

-The user: Book a flight from Paris to New York tomorrow

-Vivi: What time would you like to fly?

-The user: 5 p.m.


Actions. Sometimes users don't know how to make a command, how to answer the questions or what to do. Guides like notices are helpful to deal with their confusions. However, don’t expect the user is happy to be asked to do something. Only guide them to do a thing when the system can't handle this thing.

Example 1:

-The user: how to play a song?

-Vivi: Well, you can assign the song by saying like "Play Masterpiece", or assign the singer like "Play Madonna's songs", or assign the song and the singer like "Play Madonna's Masterpiece"

Example 2:

-The user: I want to check my courier status

-A courier app: Sure, please give me your tracking number, it starts with the letter C and has 8 digits

Example 3:

-Vivi: Please place the item under the scanner



Variation Principle


For a voice interface that applied to one-time or relatively low-frequency scenarios, like buying coffee, the user doesn't expect diversity of the conversations. But for a virtual assistant, the diversity in replies are very important. One-fit-all replies bore the user.

Example 1:

-Sure/Of course/Certainly/...

Example 2:

-Sorry, I didn't hear you, please say it again

-Sorry, I missed it, can you say it again?

-Pardon me, please say it again


Repair Principle


Like GUIs, errors are unavoidable. When the system finds the input from the user doesn't meet the expectations, it shouldn't blame for the user, but should focus on repairing the conversation, take the dialog back on track. There are three common errors in VUIs.


1. No input

When the ASR part doesn't detect valid inputs over a tolerable lapse. This can result from too much noise or that the user can't speak right now, spoke before the mic opened or even has left.

Example:

-A user: Hi, I want an Americano

-A coffee vending bot: Sure, which size? We have middle, big and super big

-The user left

-The bot: Maybe you would like to try our middle size

-Still no input from the user

-The bot: Fine, see you next time


2. Not match

When the result of NLU can't match the expectations for the intent, utterance or slots, the VIU should find a way to fix the mismatch, typically by asking or repeating a question. There are five possible types of mismatch.


Intent mismatch. This situation is often caused by the user trying to change the topic of conversations temporarily. Human can issue this mismatch naturally, but current technology is not good at dealing with it. The design should depend on the relevance between the two topics and choose to jump to the topic with the user or ask the user to stay, by the context. For example, if the user's intent is out of scope, notify the user. Also set some intents as high-class intents. When the user shows such intents, change the topic anyway.

Example:

-A coffee vending bot: Hi, what do you want?

-A user: Open the door

-The bot: Sorry, I can't do that, I can make coffee for you and please choose one on the screen (stay)

-The user: Americano

-The bot: Sure, which size? We have middle, big and super big

-The user: Wait, I want one middle cup of iced cappuccino.

-The bot: Ok, one middle cup of iced cappuccino, do you need anything else? (jump with the user)

-The user wants to end the conversation

-The user: Thanks

-The bot has set phrases like 'Thanks', 'That's it' as superior intents to end the conversation

-The bot: You are welcome. Just wait for a minute, your order comes soon



Utterance mismatch. In some languages that often ignore fixed word orders, the utterance for the same intents can be especially diverse, like Chinese. This makes designers must take as more utterance formats for a certain intent as possible. Learning in the practical situation is also a good way. But when the user uses an utterance that can't be processed, unfortunately, the system needs to repeat the question and add the right utterance format if necessary.

Example:

-The coffee vending bot: Hi, what would you like to drink?

-A user: with milk, Americano, iced, middle cup

-The bot can't process this utterance

-The bot: Sorry, what would you like?

-A user: fine, a middle cup of iced Americano, with milk


Slots mismatch. If one slot is invalid, the VUI should ask for correction instead of shutting the conversation. When necessary slots are missing, use the guide principle to guide the user.

Example:

-The coffee vending bot: Hi, what would you like?

-A user: A big Mac

-The bot: Sorry, but I can't do that. You can choose a coffee from the screen


3. System error

System errors happen when the user's command was understood correctly, but the system can't complete the task or fail due to technical issues. Two common situations are:


System failure. The system is unresponsive or returns a general error.

Example: Sorry, but I can't play the song, the database is unavailable now. I'll check it later


Invalid request. The user's command can't be completed because it's invalid. When reporting the invalid request, avoid using "sorry", just take responsibility and briefly notify the user.

Example: I can't call Sarah. There is no Sarah in the contacts, would you like to add Sarah as a contact?


When system errors happen, check every system or environment and account for all possible reasons. Where possible, provide the reason and possible next steps to the user.


4. Three levels of response

When no-input or mismatches happen, to repair the conversation, the system typically uses three turns of response to help the user get back to the track. There are three levels of responses.


Level 1: Rapid reprompt. Quickly and concisely collect the information again or in another way. It typically combines an apology with a condensed repetition of the original question.

Example:

-A travel agent app: Sure, how many days would you like to spend?

-The user: Let me see, um...

-The agent: Sorry, how many days?


Level 2: Escalating detail. Consider why the user might be having difficulty. Then, in the reprompt, include additional support like the guide principle, use examples or options to guide the user.

Example:

-The coffee vending bot: Hey, what would you like to drink?

-The system doesn't have the option of Frappuccino

-The user: A Frappuccino

-The bot: Sorry, we don't have Frppuccino here. We have Americano, Cappuccino, Mocha and other coffees you can see on the screen. Now, what would you like?


The 1st level and 2nd level can be used in the first response. The selection depends on whether the user really needs support from more information. If the 1st level is used at first, the second turn of response should be the 2nd level. If having started with the 2nd level, keep using it for the second turn.


Level 3: Decent ending. End the conversation in a friendly and decent way after 2 turns of attempts to avoid frustrating or pestering the user. If the user wants to complete the task, the system should provide other ways or let the user try it later.

Example 1: This seems beyond my expertise, maybe next time

Example 2: Sorry, I still can't understand that, please try it again after a few minutes

Example 3: Guess you are busy now, see you next time



Interaction Principle


No matter in interpersonal or person-technology conversations, interactions are critical. Lacking interactions can bore and annoy the hearer and make them easily forget. Just recall how annoyed you are when a chatbot rattling off a long list of potions to tell you press '1' for this or '2' for that. Therefore, the designers should limit the information in a turn. Note that this principle is different from the maxim of quantity or to be brief, for here the information is relevant but should be shown in a more interactive way. At the same time, some functions like reporting news themselves are not interactive, it's unnecessary to add interactions in such functions.


There are three methods to enhance interactions.


1. Go-ahead

Notify the user that the next turn will be longer can make the user refrains from intervening.

Example: It will take 10 minutes to read the following article, would you like to hear that?


2. Dispatch & Pause

Dispatch the whole information into smaller pieces and only list several different items for an interaction. Amazon thinks the upper limit is 3. Also use pauses and ask if should continue.

Example: Ten professors will show up in that meeting, including Professor Author, Professor Guinevere, Professor Lancelot, do you want to hear more?


3. GUI

GUIs are better to display a large amount of information. For devices with GUIs, some information can be displayed on the screen and omitted in the speech.

Example: I've found 3 shops nearby. You can check them on the screen


Learning Principle


This principle is optional because it only works when the AI has the ability to learn the behaviors and preferences. There are two typical things the AI can learn and used to provide more customization.


1. The expertise of the user

For novice users who unfamiliar with the options and parameters, the VUIs should offer necessary explanations for these things, multiple turns to guide the user and some confirmations. But for expert users, they don't need the explanations or guides. One-shot commands and simple questions are enough for them. This means the design should accommodate a range of user expertise and provide different experience upon that.

Example 1:

-This is a novice user

-A coffee vending bot: Welcome, I can make coffee for you, what can I do for you?

-The user: I want a cup of coffee

-The bot: Sure, what kind? We have Americano, Cappuccino, Mocha and more. You can check them on the screen

-The user: I want Americano

-The bot: Ok, iced or not iced

-The user: Iced

-The bot: Got it, which size would you like? We have middle, large and super large

-The user: Middle cup

-The bot: A middle cup of iced Americano, anything else?


Example 2:

-This is an expert user:

-The bot: Welcome back. What would you like today?

-The user: A middle cup of iced Americano please

-The bot: Got it.


2. The data of the user

When the user permits legal use of the data, the AI system can make it more convenient for the user, as don't need to repeat some parameters every day. What's more, if the AI can read the emotion demonstrated by the historical data, the VUI can change the voice and offer help. However, the promise of this usage is the user's permission. Sometimes the user can also set parameters actively, like set the address of the home.

Example:

-The user usually orders a middle cup of iced Americano

-The coffee vending bot: Hi again. The same as usual?

-The user: Yeah


Context Principle


The context principle means the AI should integrate the conversation with the context. This also requires the AI has strong ability in learning and memorizing.


Context can be used to predict intents and slots. When the context has covered some certain time, place or people, the user might take a straight command based on the context. The AI needs to use context to predict the intent and slots for this command.

Example:

-The user: Vivi, does the Taco-taco on Downmill Road open today?

-Vivi: Let me see... No, it's closed today, but it'll open tomorrow

-The user: Ok, book a place for two, arrive around 6 p.m. (for that Taco-taco)

-Vivi: Got it. I'll book it tomorrow morning


Context can be used to understand polysemous words. Words can have different meanings in different contexts.

Example:

-there is a tennis match on TV

-the user: Vivi, who are on the court?

-Vivi: I see, this is a match between Novak Djokovic and Roger Federer



The End


Above are the principles I concluded for the design of VUIs. I referred to the guidelines of Google Conversation Design and Amazon Alexa and some Internet resources to summarize them. Though I've tried to make every tips for VUI design that can be found be explained by these principles, I believe there are still some principles I've missed. What's more, I am still interested to find whether they can be supported by psychological or linguistics theories and practical statistics. With the increasing popularity of VUIs, I believe researches in this area could be well worth the time and significantly increase the user experience for future applications.



References & Materials


Amazon’s guidelines for Voice Interaction:


Google's guidelines for Conversation Design:


Pearl, Cathy, Designing Voice User Interfaces: Principles of Conversational Experiences, O’Reilly Media, 2016


Meisel, William, VUI Visions: Expert Views on Effective Voice User Interface Design, TMA Associates, 2006


Bouzid, Ahmed and Ma, Weiye, The Elements of VUI Style, Dakota Press, 2012


Comments


bottom of page