CS代考 COMP3074-HAI Lecture 18, User Testing for VUIs

HAI-Lecture18

User Testing for VUIs

Human-AI Interaction

Lecture 18

§Based on Chapter 6 and 7 in ’s book

§User Testing – Basics

§Early Stage Testing
§Wizard of Oz
§Usability Testing

§Prerelease Testing
§Pilot Testing

This lecture

COMP3074-HAI Lecture 18, User Testing for VUIs

Part 1. User Testing

Chapter 6.
(2017). Designing Voice User Interfaces

Think, Discuss, Share (3 mins) – For user testing for VUIs, what are
some of the things you may want to test?

§Special considerations for VUIs
§Do users understand that they can talk to the system? Do they

know how (what they have to say), or when?
§ The discoverability problem; prompt design

§Does your VUI understand the way people actually talk to it?
§What are the kinds of things people say/ask, and the words

people use à your VUI needs to recognise them
e.g., “set us a family quiz” keyword not matched to the expected intent

§ Is the Dialog Management in your VUI effective?
§Do people get things done with it they’re supposed to? How

well do your implemented strategies (error recovery,
confirmation, disambiguation etc.) work?

User testing for VUIs

COMP3074-HAI Lecture 18, User Testing for VUIs

§Why testing with real users?
§ Testing… for what purpose?
§Designing a study

§ Task definition and order
§Participants
§Data collection
§Data analysis

User testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§Part and parcel of human-centred / user-centred design
§ Find problems early on in the process, fix them (cost!)
§Most technology is designed to be used by and useful for

people so should be tested with them too (effectiveness,
efficiency)

§People draw on prior experience when interacting with
technology which can help or hinder – you can only find out by

§ Testing helps you improve your product, therefore make more
money, have more satisfied customers etc.

§People can have expert / local knowledge that help you design
a better VUI, especially if they are the users you are designing for

Why testing with real users?

COMP3074-HAI Lecture 18, User Testing for VUIs

§What is the user testing for?
§ To ‘measure’ user experience or task performance (response

times, accuracy etc.), e.g., successes and pain points
§Subjective (e.g., ratings) / objective measurement (elapsed

time, etc.)
§Self-reported vs. behavioural/observational data (e.g., number

§ To inform design decisions (formative evaluation)

§Early on, or mid stage
§ To evaluate final prototype (summative evaluation)
§ To report ‘success measures’ to clients, write a research paper,

or have other ‘extrinsic’ reasons

Testing… for what purpose

COMP3074-HAI Lecture 18, User Testing for VUIs

§ What is the goal/purpose for the study?
§ Can be driven by your research questions or hypotheses or exploratory
§ Study user experience, performance, etc.

§ What are you asking the user to do?

§ Participants
§ Ethics, recruitment, instructions, reimbursement, sample size etc.

§ Data collection and analysis
§ How and what data are you collecting? (interviews, surveys, measurement..)
§ Does it measure what you want it to measure? (Validity)
§ How are you analysing the data (e.g., quantitative or qualitative)

Designing a user study

COMP3074-HAI Lecture 18, User Testing for VUIs

§Designed to exercise the parts of the system you wish to test
§ focused on primary dialog paths

§ features that are likely to be used frequently,
§ tasks in areas of high risk,
§ tasks that address the major goals and design criteria identified

during requirements definition
§write the task definitions carefully to avoid biasing the participant
§ describe the goal of the task without mentioning command words

or strategies for completing the task
§E.g., “Please use this VUI to order two dishes from a restaurant.”

Task definition

COMP3074-HAI Lecture 18, User Testing for VUIs

§ To avoid order effects (e.g., primacy and recency effect),
randomize tasks if possible, e.g., using a Latin Square design, each
task in every position

§ If you’re using conditions, counterbalance the order across your

Task order

COMP3074-HAI Lecture 18, User Testing for VUIs

§ Characteristics
§ Demographics, experience of use, practicality (how far away) etc.

§ Sample and population
§ Stratified sample

§ Sample of certain characteristics, e.g., age 5 in 20s, 5 in 30s…
§ Representative sample

§ Allow generalizable statements for a population, need large N
§ How many is enough? (for Usability Testing)

§ ‘s advice: test with 5 users or less (2 users for low-fidelity
prototypes) and do many iterations of testing (at least 3 rounds of testing).

§ For fewer iterations test with 8-10 users for prototypes, and 15-20 users for
finished products. If you can, or need to iterate then a second round of
testing should suffice. Source: https://www.experiencedynamics.com/blog/2019/03/5-user-sample-size-myth-how-many-users-should-you-really-test-your-
ux#:~:text=If%20your%20prototype%20is%20higher,10%20(Agile%20User%20Testing).

Participants

§Questionnaires/surveys/interviews to gather self-report data
§ Lots of techniques (e.g., semi-structured interviews)
§Standardised instruments (e.g., SUS),
§ Tools to make your own (Qualtrics, surveymonkey etc.),

§Open/closed-ended Q, Likert scale, rankings, scores etc.
§Measurements, video/audio to gather observational data

§Quantitative: Errors, time, completions, number of words etc.
§Qualitative: Audio recordings of dialogs / interactions with VUIs

Data collection

COMP3074-HAI Lecture 18, User Testing for VUIs

§Myriad approaches
§Quantitative

§Hypothesis testing
§Descriptive and inferential statistics

§Depends on the types of data (nominal (categorical), ordinal,
interval, ratio) and its distribution (e.g., normal or log normal)

§Algorithmic/mathematical, e.g., precision and recall
§Qualitative

§ Thematic Analysis (finding the ‘themes’ in data)
§Conversation Analysis (focused on talk-in-interaction)
§ Interaction Analysis (multimodal)….and many more!

Data Analysis

COMP3074-HAI Lecture 18, User Testing for VUIs

Part 2. Early Stage and Usability Testing

Chapter 6.
(2017). Designing Voice User Interfaces

§ Testing concepts and dialog flow early on in the design process
§ Table reads with sample dialogs

§How does it sounds? Repetitive and stilted? Or concise and
natural? Etc.

§Good for spotting too many transitions with the same wording
(conversational markers), e.g., “thank you” or “got it”

§ Initial reactions to mock-ups
§Wizard of Oz testing

§ “Human behind the curtain” simulates fully working system
§Realistic but much cheaper/quicker to implement
§Elicitation study to learn what language/terms people use

Early-stage testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§How do you create a realistic simulation features of an NLU
system like recognition accuracy?
§ The human wizard’s understanding makes the simulation of NLU

unrealistic; need a strict protocol, the wizard’s rules, e.g.,
§ IF the user says the right ‘keywords’ AND the request is

complete, produce success response
§ELSE produce an error response

§Researchers doing Wizard of Oz studies need to produce
believable fiction, this is basically a performance

§Example from our paper
§ , . Fischer, and . 2020. Pulling Back the Curtain on the

Wizards of Oz. Proc. ACM Hum.-Comput. Interact. 4, CSCW3, Article 243 (December 2020), 22
pages. https://doi.org/10.1145/3432942

Wizard of Oz

COMP3074-HAI Lecture 18, User Testing for VUIs

§ When you have a working app
§ Should have the features you want to test in a fully working state
§ May need to create a fake user account for this
§ Parts may still be hardcoded (eg. if the backend is not hooked up)

§ To test the dialog flow and ease of use
§ Not generally aimed at testing recognition accuracy
§ Usually in a ‘lab setting’ (a ‘controlled’ environment)
§ But can be done remotely

§ Many more potential users can take part online
§ People don’t have to travel, and may be less self-conscious
§ Better for testing ‘in the wild’ / real-world scenarios
§ Works in a pandemic too!

Usability Testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§Measure both performance (objective) and experience (subjective)
§ Likes and dislikes, but also task completion, errors, recovery etc.
§ key measurements for testing VUI systems (Larsen, 2003):

Larsen, L .B . (2003) . “Assessment of spoken dialogue system usability: What are we really measuring?” Eurospeech, Geneva.

§ accuracy and speed, cognitive effort, transparency/confusion,
friendliness, and voice

§ Identify paint points: Where did the users struggle? Did they

know when it was OK to speak? When things did go wrong, were
they able to recover successfully?

§Recommendations: rank the issues by severity, create a plan on
when and how you’ll be able to fix them

Usability Testing cont’d

COMP3074-HAI Lecture 18, User Testing for VUIs

§ Lab vs. ‘in the wild’ / ‘real world’
§ In-car testing. Simulators or the real

thing – risks!
§ In the home. Need to balance

potential intrusions with benefits
§Generally, field trial in the setting that

your technology is going into!
§Shows the ‘messiness’ of technology in

context – think of our Alexa family
§Ecological validity is a key quality of

research studies

COMP3074-HAI Lecture 18, User Testing for VUIs

Part 3. Prerelease and Pilot Testing

Chapter 7.
(2017). Designing Voice User Interfaces

§VUI-specific prerelease tests
§Dialog traversal testing
§Recognition testing
§ Load testing

§Pilot testing
§Success criteria
§ Transcription and analysis

Prerelease and Pilot Testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§Dialog traversal testing
§ The purpose is to make sure that the system accurately

implements the dialog specification in complete detail
§ Tests all transitions, error prompts, help prompts, or anything else

that could happen at any given state in your dialog
§ every universal and every error condition must be tested
§ For example,

try an out-of-grammar utterance to test the behavior in response to a
recognition reject,
try silence to test no-speech timeouts,
multiple successive errors within dialog states etc.

Prerelease Testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§ Load testing
§ verifying that the system will perform under the stress of many

concurrent user sessions
§Will it crash or slow down to a crawl?
§Can be automated with third party services

Prerelease Testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§Define success criteria
§Not just recognition accuracy!
§ Talk to marketing, sales, support and other stakeholders, e.g.,

§Sixty percent of users who start to make a hotel reservation
complete it.

§Eighty-five percent of users complete a daily wellness check-in
at least 20 days out of a month.

§ The error rate for playing songs is less than 15 percent.
§ Five hundred users download the app in the first month.
§ The user satisfaction survey gives the app an average rating of

at least four stars.

Pilot Testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§Other key measures of success
§ Task completion rates. % of tasks completed (e.g., user books X).
§ Dropout rates (inverse of above). % and position of dropouts
§ Time spent. Average time per task or in overall app
§ Barge-in. Position and frequencies.
§ Speech vs. GUI. Rate and position(s)
§ No-speech timeouts. Position, rate, etc.
§ No intent matches. Correct rejects (out-of-domain “favourite color is

spaghetti”) vs. false rejects (in-domain, e.g., “set us a quiz”)
§ Errors. Types of errors, position, recovery etc.
§ Navigation. Back button use, menu items etc.
§ Latency. End-of-speech timeout, return of result

Pilot Testing

COMP3074-HAI Lecture 18, User Testing for VUIs

§ Make sure you log these measures of success, e.g.,
§ Recognition result (what the recognizer heard when the user spoke,

including confidence scores)
§ N-best list, if available (list of possible hypotheses)
§ Audio of user’s utterance for each state, including pre- and post-

endpointed utterances (for transcriptions, since the recognition result is not
100 percent accurate)

§ If recognition did match to something, what it matched to
§ Errors: no-speech timeouts (including timing information), no match,

recognition errors
§ State names(or other way to track where in the app the user traversed)
§ Barge-in information, if barge-in is enabled

COMP3074-HA I Lecture 18, User Testing for VUIs

§ Correct accept: The recognizer returned
the correct answer . [“My favorite color is
red .” à “red”] ✅

§ False accept: The recognizer returned an
incorrect answer. [“My favorite’s teal” à

§ False reject: The recognizer could not find
a good match to any path in the grammar
and therefore rejected rather returned an
answer . [“I think magenta is nice .” à no

Transcription and analysis

In-domain Out-of-domain
§ Correct reject: The recognizer correctly

rejected the input .[“I think I would like to
book a hotel” à no match] ✅

§ False accept: The recognizer returned an
answer that is, by definition, wrong
because the input was not in the grammar
. [“I read the paper” à “red”] ❌

28COMP3074-HAI Lecture 18, User Testing for VUIs

§ Manually transcribe the user’s utterances from your log data
§ Examine ”in-domain” vs. “out-of-domain” data, e.g., for “what’s your favourite

§User testing all the way from early concept to release, and beyond
§ to get the user experience right (language, flow, error recovery…)
§ to optimise system performance (recognition accuracy, latency,..)
§ to monitor how your app is doing and allow the designer to

improve performance
§Ensure that

§ you’ve talked to stakeholders to define success measures, and
task completion, etc.

§ and are logging the right data to measure these.
§ Tracking speech failure points and having a way to quickly make

improvements is essential for successful VUIs.

Conclusion

COMP3074-HAI Lecture 18, User Testing for VUIs

Quiz 3 next week!

CW1 due next week!

Create your Voiceflow

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts