HAI-Lecture18
User Testing for VUIs
Copyright By PowCoder代写 加微信 powcoder
Human-AI Interaction
Lecture 18
§Based on Chapter 6 and 7 in ’s book
§User Testing – Basics
§Early Stage Testing
§Wizard of Oz
§Usability Testing
§Prerelease Testing
§Pilot Testing
This lecture
COMP3074-HAI Lecture 18, User Testing for VUIs
Part 1. User Testing
Chapter 6.
(2017). Designing Voice User Interfaces
Think, Discuss, Share (3 mins) – For user testing for VUIs, what are
some of the things you may want to test?
§Special considerations for VUIs
§Do users understand that they can talk to the system? Do they
know how (what they have to say), or when?
§ The discoverability problem; prompt design
§Does your VUI understand the way people actually talk to it?
§What are the kinds of things people say/ask, and the words
people use à your VUI needs to recognise them
e.g., “set us a family quiz” keyword not matched to the expected intent
§ Is the Dialog Management in your VUI effective?
§Do people get things done with it they’re supposed to? How
well do your implemented strategies (error recovery,
confirmation, disambiguation etc.) work?
User testing for VUIs
COMP3074-HAI Lecture 18, User Testing for VUIs
§Why testing with real users?
§ Testing… for what purpose?
§Designing a study
§ Task definition and order
§Participants
§Data collection
§Data analysis
User testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§Part and parcel of human-centred / user-centred design
§ Find problems early on in the process, fix them (cost!)
§Most technology is designed to be used by and useful for
people so should be tested with them too (effectiveness,
efficiency)
§People draw on prior experience when interacting with
technology which can help or hinder – you can only find out by
§ Testing helps you improve your product, therefore make more
money, have more satisfied customers etc.
§People can have expert / local knowledge that help you design
a better VUI, especially if they are the users you are designing for
Why testing with real users?
COMP3074-HAI Lecture 18, User Testing for VUIs
§What is the user testing for?
§ To ‘measure’ user experience or task performance (response
times, accuracy etc.), e.g., successes and pain points
§Subjective (e.g., ratings) / objective measurement (elapsed
time, etc.)
§Self-reported vs. behavioural/observational data (e.g., number
§ To inform design decisions (formative evaluation)
§Early on, or mid stage
§ To evaluate final prototype (summative evaluation)
§ To report ‘success measures’ to clients, write a research paper,
or have other ‘extrinsic’ reasons
Testing… for what purpose
COMP3074-HAI Lecture 18, User Testing for VUIs
§ What is the goal/purpose for the study?
§ Can be driven by your research questions or hypotheses or exploratory
§ Study user experience, performance, etc.
§ What are you asking the user to do?
§ Participants
§ Ethics, recruitment, instructions, reimbursement, sample size etc.
§ Data collection and analysis
§ How and what data are you collecting? (interviews, surveys, measurement..)
§ Does it measure what you want it to measure? (Validity)
§ How are you analysing the data (e.g., quantitative or qualitative)
Designing a user study
COMP3074-HAI Lecture 18, User Testing for VUIs
§Designed to exercise the parts of the system you wish to test
§ focused on primary dialog paths
§ features that are likely to be used frequently,
§ tasks in areas of high risk,
§ tasks that address the major goals and design criteria identified
during requirements definition
§write the task definitions carefully to avoid biasing the participant
§ describe the goal of the task without mentioning command words
or strategies for completing the task
§E.g., “Please use this VUI to order two dishes from a restaurant.”
Task definition
COMP3074-HAI Lecture 18, User Testing for VUIs
§ To avoid order effects (e.g., primacy and recency effect),
randomize tasks if possible, e.g., using a Latin Square design, each
task in every position
§ If you’re using conditions, counterbalance the order across your
Task order
COMP3074-HAI Lecture 18, User Testing for VUIs
§ Characteristics
§ Demographics, experience of use, practicality (how far away) etc.
§ Sample and population
§ Stratified sample
§ Sample of certain characteristics, e.g., age 5 in 20s, 5 in 30s…
§ Representative sample
§ Allow generalizable statements for a population, need large N
§ How many is enough? (for Usability Testing)
§ ‘s advice: test with 5 users or less (2 users for low-fidelity
prototypes) and do many iterations of testing (at least 3 rounds of testing).
§ For fewer iterations test with 8-10 users for prototypes, and 15-20 users for
finished products. If you can, or need to iterate then a second round of
testing should suffice. Source: https://www.experiencedynamics.com/blog/2019/03/5-user-sample-size-myth-how-many-users-should-you-really-test-your-
ux#:~:text=If%20your%20prototype%20is%20higher,10%20(Agile%20User%20Testing).
Participants
§Questionnaires/surveys/interviews to gather self-report data
§ Lots of techniques (e.g., semi-structured interviews)
§Standardised instruments (e.g., SUS),
§ Tools to make your own (Qualtrics, surveymonkey etc.),
§Open/closed-ended Q, Likert scale, rankings, scores etc.
§Measurements, video/audio to gather observational data
§Quantitative: Errors, time, completions, number of words etc.
§Qualitative: Audio recordings of dialogs / interactions with VUIs
Data collection
COMP3074-HAI Lecture 18, User Testing for VUIs
§Myriad approaches
§Quantitative
§Hypothesis testing
§Descriptive and inferential statistics
§Depends on the types of data (nominal (categorical), ordinal,
interval, ratio) and its distribution (e.g., normal or log normal)
§Algorithmic/mathematical, e.g., precision and recall
§Qualitative
§ Thematic Analysis (finding the ‘themes’ in data)
§Conversation Analysis (focused on talk-in-interaction)
§ Interaction Analysis (multimodal)….and many more!
Data Analysis
COMP3074-HAI Lecture 18, User Testing for VUIs
Part 2. Early Stage and Usability Testing
Chapter 6.
(2017). Designing Voice User Interfaces
§ Testing concepts and dialog flow early on in the design process
§ Table reads with sample dialogs
§How does it sounds? Repetitive and stilted? Or concise and
natural? Etc.
§Good for spotting too many transitions with the same wording
(conversational markers), e.g., “thank you” or “got it”
§ Initial reactions to mock-ups
§Wizard of Oz testing
§ “Human behind the curtain” simulates fully working system
§Realistic but much cheaper/quicker to implement
§Elicitation study to learn what language/terms people use
Early-stage testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§How do you create a realistic simulation features of an NLU
system like recognition accuracy?
§ The human wizard’s understanding makes the simulation of NLU
unrealistic; need a strict protocol, the wizard’s rules, e.g.,
§ IF the user says the right ‘keywords’ AND the request is
complete, produce success response
§ELSE produce an error response
§Researchers doing Wizard of Oz studies need to produce
believable fiction, this is basically a performance
§Example from our paper
§ , . Fischer, and . 2020. Pulling Back the Curtain on the
Wizards of Oz. Proc. ACM Hum.-Comput. Interact. 4, CSCW3, Article 243 (December 2020), 22
pages. https://doi.org/10.1145/3432942
Wizard of Oz
COMP3074-HAI Lecture 18, User Testing for VUIs
§ When you have a working app
§ Should have the features you want to test in a fully working state
§ May need to create a fake user account for this
§ Parts may still be hardcoded (eg. if the backend is not hooked up)
§ To test the dialog flow and ease of use
§ Not generally aimed at testing recognition accuracy
§ Usually in a ‘lab setting’ (a ‘controlled’ environment)
§ But can be done remotely
§ Many more potential users can take part online
§ People don’t have to travel, and may be less self-conscious
§ Better for testing ‘in the wild’ / real-world scenarios
§ Works in a pandemic too!
Usability Testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§Measure both performance (objective) and experience (subjective)
§ Likes and dislikes, but also task completion, errors, recovery etc.
§ key measurements for testing VUI systems (Larsen, 2003):
Larsen, L .B . (2003) . “Assessment of spoken dialogue system usability: What are we really measuring?” Eurospeech, Geneva.
§ accuracy and speed, cognitive effort, transparency/confusion,
friendliness, and voice
§ Identify paint points: Where did the users struggle? Did they
know when it was OK to speak? When things did go wrong, were
they able to recover successfully?
§Recommendations: rank the issues by severity, create a plan on
when and how you’ll be able to fix them
Usability Testing cont’d
COMP3074-HAI Lecture 18, User Testing for VUIs
§ Lab vs. ‘in the wild’ / ‘real world’
§ In-car testing. Simulators or the real
thing – risks!
§ In the home. Need to balance
potential intrusions with benefits
§Generally, field trial in the setting that
your technology is going into!
§Shows the ‘messiness’ of technology in
context – think of our Alexa family
§Ecological validity is a key quality of
research studies
COMP3074-HAI Lecture 18, User Testing for VUIs
Part 3. Prerelease and Pilot Testing
Chapter 7.
(2017). Designing Voice User Interfaces
§VUI-specific prerelease tests
§Dialog traversal testing
§Recognition testing
§ Load testing
§Pilot testing
§Success criteria
§ Transcription and analysis
Prerelease and Pilot Testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§Dialog traversal testing
§ The purpose is to make sure that the system accurately
implements the dialog specification in complete detail
§ Tests all transitions, error prompts, help prompts, or anything else
that could happen at any given state in your dialog
§ every universal and every error condition must be tested
§ For example,
try an out-of-grammar utterance to test the behavior in response to a
recognition reject,
try silence to test no-speech timeouts,
multiple successive errors within dialog states etc.
Prerelease Testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§ Load testing
§ verifying that the system will perform under the stress of many
concurrent user sessions
§Will it crash or slow down to a crawl?
§Can be automated with third party services
Prerelease Testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§Define success criteria
§Not just recognition accuracy!
§ Talk to marketing, sales, support and other stakeholders, e.g.,
§Sixty percent of users who start to make a hotel reservation
complete it.
§Eighty-five percent of users complete a daily wellness check-in
at least 20 days out of a month.
§ The error rate for playing songs is less than 15 percent.
§ Five hundred users download the app in the first month.
§ The user satisfaction survey gives the app an average rating of
at least four stars.
Pilot Testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§Other key measures of success
§ Task completion rates. % of tasks completed (e.g., user books X).
§ Dropout rates (inverse of above). % and position of dropouts
§ Time spent. Average time per task or in overall app
§ Barge-in. Position and frequencies.
§ Speech vs. GUI. Rate and position(s)
§ No-speech timeouts. Position, rate, etc.
§ No intent matches. Correct rejects (out-of-domain “favourite color is
spaghetti”) vs. false rejects (in-domain, e.g., “set us a quiz”)
§ Errors. Types of errors, position, recovery etc.
§ Navigation. Back button use, menu items etc.
§ Latency. End-of-speech timeout, return of result
Pilot Testing
COMP3074-HAI Lecture 18, User Testing for VUIs
§ Make sure you log these measures of success, e.g.,
§ Recognition result (what the recognizer heard when the user spoke,
including confidence scores)
§ N-best list, if available (list of possible hypotheses)
§ Audio of user’s utterance for each state, including pre- and post-
endpointed utterances (for transcriptions, since the recognition result is not
100 percent accurate)
§ If recognition did match to something, what it matched to
§ Errors: no-speech timeouts (including timing information), no match,
recognition errors
§ State names(or other way to track where in the app the user traversed)
§ Barge-in information, if barge-in is enabled
COMP3074-HA I Lecture 18, User Testing for VUIs
§ Correct accept: The recognizer returned
the correct answer . [“My favorite color is
red .” à “red”] ✅
§ False accept: The recognizer returned an
incorrect answer. [“My favorite’s teal” à
§ False reject: The recognizer could not find
a good match to any path in the grammar
and therefore rejected rather returned an
answer . [“I think magenta is nice .” à no
Transcription and analysis
In-domain Out-of-domain
§ Correct reject: The recognizer correctly
rejected the input .[“I think I would like to
book a hotel” à no match] ✅
§ False accept: The recognizer returned an
answer that is, by definition, wrong
because the input was not in the grammar
. [“I read the paper” à “red”] ❌
28COMP3074-HAI Lecture 18, User Testing for VUIs
§ Manually transcribe the user’s utterances from your log data
§ Examine ”in-domain” vs. “out-of-domain” data, e.g., for “what’s your favourite
§User testing all the way from early concept to release, and beyond
§ to get the user experience right (language, flow, error recovery…)
§ to optimise system performance (recognition accuracy, latency,..)
§ to monitor how your app is doing and allow the designer to
improve performance
§Ensure that
§ you’ve talked to stakeholders to define success measures, and
task completion, etc.
§ and are logging the right data to measure these.
§ Tracking speech failure points and having a way to quickly make
improvements is essential for successful VUIs.
Conclusion
COMP3074-HAI Lecture 18, User Testing for VUIs
Quiz 3 next week!
CW1 due next week!
Create your Voiceflow
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com