CMP2019M Human-Computer Interaction
Week 9 – Quantitative Evaluation
Last Week
• Inspecting and testing
• Planning a user study – Validity
– Confounds
– Ethics
– Measurement
This Week
• Quantitative Evaluation in HCI – Part 1: Research methods
– Part 2: Data types
• Next weeks:
– Part 3: Data analysis I – Part 4: Data analysis II
Reading
• Interaction Design: Beyond Human-Computer Interaction, by Sharp, Rogers, & Preece
• Chapter 8.3: Simple Quantitative Analysis
• Chapter 14: Evaluation Studies
Remember Week 8: Evaluation in HCI – Testing
2.1 Testing: Overview
• Alternative evaluation approach in Human- Computer Interaction that involves users in the testing of a system, and gathers their feedback to identify areas for improvement, validate features, or gain insights into how system affects them.
• Think: Baking cookies. When you’re done, you might wonder whether they’re any good, and you offer some to your roommates. One loves them, another thinks they’re okay, so you’re probably good.
2.1 Testing: Procedure
Research Study Data Data question design collection analysis
2.1 Testing: Example
• A/B testing: Testing two different versions of a system in a controlled environment
2.1 Testing: Example
• Study design:
– What instruments do you use to gather information?
• Data collection:
– During interaction: Observations and metrics
– Post-interaction: Gather feedback with standardized questionnaires (e.g., NASA-TLX, SUS, ISO)
• Data analysis and interpretation: – Statistical methods!
This is where quantitative research methods come into play.
Quantitative research = research based on data that can be quantified (=counted).
Quantitative research = observations + mathematical data analysis (statistics).
Part 1: Making Observations
Making Observations
• Observations provide empirical foundation for quantitative research – not just observations in the sense of ‘staring at another person’
• “Observable phenomena” – attempt at gathering objective information
• Mathematical analysis requires structured, computable input
If you were to find out how tired or awake everyone in this class was, where would you start?
Are you sure the person is tired? How do you know she’s not just bored instead?
Making Observations
• Combining multiple sources of data:
– Direct feedback from people, “I’m not sleeping!”
– People’s behaviour, e.g., bigger scale: participation in class activity, smaller scale: eyes slowly closing, head dropping down, drooling
– Verbal expressions, e.g., laughing, snoring
– Non-verbal expressions, e.g., blank stare
– Also worth considering: contextual cues, e.g., Monday morning, 8:00 AM?
How can we ensure that observations are structured and objective, leading to valid conclusions?
Qualitative approach:
“The person was observed sleeping, with their eyes lightly closed, breathing regularly and quietly.”
Qualitative approach:
“The person was observed sleeping, with their eyes lightly closed, breathing regularly and quietly.”
Quantitative approach:
The person was asleep with their eyes closed for five minutes, and took 50 breaths.
Quantitative approach:
The person was asleep with their eyes closed for five minutes, and took 50 breaths.
Making Observations
• In order to validly turn observations into numbers, you must recognise the distinction between
– what you are measuring,
– the scale on which it is measured,
– and your instrument of measurement.
• Example: Temperature
– The observed temperature
– The degrees Celcius or Farenheit – The thermometer
Operationalizing
• Finding constructs that accurately model the phenomenon that we are trying to observe – simplifying reality!
• Operationalizing sleep: Eyes shut and slower breathing rate (possibly many other things)
• Operationalizing whether somebody is tired or not: a bit more difficult
• Gathering feedback from person probably helpful – how can we do this in a standardized way?
There are instruments to help you with that!
Do you feel tired?
Are you bored?
Do you feel awake?
On a scale from 1 (tired) to 5 (awake), how would you say you feel?
Observations in HCI
• Interaction with system can be examined through observations like any other human behaviour
• What aspects of a user’s interaction with a system are observable?
• Functionality
• Accessibility
• Usability
• User Experience
How can these aspects be assessed?
Observations in HCI
• User interaction with a system can be quantified through different measures.
1. Questionnaires
2. Observation (of user interaction with system)
3. Performance metrics
• More complex approaches, e.g., biometrics – eye tracking, skin conductance, …
Measures (1): Questionnaires
• Instrument to gather structured feedback from people – usually consists of a range of items (questions) and is focused on specific topic.
• Different question/answer formats with implications for analysis of questionnaire data:
– Closed vs. open questions
• For now, we only care about closed questions that dictate answer format – easier to quantify!
Measures (1): Questionnaires
• Dichotomous questions – binary
“Do you like the interface?” Yes No
• Easy to analyse, but pushes respondent to choose one of two categories – always applicable?
• What about neutral option?
• What insights can we gain from Y/N answers?
Measures (1): Questionnaires
• Categorical questions
“What interface colour do you prefer?”
Blue Yellow Pink
• Still pushes respondent to choose category
• Can you think of a situation where this approach would work well?
Measures (1): Questionnaires
• Ranking questions
“Please rank interface colour by preference.”
Blue Yellow Pink
• Adds information – creates relationship within data
• What kind of aspects could you explore through ranking questions?
2
3
1
Measures (1): Questionnaires
• Scalar: Semantic Differential
Using the interface was…
fun 1 2 3 4 5 boring easy 1 2 3 4 5 hard
• Explores a range of bipolar attitudes about a particular item
• Attitudes are represented as a pair of adjectives
• Choice of adjectives can be tricky
39
Measures (1): Questionnaires
• Scalar: Likert Scale
I enjoyed using the interface.
12345
strongly agree O O O O O strongly disagree
• Used to measure opinions, attitudes, and beliefs
• Asks user to judge a specific statement on a numeric scale that corresponds to agreement or disagreement with a statement
40
Measures (1): Questionnaires
• How do you know that your questionnaire produces a valid, reliable representation of reality?
• You go through a lengthy, detailed validation process
• Luckily, there are many validated instruments to assess aspects of usability and user experience
• Do not presume to create your own usability measurement questionnaire – choose an existing, validated option!
Measures (1): Questionnaires
• Some validated options to assess different aspects of usability and user experience:
– NASA-Task Load Index (NASA-TLX)
– ISO 9241-9 Questionnaire: Device Comfort – System Usability Scale (SUS)
• Other research areas…
– Games: PENS and GEQ – PANAS for affective state
Measures (1): Questionnaires
• NASA-Task Load Index: Cognitive Load
Measures (1): Questionnaires
• NASA-Task Load Index: Cognitive Load
– Seven items to be ranked on 20 point scale
– Analysis for each of the subscales or as composite score (“overall cognitive load”)
• Insights into how cognitively challenging an application is – focus on users’ cognition
• Full questionnaire and more information at http://humansystems.arc.nasa.gov/groups/TLX/
Measures (1): Questionnaires
• ISO 9241-9 Questionnaire: Device Comfort
Measures (1): Questionnaires
• ISO 9241-9 Questionnaire: Device Comfort
Measures (1): Questionnaires
• ISO 9241-9 Questionnaire: Device Comfort
– 13 items to be ranked on 5 point scale
– Analysis for each of the subscales or as composite score (“overall device comfort”)
• Insights into how physically demanding interaction is – useful when designing new hardware or interaction techniques
• Full questionnaire and research paper at http://www.yorku.ca/mack/CHI99b.html
Measures (1): Questionnaires
• System Usability Scale (SUS)
Measures (1): Questionnaires
• System Usability Scale (SUS)
– 10 items to be ranked on 5 point Likert scale – Composite score of 68 considered ‘average’
• Insights into general usability of a system – good starting point for you as designers!
• Full questionnaire and more information at http://www.usability.gov/how-to-and-tools/methods/ system-usability-scale.html
Measures (1): Questionnaires
• How to decide which one to use?
– Hardware vs. software
– Usability vs. experience
– Participants – children, older adults – might need different questionnaires
– Type of application – games, mobile apps might require different questionnaires
• If you have a well-defined research question, choice of questionnaire is usually straightforward
Exercise:
Which questionnaire (TLX, ISO, SUS) would you apply to evaluate the following:
1. Usability of Blackboard.
1. Usability of Blackboard.
2. Input gestures for Kinect game.
1. Usability of Blackboard.
2. Input gestures for Kinect game. 3. Nuclear power plant controls.
1. Usability of Blackboard.
2. Input gestures for Kinect game. 3. Nuclear power plant controls. 4. Dev UI for database system.
1. Usability of Blackboard.
2. Input gestures for Kinect game. 3. Nuclear power plant controls. 4. Dev UI for database system.
5. Sat-nav voice interface.
Remember: Sometimes there’ll be overlap between questionnaires – consider using multiple sources for richer dataset.
Measures (1): Questionnaires
• Common issues with questionnaires
– Paper versions: watch out for missed items
– (Likert) scales: people tend to stay in ‘neutral area’
– Anchoring effects and social acceptability, do people really report what they think?
– Does questionnaire match problem at hand?
• For your project: Consider asking your own questions IN ADDITION to standardized questionnaire.
Despite shortcomings:
Best approximation of people’s opinions – survey enough people, take findings with grain of salt, …
…and combine with other instruments, for example, observations, or metrics.
Measures (2): Observation
• Observing how a person interacts with your system
– Facial expressions
– Verbal comments etc.
– More specific aspects, e.g., how user performs a gesture on a tablet, or how player interacts with motion-based game interface
• Challenge: We are no longer interested in quality of experience, but in quantifiable results – how can we record data appropriately?
Measures (2): Observation
• Step 1: Defining quantifiable observations
• Example: Observing how a player interacts with a
gesture-based Kinect game
– Define characteristics of expected behaviour and/or events
– Turn into observable chunks
• Basically, operationalizing observable aspects of interaction
Measures (2): Observation
• Step 2: Ensuring that observations are consistent across many participants
– Develop coding scheme for observations based off previously defined observations
Event
Occurred?
Comment
User recognition failed
Yes
User stood too far from camera
Wrong gesture detected
Yes
User waved hand, but system recognized raised arm instead
Measures (2): Observation
• Step 2: Ensuring that observations are consistent across many participants
– Develop coding scheme for observations based off previously defined observations
Event
# of occurrence
Player smiled at partner
10
Player swore at screen
6
Player moved, but system didn’t react
3
Measures (2): Observation
• You need to decide how/when you will make observations:
– Option 1: Observe person while she is interacting with system, ‘live’
– Option 2: Video record interaction and analyze recorded material
• Option 1 takes less time, but requires you to develop coding scheme in advance; option 2 allows you to explore recordings first
Can you see any challenges when trying to make quantifiable observations?
Measures (3): Metrics
• Performance metrics offer objective insights into
how user interacted with system:
– Time taken to complete task(s)
– Number of tasks completed within a set time – Number of errors made in completing a task – Number of times website visited
– Number of times help consulted
– Number of successful completions
Measures (3): Metrics
• Relatively simple to present and analyse this type of data, but recording requires preparation
• However do not provide sufficient insights if they are sole measure – can only tell you what happened, not why
• Good way of backing up questionnaire results and observations – especially if findings from questionnaires and observations contradict
Exercise:
Choosing the right (amount of) measures – imagine you are the CEO of a company and need to pay for whatever you decide to add.
1. Online shop that sells groceries.
1. Online shop that sells groceries. 2. New Airbus cockpit.
1. Online shop that sells groceries. 2. New Airbus cockpit.
3. University website.
1. Online shop that sells groceries. 2. New Airbus cockpit.
3. University website.
4. Gaze control for wheelchair.
1. Online shop that sells groceries. 2. New Airbus cockpit.
3. University website.
4. Gaze control for wheelchair.
5. Dating app.
Questionnaires, observations and metrics are instruments that can help you quantify how users interact with the systems that you build.
By making the interaction process countable, you can gain valuable insights into the experience of average users.
Be aware of limitations that are introduced when pheno- mena are operationalized…
(We create abstract models of reality.)
…but understand commercial and academic value of quantitative evaluations.
(Think big data and metrics!)
Part 2: Data Types
Data Types
• Collecting, presenting and analysing quantitative data are closely linked, interdependent tasks
• The type of data you collect will determine how you can represent it
• There are distinct types of quantitative data: 1. Categorical
2. Ordinal
3. Continuous
Data Types
• Types of data go along with kinds of questions that you asked:
– Dichotomous questions and lists – categorical data
– Ranking questions – ordinal data
– Semantic differential and Likert Scales – continuous data
• Depending on type of data, different mathematical operations can be applied (more in week 11)
Categorical Data
• Data is categorical if values can be sorted according to category, and each value is chosen from a set of non-overlapping categories
• For example, a man can have the characteristic of ‘father’ with categories ‘yes’ and ‘no’
• Cars can be sorted by colour into discreet categories ‘red’ ‘black’ ‘white’ ‘yellow’
• Important: No ranking implied!
Categorical Data
Categorical Data
Ordinal Data
• Data is considered ordinal if values can be ranked (put in order) naturally, or have a rating scale attached
• Examples: your top five pubs in Lincoln
• Careful: numbers don’t always imply rank, e.g., in football
Ordinal Data
Continuous Data
• Data is continuous if the values may take on any value within a finite or infinite interval
• You can count, order and measure continuous data
• For example height, temperature, or the time
required to run a mile
• Many analysis approaches require this level of data quality – be careful when you choose statistical tests etc. (more next week)
Exercise:
Can you identify the correct type of data?
1. License plates of cars.
1. License plates of cars. 2. Your weight.
1. License plates of cars.
2. Your weight.
3. Grades (1st, 2:1, 2:2, Pass).
1. License plates of cars.
2. Your weight.
3. Grades (1st, 2:1, 2:2, Pass). 4. People’s gender.
1. License plates of cars.
2. Your weight.
3. Grades (1st, 2:1, 2:2, Pass). 4. People’s gender.
5. Task completion time.
Quantitative evaluation in HCI combines observation with mathematical analysis.
Remember:
Collecting, presenting and analysing quantitative data are linked, interdependent tasks.
Next week we’ll talk more about analysis approaches.