CS代写 COMP3074-HAI Lecture 15, ASR

HAI-Lecture15

Automatic Speech
Recognition

Copyright By PowCoder代写 加微信 powcoder

Human-AI Interaction

Lecture 15

§Key features of ASR

§ Technical Challenges of ASR

This lecture

COMP3074-HAI Lecture 15, ASR

Part 1. Key features of ASR

§AKA machine transcription, Speech To Text (STT)
§Recognition and transcription of spoken language into text
§Crucial component to turn a chatbot into a VUI
§Popular engines

§Closed source: Google Cloud Speech API, IBM Watson Speech
to Text, Amazon Transcribe, Alexa Skills kit

§Open sources: Kaldi, VOSK, Sphinx
§Python libraries

§SpeechRecognition, google-cloud-speech, watson-developer-
cloud, etc.

COMP3074-HAI Lecture 15, ASR

§ If you had to choose between a closed source and open source ASR
engine, which would you choose and why? Can you think of some of
the trade-offs either choice has?

§Popular engines
§Closed source: Google Cloud Speech API, IBM Watson Speech

to Text, Amazon Transcribe, Alexa Skills kit
§Open sources: Kaldi, VOSK, Sphinx

Think, Discuss, Share – 3 mins

COMP3074-HAI Lecture 15, ASR

§Key considerations
§Robustness of dataset / accuracy

§SIZE matters – hence why large companies with HUGE
datasets (e.g., Google) have an advantage

§Endpoint detection performance
§ how the computer knows when you begin and finish speaking

§Advanced features may be desirable, not all engines have them
§N-best lists, settable parameters like end-of-speech time-outs,

and customized vocabularies

Choosing an ASR engine

COMP3074-HAI Lecture 15, ASR

§Barge-in detection
§Detecting when the user starts speaking

§ Timeouts
§End-of-speech timeout
§No speech timeout
§ Too much speech

Key ASR features

COMP3074-HAI Lecture 15, ASR

§Allowing users to interrupt while system generates output (‘talking’)
§Option 1. Immediately stop when detecting speech

§Makes sense in IVR (telephone systems)

§Avoids long menus or lists of options
§BUT a lot can go wrong

Barge-in detection

COMP3074-HAI Lecture 15, ASR

BANKING IVR
You can transfer money, check your account balance, pay a…
[interrupting] Check my account balance

§Barge-in gone wrong

§Question + silence has encouraged user to speak ‘prematurely’
§Better to ask question last

Barge-ins cont’d

COMP3074-HAI Lecture 15, ASR

VUI SYSTEM
What would you like to do? [1-second silence] You…
I would…
VUI SYSTEM
[system continues] can. [then stops because user has barged in]

VUI SYSTEM
You can check your balance, transfer funds, or speak to an agent. What would you like to do?

§Option 2. Stop when detecting a keyword
§ the wake word, e.g., “Alexa stop” or app-specific keywords, e.g.,

“next” to skip etc.
§Makes sense for long-running actions typical on smartspeakers,

§ playing the radio, playing a song/audiobook etc.
§But don’t use availability of barge-in functionality as an excuse for

overly long prompts; remember to be concise, speech is ephemeral

Barge-ins cont’d

COMP3074-HAI Lecture 15, ASR

§AKA endpoint detection
§Detecting when the user stops speaking
§Some ASR engines let you (the developer) adjust this

§ 1.5 seconds rule of thumb
§Can be shorter for

user initiated interaction, and
Prompts that ask for a simple yes/no response

§May need to be longer for certain prompts which require the
user to think for longer

End-of-speech timeout

COMP3074-HAI Lecture 15, ASR

§No speech detected
§ Longer than end-of-speech, usually

around 10 seconds
§Result in different actions by the VUI

§ “Do nothing” most common in
smartspeakers and phones?

§Reprompt may be needed
§Helpful for system analysis, shows

where there are problems
§Could be down to accidental

triggering or problem with finding the
right response to the prompt

No speech timeout

COMP3074-HAI Lecture 15, ASR

What’s your account number?
Sorry, I didn’t get that. Your account number
can be found at the top of your statement.
Please say or type it in, or say, “I don’t know
I don’t know it.
No problem. We can look it up with your
phone number and address instead…

§ Triggered when the user talks for very long time without pauses
§Rare, people don’t normally speak like that
§May be useful in applications/skills in which users are encouraged to

talk for a long time

Too much speech

COMP3074-HAI Lecture 15, ASR

§ List with the N most likely queries the user might have said

§ What strategy would you implement for your VUI to pick from the N-

COMP3074-HAI Lecture 15, ASR

ACTUAL DIALOG

MY FAVORITE ANIMAL VUI
So, I really want to know more about what animals
you love. What’s your favorite?
Well, I think at the moment my favorite’s gotta
be…kitty cats!

5-BEST LIST returned by the ASR (ordered by
confidence)

1. WELL I THINK AT THE MOMENT MY FAVORITES GOT
TO BE FIT AND FAT

2. WELL I THINK AT THE MOMENT BY FAVORITES
GOTTA BE KITTY CATS

3. WELL I HAVE AT THE MOMENT MY FAN IS OF THE

4. WELL I HAVE AT THE MOMENT MY FAN IS OF THE

5. WELL THAT THE MOMENT MY FAVORITE IS GOT TO
BE KIT AND CAT

§ List with the N most likely queries the user might have said

§ As your app expects an animal name, it can then look for that in the N- ,
rather than take the one with the highest confidence level

COMP3074-HAI Lecture 15, ASR

ACTUAL DIALOG

MY FAVORITE ANIMAL VUI
So, I really want to know more about what animals
you love. What’s your favorite?
Well, I think at the moment my favorite’s gotta
be…kitty cats!

5-BEST LIST returned by the ASR (ordered by
confidence)

1. WELL I THINK AT THE MOMENT MY FAVORITES GOT
TO BE FIT AND FAT

2. WELL I THINK AT THE MOMENT BY FAVORITES
GOTTA BE KITTY CATS

3. WELL I HAVE AT THE MOMENT MY FAN IS OF THE

4. WELL I HAVE AT THE MOMENT MY FAN IS OF THE

5. WELL THAT THE MOMENT MY FAVORITE IS GOT TO
BE KIT AND CAT

§Also help with course correction
§Avoids suggesting the same incorrect

option over and over
§Should set a “reject” flag when the

user says “No…” to a prompt
§Move to the next item in the list

N- cont’d

COMP3074-HAI Lecture 15, ASR

TRAVEL VUI
What city are you starting from?
TRAVEL VUI
Was that Austin?
No, Boston.
TRAVEL VUI
Was that Austin?
No, Boston!
TRAVEL VUI
….Austin?

Part 2. Technical Challenges of ASR

§Where ASR technology still struggles
§Multiple speakers
§Names, spelling, and alphanumeric
§Data privacy

Technical challenges of ASR

COMP3074-HAI Lecture 15, ASR

Background noise, multiple speakers, music, dogs etc.
§Side speech

§When the user addresses another person in the middle of the VUI

§Not a lot you (a developer using an ASR engine) can do about the
§ASR training datasets can be enriched with noisy examples –

this is a job for those developing the ASR engine
§What you can do is to reprompt, maybe provide hints that it may be

too noisy to understand (if the ASR returns “noise” or low
confidence)

COMP3074-HAI Lecture 15, ASR

§Multiple devices triggered by the same phrase
§E.g., wake word said in an office triggers everyone’s phone
§Now, when you activate your phone first, it asks you to say the

phrase so that the ASR can be trained to your voice only (but it
still goes wrong for similar sounding voices)

§Multiple speakers talking to the same device
§ASR not well equipped to handle overlapping talk (overtalking),

people finishing each others sentences, etc.
§Which device should respond?

§ For example, when I say ‘Hey Siri’ it seems random whether my
iPhone or my iPad responds, and I can’t control which one…

Multiple speakers / devices

COMP3074-HAI Lecture 15, ASR

§ ASR struggle with children’s voices
§ Shorter vocal chords mean higher pitched

voices, for which there is less training data, thus
accuracy is lower

§ Child’s talk – young children also more likely to
meander, stutter, have long pauses and repeat
themselves etc.

§ When designing for children specifically
§ Confidence in the transcription is lower, so don’t

base progress on recognition/intent matching
§ When a response is required, ask for simple

yes/no response or provide graphical alternative,
like an image they can point to

COMP3074-HAI Lecture 15, ASR

HELLO BARBIE
What would you like to be when
you grow up?

HELLO BARBIE
Sounds good. I want to be a
space horticulturalist!

§Context can help with names, e.g., ”latest Album” concept
associated with artist names

§Known data can help, e.g.,
§Credit card checksums, registered user names, post codes, cities

closes to the current location etc.

Names, spelling, and alphanumeric

COMP3074-HAI Lecture 15, ASR

§Don’t store/upload anything said before the wake word
§Don’t store data (e.g., transcriptions of user queries) longer than

§Allow users to control what, and for how long it should be kept
§Anonymise data before storing it (e.g., strip out user information)
§Store data securely
§Adhere to relevant data protection laws, e.g.,

§GDPR (EU General Data Protection Regulation)

Data privacy

COMP3074-HAI Lecture 15, ASR

§ GDPR (EU General Data Protection Regulation)
§ Regulates how people can access information about them and limits what

organisations can do with personal data
§ 7 principles

§ Lawfulness, fairness, transparency,
§ Purpose limitation
§ Data minimization
§ Accuracy
§ Storage limitation
§ Integrity and confidentiality (security)
§ Accountability

§ Right to be Forgotten (EU)

COMP3074-HAI Lecture 15, ASR

§ 1. Head to https://creator.voiceflow.com/signup/promo
§ 2. Create an account using your

account and input coupon code: EDU_2022
§ 3. Done!

Set up your voiceflow account

COMP3074-HAI Lecture 15, ASR

https://creator.voiceflow.com/signup/promo

Set up your Voiceflow

Starting Voiceflow in
labs this week

CW1 – new deadline is

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com