HAI-Lecture15
Automatic Speech
Recognition
Copyright By PowCoder代写 加微信 powcoder
Human-AI Interaction
Lecture 15
§Key features of ASR
§ Technical Challenges of ASR
This lecture
COMP3074-HAI Lecture 15, ASR
Part 1. Key features of ASR
§AKA machine transcription, Speech To Text (STT)
§Recognition and transcription of spoken language into text
§Crucial component to turn a chatbot into a VUI
§Popular engines
§Closed source: Google Cloud Speech API, IBM Watson Speech
to Text, Amazon Transcribe, Alexa Skills kit
§Open sources: Kaldi, VOSK, Sphinx
§Python libraries
§SpeechRecognition, google-cloud-speech, watson-developer-
cloud, etc.
COMP3074-HAI Lecture 15, ASR
§ If you had to choose between a closed source and open source ASR
engine, which would you choose and why? Can you think of some of
the trade-offs either choice has?
§Popular engines
§Closed source: Google Cloud Speech API, IBM Watson Speech
to Text, Amazon Transcribe, Alexa Skills kit
§Open sources: Kaldi, VOSK, Sphinx
Think, Discuss, Share – 3 mins
COMP3074-HAI Lecture 15, ASR
§Key considerations
§Robustness of dataset / accuracy
§SIZE matters – hence why large companies with HUGE
datasets (e.g., Google) have an advantage
§Endpoint detection performance
§ how the computer knows when you begin and finish speaking
§Advanced features may be desirable, not all engines have them
§N-best lists, settable parameters like end-of-speech time-outs,
and customized vocabularies
Choosing an ASR engine
COMP3074-HAI Lecture 15, ASR
§Barge-in detection
§Detecting when the user starts speaking
§ Timeouts
§End-of-speech timeout
§No speech timeout
§ Too much speech
Key ASR features
COMP3074-HAI Lecture 15, ASR
§Allowing users to interrupt while system generates output (‘talking’)
§Option 1. Immediately stop when detecting speech
§Makes sense in IVR (telephone systems)
§Avoids long menus or lists of options
§BUT a lot can go wrong
Barge-in detection
COMP3074-HAI Lecture 15, ASR
BANKING IVR
You can transfer money, check your account balance, pay a…
[interrupting] Check my account balance
§Barge-in gone wrong
§Question + silence has encouraged user to speak ‘prematurely’
§Better to ask question last
Barge-ins cont’d
COMP3074-HAI Lecture 15, ASR
VUI SYSTEM
What would you like to do? [1-second silence] You…
I would…
VUI SYSTEM
[system continues] can. [then stops because user has barged in]
VUI SYSTEM
You can check your balance, transfer funds, or speak to an agent. What would you like to do?
§Option 2. Stop when detecting a keyword
§ the wake word, e.g., “Alexa stop” or app-specific keywords, e.g.,
“next” to skip etc.
§Makes sense for long-running actions typical on smartspeakers,
§ playing the radio, playing a song/audiobook etc.
§But don’t use availability of barge-in functionality as an excuse for
overly long prompts; remember to be concise, speech is ephemeral
Barge-ins cont’d
COMP3074-HAI Lecture 15, ASR
§AKA endpoint detection
§Detecting when the user stops speaking
§Some ASR engines let you (the developer) adjust this
§ 1.5 seconds rule of thumb
§Can be shorter for
user initiated interaction, and
Prompts that ask for a simple yes/no response
§May need to be longer for certain prompts which require the
user to think for longer
End-of-speech timeout
COMP3074-HAI Lecture 15, ASR
§No speech detected
§ Longer than end-of-speech, usually
around 10 seconds
§Result in different actions by the VUI
§ “Do nothing” most common in
smartspeakers and phones?
§Reprompt may be needed
§Helpful for system analysis, shows
where there are problems
§Could be down to accidental
triggering or problem with finding the
right response to the prompt
No speech timeout
COMP3074-HAI Lecture 15, ASR
What’s your account number?
Sorry, I didn’t get that. Your account number
can be found at the top of your statement.
Please say or type it in, or say, “I don’t know
I don’t know it.
No problem. We can look it up with your
phone number and address instead…
§ Triggered when the user talks for very long time without pauses
§Rare, people don’t normally speak like that
§May be useful in applications/skills in which users are encouraged to
talk for a long time
Too much speech
COMP3074-HAI Lecture 15, ASR
§ List with the N most likely queries the user might have said
§ What strategy would you implement for your VUI to pick from the N-
COMP3074-HAI Lecture 15, ASR
ACTUAL DIALOG
MY FAVORITE ANIMAL VUI
So, I really want to know more about what animals
you love. What’s your favorite?
Well, I think at the moment my favorite’s gotta
be…kitty cats!
5-BEST LIST returned by the ASR (ordered by
confidence)
1. WELL I THINK AT THE MOMENT MY FAVORITES GOT
TO BE FIT AND FAT
2. WELL I THINK AT THE MOMENT BY FAVORITES
GOTTA BE KITTY CATS
3. WELL I HAVE AT THE MOMENT MY FAN IS OF THE
4. WELL I HAVE AT THE MOMENT MY FAN IS OF THE
5. WELL THAT THE MOMENT MY FAVORITE IS GOT TO
BE KIT AND CAT
§ List with the N most likely queries the user might have said
§ As your app expects an animal name, it can then look for that in the N- ,
rather than take the one with the highest confidence level
COMP3074-HAI Lecture 15, ASR
ACTUAL DIALOG
MY FAVORITE ANIMAL VUI
So, I really want to know more about what animals
you love. What’s your favorite?
Well, I think at the moment my favorite’s gotta
be…kitty cats!
5-BEST LIST returned by the ASR (ordered by
confidence)
1. WELL I THINK AT THE MOMENT MY FAVORITES GOT
TO BE FIT AND FAT
2. WELL I THINK AT THE MOMENT BY FAVORITES
GOTTA BE KITTY CATS
3. WELL I HAVE AT THE MOMENT MY FAN IS OF THE
4. WELL I HAVE AT THE MOMENT MY FAN IS OF THE
5. WELL THAT THE MOMENT MY FAVORITE IS GOT TO
BE KIT AND CAT
§Also help with course correction
§Avoids suggesting the same incorrect
option over and over
§Should set a “reject” flag when the
user says “No…” to a prompt
§Move to the next item in the list
N- cont’d
COMP3074-HAI Lecture 15, ASR
TRAVEL VUI
What city are you starting from?
TRAVEL VUI
Was that Austin?
No, Boston.
TRAVEL VUI
Was that Austin?
No, Boston!
TRAVEL VUI
….Austin?
Part 2. Technical Challenges of ASR
§Where ASR technology still struggles
§Multiple speakers
§Names, spelling, and alphanumeric
§Data privacy
Technical challenges of ASR
COMP3074-HAI Lecture 15, ASR
Background noise, multiple speakers, music, dogs etc.
§Side speech
§When the user addresses another person in the middle of the VUI
§Not a lot you (a developer using an ASR engine) can do about the
§ASR training datasets can be enriched with noisy examples –
this is a job for those developing the ASR engine
§What you can do is to reprompt, maybe provide hints that it may be
too noisy to understand (if the ASR returns “noise” or low
confidence)
COMP3074-HAI Lecture 15, ASR
§Multiple devices triggered by the same phrase
§E.g., wake word said in an office triggers everyone’s phone
§Now, when you activate your phone first, it asks you to say the
phrase so that the ASR can be trained to your voice only (but it
still goes wrong for similar sounding voices)
§Multiple speakers talking to the same device
§ASR not well equipped to handle overlapping talk (overtalking),
people finishing each others sentences, etc.
§Which device should respond?
§ For example, when I say ‘Hey Siri’ it seems random whether my
iPhone or my iPad responds, and I can’t control which one…
Multiple speakers / devices
COMP3074-HAI Lecture 15, ASR
§ ASR struggle with children’s voices
§ Shorter vocal chords mean higher pitched
voices, for which there is less training data, thus
accuracy is lower
§ Child’s talk – young children also more likely to
meander, stutter, have long pauses and repeat
themselves etc.
§ When designing for children specifically
§ Confidence in the transcription is lower, so don’t
base progress on recognition/intent matching
§ When a response is required, ask for simple
yes/no response or provide graphical alternative,
like an image they can point to
COMP3074-HAI Lecture 15, ASR
HELLO BARBIE
What would you like to be when
you grow up?
HELLO BARBIE
Sounds good. I want to be a
space horticulturalist!
§Context can help with names, e.g., ”latest Album” concept
associated with artist names
§Known data can help, e.g.,
§Credit card checksums, registered user names, post codes, cities
closes to the current location etc.
Names, spelling, and alphanumeric
COMP3074-HAI Lecture 15, ASR
§Don’t store/upload anything said before the wake word
§Don’t store data (e.g., transcriptions of user queries) longer than
§Allow users to control what, and for how long it should be kept
§Anonymise data before storing it (e.g., strip out user information)
§Store data securely
§Adhere to relevant data protection laws, e.g.,
§GDPR (EU General Data Protection Regulation)
Data privacy
COMP3074-HAI Lecture 15, ASR
§ GDPR (EU General Data Protection Regulation)
§ Regulates how people can access information about them and limits what
organisations can do with personal data
§ 7 principles
§ Lawfulness, fairness, transparency,
§ Purpose limitation
§ Data minimization
§ Accuracy
§ Storage limitation
§ Integrity and confidentiality (security)
§ Accountability
§ Right to be Forgotten (EU)
COMP3074-HAI Lecture 15, ASR
§ 1. Head to https://creator.voiceflow.com/signup/promo
§ 2. Create an account using your
account and input coupon code: EDU_2022
§ 3. Done!
Set up your voiceflow account
COMP3074-HAI Lecture 15, ASR
https://creator.voiceflow.com/signup/promo
Set up your Voiceflow
Starting Voiceflow in
labs this week
CW1 – new deadline is
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com