CS代考 AUDIO COMPRESSION

AUDIO COMPRESSION
AUDIO COMPRESSION
Dr. – CS576 Lecture 7 10/18/2021 Page 1

TOPICS TO BE COVERED
TOPICS TO BE COVERED
Introduction
• Characteristics of Sound and Audio Signals • Applications
• Waveform Sampling and Compression
Types of Sound Compression techniques
• Sound is a Waveform – generic schemes • Sound is Perceived – psychoacoustics
• Sound is Produced – sound sources
• Sound is Performed – structured audio
Sound Synthesis
Audio standards
• ITU – G.711, G.722, G.727, G.723, G.728 • ISO – MPEG1, MPEG2, MPEG4
Dr. – CS576 Lecture 7 10/18/2021 Page 2

INTRODUCTION
INTRODUCTION
Physics Introduction – Sound is a Waveform
Recording instruments convert it to an electrical waveform signal, which is then sampled and quantized to get a digital signal
Quantization introduces error! – Listen to 16, 12, 8, 4 bit music and see the difference.
Dr. – CS576 Lecture 7 10/18/2021 Page 3

A COMPARISON TO THE VISUAL DOMAIN
A COMPARISON TO THE VISUAL DOMAIN
Sound is a 1D signal with amplitude at time t – which leads us to believe that it should be simple to compress as compared to a 2D image signal and 3D video signals
Is this true?
• Compression ratios attained for video and
images are far greater than those attained for
audio
• Consider human perception factors – human
auditory system is more sensitive to quality degradation than the visual system. As a result humans are more prone to compressed audio errors than compressed image and video errors
Dr. – CS576 Lecture 7 10/18/2021 Page 4

APPLICATIONS
APPLICATIONS
Telephone-quality speech
• Sampling rate = 8KHz • Bit rate is = 128 Kbps
CDs (stereo channels)
• Sampling rate = 44.1KHz
• Bit rate is 2x16x44100 = 1.4 Mbps!
• CD Storage = 10.5 Megabytes / minute • CD can hold on 70 minutes of audio
Surround Sound Systems with 5 channels
Dr. – CS576 Lecture 7 10/18/2021 Page 5

NEED FOR COMPRESSION
NEED FOR COMPRESSION
We need to take advantage of redundancy/correlation in the signal by statistically studying the signal – but just that is not enough!
The amount of redundancy that can be removed all through out is very little and hence all the coding methods for audio generally give a lower compression ratio than images or video
Apart from Statistical Study, more compression can be achieved based on
• Study of how sound is perceived • Study of how it is produced
Dr. – CS576 Lecture 7 10/18/2021 Page 6

TYPES OF AUDIO COMPRESSION TECHNIQUES
TYPES OF AUDIO COMPRESSION TECHNIQUES
Audio Compression techniques can be broadly classified into different categories depending on how sound is “understood”
Sound is a Waveform
• Use Statistical Distribution / etc.
• Not a good idea in general by itself
Sound is Perceived – Perception-Based
• Psycho acoustically motivated
• Need to understand the human auditory system
Sound is Produced – Production-Based • Physics/Source model motivated
Music (Sound) is Performed/Published/Represented • Event-Based Compression
Dr. – CS576 Lecture 7 10/18/2021 Page 7

SOUND AS A WAVEFORM
SOUND AS A WAVEFORM
Uses variants of PCM techniques. PCM techniques produce a high data rates but can be reduced by exploiting statistical redundancy (information theory)
Differential Pulse Code Modulation (DPCM) • Get differences in PCM signals
• Entropy code differences
Delta Modulation
• Like DPCM but only encodes differences using a
single bit suggesting a delta increase or a delta
decrease
• Good for signals that don’t change rapidly
Dr. – CS576 Lecture 7 10/18/2021 Page 8

Adaptive Differential Pulse Code Modulation
• Sophisticated version of DPCM.
• Codes the differences between the quantized
audio signals using only a small number of
specific bits which adaptively vary by signal
• Normally operates in one of two – high frequency
mode or low frequency mode (why?) Logarithmic Quantization
Different type of waveform based coding schemes
• A-law (Europe, ISDN 8KHz, 13 bits mapped to 8 log bits • -law (America,Japan – maps 14 bits to 8 log bits)
Dr. – CS576 Lecture 7 10/18/2021 Page 9

SOUND IS PERCEIVED
SOUND IS PERCEIVED
Compression attained by variations of PCM coding techniques alone are not sufficient to attain data rates for modern applications (CD, Surround Sound etc)
Perception of Sound additionally can help in compression by studying
• What frequencies we hear 20Hz – 20 KHz • When do we hear them
• When do not hear them
This branch of study – Psychoacoustics – deals with sound perception science. “Auditory Masking” is a perceptual weakness of the ear, which can be used to exploit compression without compromising quality
Dr. – CS576 Lecture 7 10/18/2021 Page 10

STRUCTURE OF HUMAN EAR
STRUCTURE OF HUMAN EAR
Middle Ear Bones
Cochlea
Ear Drum
Ear Canal
Outer Ear
Middle Ear
Inner Ear
Dr. – CS576 Lecture 7 10/18/2021 Page 11

LIMITS OF HUMAN HEARING
LIMITS OF HUMAN HEARING
The human auditory system, although very sensitive to quality, has a few limitations, which can be analyzed by considering
• Time Domain Considerations
• Frequency Domain (Spectral) Consideration • Masking or hiding – which can happen in the
Amplitude, Time and Frequency Domains
Time Domain
Events longer than 0.03 seconds are resolvable in time. Shorter events are perceived as features in frequency
Frequency Domain
20 Hz. < Human Hearing < 20 KHz. “Pitch” is perception related to frequency. Human Pitch Resolution is about 40 - 4000 Hz. Dr. – CS576 Lecture 7 10/18/2021 Page 12 Masking • Masking as defined by the American Standards Association (ASA) is the amount (or the process) by which the threshold of audibility for one sound is raised by the presence of another (masking) sound. • Masking Threshold Curve - A tone is audible only if its power is above the absolute threshold level Dr. – CS576 Lecture 7 10/18/2021 Page 13 • If a tone of a certain frequency and amplitude is present, the audibility threshold curve is changed. Other tones or noise of similar frequency, but of much lower amplitude, are not audible – loud stereo sound in a car masks the engine noise. • Masking Effect – Single Masker Dr. – CS576 Lecture 7 10/18/2021 Page 14 • Masking Effect – Multiple Masker Masking in Amplitude • Loud sounds ‘mask’ soft ones – eg Quantization Noise. Intuitively, a soft sound will not be heard if there is a competing loud sound. Dr. – CS576 Lecture 7 10/18/2021 Page 15 • This happens because of gain controls within the ear – stapedes reflex, interaction (inhibition) in the cochlea and other mechanisms at higher levels Masking in Time • A soft sound just before a louder sound is more likely to be heard than if it is just after. • In the time range of a few milliseconds • A soft event following a louder event tends to be grouped perceptually as part of that louder event • If the soft event precedes the louder event, it might be heard as a separate event. Masking in Frequency • Masking in Frequency – Loud ‘neighbor’ frequency masks soft spectral components. Low sounds mask higher ones more than high masking low. Dr. – CS576 Lecture 7 10/18/2021 Page 16 PERCEPTUAL CODING PERCEPTUAL CODING Perceptual coding tries to minimize the perceptual distortion in a transform coding scheme Basic concept: allocate more bits (more quantization levels, less error) to those channels that are most audible, fewer bits (more error) to those channels that are the least audible Needs to continuously analyze the signal to determine the current audibility threshold curve using a perceptual model Dr. – CS576 Lecture 7 10/18/2021 Page 17 PERCEPTUAL CODING . . . . . . PERCEPTUAL CODING Compressed Bit Stream Frequency filtering and Transform Quantization and coding each band PCM signal Perceptual Analysis Module Psychoacoustics Bit Allocation for each band Compressed Bit Stream PCM signal . . . . . . Bit stream formatting Adding all channels to reconstruct signal Bit stream parsing and unpacking Frequency domain reconstruction Dr. – CS576 Lecture 7 10/18/2021 Page 18 PERCEPTUAL CODING – EXAMPLE PERCEPTUAL CODING – EXAMPLE Dr. – CS576 Lecture 7 10/18/2021 Page 19 SOUND IS PERCEIVED – REVISITED SOUND IS PERCEIVED – REVISITED The auditory system does not hear everything. The perception of sound is limited by the properties discussed above. There is room to cut without us knowing about it! - by exploiting perceptual redundancy. To summarize - • Bandwidth is limited – discard using filters • Time resolution is limited – we can’t hear over sampled signals • Masking in all domains - psychoacoustics is used to discard perceptually irrelevant information. Generally requires the use of a perceptual model. Dr. – CS576 Lecture 7 10/18/2021 Page 20 SOUND IS PRODUCED SOUND IS PRODUCED This is based on the assumption that a “perfect” model could provide the perfect compression. In other words, an analysis of frequencies (and variations) produced by the sound sources, yield properties of the signal it produces. A model of this sound production source is then built and the model parameters are adjusted according to sound it produces. For example, if the sound is human speech then a well parameterized vocal model can yield high quality compression Advantage - great at compression and maybe quality Drawbacks - signal sources must be assumed, known apriori, or identified. Complex when a sound scene has one or more widely different sources. Dr. – CS576 Lecture 7 10/18/2021 Page 21 LINEAR PREDICTIVE CODING (LPC) LINEAR PREDICTIVE CODING (LPC) • Stationary vs Non Stationary • Human Speech is very highly non stationary • Dynamically changes over time, change is quick – need to approximate to stationary via frame blocking • Each window may be categorized as “voiced” (vowels, consonants) or “unvoiced” Dr. – CS576 Lecture 7 10/18/2021 Page 22 LPC DECODER / SYNTHESIZER LPC DECODER / SYNTHESIZER CELP – Code Excited Linear Prediction (MPEG-4) Dr. – CS576 Lecture 7 10/18/2021 Page 23 SOUND IS PERFORMED OR PUBLISHED SOUND IS PERFORMED OR PUBLISHED This sound is also known as event based audio or structured audio Description format that is made up of semantic information about the sounds it represents, and that makes use of high-level (algorithmic) models Event-list representation: sequence of control parameters that, taken alone, do not define the quality of a sound but instead specify the ordering and characteristics of parts of a sound with regards to some external model Dr. – CS576 Lecture 7 10/18/2021 Page 24 EVENT-LIST REPRESENTATION EVENT-LIST REPRESENTATION Event-list representations are appropriate to soundtracks, piano, percussive instruments. Not good for violin, speech and singing Sequencers: allow the specification and modification of event sequences Dr. – CS576 Lecture 7 10/18/2021 Page 25 MIDI MIDI MIDI (Musical Instrument Digital Interface) is a system specification consisting of both hardware and software components that define interconnectivity and a communication protocol for electronic synthesizers, sequencers, rhythm machines, personal computers and other musical instruments Interconnectivity defines standard cabling scheme, connectors and input/output circuitry Communication protocol defines standard multibyte messages to control the instrument’s voice, send responses and status Dr. – CS576 Lecture 7 10/18/2021 Page 26 MIDI COMMUNICATION PROTOCOL MIDI COMMUNICATION PROTOCOL The MIDI communication protocol uses multibyte messages of two kinds: channel messages and system messages. Channel messages address one of the 16 possible channels Voice Messages: used to control the voice of the instrument • Switch notes on/off • Send key pressed messages • Send control messages to control effects like vibrato, sustain and tremolo • Pitch-wheel messages are used to change the pitch of all notes • Channel key pressure provides a measure of force for the keys related to a specific channel (instrument) Dr. – CS576 Lecture 7 10/18/2021 Page 27 MIDI FILES MIDI FILES MIDI messages are received and processed by a MIDI sequencer asynchronously (in real time) • When the synthesizer receives a “note on” message it plays the note • When it receives the corresponding “note off” it turns it off If MIDI data is stored as a data file, and/or edited using a sequencer, the tone form of “time stamping” for the MIDI message is required and is specified by the Standard MIDI file specifications. Dr. – CS576 Lecture 7 10/18/2021 Page 28 SOUND REPRESENTATION AND SYNTHESIS SOUND REPRESENTATION AND SYNTHESIS Sampling – Individual instrument sounds (notes) are digitally recorded and stored in memory in the instrument. When the instrument is played, the note recording are reproduced and mixed to produce the output sound Takes a lot of memory! To reduce storage: • Transpose the pitch of a sample during playback • Quasi-periodic sounds can be “looped” after the attack transient has died Used for creating sound effects for film (Foley) Dr. – CS576 Lecture 7 10/18/2021 Page 29 SOUND REPRESENTATION AND SYNTHESIS (2) SOUND REPRESENTATION AND SYNTHESIS (2) Additive and subtractive synthesis – • Synthesize sound from the superposition of sinusoidal components (additive) or from the filtering of an harmonically rich source sound (subtractive) • Very compact but with “analog synthesizer” feel Frequency modulation synthesis – • Can synthesize a variety of sounds such as brass-like and woodwind-like, percussive sounds, bowed strings and piano tones • No straightforward method available to determine a FM synthesis algorithm from an analysis of a desired sound Dr. – CS576 Lecture 7 10/18/2021 Page 30 AUDIO CODING: MAIN STANDARDS AUDIO CODING: MAIN STANDARDS MPEG (Moving Picture Expert Group) family • MPEG1 - Layer 1, Layer 2, Layer 3 (MP-3) • MPEG2 - Back-compatible with MPEG1, AAC (non-back-compatible) • MPEG4 – CELP and AAC Dolby ITU Speech Coding Standards • ITU G.711 • ITU G.722 • ITU G.726, G.727 • ITU G.729, G.723 • ITU G.728 Dr. – CS576 Lecture 7 10/18/2021 Page 31 MPEG-1 AUDIO CODER MPEG-1 AUDIO CODER Layered Audio Compression Scheme, each being backward compatible Layer1 • Transparent at 384 Kbps • Subband coding with 32 channels (12 samples/band) • Coefficient normalized (extracts Scale Factor) • For each block, chooses among 15 quantizers for perceptual quantization • No entropy coding after transform coding • Decoder is much simpler than the encoder Layer2 • Transparent at 296 Kbps • Improved perceptual model • Finer resolution quantizers Dr. – CS576 Lecture 7 10/18/2021 Page 32 Layer 3 • Transparent at 96 Kb/s per channel • Applies a variable-size modified DCT on the samples of each subband channel • Uses non-uniform quantizers • Has entropy coder (Huffman) - requires buffering! • Much mode complex than Layer 1 and 2 Dr. – CS576 Lecture 7 10/18/2021 Page 33 MPEG-1 LAYERS 1 AND 2 AUDIO CODEC Dr. – CS576 Lecture 7 10/18/2021 Page 34 MPEG-1 LAYER 3 (MP3) AUDIO CODEC Dr. – CS576 Lecture 7 10/18/2021 Page 35 MPEG-2 AUDIO CODEC MPEG-2 AUDIO CODEC Designed with a goal to provide theater-style surround- sound capabilities and backward compatibility. Has various modes of surround sound operation: • Mono-aural • Stereo • Three channel (left, right and center) • Four channel (left, right, center, rear surround) • Five channel (four channel + center) at 640 kbps Non-backward compatible (AAC): • At 320 Kb/s judged to be equivalent to MPEG-2 at 640 Kb/s for five-channels surround-sound • Can operate with any number of channels (between 1 and 48) and output bit rate (from 8 Kb/s per channel to 182 Kb/s per channel) • Sampling rates between 8Khz and 96 KHz per ch Dr. – CS576 Lecture 7 10/18/2021 Page 36 DOLBY AC-3 DOLBY AC-3 Used in movie theaters as part of the Dolby digital film system. Selected for the USA Digital TV (DTV) and DVD Bit-rate: 320 Kb/s for 5.1 stereo Uses 512-point Modified DCT (can be switched to 256- point) Floating-point conversion into exponent-mantissa pairs (mantissas quantized with variable number of bits) Does not transmit bit allocation but perceptual model parameters Dr. – CS576 Lecture 7 10/18/2021 Page 37 DOLBY AC-3 ENCODER DOLBY AC-3 ENCODER Dr. – CS576 Lecture 7 10/18/2021 Page 38 ITU SPEECH COMPRESSION STANDARDS ITU SPEECH COMPRESSION STANDARDS Dr. – CS576 Lecture 7 10/18/2021 Page 39 ITU G.711 ITU G.711 Designed for telephone bandwidth speech signal (3KHz) Does direct sample-by-sample non-uniform quantization (PCM). Provides the lowest delay possible (1 sample) and the lowest complexity. Employs u-law and A-law encoding schemes. High-rate and no recovery mechanism, used as the default coder for ISDN video telephony ITU G.722 Designed to transmit 7-Khz bandwidth voice or music Divides signal in two bands (high-pass and low-pass), which are then encoded with different modalities But G.722 is preferred over G.711 PCM because of increased bandwidth for teleconference-type applications. Music quality is not perfectly transparent. Dr. – CS576 Lecture 7 10/18/2021 Page 40 ITU G.726, G.727 ITU G.726, G.727 Has ADPCM (Adaptive Differential PCM) codecs for telephone bandwidth speech. Can operate using 2, 3, 4 or 5 bits per sample ITU G.729, G.723 Model-based coders: use special models of production (synthesis) of speech • Linear synthesis • Analysis by synthesis: the optimal “input noise” is computed and coded into a multipulse excitation • LPC parameters coding and Pitch prediction Have provision for dealing with frame erasure and packet-loss concealment (good on the Internet) G.723 is part of the standard H.324 standard for communication over POTS with a modem Dr. – CS576 Lecture 7 10/18/2021 Page 41 ITU G.728 ITU G.728 Hybrid between the lower bit-rate model-based coders (G.723 and G.729) and ADPCM coders Low-delay but fairly high complexity Considered equivalent in performance to 32 Kb/s G.726 and G.727 Suggested speech coder for low-bit rate (64-128 Kb/s) ISDN video telephony Remarkably robust to random bit errors Dr. – CS576 Lecture 7 10/18/2021 Page 42 QUESTION QUESTION Both the visual image/video encoders and the psychoacoustic audio encoders work by converting the input spatial or time domain samples to the frequency domain and quantize the frequency coefficients. • How is the conversion to the frequency domain different for visual encoders compared to audio encoders? Why is this difference made for audio encoders? Dr. – CS576 Lecture 7 10/18/2021 Page 43 QUESTION QUESTION Both the visual image/video encoders and the psychoacoustic audio encoders work by converting the input spatial or time domain samples to the frequency domain and quantize the frequency coefficients. • How is the conversion to the frequency domain different for visual encoders compared to audio encoders? Why is this difference made for audio encoders? • How does the quantization of frequency coefficients in the psychoacoustic encoders differ from that used in the visual media types? Dr. – CS576 Lecture 7 10/18/2021 Page 44 QUESTION QUESTION Both the visual image/video encoders and the psychoacoustic audio encoders work by converting the input spatial or time domain samples to the frequency domain and quantize the frequency coefficients. • How is the conversion to the frequency domain different for visual encoders compared to audio encoders? Why is this difference made for audio encoders? • How does the quantization of frequency coefficients in the psychoacoustic encoders differ from that used in the visual media types? • Why is it necessary to transmit the bit rate allocation in the bit stream for audio encoders? Does this have to be done in the beginning, towards the end or often – Explain! How do the visual encoders convey the bit rate allocation? Dr. – CS576 Lecture 7 10/18/2021 Page 45 Dr. – CS576 Lecture 7 10/18/2021 Page 46