Rights Contact Login For More Details
- Wiley
More About This Title Techniques for Noise Robustness in AutomaticSpeech Recognition
- English
English
Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences.
Key features:
- Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech.
- Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments.
- Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR.
- Includes contributions from top ASR researchers from leading research units in the field
- English
English
Tuomas Virtanen, Tampere University of Technology, Finland
Dr . Virtanen is a senior researcher at Tampere University of Technology. Previously, he has worked at Cambridge University, UK as a research associate. His main research contributions are in sound source separation and its application to robust speech recognition, audio content analysis, and music information retrieval. He is well-known for his work on non-negative matrix factorization based source separation, which is currently widely used in the field. He has published numerous journal and conference articles related to above topics.
Rita Singh, Carnegie Mellon University, USA
Dr. Singh is the CEO of a speech-technology startup but remains an adjunct faculty of the Language Technologies Institute at Carnegie Mellon University. She has been a major contributor to the open-source CMU sphinx and is one of the main architects of the popular Sphinx4 java-based open-source speech recognition system. In addition to her work on core speech recognition technology, she has also developed several algorithms for noise compensation, and was the prime architect of CMU's award-winning submission to the 2001 Naval Research Lab's challenge on automatic recognition of speech in noisy environments (SPINE).
Bhiksha Raj, Carnegie Mellon University, USA
Dr. Raj is an associate professor in the Language Technologies Institute and in Electrical and Computer Engineering at Carnegie Mellon University. He has worked extensively on robustness algorithms for speech recognition, and is very well-known for his contributions to the highly-popular VTS approach for noise compensation, as well as his contributions to missing-feature-based techniques for noise compensation. He has published extensively on and holds patents for algorithms for microphone array processing and signal separation.
- English
English
List of Contributors xv
Acknowledgments xvii
1 Introduction 1
Tuomas Virtanen, Rita Singh, Bhiksha Raj
1.1 Scope of the Book 1
1.2 Outline 2
1.3 Notation 4
Part One FOUNDATIONS
2 The Basics of Automatic Speech Recognition 9
Rita Singh, Bhiksha Raj, Tuomas Virtanen
2.1 Introduction 9
2.2 Speech Recognition Viewed as Bayes Classification 10
2.3 Hidden Markov Models 11
2.3.1 Computing Probabilities with HMMs 12
2.3.2 Determining the State Sequence 17
2.3.3 Learning HMM Parameters 19
2.3.4 Additional Issues Relating to Speech Recognition Systems 20
2.4 HMM-Based Speech Recognition 24
2.4.1 Representing the Signal 24
2.4.2 The HMM for a Word Sequence 25
2.4.3 Searching through all Word Sequences 26
References 29
3 The Problem of Robustness in Automatic Speech Recognition 31
Bhiksha Raj, Tuomas Virtanen, Rita Singh
3.1 Errors in Bayes Classification 31
3.1.1 Type 1 Condition: Mismatch Error 33
3.1.2 Type 2 Condition: Increased Bayes Error 34
3.2 Bayes Classification and ASR 35
3.2.1 All We Have is a Model: A Type 1 Condition 35
3.2.2 Intrinsic Interferences—Signal Components that are Unrelated to the Message: A Type 2 Condition 36
3.2.3 External Interferences—The Data are Noisy: Type 1 and Type 2 Conditions 36
3.3 External Influences on Speech Recordings 36
3.3.1 Signal Capture 37
3.3.2 Additive Corruptions 41
3.3.3 Reverberation 42
3.3.4 A Simplified Model of Signal Capture 43
3.4 The Effect of External Influences on Recognition 44
3.5 Improving Recognition under Adverse Conditions 46
3.5.1 Handling the Model Mismatch Error 46
3.5.2 Dealing with Intrinsic Variations in the Data 47
3.5.3 Dealing with Extrinsic Variations 47
References 50
Part Two SIGNAL ENHANCEMENT
4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53
Rainer Martin, Dorothea Kolossa
4.1 Introduction 53
4.2 Signal Analysis and Synthesis 55
4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55
4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57
4.3 Voice Activity Detection 58
4.3.1 VAD Design Principles 58
4.3.2 Evaluation of VAD Performance 62
4.3.3 Evaluation in the Context of ASR 62
4.4 Noise Power Spectrum Estimation 65
4.4.1 Smoothing Techniques 65
4.4.2 Histogram and GMM Noise Estimation Methods 67
4.4.3 Minimum Statistics Noise Power Estimation 67
4.4.4 MMSE Noise Power Estimation 68
4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69
4.5 Adaptive Filters for Signal Enhancement 71
4.5.1 Spectral Subtraction 71
4.5.2 Nonlinear Spectral Subtraction 73
4.5.3 Wiener Filtering 74
4.5.4 The ETSI Advanced Front End 75
4.5.5 Nonlinear MMSE Estimators 75
4.6 ASR Performance 80
4.7 Conclusions 81
References 82
5 Extraction of Speech from Mixture Signals 87
Paris Smaragdis
5.1 The Problem with Mixtures 87
5.2 Multichannel Mixtures 88
5.2.1 Basic Problem Formulation 88
5.2.2 Convolutive Mixtures 92
5.3 Single-Channel Mixtures 98
5.3.1 Problem Formulation 98
5.3.2 Learning Sound Models 100
5.3.3 Separation by Spectrogram Factorization 101
5.3.4 Dealing with Unknown Sounds 105
5.4 Variations and Extensions 107
5.5 Conclusions 107
References 107
6 Microphone Arrays 109
John McDonough, Kenichi Kumatani
6.1 Speaker Tracking 110
6.2 Conventional Microphone Arrays 113
6.3 Conventional Adaptive Beamforming Algorithms 120
6.3.1 Minimum Variance Distortionless Response Beamformer 120
6.3.2 Noise Field Models 122
6.3.3 Subband Analysis and Synthesis 123
6.3.4 Beamforming Performance Criteria 126
6.3.5 Generalized Sidelobe Canceller Implementation 129
6.3.6 Recursive Implementation of the GSC 130
6.3.7 Other Conventional GSC Beamformers 131
6.3.8 Beamforming based on Higher Order Statistics 132
6.3.9 Online Implementation 136
6.3.10 Speech-Recognition Experiments 140
6.4 Spherical Microphone Arrays 142
6.5 Spherical Adaptive Algorithms 148
6.6 Comparative Studies 149
6.7 Comparison of Linear and Spherical Arrays for DSR 152
6.8 Conclusions and Further Reading 154
References 155
Part Three FEATURE ENHANCEMENT
7 From Signals to Speech Features by Digital Signal Processing 161
Matthias W¨olfel
7.1 Introduction 161
7.1.1 About this Chapter 162
7.2 The Speech Signal 162
7.3 Spectral Processing 163
7.3.1 Windowing 163
7.3.2 Power Spectrum 165
7.3.3 Spectral Envelopes 166
7.3.4 LP Envelope 166
7.3.5 MVDR Envelope 169
7.3.6 Warping the Frequency Axis 171
7.3.7 Warped LP Envelope 175
7.3.8 Warped MVDR Envelope 176
7.3.9 Comparison of Spectral Estimates 177
7.3.10 The Spectrogram 179
7.4 Cepstral Processing 179
7.4.1 Definition and Calculation of Cepstral Coefficients 180
7.4.2 Characteristics of Cepstral Sequences 181
7.5 Influence of Distortions on Different Speech Features 182
7.5.1 Objective Functions 182
7.5.2 Robustness against Noise 185
7.5.3 Robustness against Echo and Reverberation 187
7.5.4 Robustness against Changes in Fundamental Frequency 189
7.6 Summary and Further Reading 191
References 191
8 Features Based on Auditory Physiology and Perception 193
Richard M. Stern, Nelson Morgan
8.1 Introduction 193
8.2 Some Attributes of Auditory Physiology and Perception 194
8.2.1 Peripheral Processing 194
8.2.2 Processing at more Central Levels 200
8.2.3 Psychoacoustical Correlates of Physiological Observations 202
8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206
8.2.5 Summary 208
8.3 “Classic” Auditory Representations 208
8.4 Current Trends in Auditory Feature Analysis 213
8.5 Summary 221
Acknowledgments 222
References 222
9 Feature Compensation 229
Jasha Droppo
9.1 Life in an Ideal World 229
9.1.1 Noise Robustness Tasks 229
9.1.2 Probabilistic Feature Enhancement 230
9.1.3 Gaussian Mixture Models 231
9.2 MMSE-SPLICE 232
9.2.1 Parameter Estimation 233
9.2.2 Results 236
9.3 Discriminative SPLICE 237
9.3.1 The MMI Objective Function 238
9.3.2 Training the Front-End Parameters 239
9.3.3 The Rprop Algorithm 240
9.3.4 Results 241
9.4 Model-Based Feature Enhancement 242
9.4.1 The Additive Noise-Mixing Equation 243
9.4.2 The Joint Probability Model 244
9.4.3 Vector Taylor Series Approximation 246
9.4.4 Estimating Clean Speech 247
9.4.5 Results 247
9.5 Switching Linear Dynamic System 248
9.6 Conclusion 249
References 249
10 Reverberant Speech Recognition 251
Reinhold Haeb-Umbach, Alexander Krueger
10.1 Introduction 251
10.2 The Effect of Reverberation 252
10.2.1 What is Reverberation? 252
10.2.2 The Relationship between Clean and Reverberant Speech Features 254
10.2.3 The Effect of Reverberation on ASR Performance 258
10.3 Approaches to Reverberant Speech Recognition 258
10.3.1 Signal-Based Techniques 259
10.3.2 Front-End Techniques 260
10.3.3 Back-End Techniques 262
10.3.4 Concluding Remarks 265
10.4 Feature Domain Model of the Acoustic Impulse Response 265
10.5 Bayesian Feature Enhancement 267
10.5.1 Basic Approach 268
10.5.2 Measurement Update 269
10.5.3 Time Update 270
10.5.4 Inference 271
10.6 Experimental Results 272
10.6.1 Databases 272
10.6.2 Overview of the Tested Methods 273
10.6.3 Recognition Results on Reverberant Speech 274
10.6.4 Recognition Results on Noisy Reverberant Speech 276
10.7 Conclusions 277
Acknowledgment 278
References 278
Part Four MODEL ENHANCEMENT
11 Adaptation and Discriminative Training of Acoustic Models 285
Yannick Est`eve, Paul Del´eglise
11.1 Introduction 285
11.1.1 Acoustic Models 286
11.1.2 Maximum Likelihood Estimation 287
11.2 Acoustic Model Adaptation and Noise Robustness 288
11.2.1 Static (or Offline) Adaptation 289
11.2.2 Dynamic (or Online) Adaptation 289
11.3 Maximum A Posteriori Reestimation 290
11.4 Maximum Likelihood Linear Regression 293
11.4.1 Class Regression Tree 294
11.4.2 Constrained Maximum Likelihood Linear Regression 297
11.4.3 CMLLR Implementation 297
11.4.4 Speaker Adaptive Training 298
11.5 Discriminative Training 299
11.5.1 MMI Discriminative Training Criterion 301
11.5.2 MPE Discriminative Training Criterion 302
11.5.3 I-smoothing 303
11.5.4 MPE Implementation 304
11.6 Conclusion 307
References 308
12 Factorial Models for Noise Robust Speech Recognition 311
John R. Hershey, Steven J. Rennie, Jonathan Le Roux
12.1 Introduction 311
12.2 The Model-Based Approach 313
12.3 Signal Feature Domains 314
12.4 Interaction Models 317
12.4.1 Exact Interaction Model 318
12.4.2 Max Model 320
12.4.3 Log-Sum Model 321
12.4.4 Mel Interaction Model 321
12.5 Inference Methods 322
12.5.1 Max Model Inference 322
12.5.2 Parallel Model Combination 324
12.5.3 Vector Taylor Series Approaches 326
12.5.4 SNR-Dependent Approaches 331
12.6 Efficient Likelihood Evaluation in Factorial Models 332
12.6.1 Efficient Inference using the Max Model 332
12.6.2 Efficient Vector-Taylor Series Approaches 334
12.6.3 Band Quantization 335
12.7 Current Directions 337
12.7.1 Dynamic Noise Models for Robust ASR 338
12.7.2 Multi-Talker Speech Recognition using Graphical Models 339
12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340
References 341
13 Acoustic Model Training for Robust Speech Recognition 347
Michael L. Seltzer
13.1 Introduction 347
13.2 Traditional Training Methods for Robust Speech Recognition 348
13.3 A Brief Overview of Speaker Adaptive Training 349
13.4 Feature-Space Noise Adaptive Training 351
13.4.1 Experiments using fNAT 352
13.5 Model-Space Noise Adaptive Training 353
13.6 Noise Adaptive Training using VTS Adaptation 355
13.6.1 Vector Taylor Series HMM Adaptation 355
13.6.2 Updating the Acoustic Model Parameters 357
13.6.3 Updating the Environmental Parameters 360
13.6.4 Implementation Details 360
13.6.5 Experiments using NAT 361
13.7 Discussion 364
13.7.1 Comparison of Training Algorithms 364
13.7.2 Comparison to Speaker Adaptive Training 364
13.7.3 Related Adaptive Training Methods 365
13.8 Conclusion 366
References 366
Part Five COMPENSATION FOR INFORMATION LOSS
14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371
Jon Barker
14.1 Introduction 371
14.2 Classification with Incomplete Data 373
14.2.1 A Simple Missing Data Scenario 374
14.2.2 Missing Data Theory 376
14.2.3 Validity of the MAR Assumption 378
14.2.4 Marginalising Acoustic Models 379
14.3 Energetic Masking 381
14.3.1 The Max Approximation 381
14.3.2 Bounded Marginalisation 382
14.3.3 Missing Data ASR in the Cepstral Domain 384
14.3.4 Missing Data ASR with Dynamic Features 386
14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388
14.4.1 Missing Data with Soft Masks 388
14.4.2 Sub-band Combination Approaches 391
14.4.3 Speech Fragment Decoding 393
14.5 Some Perspectives on Performance 395
References 396
15 Missing-Data Techniques: Feature Reconstruction 399
Jort Florent Gemmeke, Ulpu Remes
15.1 Introduction 399
15.2 Missing-Data Techniques 401
15.3 Correlation-Based Imputation 402
15.3.1 Fundamentals 402
15.3.2 Implementation 404
15.4 Cluster-Based Imputation 406
15.4.1 Fundamentals 406
15.4.2 Implementation 408
15.4.3 Advances 409
15.5 Class-Conditioned Imputation 411
15.5.1 Fundamentals 411
15.5.2 Implementation 412
15.5.3 Advances 413
15.6 Sparse Imputation 414
15.6.1 Fundamentals 414
15.6.2 Implementation 416
15.6.3 Advances 418
15.7 Other Feature-Reconstruction Methods 420
15.7.1 Parametric Approaches 420
15.7.2 Nonparametric Approaches 421
15.8 Experimental Results 421
15.8.1 Feature-Reconstruction Methods 422
15.8.2 Comparison with Other Methods 424
15.8.3 Advances 426
15.8.4 Combination with Other Methods 427
15.9 Discussion and Conclusion 428
Acknowledgments 429
References 430
16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433
Arun Narayanan, DeLiang Wang
16.1 Introduction 433
16.2 Auditory Scene Analysis 434
16.3 Computational Auditory Scene Analysis 435
16.3.1 Ideal Binary Mask 435
16.3.2 Typical CASA Architecture 438
16.4 CASA Strategies 440
16.4.1 IBM Estimation Based on Local SNR Estimates 440
16.4.2 IBM Estimation using ASA Cues 442
16.4.3 IBM Estimation as Binary Classification 448
16.4.4 Binaural Mask Estimation Strategies 451
16.5 Integrating CASA with ASR 452
16.5.1 Uncertainty Transform Model 454
16.6 Concluding Remarks 458
Acknowledgment 458
References 458
17 Uncertainty Decoding 463
Hank Liao
17.1 Introduction 463
17.2 Observation Uncertainty 465
17.3 Uncertainty Decoding 466
17.4 Feature-Based Uncertainty Decoding 468
17.4.1 SPLICE with Uncertainty 470
17.4.2 Front-End Joint Uncertainty Decoding 471
17.4.3 Issues with Feature-Based Uncertainty Decoding 472
17.5 Model-Based Joint Uncertainty Decoding 473
17.5.1 Parameter Estimation 475
17.5.2 Comparisons with Other Methods 476
17.6 Noisy CMLLR 477
17.7 Uncertainty and Adaptive Training 480
17.7.1 Gradient-Based Methods 481
17.7.2 Factor Analysis Approaches 482
17.8 In Combination with Other Techniques 483
17.9 Conclusions 484
References 485
Index 487