fbpx
Wikipedia

Speech processing

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.[1]

History edit

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.[3]

Linear predictive coding (LPC), a speech processing algorithm, was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966.[4] Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at Bell Labs during the 1970s.[4] LPC was the basis for voice-over-IP (VoIP) technology,[4] as well as speech synthesizer chips, such as the Texas Instruments LPC Speech Chips used in the Speak & Spell toys from 1978.[5]

One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by AT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.[6]

By the early 2000s, the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern neural networks and deep learning.[citation needed]

Techniques edit

Dynamic time warping edit

Dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.[citation needed]

Hidden Markov models edit

A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the Markov property, the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t − 1). Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t).[citation needed]

Artificial neural networks edit

An artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.[citation needed]

Phase-aware processing edit

Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase:[7] result of arctangent function is not continuous due to periodical jumps on  . After phase unwrapping (see,[8] Chapter 2.3; Instantaneous phase and frequency), it can be expressed as:[7][9]  , where   is linear phase (  is temporal shift at each frame of analysis),   is phase contribution of the vocal tract and phase source.[9] Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase [10] and its derivatives by time (instantaneous frequency) and frequency (group delay),[11] smoothing of phase across frequency.[11] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.[9]

Applications edit

See also edit

References edit

  1. ^ Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS].
  2. ^ Juang, B.-H.; Rabiner, L.R. (2006), "Speech Recognition, Automatic: History", Encyclopedia of Language & Linguistics, Elsevier, pp. 806–819, doi:10.1016/b0-08-044854-2/00906-8, ISBN 9780080448541
  3. ^ Myasnikov, L. L.; Myasnikova, Ye. N. (1970). Automatic recognition of sound pattern (in Russian). Leningrad: Energiya.
  4. ^ a b c Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" (PDF). Found. Trends Signal Process. 3 (4): 203–303. doi:10.1561/2000000036. ISSN 1932-8346.
  5. ^ "VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development".
  6. ^ Huang, Xuedong; Baker, James; Reddy, Raj (2014-01-01). "A historical perspective of speech recognition". Communications of the ACM. 57 (1): 94–103. doi:10.1145/2500887. ISSN 0001-0782. S2CID 6175701.
  7. ^ a b Mowlaee, Pejman; Kulmer, Josef (August 2015). "Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 23 (8): 1283–1294. doi:10.1109/TASLP.2015.2430820. ISSN 2329-9290. S2CID 13058142. Retrieved 2017-12-03.
  8. ^ Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017). Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley. ISBN 978-1-119-23882-9.
  9. ^ a b c Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.
  10. ^ Kulmer, Josef; Mowlaee, Pejman (May 2015). "Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition". IEEE Signal Processing Letters. 22 (5): 598–602. doi:10.1109/LSP.2014.2365040. ISSN 1070-9908. S2CID 15503015. Retrieved 2017-12-03.
  11. ^ a b Mowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016). "Advances in phase-aware signal processing in speech communication". Speech Communication. 81: 1–29. doi:10.1016/j.specom.2016.04.002. ISSN 0167-6393. S2CID 17409161. Retrieved 2017-12-03.

speech, processing, this, article, about, electronic, speech, processing, speech, processing, human, brain, language, processing, brain, study, speech, signals, processing, methods, signals, signals, usually, processed, digital, representation, speech, process. This article is about electronic speech processing For speech processing in the human brain see Language processing in the brain Speech processing is the study of speech signals and the processing methods of signals The signals are usually processed in a digital representation so speech processing can be regarded as a special case of digital signal processing applied to speech signals Aspects of speech processing includes the acquisition manipulation storage transfer and output of speech signals Different speech processing tasks include speech recognition speech synthesis speaker diarization speech enhancement speaker recognition etc 1 Contents 1 History 2 Techniques 2 1 Dynamic time warping 2 2 Hidden Markov models 2 3 Artificial neural networks 2 4 Phase aware processing 3 Applications 4 See also 5 ReferencesHistory editEarly attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels In 1952 three researchers at Bell Labs Stephen Balashek R Biddulph and K H Davis developed a system that could recognize digits spoken by a single speaker 2 Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s 3 Linear predictive coding LPC a speech processing algorithm was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone NTT in 1966 4 Further developments in LPC technology were made by Bishnu S Atal and Manfred R Schroeder at Bell Labs during the 1970s 4 LPC was the basis for voice over IP VoIP technology 4 as well as speech synthesizer chips such as the Texas Instruments LPC Speech Chips used in the Speak amp Spell toys from 1978 5 One of the first commercially available speech recognition products was Dragon Dictate released in 1990 In 1992 technology developed by Lawrence Rabiner and others at Bell Labs was used by AT amp T in their Voice Recognition Call Processing service to route calls without a human operator By this point the vocabulary of these systems was larger than the average human vocabulary 6 By the early 2000s the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern neural networks and deep learning citation needed Techniques editDynamic time warping edit Main article Dynamic time warpingDynamic time warping DTW is an algorithm for measuring similarity between two temporal sequences which may vary in speed In general DTW is a method that calculates an optimal match between two given sequences e g time series with certain restriction and rules The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost where the cost is computed as the sum of absolute differences for each matched pair of indices between their values citation needed Hidden Markov models edit Main article Hidden Markov modelA hidden Markov model can be represented as the simplest dynamic Bayesian network The goal of the algorithm is to estimate a hidden variable x t given a list of observations y t By applying the Markov property the conditional probability distribution of the hidden variable x t at time t given the values of the hidden variable x at all times depends only on the value of the hidden variable x t 1 Similarly the value of the observed variable y t only depends on the value of the hidden variable x t both at time t citation needed Artificial neural networks edit Main article Artificial neural networkAn artificial neural network ANN is based on a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain Each connection like the synapses in a biological brain can transmit a signal from one artificial neuron to another An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it In common ANN implementations the signal at a connection between artificial neurons is a real number and the output of each artificial neuron is computed by some non linear function of the sum of its inputs citation needed Phase aware processing edit Phase is usually supposed to be random uniform variable and thus useless This is due wrapping of phase 7 result of arctangent function is not continuous due to periodical jumps on 2 p displaystyle 2 pi nbsp After phase unwrapping see 8 Chapter 2 3 Instantaneous phase and frequency it can be expressed as 7 9 ϕ h l ϕ l i n h l PS h l displaystyle phi h l phi lin h l Psi h l nbsp where ϕ l i n h l w 0 l D t displaystyle phi lin h l omega 0 l Delta t nbsp is linear phase D t displaystyle Delta t nbsp is temporal shift at each frame of analysis PS h l displaystyle Psi h l nbsp is phase contribution of the vocal tract and phase source 9 Obtained phase estimations can be used for noise reduction temporal smoothing of instantaneous phase 10 and its derivatives by time instantaneous frequency and frequency group delay 11 smoothing of phase across frequency 11 Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase 9 Applications editInteractive voice response Virtual Assistants Voice Identification Emotion Recognition Call Center Automation RoboticsSee also editComputational audiology Neurocomputational speech processing Speech coding Speech technology Natural Language ProcessingReferences edit Sahidullah Md Patino Jose Cornell Samuele Yin Ruiking Sivasankaran Sunit Bredin Herve Korshunov Pavel Brutti Alessio Serizel Romain Vincent Emmanuel Evans Nicholas Marcel Sebastien Squartini Stefano Barras Claude 2019 11 06 The Speed Submission to DIHARD II Contributions amp Lessons Learned arXiv 1911 02388 eess AS Juang B H Rabiner L R 2006 Speech Recognition Automatic History Encyclopedia of Language amp Linguistics Elsevier pp 806 819 doi 10 1016 b0 08 044854 2 00906 8 ISBN 9780080448541 Myasnikov L L Myasnikova Ye N 1970 Automatic recognition of sound pattern in Russian Leningrad Energiya a b c Gray Robert M 2010 A History of Realtime Digital Speech on Packet Networks Part II of Linear Predictive Coding and the Internet Protocol PDF Found Trends Signal Process 3 4 203 303 doi 10 1561 2000000036 ISSN 1932 8346 VC amp G VC amp G Interview 30 Years Later Richard Wiggins Talks Speak amp Spell Development Huang Xuedong Baker James Reddy Raj 2014 01 01 A historical perspective of speech recognition Communications of the ACM 57 1 94 103 doi 10 1145 2500887 ISSN 0001 0782 S2CID 6175701 a b Mowlaee Pejman Kulmer Josef August 2015 Phase Estimation in Single Channel Speech Enhancement Limits Potential IEEE ACM Transactions on Audio Speech and Language Processing 23 8 1283 1294 doi 10 1109 TASLP 2015 2430820 ISSN 2329 9290 S2CID 13058142 Retrieved 2017 12 03 Mowlaee Pejman Kulmer Josef Stahl Johannes Mayer Florian 2017 Single channel phase aware signal processing in speech communication theory and practice Chichester Wiley ISBN 978 1 119 23882 9 a b c Kulmer Josef Mowlaee Pejman April 2015 Harmonic phase estimation in single channel speech enhancement using von Mises distribution and prior SNR Acoustics Speech and Signal Processing ICASSP 2015 IEEE International Conference on IEEE pp 5063 5067 Kulmer Josef Mowlaee Pejman May 2015 Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition IEEE Signal Processing Letters 22 5 598 602 doi 10 1109 LSP 2014 2365040 ISSN 1070 9908 S2CID 15503015 Retrieved 2017 12 03 a b Mowlaee Pejman Saeidi Rahim Stylianou Yannis July 2016 Advances in phase aware signal processing in speech communication Speech Communication 81 1 29 doi 10 1016 j specom 2016 04 002 ISSN 0167 6393 S2CID 17409161 Retrieved 2017 12 03 Retrieved from https en wikipedia org w index php title Speech processing amp oldid 1198831422, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.