语音识别系统课件

资源描述

语音识别系统吴嘉瑶2024年7月22日什么是语音识别？什么是语音识别？输入输出普通话p,u,t,o,ng,h,u,a语音识别问题分析语音识别问题分析Datatypedimension1D2D3DChannelSingleMultiple语音中的“空间”维度与频率分布和特征提取的数学变换相联系，它包含多重声学上的变换属性，例如来自环境的因素、说话人、口音、说话方式和速率等。环境因素包括麦克风特性、语音传输信道、环境噪声和室内混响，后几种因素则包括空间和时间维度的相关性。语音识别实践P35语音识别基本步骤语音识别基本步骤不定长输入输出长度不对等对齐状态识别音素组成单词组成句子时序性上下文相关性语音识别基本框架的变迁语音识别基本框架的变迁GMM+HMMGMM+HMMN-gramN-gram基本结构MFCCMFCC语音识别基本框架的变迁语音识别基本框架的变迁GMM+HMMGMM+HMMN-gramN-gramTandem结构DNNDNN输入：原始波形输出：上下文有关音素分布，判别问题中间层作为特征，bottlenecklayer(几十维)语音识别基本框架的变迁语音识别基本框架的变迁DNN+HMMDNN+HMMtrigramtrigramHybrid结构输入：语音信号输出：判断每帧是哪个状态，DNN代替GMM的功能训练时还是需要传统GMM+HMM提供对齐WhydoweneedHMM?WhydoweneedHMM?神经网络只进行逐帧判别训练时需要HMM系统提供各音素起止时间解码时需要考虑状态转移概率But.But.如果我们不进行逐帧的判别呢？CanweabandonHMM?CanweabandonHMM?上下文建模能力有限语音识别基本框架的变迁语音识别基本框架的变迁RNNRNN（CTCCTC）TrigramorTrigramorNNNNCTC结构语音识别基本框架的变迁语音识别基本框架的变迁Attention结构语音识别系统性能语音识别系统性能语音识别系统性能语音识别系统性能IsASRsystemgoodenough?IsASRsystemgoodenough?远场麦克风语音识别ALEXA高噪音环境下的语音识别车载带口音的语音识别方言不流利的自然语音，变速或者带有情绪的语音识别。13581887557商业语音识别系统商业语音识别系统ASRsystemEnd2EndEnd2EndCTCBaiduDeepSpeech1DeepSpeech2FacebookAttention-basedGoogle2016_ICASSP2018_ICASSPXiaoMi2018_ICASSPBaidu2018_ICASSPNoiseRobustRNN-TGoogle2017_ICASSPTraditionTraditionFSMNAlibaba2018_ICASSPCNNSPEECH2018_ICASSPNoiseRobustCTCLossFunctionCTCLossFunction1234KcategorySoftmaxovervocabularyScore(k,t)=logP(k,t|X)TotalScoreofonepaththesumofscoresatdifferenttimestepsTheprobabilityofanytranscriptthesumofprobabilitiesofallpaths.不再进行逐帧判别添加blank，让输出能缩成标答即可（实际输出位置接近真实位置）普通声学模型：nniiiihhhaoaoaoaoaoCTC：-n-i-h-ao-CTCLossFunctionCTCLossFunctionlCTC的特点-帧独立假设-假设上下文已由RNN处理l训练-所有能缩成标准答案的总概率-动态规划算法(前向后向算法)l解码-beamsearchRNNRNN（CTCCTC）TrigramorTrigramorNNNNlCTC的优点-简洁-不需要语言知识（词典和语言模型）-OOV问题（词典既是铠甲，又是软肋）-端到端训练（模块单独训练整个系统不一定最好）lCTC的缺点-大量的训练数据（身兼数职）-语音数据里上下文有关信息少（外接LM）-帧独立假设。/greit/可以拼成great或者grate，但CTC可能会拼成grete.Graves,Alex,etal.Connectionisttemporalclassification:labellingunsegmentedsequencedatawithrecurrentneuralnetworks.Proceedings of the 23rd international conference on Machine learning.ACM,2006.DeepSpeech1.0DeepSpeech1.0Hannun,Awni,etal.Deepspeech:Scalingupend-to-endspeechrecognition.arXiv preprint arXiv:1412.5567(2014).FacebookConvNetCTCFacebookConvNetCTCCollobert,Ronan,ChristianPuhrsch,andGabrielSynnaeve.Wav2letter:anend-to-endconvnet-basedspeechrecognitionsystem.arXiv preprint arXiv:1609.03193(2016).Liptchinsky,V.,G.Synnaeve,andR.Collobert.Letterbasedspeechrecognitionwithgatedconvnets.CoRR,vol.abs/1712.094441(2017).商业语音识别系统商业语音识别系统ASRsystemEnd2EndEnd2EndCTCBaiduDeepSpeech1DeepSpeech2FacebookAttention-basedGoogle2016_ICASSP2018_ICASSPXiaoMi2018_ICASSPBaidu2018_ICASSPNoiseRobustRNN-TGoogle2017_ICASSPTraditionTraditionFSMNAlibaba2018_ICASSPCNNSPEECH2018_ICASSPNoiseRobustSeq2SeqwithattentionSeq2Seqwithattentionl编码器-序列的特征提取和信息压缩（声学模型）l解码器-每步主动寻找输入需要那几帧（语言模型）Seq2SeqwithattentionSeq2SeqwithattentionEncoderDecoderlSeq2Seq的特点-编码器+解码器-每步都输出，没有Blankl训练l解码-beamsearchlSeq2Seq的优点-帧非独立假设-端到端训练lSeq2Seq的缺点-不能在线识别GoogleLASGoogleLASGoogleLASGoogleLASPyramidalRNN信息压缩特征提取AttentionRNN2.Chiu,Chung-Cheng,etal.State-of-the-artspeechrecognitionwithsequence-to-sequencemodels.2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.1.Chan,William,etal.Listen,attendandspell:Aneuralnetworkforlargevocabularyconversationalspeechrecognition.Acoustics,Speech and Signal Processing(ICASSP),2016 IEEE International Conference on.IEEE,2016.类似于对齐操作GoogleLASGoogleLASXiaoMiXiaoMiShan,Changhao,etal.Attention-basedend-to-endspeechrecognitiononvoicesearch.2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.LanguageEnglishCharactera,b,cWordhelloChineseCharacter你，好Word你好FromEnglishtoMandarinonVoiceSearchTask1.StructurecharacterEmbedding2.TrainingL2regularizationGaussianweightnoiseFrameskippingAttentionsmoothingXiaoMiXiaoMiencoderdecoderBaiduASRGANBaiduASRGANCleanAudioEncoderCleanembeddingDiscriminatorLossNoisyAudioEncoderNoisyembeddingAttentionDecoderCELossAugmentationSriram,Anuroop,etal.Robustspeechrecognitionusinggenerativeadversarialnetworks.2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.ThinkThink编码结果Matchfunction（两个向量的匹配度）Encoder和Decoder的网络商业语音识别系统商业语音识别系统ASRsystemEnd2EndEnd2EndCTCBaiduDeepSpeech1DeepSpeech2FacebookAttention-basedGoogle2016_ICASSP2018_ICASSPXiaoMi2018_ICASSPBaidu2018_ICASSPNoiseRobustRNN-TGoogle2017_ICASSPTraditionTraditionFSMNAlibaba2018_ICASSPCNNSPEECH2018_ICASSPNoiseRobustRNN-TRNN-TRao,Kanishka,HaimSak,andRohitPrabhavalkar.Exploringarchitectures,dataandunitsforstreamingend-to-endspeechrecognitionwithRNN-transducer.Automatic Speech Recognition and Understanding Workshop(ASRU),2017 IEEE.IEEE,2017.训练：动态规划算法解码：beamsearchEnd2EndEnd2End比较比较CTCTransducerAttention输出语言模型无有有对齐单调单调不单调硬硬软解码所需步数输入长度输入长度+输出长度输出长度Bahdanau,Dzmitry,KyunghyunCho,andYoshuaBengio.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.arXiv preprint arXiv:1409.0473(2014).Graves,Alex,etal.Connectionisttemporalclassification:labellingunsegmentedsequencedatawithrecurrentneuralnetworks.Proceedings of the 23rd international conference on Machine learning.ACM,2006.Graves,Alex.Sequencetransductionwithrecurrentneuralnetworks.arXiv preprint arXiv:1211.3711(2012).CTC做输出独立假设，而Transducer和Attention不独立BetterwithLMBetterwithLM！端到端学习中包含语言模型但是较弱文本数据比语音数据更好获得商业语音识别系统商业语音识别系统ASRsystemEnd2EndEnd2EndCTCBaiduDeepSpeech1DeepSpeech2FacebookAttention-basedGoogle2016_ICASSP2018_ICASSPXiaoMi2018_ICASSPBaidu2018_ICASSPNoiseRobustRNN-TGoogle2017_ICASSPTraditionTraditionFSMNAlibaba2018_ICASSPCNNSPEECH2018_ICASSPNoiseRobustAlibaba-FSMNAlibaba-FSMN在信号处理学科中，有两种滤波器，分别叫做IIR和FIR，它们和两种神经网络相对应。所提出的FSMN受到数字信号处理中的滤波器设计知识的启发，任何无限脉冲响应（IIR）滤波器都可以使用高阶有限脉冲响应（FIR）滤波器很好地近似。Zhang,Shiliang,etal.CompactFeedforwardSequentialMemoryNetworksforLargeVocabularyContinuousSpeechRecognition.INTERSPEECH.2016.Zhang,Shiliang,etal.Deep-FSMNforLargeVocabularyContinuousSpeechRecognition.arXiv preprint arXiv:1803.05030(2018).Zhang,Shiliang,etal.Feedforwardsequentialmemorynetworks:Anewstructuretolearnlong-termdependency.arXiv preprint arXiv:1512.08301(2015).商业语音识别系统商业语音识别系统ASRsystemEnd2EndEnd2EndCTCBaiduDeepSpeech1DeepSpeech2FacebookAttention-basedGoogle2016_ICASSP2018_ICASSPXiaoMi2018_ICASSPBaidu2018_ICASSPNoiseRobustRNN-TGoogle2017_ICASSPTraditionTraditionFSMNAlibaba2018_ICASSPCNNSPEECH2018_ICASSPNoiseRobustVDCRNVDCRNTan,Tian,etal.Adaptiveverydeepconvolutionalresidualnetworkfornoiserobustspeechrecognition.IEEE/ACM Transactions on Audio,Speech,and Language Processing26.8(2018):1393-1405.NoiserobustFront-enddenoisingdereverberationBack-endModeladaptationBasedVDCNN:noiserobustsuperiorthanothermodelsVDCNN+BN+residuallearningFAT(factureawaretraining)&CAT(clusteradaptivetraining)谢谢！

展开阅读全文

语音识别系统课件

最新文档