Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model.

Language Models For Speech Recognition

Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model is to make available to the recognizer adequate estimates of the probabilities

Language Models u u u u

N-gram models uMake the Markov assumption that only the prior local context – the last (N-1) words – affects the next word uN=3trigrams uN=2bigrams uN=1 unigrams

Parameter estimation Maximum Likelihood Estimator uN=3trigrams uN=2bigrams uN=1 unigrams uThis will assign zero probabilities to unseen events

Number of Parameters uFor a vocabulary of size V, a 1-gram model has V-1 independent parameters uA 2-gram model has V 2 -1 independent parameters uIn general, an n-gram model has V n -1 independent parameters Typical values for a moderate size vocabulary of 20000 words are: ModelParameters 1-gram20000 2-gram20000 2 = 400 million 3-gram20000 3 = 8 trillion

Number of Parameters u|V|=60.000N=35M Eleftherotypia daily newspaper Count1-grams2-grams3-grams 1160.2733.877.97613.128.073 251.725784.0121.802.348 327.171314.114562.264 >0390.7965.834.63216.515.051 >=0390.79636x10 8 216x10 12 uIn a typical training text, roughly 80% of trigrams occur only once Good-Turing estimate: ML estimates will be zero for 37.5% of the 3-grams and for 11% of the 2-grams

Problems uData sparseness: we have not enough data to train the model parameters Solutions uSmoothing techniques: accurately estimate probabilities in the presence of sparse data –Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off) uBuild compact models: they have fewer parameters to train and thus require less data –equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))

Smoothing uMake distributions more uniform uRedistribute probability mass from higher to lower probabilities

Additive Smoothing uFor each n-gram that occurs r times, pretend that it occurs r+1 times ue.gbigrams

Good-Turing Smoothing uFor any n-gram that occurs r times, pretend that it occurs r* times is the number of n-grams which occurs r times uTo convert this count to a probability we just normalize uTotal probability of unseen n-grams

Example r(=MLE)nrnr r * (=GT) 03.594.165.3680.001078 13.877.9760.404 2784.0121.202 3314.1142.238 4175.7203.187 5112.0064.199 678.3915.238 758.6616.270

u uGood-Turing uIntuitively Jelinek-Mercer Smoothing (linear interpolation) uInterpolate a higher-order model with a lower-order model uGiven fixed p ML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm

Katz Smoothing (backing-off) uFor those events which wave been observed in the training data we assume some reliable estimate of the probability uFor the remaining unseen events we back-off to some less specific distribution u is chosen so that the total probability sums to 1

Witten-Bell Smoothing uModel the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)

Absolute Discounting uSubtract a constant D from each nonzero count

Kneser-Ney  Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows

Modified Kneser-Ney

Measuring Model Quality uConsider the language as an information source L, which emits a sequence of symbols w i from a finite alphabet (the vocabulary) uThe quality of a language model M can be judged by its cross entropy with regard to the distribution P T (x) of some hitherto unseen text T: uIntuitively speaking cross entropy is the entropy of T as “perceived” by the model M

Perplexity uPerplexity: uIn a language with perplexity X, every word can be followed be X different words with equal probabilities

Elements of Information Theory uEntropy uMutual Information pointwise  Kullback-Leiblel (KL) divergence

The Greek Language  Highly inflectional language uA Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage EnglishFrenchGreekGerman SourceWall Street Journal Le MondeEleytherotypiaFrankfurter Rundschau Corpus size37.2 M37.7 M35 M31.5 M Distinct words165 K280 K410 K500 K Vocabulary size60 K Lexical coverage99.6 %98.3 %96.5 %95.1 %

Perplexity EnglishFrenchGreekGerman Vocabulary Size20 K 64 K 2-gram PP198178232430 3-gram PP135119163336

Experimental Results 1M5M35M SmoothingPPWERPPWERPPWER Good-Turing34127.7124823.4816319.59 Witten-Bell35427.4225124.1716319.84 Absolute Discounting34428.4725624.2516920.78 Modified Kneser-Ney32826.7823721.9115618.57 1M5M35M OOV4.75%3.46%3.17%

Hit Rate hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram27.316.47.4 2-gram52.549.940 3-gram20.233.752.6

Class-based Models uSome words are similar to other words in their meaning and syntactic function uGroup words into classes –Fewer parameters –Better estimates

Class-based n-gram models uSuppose that we partition the vocabulary into G classes uThis model produces text by first generating a string of classes g 1,g 2,…,g n and then converting them into the words w i, i=1,2,…n with probability p(w i |g i ) uAn n-gram model has V n -1 independent parameters (216x10 12 ) uA class-based model has G n -1+V-G parameters( 10 9 ) G n -1 of an n-gram model for a vocabulary of size G V-G of the form p(w i |g i )

Relation to n-grams

Defining Classes uManually –Use part-of-speech labels by linguistic experts or a tagger –Use stem information uAutomatically –Cluster words as part of an optimization method e.g. Maximize the log-likelihood of test text

Agglomerative Clustering uBottom-up clustering uStart with a separate cluster for each word uMerge that pair for which the loss in average MI is least

Example uSyntactical classes u verbs, past tense:άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν u nouns, neuter:άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα u Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος uSemantic classes u last names:βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης u countries:βραζιλία, βρετανία, γαλλία, γερμανία, δανία u numerals:δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο uSome not so well defined classes u ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε u εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος

Stem-based Classes u άγνωστ:άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα, u βλέπ:βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν u εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται u εξοχικ:εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές u ιστορικ:ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς, ιστορική, ιστορικής, ιστορικές, ιστορικά u καθηγητ:καθηγητής, καθηγητή, καθηγητές, καθηγητών u μαχητικ:μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής, μαχητικά

Experimental Results GPP (1M)PP (5M)PP (35M) 1130914611503 133 (POS)104711431167 500--314 1000--266 2000--224 30000 (stem)383299215 60000328237156

Example uInterpolate class-based and word-based models

Experimental Results 1M5M35M GPPWERPPWERPPWER 133 (POS)32527.1123622.0015618.52 500----15118.63 1000----15018.61 2000----14918.65 30000 (stem)31926.9923222.0415418.44 6000032826.7823721.9115618.57

Hit Rate hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram21.312.15.1 2-gram5650.437.6 3-gram22.737.657.4 hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram27.316.47.4 2-gram52.549.940 3-gram20.233.752.6

Experimental Results 1M5M35M ModelPPWERPPWERPPWER ME 3gram33126.8323921.9415818.60 ME 3gram+stem32026.5422721.6614318.29 1M5M35M ModelPPWERPPWERPPWER BO 3gram32826.7823721.9115618.57 Interp. 3gram+stem31926.9923222.0415418.44

Where do we go from here? uUse syntactic information The dog on the hill barked uConstraints

Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model.

Παρόμοιες παρουσιάσεις

Παρουσίαση με θέμα: "Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model."— Μεταγράφημα παρουσίασης:

Παρόμοιες παρουσιάσεις

Σχετικά με το έργο

Σχόλια

Είσοδος

Σύνδεση μέσω των κοινωνικών δικτύων:

Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model.

Παρόμοιες παρουσιάσεις

Παρουσίαση με θέμα: "Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model."— Μεταγράφημα παρουσίασης:

Παρόμοιες παρουσιάσεις

Σχετικά με το έργο

Σχόλια