Language Models For Speech Recognition
Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model is to make available to the recognizer adequate estimates of the probabilities
Language Models u u u u
N-gram models uMake the Markov assumption that only the prior local context – the last (N-1) words – affects the next word uN=3trigrams uN=2bigrams uN=1 unigrams
Parameter estimation Maximum Likelihood Estimator uN=3trigrams uN=2bigrams uN=1 unigrams uThis will assign zero probabilities to unseen events
Number of Parameters uFor a vocabulary of size V, a 1-gram model has V-1 independent parameters uA 2-gram model has V 2 -1 independent parameters uIn general, an n-gram model has V n -1 independent parameters Typical values for a moderate size vocabulary of words are: ModelParameters 1-gram gram = 400 million 3-gram = 8 trillion
Number of Parameters u|V|=60.000N=35M Eleftherotypia daily newspaper Count1-grams2-grams3-grams > >= x x10 12 uIn a typical training text, roughly 80% of trigrams occur only once Good-Turing estimate: ML estimates will be zero for 37.5% of the 3-grams and for 11% of the 2-grams
Problems uData sparseness: we have not enough data to train the model parameters Solutions uSmoothing techniques: accurately estimate probabilities in the presence of sparse data –Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off) uBuild compact models: they have fewer parameters to train and thus require less data –equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))
Smoothing uMake distributions more uniform uRedistribute probability mass from higher to lower probabilities
Additive Smoothing uFor each n-gram that occurs r times, pretend that it occurs r+1 times ue.gbigrams
Good-Turing Smoothing uFor any n-gram that occurs r times, pretend that it occurs r* times is the number of n-grams which occurs r times uTo convert this count to a probability we just normalize uTotal probability of unseen n-grams
Example r(=MLE)nrnr r * (=GT)
u uGood-Turing uIntuitively Jelinek-Mercer Smoothing (linear interpolation) uInterpolate a higher-order model with a lower-order model uGiven fixed p ML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm
Katz Smoothing (backing-off) uFor those events which wave been observed in the training data we assume some reliable estimate of the probability uFor the remaining unseen events we back-off to some less specific distribution u is chosen so that the total probability sums to 1
Witten-Bell Smoothing uModel the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)
Absolute Discounting uSubtract a constant D from each nonzero count
Kneser-Ney Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows
Modified Kneser-Ney
Measuring Model Quality uConsider the language as an information source L, which emits a sequence of symbols w i from a finite alphabet (the vocabulary) uThe quality of a language model M can be judged by its cross entropy with regard to the distribution P T (x) of some hitherto unseen text T: uIntuitively speaking cross entropy is the entropy of T as “perceived” by the model M
Perplexity uPerplexity: uIn a language with perplexity X, every word can be followed be X different words with equal probabilities
Elements of Information Theory uEntropy uMutual Information pointwise Kullback-Leiblel (KL) divergence
The Greek Language Highly inflectional language uA Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage EnglishFrenchGreekGerman SourceWall Street Journal Le MondeEleytherotypiaFrankfurter Rundschau Corpus size37.2 M37.7 M35 M31.5 M Distinct words165 K280 K410 K500 K Vocabulary size60 K Lexical coverage99.6 %98.3 %96.5 %95.1 %
Perplexity EnglishFrenchGreekGerman Vocabulary Size20 K 64 K 2-gram PP gram PP
Experimental Results 1M5M35M SmoothingPPWERPPWERPPWER Good-Turing Witten-Bell Absolute Discounting Modified Kneser-Ney M5M35M OOV4.75%3.46%3.17%
Hit Rate hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram gram gram
Class-based Models uSome words are similar to other words in their meaning and syntactic function uGroup words into classes –Fewer parameters –Better estimates
Class-based n-gram models uSuppose that we partition the vocabulary into G classes uThis model produces text by first generating a string of classes g 1,g 2,…,g n and then converting them into the words w i, i=1,2,…n with probability p(w i |g i ) uAn n-gram model has V n -1 independent parameters (216x10 12 ) uA class-based model has G n -1+V-G parameters( 10 9 ) G n -1 of an n-gram model for a vocabulary of size G V-G of the form p(w i |g i )
Relation to n-grams
Defining Classes uManually –Use part-of-speech labels by linguistic experts or a tagger –Use stem information uAutomatically –Cluster words as part of an optimization method e.g. Maximize the log-likelihood of test text
Agglomerative Clustering uBottom-up clustering uStart with a separate cluster for each word uMerge that pair for which the loss in average MI is least
Example uSyntactical classes u verbs, past tense:άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν u nouns, neuter:άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα u Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος uSemantic classes u last names:βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης u countries:βραζιλία, βρετανία, γαλλία, γερμανία, δανία u numerals:δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο uSome not so well defined classes u ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε u εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος
Stem-based Classes u άγνωστ:άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα, u βλέπ:βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν u εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται u εξοχικ:εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές u ιστορικ:ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς, ιστορική, ιστορικής, ιστορικές, ιστορικά u καθηγητ:καθηγητής, καθηγητή, καθηγητές, καθηγητών u μαχητικ:μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής, μαχητικά
Experimental Results GPP (1M)PP (5M)PP (35M) (POS) (stem)
Example uInterpolate class-based and word-based models
Experimental Results 1M5M35M GPPWERPPWERPPWER 133 (POS) (stem)
Hit Rate hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram gram gram hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram gram gram
Experimental Results 1M5M35M ModelPPWERPPWERPPWER ME 3gram ME 3gram+stem M5M35M ModelPPWERPPWERPPWER BO 3gram Interp. 3gram+stem
Where do we go from here? uUse syntactic information The dog on the hill barked uConstraints