Η παρουσίαση φορτώνεται. Παρακαλείστε να περιμένετε

Η παρουσίαση φορτώνεται. Παρακαλείστε να περιμένετε

Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model.

Παρόμοιες παρουσιάσεις


Παρουσίαση με θέμα: "Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model."— Μεταγράφημα παρουσίασης:

1 Language Models For Speech Recognition

2 Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model is to make available to the recognizer adequate estimates of the probabilities

3 Language Models u u u u

4 N-gram models uMake the Markov assumption that only the prior local context – the last (N-1) words – affects the next word uN=3trigrams uN=2bigrams uN=1 unigrams

5 Parameter estimation Maximum Likelihood Estimator uN=3trigrams uN=2bigrams uN=1 unigrams uThis will assign zero probabilities to unseen events

6 Number of Parameters uFor a vocabulary of size V, a 1-gram model has V-1 independent parameters uA 2-gram model has V 2 -1 independent parameters uIn general, an n-gram model has V n -1 independent parameters Typical values for a moderate size vocabulary of 20000 words are: ModelParameters 1-gram20000 2-gram20000 2 = 400 million 3-gram20000 3 = 8 trillion

7 Number of Parameters u|V|=60.000N=35M Eleftherotypia daily newspaper Count1-grams2-grams3-grams 1160.2733.877.97613.128.073 251.725784.0121.802.348 327.171314.114562.264 >0390.7965.834.63216.515.051 >=0390.79636x10 8 216x10 12 uIn a typical training text, roughly 80% of trigrams occur only once Good-Turing estimate: ML estimates will be zero for 37.5% of the 3-grams and for 11% of the 2-grams

8 Problems uData sparseness: we have not enough data to train the model parameters Solutions uSmoothing techniques: accurately estimate probabilities in the presence of sparse data –Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off) uBuild compact models: they have fewer parameters to train and thus require less data –equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))

9 Smoothing uMake distributions more uniform uRedistribute probability mass from higher to lower probabilities

10 Additive Smoothing uFor each n-gram that occurs r times, pretend that it occurs r+1 times ue.gbigrams

11 Good-Turing Smoothing uFor any n-gram that occurs r times, pretend that it occurs r* times is the number of n-grams which occurs r times uTo convert this count to a probability we just normalize uTotal probability of unseen n-grams

12 Example r(=MLE)nrnr r * (=GT) 03.594.165.3680.001078 13.877.9760.404 2784.0121.202 3314.1142.238 4175.7203.187 5112.0064.199 678.3915.238 758.6616.270

13 u uGood-Turing uIntuitively Jelinek-Mercer Smoothing (linear interpolation) uInterpolate a higher-order model with a lower-order model uGiven fixed p ML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm

14 Katz Smoothing (backing-off) uFor those events which wave been observed in the training data we assume some reliable estimate of the probability uFor the remaining unseen events we back-off to some less specific distribution u is chosen so that the total probability sums to 1

15 Witten-Bell Smoothing uModel the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)

16 Absolute Discounting uSubtract a constant D from each nonzero count

17 Kneser-Ney  Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows

18 Modified Kneser-Ney

19 Measuring Model Quality uConsider the language as an information source L, which emits a sequence of symbols w i from a finite alphabet (the vocabulary) uThe quality of a language model M can be judged by its cross entropy with regard to the distribution P T (x) of some hitherto unseen text T: uIntuitively speaking cross entropy is the entropy of T as “perceived” by the model M

20 Perplexity uPerplexity: uIn a language with perplexity X, every word can be followed be X different words with equal probabilities

21 Elements of Information Theory uEntropy uMutual Information pointwise  Kullback-Leiblel (KL) divergence

22 The Greek Language  Highly inflectional language uA Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage EnglishFrenchGreekGerman SourceWall Street Journal Le MondeEleytherotypiaFrankfurter Rundschau Corpus size37.2 M37.7 M35 M31.5 M Distinct words165 K280 K410 K500 K Vocabulary size60 K Lexical coverage99.6 %98.3 %96.5 %95.1 %

23 Perplexity EnglishFrenchGreekGerman Vocabulary Size20 K 64 K 2-gram PP198178232430 3-gram PP135119163336

24 Experimental Results 1M5M35M SmoothingPPWERPPWERPPWER Good-Turing34127.7124823.4816319.59 Witten-Bell35427.4225124.1716319.84 Absolute Discounting34428.4725624.2516920.78 Modified Kneser-Ney32826.7823721.9115618.57 1M5M35M OOV4.75%3.46%3.17%

25 Hit Rate hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram27.316.47.4 2-gram52.549.940 3-gram20.233.752.6

26 Class-based Models uSome words are similar to other words in their meaning and syntactic function uGroup words into classes –Fewer parameters –Better estimates

27 Class-based n-gram models uSuppose that we partition the vocabulary into G classes uThis model produces text by first generating a string of classes g 1,g 2,…,g n and then converting them into the words w i, i=1,2,…n with probability p(w i |g i ) uAn n-gram model has V n -1 independent parameters (216x10 12 ) uA class-based model has G n -1+V-G parameters( 10 9 ) G n -1 of an n-gram model for a vocabulary of size G V-G of the form p(w i |g i )

28 Relation to n-grams

29 Defining Classes uManually –Use part-of-speech labels by linguistic experts or a tagger –Use stem information uAutomatically –Cluster words as part of an optimization method e.g. Maximize the log-likelihood of test text

30 Agglomerative Clustering uBottom-up clustering uStart with a separate cluster for each word uMerge that pair for which the loss in average MI is least

31 Example uSyntactical classes u verbs, past tense:άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν u nouns, neuter:άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα u Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος uSemantic classes u last names:βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης u countries:βραζιλία, βρετανία, γαλλία, γερμανία, δανία u numerals:δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο uSome not so well defined classes u ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε u εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος

32 Stem-based Classes u άγνωστ:άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα, u βλέπ:βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν u εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται u εξοχικ:εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές u ιστορικ:ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς, ιστορική, ιστορικής, ιστορικές, ιστορικά u καθηγητ:καθηγητής, καθηγητή, καθηγητές, καθηγητών u μαχητικ:μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής, μαχητικά

33 Experimental Results GPP (1M)PP (5M)PP (35M) 1130914611503 133 (POS)104711431167 500--314 1000--266 2000--224 30000 (stem)383299215 60000328237156

34 Example uInterpolate class-based and word-based models

35 Experimental Results 1M5M35M GPPWERPPWERPPWER 133 (POS)32527.1123622.0015618.52 500----15118.63 1000----15018.61 2000----14918.65 30000 (stem)31926.9923222.0415418.44 6000032826.7823721.9115618.57

36 Hit Rate hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram21.312.15.1 2-gram5650.437.6 3-gram22.737.657.4 hit rate % (1M)hit rate % (5M)hit rate % (35M) 1-gram27.316.47.4 2-gram52.549.940 3-gram20.233.752.6

37 Experimental Results 1M5M35M ModelPPWERPPWERPPWER ME 3gram33126.8323921.9415818.60 ME 3gram+stem32026.5422721.6614318.29 1M5M35M ModelPPWERPPWERPPWER BO 3gram32826.7823721.9115618.57 Interp. 3gram+stem31926.9923222.0415418.44

38 Where do we go from here? uUse syntactic information The dog on the hill barked uConstraints


Κατέβασμα ppt "Language Models For Speech Recognition. Speech Recognition u : sequence of acoustic vectors uFind the word sequence so that: uThe task of a language model."

Παρόμοιες παρουσιάσεις


Διαφημίσεις Google