Building Prosodic Structures in a Concept-to-Speech System Gerasimos Xydas, Dimitris Spiliotopoulos & Georgios Kouroupetroglou Speech Group Dep. of Informatics.

Building Prosodic Structures in a Concept-to-Speech System Gerasimos Xydas, Dimitris Spiliotopoulos & Georgios Kouroupetroglou Speech Group Dep. of Informatics and Telecommunications University of Athens {gxydas, dspiliot, koupe}@di.uoa.gr 1st Balkan Conference on Informatics Thessaloniki, 21-23 November, 2003

Outline Prosody Concept-to-Speech system Concept-to-Speech system (SOLE-ML) Corpus Training the prosodic models Prosody prediction Conclusions

Prosody Prosodic events – Position and type of: Phrase breaks Pitch accents Phrase accents & Boundary tones – Prediction of the type and placement of the above.

Concept-to-Speech system Prosody generation Traditional Text-to-Speech systems handle plain text. Difficulties: – Statistical percentage failure (POS tagging, etc.) – Lack of underlying foci information – Only subset of intonation events identified and used

Concept-to-Speech system Concept-to-Speech systems handle abundance of information: Authoring component Natural language generator Speech synthesizer SOLE-ML XML MPIRO Authoring tool EXPRIMODEMOSTHeNES

Concept-to-Speech system Prosody generation Advantages: – Limited domain leading to concrete set of data – The NLG produces linguistically enriched texts (as opposed to plain text) – Error-free phrase and part-of-speech tagging – Use: Derive intonational focus points – Most importantly: Explore rhetorical relations in terms of prosody But: NLG systems usually deal with written text and fail to represent spoken language

C-t-S system (SOLE-ML) EXPRIMO  SOLE-ML  DEMOSTHeNES Represents – Enumerated word lists – Syntactic structure Phrase level (phrase type – sentence, NP, PP, etc.) Word level (part-of-speech – determiner, noun, verb, preposition, etc.) Punctuation, parentheses, etc. – Canned-text (portions of plain text, no extra information)

C-t-S system (SOLE-ML) cont. Error-free syntax information leads towards identification of intonational focus, but… … semantic, pragmatic issues affect also. EXPRIMO NLG contains valuable features inside the language generation stages, not supported by initial SOLE-ML specification. Thus, need for extension.

C-t-S system (SOLE-ML extended) Directly or indirectly imply emphasis. New specification (noun phrases): – Newness / given information: newness [new/old] – Number of times mentioned before: mentioned-count [integer] – Whether second argument to verb: arg2 [true/false] – Whether there is deixis: genitive-deixis, accusative-deixis [true/false] – Whether proper noun phrase: proper-group [true/false]

C-t-S system (SOLE-ML example) … που δημιουργήθηκε κατά τη διάρκεια της αρχαϊκής περιόδου … … …

Corpus (general) Training: 516 utterances, 5380 words, 13214 syllables Test: 1509 syllables Test data contains fair distribution of features of interest Male and female professional speakers.

Corpus (focus) FOCUS LEVELS: Strong focus:[newness=new] AND [arg2=true] AND [proper-group=true] AND [(genitive-deixis) OR (accusative-deixis)] Normal focus:[newness=old] AND [arg2=true] AND [proper-group=true] AND [(genitive-deixis) OR (accusative-deixis)] Weak focus:[newness=old]

Corpus (procedure) Text corpora annotated by DEMOSTHeNES XML export component. visualization: RTF format. Voice corpora segmented and hand-annotated using GRToBI by 3 expert linguists. Post-processing (groupings) – Pitch accents grouped (eliminate low frequency occurrences errors) – Phrase accents – boundary tones (co-occur in GRToBI) – Break indices (sandhi, mismatch, pause marks) – RESULT: a. more robust results, b. huge reduction of human annotation evaluation mismatch occurrences

Corpus (pitch accents) Feature accent 1 accent 2 accent 3 accent 4 accent 5 Main accentL*H*L*+HL+H*H*+L diacritics downstep!H*L*+!HL+!H*!H*+L weakwL*+H early>L*+H late<L*+H low pointwL* Occurrences %9.2312.2032.6527.1318.79

Corpus (endtones) Feature endtone 1 endtone 2 endtone 3 endtone 4 endtone 5 endtone 6 endtone 7 endtone 8 Main toneL-H-L%H%L-L%L-H%H-L%H-H% Downstep diacritics !H-!H%L-!H%!H-L%!H-H% H-!H% !H-!H% Occurrences % 4.8848.70045.190.4700.88

Corpus (break indices) Break indexOccurrences (%) 032.1 147.23 211.08 39.59

Training prosodic models Used Classification and Regression Trees, wagon software. Built 3 prosodic models: – Phrase break model (break indices assigned to syllables and placed at word boundaries) – Accent model (pitch accents assigned to stressed syllables) – Endtone model (end tones assigned to syllables and placed at phrase boundaries)

Training (features) 1/2 For each item plus two items before (p, pp) and two items after (n, nn), in Syllable, Word, and Phrase relation (40 parameters): Features (generic): – R:Sylstructure.parent.gpos (part-of-speech of word) – stress (lexical stress) – Syl_in (number of syllables since last phrase break) – Syl_out (number of syllables until next phrase break) – Ssyl_in (number of stressed syllables since last phrase break) – Ssyl_out (number of stressed syllables until next phrase break) – R:SylStructure.parent.R:Phrase.parent.punc (phrase punctuation) Features (SOLE specific) – R:SylStructure.parent.R:Phrase.parent.newness – R:SylStructure.parent.R:Phrase.parent.arg2 – R:SylStructure.parent.R:Phrase.parent.deixis Additional features: – R:SylStructure.parent.bi (break index) [Accent & Endtone models only] – accent [Endtone model only]

Training (features) 2/2 R:Sylstructure.parent.gpos tagset: VbVerB AjAdJective NoNoun AtArTicle CjConJuction PnProNoun PpPrePosition AdAdverb PtParticle

Training (phrase break) train0123ScoreCor. test 0171511011715/172799.305 1712461712461/254096.890 211265554555/59693.121 3238503503/51697.481 Selected features: gpos, syl_in, syl_out, newness, deixis, stress, punc. Overall: 97.304%

Training (accent tone model) trainNONEL+H*L*+HH*+LH*L*ScoreCor. test NONE9612211009612/961699.958 L+H*69642130964/97698.770 L*+H81311510301151/117597.957 H*+L44066710667/67698.669 H*841124140414/43994.305 L*62031320320/33296.386 Selected features: gpos, syl_in, syl_out, bi, newness, arg2, deixis. Overall: 99.349%

Training (endtone model) trainNONEL-L%L-H%H-H%H-L-ScoreCor. test NONE12293000 1 0 12293/12294 99.992 L-L%04170000417/417100.000 L-H%0040004/4100.000 H-H%0005005/5100.000 H-00004490449/449100.000 L-000004545/45100.000 Selected features: syl_in, bi, punc. Overall: 99.992%

Prosody prediction - example “Αυτό το έκθεμα είναι ένας στατήρας που δημιουργήθηκε κατά την διάρκεια της ελληνιστικής περιόδου.” [a - fto L+H* ] 1 – [to] 0 – [e H*+L - kTe – ma] 1 [i L*+H – ne] 1 – [e – nas] 0 – [sta - ti H*+L - ras] 2 - [H-] [pu] 0 – [Di - mi - u - rji L*+H - Ti – ce] 1 – [ka – ta] 0 – [ti] 0 – [Dja H* - rci – a] 1 [tis] 0 – [e - li - ni - sti - cis L*+H ] 1 – [pe - ri - o H*+L – Du] 3 - [L-L%]

Sample 1 exhibit11-1.xml Αυτό το έκθεμα είναι ένας στατήρας, που δημιουργήθηκε κατά τη διάρκεια της ελληνιστικής περιόδου. Χρονολογείται ανάμεσα στο 220 και το 189 π.Χ.. Στον εμπροσθότυπο του νομίσματος, κεφάλι Αθηνάς, μια δημοφιλής απεικόνιση στα νομίσματα του αρχαίου ελληνικού κόσμου, με κορινθιακό κράνος και στον οπισθότυπο, θηλυκή μορφή, προσωποποίηση της Αιτωλίας, καθισμένη σε μακεδονικές και γαλατικές ασπίδες. Η σκηνή αναφέρεται στη μάχη των Αιτωλών ενάντια στους Μακεδόνες και στους Γαλάτες. Αυτός ο στατήρας έχει φτιαχτεί από χρυσό και προέρχεται από την Αιτωλική Συμπολιτεία

Sample 2 exhibit25-1.xml Αυτό το έκθεμα είναι ένα ανάγλυφο, που δημιουργήθηκε κατά τη διάρκεια της ελληνιστικής περιόδου. Η ελληνιστική περίοδος καλύπτει το χρονικό διάστημα ανάμεσα στο 323 και το 31 π.Χ.. Αυτό το ανάγλυφο χρονολογείται ανάμεσα στο 313 και το 312 π.Χ.. Απεικονίζει το δαφνοστεφανωμένο Διόνυσο καθιστό και απέναντί του ένα σάτυρο να στέκεται κρατώντας οινοχόη. Στο επιστύλιο κρέμονται πέντε προσωπεία. Από αριστερά: το προσωπείο του δύστροπου πατέρα, της γριάς γυναίκας, του πονηρού δούλου, του αγένιου νέου και της νέας με την κοντή κόμη. Σήμερα αυτό το ανάγλυφο βρίσκεται στο Επιγραφικό Μουσείο Αθηνών.

Technology overview CONCEPT-TO-SPEECH SYSTEM: EXPRIMO natural language generator based on ILEX Domain data entered and updated through MPIRO Authoring Tool SOLE markup language facilitated linguistically enriched texts. DEMOSTHeNES Speech Composer system

Conclusion Concept-to-Speech system: – Enriched linguistic meta-information (XML, SOLE-ML) – Evidence of stress, intonational focus Corpus, CART training. Prosody models: – phrase breaks, – accent tone, – endtone. Results: – Large set of features-parameters, prediction improvement. – Restricted text allows belief that specific features useful for other texts. – Prosody models can be applied to plain text (large amount of untagged data).

University of Athens Speech group www.di.uoa.gr/speech

Building Prosodic Structures in a Concept-to-Speech System Gerasimos Xydas, Dimitris Spiliotopoulos & Georgios Kouroupetroglou Speech Group Dep. of Informatics.

Παρόμοιες παρουσιάσεις

Παρουσίαση με θέμα: "Building Prosodic Structures in a Concept-to-Speech System Gerasimos Xydas, Dimitris Spiliotopoulos & Georgios Kouroupetroglou Speech Group Dep. of Informatics."— Μεταγράφημα παρουσίασης:

Παρόμοιες παρουσιάσεις

Σχετικά με το έργο

Σχόλια

Είσοδος

Σύνδεση μέσω των κοινωνικών δικτύων:

Building Prosodic Structures in a Concept-to-Speech System Gerasimos Xydas, Dimitris Spiliotopoulos & Georgios Kouroupetroglou Speech Group Dep. of Informatics.

Παρόμοιες παρουσιάσεις

Παρουσίαση με θέμα: "Building Prosodic Structures in a Concept-to-Speech System Gerasimos Xydas, Dimitris Spiliotopoulos & Georgios Kouroupetroglou Speech Group Dep. of Informatics."— Μεταγράφημα παρουσίασης:

Παρόμοιες παρουσιάσεις

Σχετικά με το έργο

Σχόλια