Incorporating prosody into neural speech processing pipelines

  1. Öktem, Hamdi Alp
Dirigée par:
  1. Antonio Bonafonte Cávez Directeur/trice
  2. Mireia Farrús Cabeceran Co-directeur/trice

Université de défendre: Universitat Pompeu Fabra

Fecha de defensa: 25 février 2019

Jury:
  1. David Escudero Mancebo President
  2. Francesc Alías Pujol Secrétaire
  3. Jordi Adell Mercado Rapporteur

Type: Thèses

Teseo: 582833 DIALNET lock_openTDX editor

Résumé

In this dissertation, I study the inclusion of prosody into two applications that involve speech understanding: automatic speech transcription and spoken language translation. In the former case, I propose a method that uses an attention mechanism over parallel sequences of prosodic and morphosyntactic features. Results indicate an F1 score of 70.3% in terms of overall punctuation generation accuracy. In the latter problem I deal with enhancing spoken language translation with prosody. A neural machine translation system trained with movie-domain data is adapted with pause features using a prosodically annotated bilingual dataset. Results show that prosodic punctuation generation as a preliminary step to translation increases translation accuracy by 1% in terms of BLEU scores. Encoding pauses as an extra encoding feature gives an additional 1% increase to this number. The system is further extended to jointly predict pause features in order to be used as an input to a text-to-speech system.