Síntesis de voz aplicada a la traducción voz a voz

  1. Agüero, Pablo Daniel
Dirigida por:
  1. Antonio Bonafonte Cávez Director/a

Universidad de defensa: Universitat Politècnica de Catalunya (UPC)

Fecha de defensa: 23 de octubre de 2012

Tribunal:
  1. María Asunción Moreno Bilbao Presidente/a
  2. Francesc Alías Pujol Secretario/a
  3. David Escudero Mancebo Vocal

Tipo: Tesis

Teseo: 114684 DIALNET lock_openTDX editor

Resumen

In the field of speech technologies, text-to-speech conversion is the automatic generation of artificial voices that sound identical to a human voice when reading a text in loud speech. Inside a text-to-speech system, the prosody module produces the prosodic information that is necessary to generate a natural voice: intonational phrases, intonation of the sentence, duration and energy of phonemes, etc. The correct generation of this information directly impacts in the naturalness and expressiveness of the system. The main goals of this thesis is the development of new algorithms to train models for prosody generation that may be used in a text-to-speech system, and their use in the framework of speech-to-speech translation. In this thesis several alternatives were studied for intonation modeling. They combine the parameterization and the intonation model generation as a integrated process. Such approach was successfully judged both with objective and subjective evaluations. The influence of segmental and suprasegmental factors in duration modeling was also studied. Several algorithms were proposed with the results of these studies that may combine segmental and suprasegmental information, likewise other publications of this field. Finally, an analysis of various phrase break models was also performed, both with words and accent groups: classification trees (CART), language modeling (LM) and finite state transducers (FST). The use of the same data set in the experiments was useful to obtain relevant conclusions about the differences between these models. One of the main goals of this thesis was the improvement of naturalness, expressiveness and consistency with the style of the source speaker in text-to-speech systems. This may be done by using the prosody of the source speaker in the framework of speech-to-speech translation as an additional information source. Several algorithms were developed for prosody generation that may integrate such additional information for the prediction of intonation, phoneme duration and phrase breaks. In that direction several approaches were studied to transfer the intonation from one language to the other. The chosen approach was an automatic clustering algorithm that finds a certain number of tonal movements that are related between languages, without any limitation about their number. In this way, it is possible to use this coding for intonation modeling of the target language. Experimental results show an improvement, that is more relevant in close languages, such as Spanish and Catalan. Although no segmental duration transfer was performed between languages, in this thesis is proposed the transfer of rhythm from one language to the other. For that purpose a method that combines the rhythm transfer and audio synchronization was proposed. The synchronizations is included because of its importance for the speech-to-speech translation technology when video is also used. Lastly, in this thesis was also proposed a pause transfer technique in the framework of speech-to-speech translation, by means of alignment information. Studies in training data have shown the advantage of tuples for this task. In order to predict any pause that can not be transferred using the before mentioned method, conventional pause prediction algorithms are used (CART, CART+LM, FST), taking into account the already transferred pauses.