KRISTINA goes to the Opera

July 29, 2017, by Mireia Farrús

The time of the castrati in the opera, those classical male singing voices equivalent to that of a soprano, mezzo-soprano, or contralto by means of castration is far away. No one could imagine such a technique being carried out nowadays. Luckily enough, it was made illegal in the 19th century, coinciding with the end of the Baroque period and the incorporation of women in the opera.

At this point, you might be wondering what the relationship between Baroque and KRISTINA is. So well, technologies have evolved so greatly, that we are now able to convert a male voice into a female voice. Statistical parametric speech synthesis has become a highly competitive technique with respect to the most commonly used concatenative synthesis over the last decade. Although concatenative synthesis has usually overcome the statistical one in terms of voice quality and naturalness [1], statistical parametric synthesis requires much less stored speech data, and it is much more flexible and adaptable due to the statistical modelling process. Therefore, they can much easily be modified to target the needs of each context.

In KRISTINA, we have an Arabic male voice to be used in MaryTTS open-source speech synthesis system [2]. In front of the need of having also a female Arabic voice for one of our use cases, first of all, we increased the fundamental frequency value. Fundamental frequency is the main differentiating speech characteristic between men and women, being much higher in women than in men. However, there are many more features that make male and female voices sound different due to their physiological differences, mainly in the vocal tract. In order to avoid a resulting artificial voice, the vocal tract linear scaling effect is also modified accordingly. You can listen to the original male voice here, and the resulting converted female voice here.


[1] H. Zen, K. Tokuda, A. W. Black (2009), “Statistical parametric speech synthesis”. Speech Communication, vol. 51(11), pp. 1039–1064.

multilingual intelligent embodied agent social competence adaptive dialogue expressive speech recognition and synthesis discourse generation vocal facial and gestural social and emotional cues