Evaulating text-to-speech (TTS) engines
The widespread availability of open source and commercial text-to-speech (TTS)
engines allows for the rapid creation of telephony services that require a
TTS component. However, there exists neither a standard corpus nor common
metrics to objectively evaluate TTS engines. Listening tests are a
prominent method of evaluation in the domain where the primary goal
is to produce speech targeted at human listeners. Nonetheless, subjective
evaluation can be problematic and expensive. Objective evaluation metrics,
such as word accuracy and contextual disambiguation (is “Dr.”
rendered as Doctor or Drive?), have the benefit of being both inexpensive
and unbiased. In this paper, we study seven TTS engines, four open source
engines and three commercial ones. We systematically evaluate each TTS
engine on two axes: (1) contextual word accuracy (includes support
for numbers, homographs, foreign words, acronyms, and directional
abbreviations); and (2) naturalness (how natural the TTS sounds
to human listeners). Our results indicate that commercial engines
may have an edge over open source TTS engines.
This project was investigated by Jordan Kalfen (BS CS), Vijay K. Gurbani, and
additional collaborators from Vail
Systems, Inc. The project resulted in the following publication(s):
Jordan Hosier, Jordan Kalfen, Nikhita Sharma, and Vijay K. Gurbani,
"A systematic study of open source and commercial text-to-speech (TTS)
engines," Proceedings of the 23rd International Conference on Text,
Speech and Dialogue (
TSD 2020), Springer Lecture Notes in Artificial Intelligence (LNAI),
Volume 12284, September 2020.