Fundamental Frequency Generation for Whisper-to-Audible Speech Conversion
by , , , ,
Abstract:
In this work, we address the issues involved in whisper-to-audible speech conversion. Spectral mapping techniques using Gaussian mixture models or Artificial Neural Networks borrowed from voice conversion have been applied to transform whisper spectral features to normally phonated audible speech. However, the modeling and generation of fundamental frequency ($F_0$) and its contour in the converted speech is a major issue. Whispered speech does not contain explicit voicing characteristics and hence it is hard to derive a suitable $F_0$, making it difficult to generate a natural prosody after conversion. Our work addresses the $F_0$ modeling in whisper-to-speech conversion. We show that $F_0$ contours can be derived from the mapped spectral vectors, which can be used for the synthesis of a speech signal. We also present a hybrid unit selection approach for whisper-to-speech conversion. Unit selection is performed on the spectral vectors, where $F_0$ and its contour can be obtained as a byproduct without any additional modeling.
Reference:
Fundamental Frequency Generation for Whisper-to-Audible Speech Conversion (Matthias Janke, Kishore Prahallad, Michael Wand, Till Heistermann, Tanja Schultz), In The 39th International Conference on Acoustics, Speech, and Signal Processing, 2014. (ICASSP 2014)
Bibtex Entry:
@inproceedings{janke2014fundamental,
  year={2014},
  title={Fundamental Frequency Generation for Whisper-to-Audible Speech Conversion},
  note={ICASSP 2014},
  booktitle={The 39th International Conference on Acoustics, Speech, and Signal Processing},
  url={https://www.csl.uni-bremen.de/cms/images/documents/publications/Janke_ICASSP14_F0GenerationWhisper.pdf},
  abstract={In this work, we address the issues involved in whisper-to-audible speech conversion. Spectral mapping techniques using Gaussian mixture models or Artificial Neural Networks borrowed from voice conversion have been applied to transform whisper spectral features to normally phonated audible speech. However, the modeling and generation of fundamental frequency ($F_0$) and its contour in the converted speech is a major issue. Whispered speech does not contain explicit voicing characteristics and hence it is hard to derive a suitable $F_0$, making it difficult to generate a natural prosody after conversion. Our work addresses the $F_0$ modeling in whisper-to-speech conversion. We show that $F_0$ contours can be derived from the mapped spectral vectors, which can be used for the synthesis of a speech signal. We also present a hybrid unit selection approach for whisper-to-speech conversion. Unit selection is performed on the spectral vectors, where $F_0$ and its contour can be obtained as a byproduct without any additional modeling.},
  author={Janke, Matthias and Prahallad, Kishore and Wand, Michael and Heistermann, Till and Schultz, Tanja}
}