[2010.12267] Show and Speak: Directly Synthesize Spoken Description of Imagesopen searchopen navigation menucontact arXivsubscribe to arXiv mailings

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

Date: 2020/11/18 15:51

Related Entries

Read more GitHub - lucidrains/lightweight-gan: Implementation of 'lightweight' GAN, proposed in ICLR 2021, in ...
5 users, 14 mentions 2020/11/21 09:51
Read more ism_statphys
0 users, 4 mentions 2020/11/22 11:22
Read more 非線形最適化の基礎〜KKT条件〜 - Speaker Deck
0 users, 0 mentions 2020/11/24 09:51
Read more GitHub - ourownstory/neural_prophet: NeuralProphet - a Neural Network based Time-Series model
0 users, 8 mentions 2020/11/30 18:53
Read more JR東日本 列車運行予測 | SIGNATE - Data Science Competition
0 users, 17 mentions 2020/12/01 02:21