Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis

Abstract: To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts. Besides strong results on 40+ languages, the framework demonstrates capabilities to adapt to new languages under extreme low-resource and even few-shot scenarios of merely 40s transcribed recording, without the need of per-language resources like lexicon, extra corpus, auxiliary models, or linguistic expertise, thus ensuring scalability. While it retains satisfactory intelligibility and naturalness matching rich-resource models. Exhaustive comparative and ablation studies are performed to reveal the potential of the framework for low-resource languages. Furthermore, we propose a novel method to extract language-specific sub-networks in a multilingual model for a better understanding of its mechanism.

For the paper, see here. The pipeline and pretrained models using open datasets are available here.

Contents

Rich-resource source languages

Below are audio samples used in the MOS tests.

English (US, en-us)

Recordings Phoneme inputs Byte inputs
There are ten calories in a serving of fried chicken.
Never be ashamed of your life goals, because that's your parents' job.
An article casts doubt on British tycoon Richard Branson's foray into the railroad business.

Cantonese (Hong Kong, zh-hk)

Recordings Phoneme inputs Byte inputs
他後來對陪審員說,這純屬巧合。
taa1 hau6loi4 deoi3 pui4sam2jyun4 syut3, ze5 seon4suk6 haau2hap6.
身為主席最重要把持的原則是什麼?
san1wai4 zyu2zik6 zeoi3 zung6jiu3 baa2ci4 dik1 jyun4zak1 si6 sam6mo1?
此外,草案亦提出其他技術性的修訂建議。
ci2ngoi6, cou2ngon3 jik6 tai4ceot1 kei4taa1 gei6seot6sing3 dik1 sau1deng6 gin3ji5.

Telugu (India, te-in)

Recordings Phoneme inputs Byte inputs
త్రివిక్రమ్ పెన్ను పవర్ ఏ రేంజ్ లో ఉంటుందో చూపించారు
Trivikram pennu pavar ē rēn̄j lō uṇṭundō cūpin̄cāru
ఎప్పుడైనా రేర్‌గా వేరే స్టార్ హీరోలు వస్తుంటారు
Eppuḍainā rērgā vērē sṭār hīrōlu vastuṇṭāru
తాజాగా ఆ గ్రామంలో రేవంత్ రెడ్డి కాలు మోపారు
Tājāgā ā grāmanlō rēvant reḍḍi kālu mōpāru

 

Low-resource target adaptation

Below are audio samples used in the subjective tests, or from the model used in subjective tests for each experiment. We also present samples of byte models from en-in CER-Ex test set, on which the single-language full-resource model often fails.

English (India, en-in)

After his education, he took up his career as a lawyer. In the twentieth century there was the 'Red revolution' for social equality. Parents of children in a government school were more likely to be satisfied if the school required children to wear a uniform and if the school had a formal complaint procedure.
(CER-Ex)
Looking at these figures, he said, it appears impossible to compete in the comity of nations technically, socially and economically (Source: EPB)
(CER-Ex)
Recordings
Phoneme inputs
Byte (10 samples)
Byte (30 samples)
Byte (1k samples)
Byte Single (9k samples)

Romanian (Romania, ro-ro)

Decizia însă nu a mai avut obiect, așa că Iosif Pop s-a întors liniștit la afacerile sale. Ce ne facem dacă într-o bună zi ne trezim fără toate acestea? Pentru aceasta până acum a fost adusă doar piatra.
Recordings
Phoneme inputs
Byte (10 samples, 0.1)
Byte (30 samples)
Byte (1k samples)
Byte Single (7k samples)

Greek (Greece, el-gr)

Αλλά αφορά την προστασία των ελληνικών θαλασσών που έχουν αρχίσει να μολύνονται επικίνδυνα.
Allá aforá tin prostasía ton ellinikón thalassón pou échoun archísei na molýnontai epikíndyna.
Μην αφήνετε μεγάλα χρηματικά ποσά ή αντικείμενα αξίας στο σπίτι σας.
Min afínete megála chrimatiká posá í antikeímena axías sto spíti sas.
Κι όμως , η αγάπη είναι τόσο δύσκολη , όσο και απλή.
Ki ómos, i agápi eínai tóso dýskoli, óso kai aplí.
Recordings
Phoneme inputs
Byte (10 samples, Pangram, 0.1)
Byte (30 samples)
Byte (500 samples)
Byte Single (7k samples)

Mandarin Chinese (Mainland China, zh-cn)

老叔公把牛绳子拴在树干上。
lao3 shu1-gong1 ba3 niu2 sheng2-zi3 shuan1 zai4 shu4-gan4 shang4.
我一定要好好改造,重新做人。
wo3 yi2-ding4 yao4 hao3-hao1 gai3-zao4, chong2-xin1 zuo4ren2.
这轻柔的声音,这瘦小的身影,给我一种可亲可信的感觉。
zhe4 qing1-rou2 de5 sheng1-yin1, zhe4 shou4-xiao3 de5 shen1-ying3, gei3 wo3 yi4-zhong3 ke3-qin1 ke3-xin4 de5 gan3-jüe2.
Recordings
Phoneme inputs
Byte (100 samples, Pinyin)
Byte (1k samples, Pinyin)
Byte Single (19k samples, Pinyin)

Cross-language speaker transfer

The cross-language transfer on a Japanese female speaker and an American male speaker is given below. The model could generate English speech in the Japanese speaker's voice, and Japanese speech in the American speaker's voice, even though such cases are unseen during training. We further compare with the recordings for corresponding English utterances by the Japanese speaker, and found that by our model more natural English speech with less Japanese accent could be produced.

The final round began with Bob Murphy three strokes off the pace. He's not the type of guy you'll want to watch each week. 先輩と日本チームを応援した。
Senpai to Nihon chīmu o ōen shita.
目先の自律反発を予想する声も聞かれた。
Mesaki no jiritsu hanpatsu o yosō suru koe mo kika reta.
Japanese Female Speaker
Japanese Female Speaker (Recording)
American Male Speaker