Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Abstract: End-to-end TTS suffers from high data requirements as it is difficult for both costly speech corpora to cover all necessary knowledge and neural models to learn the knowledge, hence additional knowledge needs to be injected manually. For example, to capture pronunciation knowledge on languages without regular orthography, a complicated grapheme-to-phoneme pipeline needs to be built based on a structured, large pronunciation lexicon, leading to extra, sometimes high, costs to extend neural TTS to such languages. In this paper, we propose a framework to learn to extract knowledge from unstructured external resources using Token2Knowledge attention modules. The framework is applied to build a novel end-to-end TTS model named Neural Lexicon Reader that extracts pronunciations from raw lexicon texts. Experiments support the potential of our framework that the model significantly reduces pronunciation errors in low-resource, end-to-end Chinese TTS, and the lexicon-reading capability can be transferred to other languages with a smaller amount of data.

The paper is here. The pipeline and pretrained models using open datasets are available here.

General domain samples

‣ Below are audio samples used in the MOS tests.

Ground Truth Baseline, 18K data Baseline, 10K data NLR, 18K data NLR, 10K data
人均寿命可达七十岁左右。
两个月后,洗衣机干脆不能用了。
山地好恬静,只有秋虫此起彼伏的轻声吟唱。
本报讯:与缅甸、老挝、越南三国接壤的云南省思茅地区,禁毒任务繁重。

Model extensibility

Even for a trained model, the pronunciation of characters can be manupulated by lexicon texts, and new knowledge can be introduced. Below is an example, using the script "矿工从巷道中走出":

Manipulated lexicon text Changes Audio

矿 ● kuàng (1)矿物,蕴藏在地层中的自然物质:矿藏(cáng)。铁矿。煤矿。矿产。矿泉。矿源。 (2)开采矿物的场所:矿井。矿坑。下矿。

巷 ● xiàng 胡同,里弄:小巷。陋巷。穷巷。巷陌(街道)。巷战(在城市街巷里进行的战斗)。穷街陋巷。 ● hàng (1)〔巷道〕采矿或探矿时挖的坑道。(2)义同(一)。

(Original)
(kuàng gōng cóng hàng dào zhōng zǒu chū)

释义:矿 ● wáng (1)矿物,蕴藏在地层中的自然物质:矿藏(cáng)。铁矿。煤矿。矿产。矿泉。矿源。 (2)开采矿物的场所:矿井。矿坑。下矿。

巷 ● xiàng 胡同,里弄:小巷。陋巷。穷巷。巷陌(街道)。巷战(在城市街巷里进行的战斗)。穷街陋巷。

"巷" is a heteronym that should be pronounced in a special way (hàng) instead of the most common one (xiàng). In this sample we removed the text describing the pronunciation hàng, and changed the pronunciation of "矿".
(wáng gōng cóng xiàng dào zhōng zǒu chū)
释义:矿 ● kuàng (1)矿物,蕴藏在地层中的自然物质:矿藏(cáng)。铁矿。煤矿。矿产。矿泉。矿源。 (2)开采矿物的场所:矿井。矿坑。下矿。● tuó〔矿工〕矿山工人;尤指采矿的工人 In this sample we add an additional (fake) pronunciation to the character "矿", making it a heteronym, with a reading tuó that matches the context. In this way, for low-resource languages with incomplete lexicons, the pronunciation knowledge can be easily added after the model is trained.
(tuó gōng cóng hàng dào zhōng zǒu chū)