PhD Student
Idiap Research Institute
École Polytechnique Fédérale de Lausanne (EPFL)
Email: mutian.he at idiap dot ch
Github / Google Scholar / CV
I am a PhD candidate at the Idiap Research Institute, EPFL, Switzerland, advised by Phil Garner, working on spoken language understanding by combining both speech and NLP techniques. Before that, I received my B.E. degree from Beihang University (BUAA) in 2019, and my MPhil degree from the Hong Kong University of Science and Technology in 2022 with thesis on the conceptualization in commonsense reasoning, advised by Yangqiu Song. I have also worked on the topic of speech synthesis at Microsoft, focused on robustness, multilinguality, and low-resource condition.
I'm interested in a broad range of topics on machine learning side of speech and language processing, including pretraining, modelling, and generation.
Convert pretrained speech and language transformer models to linear complexity task-specific models without re-pretraining with minimal performance regression via distillation, examined on RoBERTa -> Linformer, Pythia -> Mamba LM, and Wav2Vec2 -> bi-Mamba2. The fine-tuning trajectory of the original transformer can be leveraged to preserve pretraining knowledge.
The paper presents a comprehensive study on replicating the human conceptual induction process (i.e. commonsense reasoning by conceptual knowledge, e.g. a duck is a bird), including 1) formulating the steps of machine conceptualization and conceptual induction; 2) collecting a dataset on the validity of event and triple conceptualization, and 3) develop NLP models to carry out the conceptualization process. An abstract commonsense knowledge base is then derived and proved helpful on downstream tasks.
Speech translation is found useful as a pretraining or auxiliary task to train models for spoken language understanding as a downstream task, especially in low-resource and multilingual scenarios, and allows few-shot transfer to new languages. Preserving speech translation knowledge by Bayesian regularizers is also found helpful for the downstream task.
ChatGPT shows strong zero/few-shot performance on intent classification benchmarks, close to supervised models and much better than smaller LMs. The performance is lower when there are ASR errors, the task involves word pronunciation, or on slot filling tasks where prompting is more complex.
A French TTS system using the diffusion-based acoustic model and vocoder, with specialized efforts on text analysis to resolve liaisons and heterophonic homographs.
An RAG-fashion E2E TTS system directly taking text inputs. This allows the model to read the external dictionary entry of the word to determine the correct reading, instead of memorizing the pronunciation internally. The technique boosts low-resource performance on Mandarin, Cantonese, and Japanese, particularly for resolving polygraphs (characters with different readings in different semantic contexts).
Aimed at scaling TTS to innumerable new languages with minimal per-language effort and expert knowledge, a massively multilingual TTS model is pretrained, and can be adapted to target languages with low resource or even few (e.g. 30) shots. Byte input is adopted to avoid the need of per-language knowledge and text analysis. Language-specific sub-networks related to the language family can be identified from the model parameters.
Extend commonsense knowledge graphs (e.g. ATOMIC) and construct new triples with unseen events using conceptualization, e.g. reasoning over events about “beverage” from events about “milk”, using neural models trained by negative sampling.
Counting the subgraphs isomorphic to a small pattern graph from a large graph, using neural models that iteratively attend to both graphs, which allows much faster computation than the NP-hard exact algorithm with limited error.
An attention mechanism for seq2seq autoregressive TTS that encourages the attention focus to move forward monotonically but at most one token per step, thus avoiding TTS errors due to skipping, repeating, and attention collapse when handling long or out-of-domain inputs.
Efficiently process non-stationary text data that evolve through time (e.g. with word meaning changing through years) by time-wise feature smoothing or leveraging features from past models.