Mutian He's Homepage

Mutian He

PhD Student
Idiap Research Institute
École Polytechnique Fédérale de Lausanne (EPFL)

Email: mutian.he at idiap dot ch

Google Scholar / Github / LinkedIn / Twitter / CV

I am a PhD candidate at the Idiap Research Institute, EPFL, Switzerland, advised by Phil Garner. I work on spoken language understanding with the aim of efficient and high-level understanding of speech documents. Before that, I received my B.E. degree from Beihang University (BUAA) in 2019, and my MPhil degree from the Hong Kong University of Science and Technology in 2022 with my thesis on the conceptualization in commonsense reasoning, advised by Yangqiu Song. I have also worked on the topic of speech synthesis at Microsoft, focused on robustness, multilinguality, and low-resource conditions.

I'm interested in a broad range of topics on the machine learning side of speech and language processing, including pretraining, modelling, and generation.

Papers

Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
ICLR 2025 [Paper] [Code] [Poster]
Mutian He, Philip N. Garner
Convert pretrained speech and language transformer models to linear complexity task-specific models without re-pretraining with minimal performance regression via distillation, examined on RoBERTa -> Linformer, Pythia -> Mamba LM, and Wav2Vec2 -> bi-Mamba2. The fine-tuning trajectory of the original transformer can be leveraged to preserve pretraining knowledge.
Identifying Storytelling in Job Interviews using Deep Learning
Computers in Human Behavior Reports [Paper]
Elisabeth Germanier, Mutian He, Amina Mardiyyah Rufai, Philip N Garner, Adrian Bangerter, Laetitia A Renier, Marianne Schmid Mast, Koralie Orji
Apply recent pretrained speech and language models to job interview recordings to identifying the segments crucial for the interview performance, e.g. storytelling. Different options to leverage audio signal and longer interview context are investigated.
Acquiring and Modelling Abstract Commonsense Knowledge via Conceptualization
Artificial Intelligence (AIJ), 2024 [Paper] [Code & Data]; also see this [Talk]
Mutian He, Tianqing Fang, Weiqi Wang, Yangqiu Song
The paper presents a comprehensive study on replicating the human conceptual induction process (i.e. commonsense reasoning by conceptual knowledge, e.g. a duck is a bird), including 1) formulating the steps of machine conceptualization and conceptual induction; 2) collecting a dataset on the validity of event and triple conceptualization, and 3) develop NLP models to carry out the conceptualization process. An abstract commonsense knowledge base is then derived and proved helpful on downstream tasks.
The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation
Findings of EMNLP 2023 [Paper] [Code & Data] [Poster]
Mutian He, Philip N. Garner
Speech translation is found useful as a pretraining or auxiliary task to train models for spoken language understanding as a downstream task, especially in low-resource and multilingual scenarios, and allows few-shot transfer to new languages. Preserving speech translation knowledge by Bayesian regularizers is also found helpful for the downstream task.
Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
Interspeech 2023 (Oral) [Paper] [Resources]
Mutian He, Philip N. Garner
ChatGPT shows strong zero/few-shot performance on intent classification benchmarks, close to supervised models and much better than smaller LMs. The performance is lower when there are ASR errors, the task involves word pronunciation, or on slot filling tasks where prompting is more complex.
The Idiap Speech Synthesis System for the Blizzard Challenge 2023
Proc. 18th Blizzard Challenge Workshop [Paper]
Haolin Chen, Mutian He, Louise Coppieters de Gibson, Philip N. Garner
A French TTS system using the diffusion-based acoustic model and vocoder, with specialized efforts on text analysis to resolve liaisons and heterophonic homographs.
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge
Interspeech 2022 [Paper] [Demo] [Code] [Poster]
Mutian He, Jingzhou Yang, Lei He, Frank K. Soong
An RAG-fashion E2E TTS system directly taking text inputs. This allows the model to read the external dictionary entry of the word to determine the correct reading, instead of memorizing the pronunciation internally. The technique boosts low-resource performance on Mandarin, Cantonese, and Japanese, particularly for resolving polygraphs (characters with different readings in different semantic contexts).
Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis
[Paper] [Demo] [Code]
Mutian He, Jingzhou Yang, Lei He, Frank K. Soong
Aimed at scaling TTS to innumerable new languages with minimal per-language effort and expert knowledge, a massively multilingual TTS model is pretrained, and can be adapted to target languages with low resource or even few (e.g. 30) shots. Byte input is adopted to avoid the need of per-language knowledge and text analysis. Language-specific sub-networks related to the language family can be identified from the model parameters.
On the Role of Conceptualization in Commonsense Knowledge Graph Construction
[Paper] [Code]
Mutian He, Yangqiu Song, Kun Xu, Dong Yu
Extend commonsense knowledge graphs (e.g. ATOMIC) and construct new triples with unseen events using conceptualization, e.g. reasoning over events about “beverage” from events about “milk”, using neural models trained by negative sampling.
Neural Subgraph Isomorphism Counting
KDD 2020 [Paper]
Xin Liu, Haojie Pan, Mutian He, Yangqiu Song, Xin Jiang
Counting the subgraphs isomorphic to a small pattern graph from a large graph, using neural models that iteratively attend to both graphs, which allows much faster computation than the NP-hard exact algorithm with limited error.
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS
Interspeech 2019 (Oral) [Paper] [Demo] [Code]
Mutian He, Yan Deng, Lei He
An attention mechanism for seq2seq autoregressive TTS that encourages the attention focus to move forward monotonically but at most one token per step, thus avoiding TTS errors due to skipping, repeating, and attention collapse when handling long or out-of-domain inputs.
Time-evolving Text Data Classification with Deep Neural Networks
IJCAI 2018 [Paper]
Yu He, Jianxin Li, Yangqiu Song, Mutian He, Hao Peng
Efficiently process non-stationary text data that evolve through time (e.g. with word meaning changing through years) by time-wise feature smoothing or leveraging features from past models.

Teaching

Intro to Natural Language Processing, HKUST, Spring 2020
Intro to Speech Processing, Idiap, Fall 2022, 2023

Miscellaneous

CSRankings is a powerful tool for identifying active researchers in various fields of computer science, but the area of speech is not covered. Inspired by the idea, I created a similar Speech Rankings when I was looking for potential PhD advisors.