Mutian He


PhD Student
Idiap Research Institute
École Polytechnique Fédérale de Lausanne (EPFL)

Email: mutian.he at idiap dot ch

Github / Google Scholar / CV


I am a PhD candidate at the Idiap Research Institute, EPFL, Switzerland, advised by Phil Garner, working on spoken language understanding by combining both speech and NLP techniques. Before that, I received my B.E. degree from Beihang University (BUAA) in 2019, and my MPhil degree from the Hong Kong University of Science and Technology in 2022 with thesis on the conceptualization in commonsense reasoning, advised by Yangqiu Song. I have also worked on the topic of speech synthesis at Microsoft, focused on robustness, multilinguality, and low-resource condition.

I'm interested in a broad range of topics on machine learning side of speech and language processing, including pretraining, modelling, and generation.

Papers

  • Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
    ICLR 2025 [Paper] [Code]
    Mutian He, Philip N. Garner

    Convert pretrained speech and language transformer models to linear complexity task-specific models without re-pretraining with minimal performance regression via distillation, examined on RoBERTa -> Linformer, Pythia -> Mamba LM, and Wav2Vec2 -> bi-Mamba2. The fine-tuning trajectory of the original transformer can be leveraged to preserve pretraining knowledge.

  • Acquiring and Modelling Abstract Commonsense Knowledge via Conceptualization
    Artificial Intelligence (AIJ), 2024 [Paper] [Code & Data]
    Mutian He, Tianqing Fang, Weiqi Wang, Yangqiu Song

    The paper presents a comprehensive study on replicating the human conceptual induction process (i.e. commonsense reasoning by conceptual knowledge, e.g. a duck is a bird), including 1) formulating the steps of machine conceptualization and conceptual induction; 2) collecting a dataset on the validity of event and triple conceptualization, and 3) develop NLP models to carry out the conceptualization process. An abstract commonsense knowledge base is then derived and proved helpful on downstream tasks.

  • The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation
    Findings of EMNLP 2023 [Paper] [Code & Data]
    Mutian He, Philip N. Garner

    Speech translation is found useful as a pretraining or auxiliary task to train models for spoken language understanding as a downstream task, especially in low-resource and multilingual scenarios, and allows few-shot transfer to new languages. Preserving speech translation knowledge by Bayesian regularizers is also found helpful for the downstream task.

  • Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
    Interspeech 2023 [Paper] [Resources]
    Mutian He, Philip N. Garner

    ChatGPT shows strong zero/few-shot performance on intent classification benchmarks, close to supervised models and much better than smaller LMs. The performance is lower when there are ASR errors, the task involves word pronunciation, or on slot filling tasks where prompting is more complex.

  • The Idiap Speech Synthesis System for the Blizzard Challenge 2023
    Proc. 18th Blizzard Challenge Workshop [Paper]
    Haolin Chen, Mutian He, Louise Coppieters de Gibson, Philip N. Garner

    A French TTS system using the diffusion-based acoustic model and vocoder, with specialized efforts on text analysis to resolve liaisons and heterophonic homographs.

  • Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge
    Interspeech 2022 [Paper] [Demo] [Code]
    Mutian He, Jingzhou Yang, Lei He, Frank K. Soong

    An RAG-fashion E2E TTS system directly taking text inputs. This allows the model to read the external dictionary entry of the word to determine the correct reading, instead of memorizing the pronunciation internally. The technique boosts low-resource performance on Mandarin, Cantonese, and Japanese, particularly for resolving polygraphs (characters with different readings in different semantic contexts).

  • Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis
    [Paper] [Demo] [Code]
    Mutian He, Jingzhou Yang, Lei He, Frank K. Soong

    Aimed at scaling TTS to innumerable new languages with minimal per-language effort and expert knowledge, a massively multilingual TTS model is pretrained, and can be adapted to target languages with low resource or even few (e.g. 30) shots. Byte input is adopted to avoid the need of per-language knowledge and text analysis. Language-specific sub-networks related to the language family can be identified from the model parameters.

  • On the Role of Conceptualization in Commonsense Knowledge Graph Construction
    [Paper] [Code]
    Mutian He, Yangqiu Song, Kun Xu, Dong Yu

    Extend commonsense knowledge graphs (e.g. ATOMIC) and construct new triples with unseen events using conceptualization, e.g. reasoning over events about “beverage” from events about “milk”, using neural models trained by negative sampling.

  • Neural Subgraph Isomorphism Counting
    KDD 2020 [Paper]
    Xin Liu, Haojie Pan, Mutian He, Yangqiu Song, Xin Jiang

    Counting the subgraphs isomorphic to a small pattern graph from a large graph, using neural models that iteratively attend to both graphs, which allows much faster computation than the NP-hard exact algorithm with limited error.

  • Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS
    Interspeech 2019 [Paper] [Demo] [Code]
    Mutian He, Yan Deng, Lei He

    An attention mechanism for seq2seq autoregressive TTS that encourages the attention focus to move forward monotonically but at most one token per step, thus avoiding TTS errors due to skipping, repeating, and attention collapse when handling long or out-of-domain inputs.

  • Time-evolving Text Data Classification with Deep Neural Networks
    IJCAI 2018 [Paper]
    Yu He, Jianxin Li, Yangqiu Song, Mutian He, Hao Peng

    Efficiently process non-stationary text data that evolve through time (e.g. with word meaning changing through years) by time-wise feature smoothing or leveraging features from past models.

Teaching

  • Intro to Natural Language Processing, HKUST, Spring 2020
  • Intro to Speech Processing, Idiap, Fall 2022, 2023

Miscellaneous

CSRankings is a powerful tool for identifying active researchers in various fields of computer science, but the area of speech is not covered. Inspired by the idea, I created a similar Speech Rankings when I was looking for potential PhD advisors.
Plain Academic