Speech Rankings

A list of researchers in the area of speech ordered by the number of relevant publications, for the purpose of identifying potential academic supervisors.

Report exported at 2024-10-16 04:11:58, see here for how it is created.
Export parameters: --year_start 2019 --year_end 2024 --year_shift 1 --author_start_year 1900 --exclude_venue SSW,ASRU,IWSLT,SLT --n_pubs 20 --rank_start 0 --rank_end 200 --output speech_rankings.html

#1  | Shinji Watanabe 0001 | DBLP Google Scholar  
By venueICASSP: 101Interspeech: 89TASLP: 12ACL: 5NAACL: 4AAAI: 2EMNLP-Findings: 2ACL-Findings: 1NeurIPS: 1IJCAI: 1ICML: 1
By year2024: 282023: 602022: 432021: 342020: 182019: 262018: 10
ISCA sessionsspeech recognition: 8speech synthesis: 4non-autoregressive sequential modeling for speech processing: 4speaker diarization: 4low-resource asr development: 3spoken language translation, information retrieval, summarization, resources, and evaluation: 2spoken dialog systems and conversational analysis: 2spoken language understanding: 2spoken language processing: 2novel models and training methods for asr: 2asr: 2source separation: 2neural networks for language modeling: 2robust speech recognition: 2adjusting to speaker, accent, and domain: 2novel transformer models for asr: 1self-supervised learning in asr: 1search methods and decoding algorithms for asr: 1speech, voice, and hearing disorders: 1articulation: 1robust asr, and far-field/multi-talker asr: 1search/decoding algorithms for asr: 1streaming asr: 1speech enhancement and intelligibility: 1neural transducers, streaming asr and novel asr models: 1speech segmentation: 1adaptation, transfer learning, and distillation for asr: 1speech processing & measurement: 1single-channel and multi-channel speech enhancement: 1spoken dialogue systems and multimodality: 1tools, corpora and resources: 1streaming for asr/rnn transducers: 1acoustic event detection and acoustic scene classification: 1low-resource speech recognition: 1miscellanous topics in asr: 1emotion and sentiment analysis: 1topics in asr: 1cross/multi-lingual and code-switched asr: 1speech signal analysis and representation: 1target speaker detection, localization and separation: 1single-channel speech enhancement: 1asr neural network architectures and training: 1speaker embedding: 1noise robust and distant speech recognition: 1sequence-to-sequence speech recognition: 1asr for noisy and far-field speech: 1speaker recognition: 1speaker recognition evaluation: 1asr neural network training: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1asr neural network architectures: 1speech and voice disorders: 1search methods for speech recognition: 1speech technologies for code-switching in multilingual communities: 1nn architectures for asr: 1language identification: 1sequence models for asr: 1end-to-end speech recognition: 1the first dihard speech diarization challenge: 1deep enhancement: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 59task analysis: 19speech enhancement: 18self supervised learning: 15data models: 14end to end: 14natural language processing: 14decoding: 13predictive models: 13end to end speech recognition: 11computational modeling: 9pipelines: 9benchmark testing: 8ctc: 8spoken language understanding: 8speaker recognition: 8recurrent neural nets: 8adaptation models: 7signal processing algorithms: 7transformers: 7speaker diarization: 7speech coding: 7training data: 6encoding: 6transformer: 6vocabulary: 6automatic speech recognition: 5hidden markov models: 5representation learning: 5asr: 5speech translation: 5speech separation: 5semantics: 5speech synthesis: 5multitasking: 5text analysis: 5source separation: 5transfer learning: 4correlation: 4recording: 4annotations: 4transducers: 4supervised learning: 4time frequency analysis: 4language translation: 4pattern classification: 4signal classification: 4microphone arrays: 4symbols: 3analytical models: 3streaming: 3speech summarization: 3behavioral sciences: 3standards: 3memory management: 3data augmentation: 3measurement: 3oral communication: 3audio visual: 3hubert: 3attention: 3visualization: 3eend: 3noise measurement: 3complex spectral mapping: 3time domain analysis: 3array signal processing: 3misp challenge: 3encoder decoder: 3rnn t: 3open source: 3non autoregressive: 3autoregressive processes: 3sequence to sequence: 3connectionist temporal classification: 3convolutional neural nets: 3audio signal processing: 3biological system modeling: 2benchmark: 2protocols: 2speech: 2switches: 2inference algorithms: 2discrete units: 2complexity theory: 2interpretability: 2buildings: 2end to end models: 2machine translation: 2probabilistic logic: 2databases: 2linguistics: 2speaker diarisation: 2topic model: 2phonetics: 2sentiment analysis: 2error analysis: 2indexes: 2estimation: 2reverberation: 2frame online speech enhancement: 2end to end systems: 2stop challenge: 2degradation: 2self supervised representations: 2voice conversion: 2reproducibility of results: 2unsupervised asr: 2semi supervised learning: 2transducer: 2quality assessment: 2phase estimation: 2self supervision: 2code switched asr: 2public domain software: 2graph theory: 2end to end asr: 2text to speech: 2cycle consistency: 2end to end speech translation: 2pattern clustering: 2neural net architecture: 2self attention: 2joint ctc/attention: 2unpaired data: 2multiple microphone array: 2sound event detection: 2signal detection: 2multilingual text to speech: 1low resource adaptation: 1graphone: 1adaptation of masked language model: 1task generalization: 1evaluation: 1foundation model: 1semi autoregressive: 1redundancy: 1systematics: 1long form asr: 1data processing: 1probes: 1mutual information: 1linear probing: 1information theory: 1solids: 1conversational speech recognition: 1conversation transcription: 1multi talker automatic speech recognition: 1label priors: 1runtime: 1forced alignment: 1instruction tuning: 1collaboration: 1conversational speech: 1context modeling: 1contextual information: 1robustness: 1zero shot learning: 1code switching: 1splicing: 1synthetic summary: 1chatbots: 1large language model: 1chatgpt: 1multi modal tokens: 1image to speech synthesis: 1vector quantization: 1image to speech captioning: 1multi modal speech processing: 1memory: 1costs: 1dataset: 1boosting: 1modulation: 1multitask: 1spoken language model: 1animation: 1three dimensional displays: 1multi task learning: 1tongue: 1speech animation: 1solid modeling: 1ema: 1context: 1generative context: 1self supervised speech models: 1beam search: 1acoustic beams: 1contextualization: 1biasing: 1st: 1multi tasking: 1mt: 1low resource language lip reading: 1multilingual automated labeling: 1lips: 1visual speech recognition: 1lip reading: 1artificial intelligence: 1encoder decoder models: 1modularity: 1network architecture: 1online diarization: 1optimization: 1robust automatic speech recognition: 1articulatory attribute: 1broad phonetic classes: 1full and sub band integration: 1acoustic beamforming: 1computer architecture: 1discrete fourier transforms: 1low latency communication: 1microphone array processing: 1prediction algorithms: 1spoken dialog system: 1emotion recognition: 1joint modelling: 1history: 1speaker attributes: 1overthinking: 1synchronization: 1prosody transfer: 1rhythm: 1one shot: 1synthesizers: 1disentangled speech representation: 1codes: 1multilingual asr: 1face recognition: 1low resource asr: 1cleaning: 1usability: 1target tracking: 1disfluency detection: 1espnet: 1s3prl: 1learning systems: 1multiprotocol label switching: 1pseudo labeling: 1semisupervised learning: 1limiting: 1intermediate loss: 1pre trained language model: 1bit error rate: 1bert: 1masked language model: 1adapter: 1data mining: 1evaluation protocol: 1speaker verification: 1video on demand: 1computational efficiency: 1end to end modeling: 1memory efficient encoders: 1dual speech/text encoder: 1long spoken document: 1e2e: 1on device: 1tensors: 1e branchformer: 1sequential distillation: 1tensor decomposition: 1articulatory: 1gestural scores: 1production systems: 1factor analysis: 1kinematics: 1lda: 1unsupervised: 1wavlm: 1automatic speech quality assessment: 1speech language model: 1discrete token: 1closed box: 1real time systems: 1speech to text translation: 1out of order: 1heterogeneous networks: 1self supervised models: 1convolution: 1structured pruning: 1bridges: 1connectors: 1question answering (information retrieval): 1speech to speech translation: 1text to speech augmentation: 1fine tuning: 1speaker separation: 1low complexity speech enhancement: 1hearing aids design: 1road transportation: 1memory architecture: 1quantization (signal): 1tv: 1multimodality: 1production: 1articulatory inversion: 1articulatory speech processing: 1text recognition: 1spoken named enitiy recognition: 1zero shot asr: 1impedance matching: 1acoustic measurements: 1acoustic parameters: 1phonetic alignment: 1perceptual quality: 1noise reduction: 1enhancement: 1explainable enhancement evaluation: 1frequency estimation: 1eda: 1iterative methods: 1inference mechanisms: 1speech based user interfaces: 1gtc: 1multi speaker overlapped speech: 1wfst: 1wake word spotting: 1audio visual systems: 1microphone array: 1ctc/attention speech recognition: 1channel bank filters: 1fourier transforms: 1computer based training: 1self supervised speech representation: 1sensor fusion: 1attention fusion: 1rover: 1generative model: 1diffusion probabilistic model: 1bic: 1interactive systems: 1unit based language model: 1acoustic unit discovery: 1gtc t: 1noise robustness: 1joint modeling: 1natural languages: 1audio captioning: 1aac: 1linguistic annotation: 1re current neural network: 1sru++: 1bilingual asr: 1computational linguistics: 1audio processing: 1open source toolkit: 1software packages: 1python: 1end to end speech processing: 1conformer: 1image sequences: 1non autoregressive sequence generation: 1non autoregressive decoding: 1multiprocessing systems: 1conditional masked language model: 1long sequence data: 1gaussian processes: 1search problems: 1multitask learning: 1stochastic processes: 1continuous speech separation: 1long recording speech separation: 1online processing: 1transforms: 1dual path modeling: 1noisy speech: 1deep learning (artificial intelligence): 1signal denoising: 1loudspeakers: 1diarization: 1audio recording: 1entropy: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1direction of arrival estimation: 1source localization: 1multi encoder multi resolution (mem res): 1multi encoder multi array (mem array): 1hierarchical attention network (han): 1curriculum learning: 1end to end model: 1multi talker mixed speech recognition: 1knowledge distillation: 1permutation invariant training: 1overlapped speech recognition: 1neural beamforming: 1lightweight convolution: 1dynamic convolution: 1open source software: 1proposals: 1neural network: 1region proposal network: 1faster r cnn: 1speaker adaptation: 1end to end speech synthesis: 1joint training of asr tts: 1multi stream: 1two stage training: 1weakly supervised learning: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1voice activity detection: 1ctc greedy search: 1cloud computing: 1covariance matrix adaptation evolution strategy (cma es): 1multi objective optimization: 1pareto optimisation: 1genetic algorithm: 1parallel processing: 1deep neural network (dnn): 1evolutionary computation: 1attention models: 1discriminative training: 1optimisation: 1softmax margin: 1beam search training: 1sequence learning: 1multi speaker speech recognition: 1cocktail party problem: 1attention mechanism: 1cold fusion: 1automatic speech recognition (asr): 1language model: 1shallow fusion: 1storage management: 1deep fusion: 1expert systems: 1low resource language: 1multilingual speech recognition: 1acoustic model: 1autoencoder: 1weakly labeled data: 1restricted boltzmann machine: 1unsupervised learning: 1conditional restricted boltzmann machine: 1robust speech recognition: 1acoustic modeling: 1chime 5 challenge: 1kaldi: 1discrete representation: 1mask inference: 1interpolation: 1error statistics: 1stream attention: 1speech codecs: 1word processing: 1sub word modeling: 1
Most publications (all venues) at2023: 982024: 752022: 732021: 702019: 47

Affiliations
Carnegie Mellon University, Pittsburgh, PA, USA
Johns Hopkins University, Baltimore, MD, USA (former)
Mitsubishi Electric Research Laboratories, Cambridge, MA, USA (2012 - 2017)
NTT Communication Science Laboratories, Kyoto, Japan (2001 - 2011)
Waseda University, Tokyo, Japan (PhD 2006)

Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001
End-to-End Speech Recognition: A Survey.

TASLP2024 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe 0001, Shinnosuke Takamichi, Hiroshi Saruwatari, 
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis.

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee, 
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Siddhant Arora, George Saon, Shinji Watanabe 0001, Brian Kingsbury, 
Semi-Autoregressive Streaming ASR with Label Context.

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 William Chen, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001
Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing.

ICASSP2024 Kwanghee Choi, Jee-Weon Jung, Shinji Watanabe 0001
Understanding Probe Behaviors Through Variational Bounds of Mutual Information.

ICASSP2024 Samuele Cornell, Jee-Weon Jung, Shinji Watanabe 0001, Stefano Squartini, 
One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition.

ICASSP2024 Ruizhe Huang, Xiaohui Zhang 0007, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe 0001, Daniel Povey, Sanjeev Khudanpur, 
Less Peaky and More Accurate CTC Forced Alignment by Label Priors.

ICASSP2024 Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan S. Sharma, Shinji Watanabe 0001, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee, 
Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.

ICASSP2024 Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization.

ICASSP2024 Amir Hussein, Dorsa Zeinali, Ondrej Klejch, Matthew Wiesner, Brian Yan, Shammur Absar Chowdhury, Ahmed Ali 0002, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora.

ICASSP2024 Jee-Weon Jung, Roshan S. Sharma, William Chen, Bhiksha Raj, Shinji Watanabe 0001
AugSumm: Towards Generalizable Speech Summarization Using Synthetic Labels from Large Language Models.

ICASSP2024 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe 0001, Yong Man Ro, 
Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens.

ICASSP2024 Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe 0001, Joon Son Chung, 
VoxMM: Rich Transcription of Conversations in the Wild.

ICASSP2024 Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhongqiu Wang, Shinji Watanabe 0001
Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor.

ICASSP2024 Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe 0001
Hubertopic: Enhancing Semantic Representation of Hubert Through Self-Supervision Utilizing Topic Model.

ICASSP2024 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-Weon Jung, Xuankai Chang, Shinji Watanabe 0001
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks.

ICASSP2024 Salvador Medina, Sarah L. Taylor, Carsten Stoll, Gareth Edwards, Alex Hauptmann 0001, Shinji Watanabe 0001, Iain A. Matthews, 
PhISANet: Phonetically Informed Speech Animation Network.

ICASSP2024 Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe 0001, Karen Livescu, 
Generative Context-Aware Fine-Tuning of Self-Supervised Speech Models.

#2  | Helen M. Meng | DBLP Google Scholar  
By venueICASSP: 74Interspeech: 72TASLP: 19ICML: 1IJCAI: 1
By year2024: 142023: 302022: 432021: 312020: 182019: 222018: 9
ISCA sessionsspeech synthesis: 13speech and language in health: 5voice conversion and adaptation: 5speech recognition of atypical speech: 4speech recognition: 2topics in asr: 2spoken term detection: 2asr neural network architectures: 2neural techniques for voice conversion and waveform generation: 2medical applications and visual asr: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multi-talker methods in speech processing: 1single-channel speech enhancement: 1multi-, cross-lingual and other topics in asr: 1novel models and training methods for asr: 1atypical speech analysis and detection: 1multimodal speech emotion recognition and paralinguistics: 1miscellaneous topics in speech, voice and hearing disorders: 1spoofing-aware automatic speaker verification (sasv): 1zero, low-resource and multi-modal speech recognition: 1embedding and network architecture for speaker recognition: 1voice anti-spoofing and countermeasure: 1non-autoregressive sequential modeling for speech processing: 1assessment of pathological speech and language: 1non-native speech: 1speaker recognition: 1speech synthesis paradigms and methods: 1speech in multimodality: 1asr neural network architectures and training: 1new trends in self-supervised speech processing: 1multimodal speech processing: 1learning techniques for speaker recognition: 1speech and speaker recognition: 1speech and audio classification: 1lexicon and language model for speech recognition: 1novel neural network architectures for acoustic modelling: 1second language acquisition and code-switching: 1voice conversion: 1emotion recognition and analysis: 1plenary talk: 1expressive speech synthesis: 1deep learning for source separation and pitch tracking: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 37speech synthesis: 21speaker recognition: 19natural language processing: 14recurrent neural nets: 11adaptation models: 8speech coding: 8speech separation: 8emotion recognition: 8task analysis: 7vocoders: 7speech enhancement: 7speaker verification: 7voice conversion: 7speech emotion recognition: 7decoding: 6semantics: 6self supervised learning: 6transformer: 6bayes methods: 6deep learning (artificial intelligence): 6text analysis: 6optimisation: 6text to speech: 5transformers: 5linguistics: 5adversarial attack: 5data models: 4representation learning: 4dysarthric speech reconstruction: 4audio visual: 4expressive speech synthesis: 4data mining: 4neural architecture search: 4security of data: 4gaussian processes: 4quantisation (signal): 4multi channel: 4overlapped speech: 4elderly speech: 3dysarthric speech: 3hidden markov models: 3speech: 3visualization: 3training data: 3speaker adaptation: 3predictive models: 3error analysis: 3domain adaptation: 3robustness: 3bayesian learning: 3knowledge distillation: 3audio signal processing: 3biometrics (access control): 3speech intelligibility: 3entropy: 3voice activity detection: 3variational inference: 3language models: 3convolutional neural nets: 3pre trained asr system: 2wav2vec2.0: 2older adults: 2bidirectional attention mechanism: 2spectrogram: 2multi modal: 2vq vae: 2cloning: 2language model: 2timbre: 2transfer learning: 2instruments: 2coherence: 2perturbation methods: 2cognition: 2vector quantization: 2hierarchical: 2estimation: 2conformer: 2end to end: 2automatic speech recognition: 2computational modeling: 2noise reduction: 2asr: 2speaking style modelling: 2bidirectional control: 2multi task learning: 2particle separators: 2time frequency analysis: 2source separation: 2costs: 2measurement: 2data augmentation: 2handicapped aids: 2disordered speech recognition: 2time delay neural network: 2automatic speaker verification: 2adversarial defense: 2model uncertainty: 2neural language models: 2trees (mathematics): 2benchmark testing: 2audio visual systems: 2anti spoofing: 2speaker diarization: 2multi look: 2inference mechanisms: 2gradient methods: 2admm: 2autoregressive processes: 2quantization: 2code switching: 2standards: 1multi lingual xlsr: 1hubert: 1films: 1multiscale speaking style transfer: 1text to speech synthesis: 1games: 1automatic dubbing: 1cross lingual speaking style transfer: 1prompt based learning: 1diffusion model: 1metric learning: 1natural languages: 1av hubert: 1transforms: 1pre training: 1self supervised style enhancing: 1dance expressiveness: 1dance generation: 1genre matching: 1dance dynamics: 1humanities: 1dynamics: 1beat alignment: 1zero shot: 1multi scale acoustic prompts: 1prompt tuning: 1parameter efficient tuning: 1transformer adapter: 1pre trained transformer: 1multiple signal classification: 1long multi track: 1multi view midivae: 1symbolic music generation: 1two dimensional displays: 1speech disentanglement: 1vae: 1voice cloning: 1static var compensators: 1harmonic analysis: 1power harmonic filters: 1synthesizers: 1neural concatenation: 1signal generators: 1singing voice conversion: 1speech normalization: 1speech units: 1pipelines: 1speech representation learning: 1information retrieval: 1interaction gesture: 1multi agent conversational interaction: 1oral communication: 1dialog intention and emotion: 1co speech gesture generation: 1neural tts: 1multi stage multi codebook (msmc): 1speech representation: 1context modeling: 1style modeling: 1bit error rate: 1multi scale: 1speech dereverberation: 1maximum likelihood detection: 1nonlinear filters: 1neural machine translation: 1hierarchical attention mechanism: 1machine translation: 1meta learning: 1meta generalized speaker verification: 1performance evaluation: 1optimization: 1domain mismatch: 1recording: 1subband interaction: 1inter subnet: 1global spectral information: 1feature selection: 1rabbits: 1disfluency pattern: 1dementia detection: 1audiobook speech synthesis: 1prediction methods: 1context aware: 1multi sentence: 1hierarchical transformer: 1additives: 1contrastive learning: 1multiobjective optimization: 1additive angular margin: 1optimization methods: 1attention mechanism: 1alzheimer’s disease: 1sociology: 1syntactics: 1task oriented: 1pretrained embeddings: 1multimodality: 1affective computing: 1multi label: 1emotional expression: 1multi culture: 1vocal bursts: 1data analysis: 1target speech extraction: 1multi modal fusion: 1fuses: 1encoding: 12d positional encoding.: 1cross attention: 1end to end speech recognition: 1multi talker speech recognition: 1network architecture: 1corrector network: 1time domain: 1time frequency domain: 1learning systems: 1synthetic corpus: 1audio recording: 1neural vocoder: 1semantic augmentation: 1upper bound: 1difficulty aware: 1stability analysis: 1contextual biasing: 1biased words: 1sensitivity: 1open vocabulary keyword spotting: 1acoustic model: 1dynamic network pruning: 1melody unsupervision: 1differentiable up sampling layer: 1rhythm: 1vocal range: 1regulators: 1annotations: 1singing voice synthesis: 1bi directional flow: 1elderly speech recognition: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1pattern classification: 1adversarial attacks: 1supervised learning: 1monte carlo methods: 1tree structure: 1prosodic structure prediction: 1computational linguistics: 1span based decoder: 1character level: 1image segmentation: 1phase information: 1full band extractor: 1multi scale time sensitive channel attention: 1memory management: 1convolution: 1knowledge based systems: 1flat lattice transformer: 1rule based: 1chinese text normalization: 1none standard word: 1relative position encoding: 1articulatory inversion: 1hybrid power systems: 1xlnet: 1speaking style: 1conversational text to speech synthesis: 1graph neural network: 1matrix algebra: 1end to end model: 1forced alignment: 1dereverberation and recognition: 1reverberation: 1speaker change detection: 1multitask learning: 1unsupervised learning: 1unsupervised speech decomposition: 1adversarial speaker adaptation: 1speaker identity: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1vocoder: 1uniform sampling: 1path dropout: 1partially fake audio detection: 1audio deep synthesis detection challenge: 1design methodology: 1mean square error methods: 1neural network quantization: 1mixed precision: 1connectionist temporal classification: 1cross entropy: 1disentangling: 1hybrid bottleneck features: 1feature fusion: 1data handling: 1m2met: 1direction of arrival estimation: 1direction of arrival: 1delays: 1generalisation (artificial intelligence): 1lf mmi: 1gaussian process: 1any to many: 1sequence to sequence modeling: 1signal reconstruction: 1signal sampling: 1signal representation: 1location relative attention: 1multimodal speech recognition: 1capsule: 1exemplary emotion descriptor: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1lstm rnn: 1low bit quantization: 1image recognition: 1microphone arrays: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1emotion: 1global style token: 1expressive: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1adress: 1patient diagnosis: 1alzheimer's disease detection: 1signal classification: 1diseases: 1features: 1geriatrics: 1medical diagnostic computing: 1ctc: 1non autoregressive: 1neural network based text to speech: 1grammars: 1prosody control: 1word processing: 1syntactic parse tree traversal: 1syntactic representation learning: 1controllable and efficient: 1semi autoregressive: 1prosody modelling: 1multi speaker and multi style tts: 1hifi gan: 1durian: 1low resource condition: 1weapons: 1information filters: 1switches: 1uncertainty: 1neurocognitive disorder detection: 1dementia: 1phonetic pos teriorgrams: 1x vector: 1gmm i vector: 1accent conversion: 1accented speech recognition: 1cross modal: 1seq2seq: 1adversarial training: 1spatial smoothing: 1spoofing countermeasure: 1recurrent neural networks: 1data compression: 1alternating direction methods of multipliers: 1audio visual speech recognition: 1multilingual speech synthesis: 1foreign accent: 1spectral analysis: 1center loss: 1human computer interaction: 1discriminative features: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1neural network language models: 1lstm: 1parameter estimation: 1connectionist temporal classification (ctc): 1e learning: 1computer assisted pronunciation training (capt): 1con volutional neural network (cnn): 1mispronunciation detection and diagnosis (mdd): 1multi head self attention: 1dilated residual network: 1wavenet: 1self attention: 1blstm: 1phonetic posteriorgrams(ppgs): 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1natural gradient: 1rnnlms: 1
Most publications (all venues) at2022: 762023: 562021: 542024: 452019: 29

Affiliations
The Chinese University of Hog Kong
Massachusetts Institute of Technology, Cambridge, MA, USA (former)

Recent publications

TASLP2024 Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu, 
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition.

TASLP2024 Jingbei Li, Sipan Li, Ping Chen, Luwen Zhang, Yi Meng, Zhiyong Wu 0001, Helen Meng, Qiao Tian, Yuping Wang, Yuxuan Wang 0002, 
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing.

TASLP2024 Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Xueyuan Chen, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Zhiyong Wu 0001, Xixin Wu, Helen Meng
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

ICASSP2024 Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu 0001, Haozhi Huang 0004, Helen Meng
Enhancing Expressiveness in Dance Generation Via Integrating Frequency and Music Style Information.

ICASSP2024 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Dan Luo, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han 0001, Helen Meng
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.

ICASSP2024 Zhe Li, Man-Wai Mak, Helen Mei-Ling Meng
Dual Parameter-Efficient Fine-Tuning for Speaker Representation Via Speaker Prompt Tuning and Adapters.

ICASSP2024 Zhiwei Lin, Jun Chen 0024, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu 0001, Helen Meng
Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation.

ICASSP2024 Hui Lu, Xixin Wu, Haohan Guo, Songxiang Liu, Zhiyong Wu 0001, Helen Meng
Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations.

ICASSP2024 Binzhu Sha, Xu Li 0015, Zhiyong Wu 0001, Ying Shan, Helen Meng
Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion.

ICASSP2024 Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization.

ICASSP2024 Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu 0001, Minglei Li 0001, Zonghong Dai, Helen Meng
Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng
UniAudio: Towards Universal Audio Generation with Large Language Models.

TASLP2023 Haohan Guo, Fenglong Xie, Xixin Wu, Frank K. Soong, Helen Meng
MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTS.

TASLP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Helen Meng
MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.

TASLP2023 Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu, 
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Jun Chen 0024, Wei Rao, Zilin Wang, Jiuxin Lin, Zhiyong Wu 0001, Yannan Wang, Shidong Shang, Helen Meng
Inter-Subnet: Speech Enhancement with Subband Interaction.

#3  | Haizhou Li 0001 | DBLP Google Scholar  
By venueInterspeech: 70ICASSP: 44TASLP: 29SpeechComm: 8AAAI: 2NeurIPS: 1
By year2024: 192023: 242022: 222021: 312020: 252019: 262018: 7
ISCA sessionsspeech synthesis: 6source separation: 4voice conversion and adaptation: 3speech signal characterization: 3speech technologies for code-switching in multilingual communities: 3invariant and robust pre-trained acoustic models: 2analysis of speech and audio signals: 2novel models and training methods for asr: 2speaker recognition: 2speech enhancement, bandwidth extension and hearing aids: 2spoken term detection: 2anti-spoofing for speaker verification: 1speaker and language identification: 1biosignal-enabled spoken communication: 1asr: 1resource-constrained asr: 1target speaker detection, localization and separation: 1the first dicova challenge: 1spoken language understanding: 1self-supervision and semi-supervision for neural asr training: 1speech enhancement and intelligibility: 1robust speaker recognition: 1feature, embedding and neural architecture for speaker recognition: 1neural signals for spoken communication: 1the attacker’s perpective on automatic speaker verification: 1targeted source separation: 1speech in multimodality: 1the interspeech 2020 far field speaker verification challenge: 1speaker recognition challenges and applications: 1anti-spoofing and liveness detection: 1asr neural network architectures: 1cross/multi-lingual and code-switched speech recognition: 1the interspeech 2019 computational paralinguistics challenge (compare): 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker recognition and anti-spoofing: 1speech processing and analysis: 1speaker recognition evaluation: 1speaker and language recognition: 1speech and audio characterization and segmentation: 1neural waveform generation: 1the zero resource speech challenge 2019: 1speech and speaker recognition: 1speaker recognition and diarization: 1cross-lingual and multilingual asr: 1speech and singing production: 1prosody modeling and generation: 1voice conversion and speech synthesis: 1speaker verification: 1show and tell: 1source separation from monaural input: 1
IEEE keywordsspeech recognition: 22speaker recognition: 21task analysis: 14speech synthesis: 14natural language processing: 9transformers: 7visualization: 6speaker embedding: 6data models: 6speech coding: 5speech enhancement: 5target speaker extraction: 5multi modal: 5emotion recognition: 5music: 5speaker extraction: 5lips: 4computational modeling: 4hidden markov models: 4phonetics: 4decoding: 4training data: 4self supervised learning: 4time domain: 4text analysis: 4transformer: 3time frequency analysis: 3data mining: 3rendering (computer graphics): 3signal processing algorithms: 3adaptation models: 3voice activity detection: 3synchronization: 3pipelines: 3music information retrieval: 3speech intelligibility: 3representation learning: 3multi task learning: 3singing voice separation: 3voice conversion: 3transfer learning: 3security of data: 3anti spoofing: 3speaker verification: 2accent: 2steganalysis: 2steganography: 2speech separation: 2internet: 2measurement: 2cross lingual voice conversion (xvc): 2predictive models: 2auditory system: 2filtering algorithms: 2direction of arrival: 2location awareness: 2linguistics: 2sparsely overlapped speech: 2noise robustness: 2recurrent neural networks: 2robustness: 2direction of arrival estimation: 2automatic dialogue evaluation: 2correlation: 2time frequency attention: 2cocktail party problem: 2codes: 2benchmark testing: 2image recognition: 2lyrics transcription: 2hearing: 2convolutional neural nets: 2pattern classification: 2pre training: 2tacotron: 2vocoders: 2speaker characterization: 2voice conversion (vc): 2cross lingual: 2word processing: 2tts: 2signal detection: 2signal reconstruction: 2cepstral analysis: 2source separation: 2automatic cued speech recognition: 1computational efficiency: 1computation and parameter efficient: 1cross attention: 1resnet: 1stride configuration: 1temporal resolution: 12d cnn: 1convolutional neural networks: 1image resolution: 1computer architecture: 1controllable: 1text to speech (tts) synthesis: 1accent intensity: 1multi agent deep learning: 1weight parameter aggregation: 1streams: 1low bit rate speech streams: 1long short term memory: 1pretraining: 1siamese network: 1psychoacoustic models: 1self supervise: 1synthetic data: 1maximum mean discrepancy: 1predictive coding: 1electronics packaging: 1multimodal sensors: 1oral communication: 1context modeling: 1dialog systems: 1history: 1multi reference: 1timbre: 1pitch normalization: 1text to speech (tts): 1phonetic variation: 1prosodic variation: 1target speaker localization: 1speaker dependent mask: 1focusing: 1emotional text to speech: 1emotion prediction: 1emotion control: 1target speech diarization: 1switches: 1semantics: 1speaker diarization: 1prompt driven: 1mimics: 1active speaker detection: 1audio visual: 1interference: 1speech: 1low snr: 1testing: 1optimization: 1artificial noise: 1signal to noise ratio: 1background noise: 1gradient: 1noise robust: 1neuromorphics: 1neurons: 1encoding: 1spiking neural networks: 1spike encoding: 1filter banks: 1learnable audio front end: 1system performance: 1in the wild: 1dino: 1biological system modeling: 1spiking neural network (snn): 1voice activity detection (vad): 1auditory attention: 1power demand: 1multiple signal classification: 1lyrics transcription in polyphonic music: 1integrated fine tuning: 1vocal extraction: 1robots: 1speaker tracking: 1cross modal attention: 1estimation: 1audio visual fusion: 1progressive clustering: 1diverse positive pairs: 1supervised learning: 1face recognition: 1speech streams: 1delays: 1resistance: 1pitch delays: 1deep neural networks: 1distortion: 1voice over internet protocol: 1quantization (signal): 1multitask learning: 1adapters: 1multi domain generalization: 1noise measurement: 1restcn: 1error analysis: 1linguistic loss: 1brain modeling: 1speech stimulus: 1electroencephalography: 1eeg decoding: 1speech envelope: 1match mismatch classification: 1visual occlusions: 1design methodology: 1inpainting: 1noisy label: 1deep cleansing: 1audiovisual: 1joint pre training: 1speech representation: 1analytical models: 1feeds: 1transformer cores: 1spare self attention: 1central moment discrepancy (cmd): 1missing modality imagination: 1invariant feature: 1multimodal emotion recognition: 1automatic lyrics transcription in polyphonic music: 1multitasking: 1instruments: 1singing skill evaluation: 1lyrics synchronization: 1singing information processing: 1audio signal processing: 1singing voice synthesis: 1singing voice: 1 $general speech mixture$ : 1scenario aware differentiated loss: 1filtering theory: 1speech lip synchronization: 1self enrollment: 1multilingual: 1language translation: 1grammars: 1natural languages: 1selective auditory attention: 1globalphone: 1target language extraction: 1lyrics transcription of polyphonic music: 1beamforming: 1doa estimation: 1speaker localizer: 1reverberation: 1array signal processing: 1multi scale frequency channel attention: 1short utterance: 1text independent speaker verification: 1text detection: 1visual text to speech: 1automatic voice over: 1textual visual attention: 1image fusion: 1lip speech synchronization: 1video signal processing: 1pseudo label selection: 1self supervised speaker recognition: 1loss gated learning: 1unsupervised learning: 1temporal convolutional network: 1energy distribution: 1prompt: 1multimodal: 1phrase break prediction: 1morphological and phonological features: 1deep learning (artificial intelligence): 1self attention: 1prosodic phrasing: 1mongolian speech synthesis: 1expressive speech synthesis: 1audio databases: 1frame and style reconstruction loss: 1speech analysis: 1voice conversion evaluation: 1voice conversion challenges: 1vocoding: 1target speaker verification: 1single and multi talker speaker verification: 1interactive systems: 1speech based user interfaces: 1human computer interaction: 1sport: 1holistic framework: 1text to speech (tts): 1non parallel: 1context vector: 1autoencoder: 1personalized speech generation: 1language agnostic: 1syntax: 1computational linguistics: 1graph theory: 1graph neural network: 1synthetic speech detection: 1signal companding: 1data augmentation: 1signal fusion: 1multi stage: 1spectro temporal attention: 1speech emotion recognition: 1convolution: 1channel attention: 1disentangled feature learning: 1signal denoising: 1adversarial training: 1signal representation: 1image sequences: 1acoustic embed dings: 1linguistic embeddings: 1image classification: 1intent classification: 1cloning: 1speaker adaption: 1target tracking: 1voice cloning: 1speech emotion recognition (ser): 1emotional voice conversion: 1emotional speech dataset: 1evaluation by ranking: 1musical acoustics: 1evaluation of singing quality: 1inter singer measures: 1music theory motivated measures: 1self organising feature maps: 1depth wise separable convolution: 1multi scale: 1inference mechanisms: 1knowledge distillation: 1autoregressive processes: 1chains corpus: 1vocal tract constriction: 1whispered speech: 1synthetic attacks: 1replay attacks: 1generalized countermeasures: 1asvspoof 2019: 1wavenet adaptation: 1singular value decomposition: 1singular value decomposition (svd): 1automatic speech recognition: 1acoustic modeling: 1music genre: 1lyrics alignment: 1sensor fusion: 1multi scale fusion: 1speech bandwidth extension: 1signal restoration: 1time domain analysis: 1low resource asr: 1catastrophic forgetting.: 1independent language model: 1fine tuning: 1text to speech: 1code switching: 1crosslingual word embedding: 1end to end: 1continuous wavelet transforms: 1tandem feature: 1phonetic posteriorgrams (ppgs): 1wavenet vocoder: 1sparse matrices: 1dictionaries: 1prosody conversion: 1language modelling: 1cross lingual embedding: 1code switch: 1audio source separation: 1polyphonic music: 1asr: 1lyrics to audio alignment: 1asvspoof 2017: 1channel bank filters: 1automatic speaker verification: 1spatial differentiation: 1band pass filters: 1iir filters: 1spectrum approximation loss: 1phonetic posteriorgram (ppg): 1average modeling approach (ama): 1
Most publications (all venues) at2024: 742010: 702023: 672021: 652015: 61

Affiliations
Chinese University of Hong Kong (Shenzhen), China
National University of Singapore, Department of Electrical and Computer Engineering, Singapore
Nanyang Technological University, Singapore (2006 - 2016)
Institute for Infocomm Research, A*STAR, Singapore (2003 - 2016)
University of New South Wales, Sydney, Australia (2011)
University of Eastern Finland, Kuopio, Finland (2009)
South China University of Technology, Guangzhou, China (PhD 1990)

Recent publications

SpeechComm2024 Shuai Wang 0016, Zhengyang Chen, Bing Han, Hongji Wang, Chengdong Liang, Binbin Zhang, Xu Xiang, Wen Ding, Johan Rohdin, Anna Silnova, Yanmin Qian, Haizhou Li 0001
Advancing speaker embedding learning: Wespeaker toolkit for research and production.

TASLP2024 Lei Liu, Li Liu 0036, Haizhou Li 0001
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition.

TASLP2024 Tianchi Liu 0004, Kong Aik Lee, Qiongqiong Wang, Haizhou Li 0001
Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification.

TASLP2024 Rui Liu 0008, Berrak Sisman, Guanglai Gao, Haizhou Li 0001
Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering.

TASLP2024 Congcong Sun, Hui Tian 0002, Peng Tian, Haizhou Li 0001, Zhenxing Qian, 
Multi-Agent Deep Learning for the Detection of Multiple Speech Steganography Methods.

TASLP2024 Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang 0016, Haizhou Li 0001
Speech Separation With Pretrained Frontend to Minimize Domain Mismatch.

TASLP2024 Koichiro Yoshino, Yun-Nung Chen, Paul A. Crook, Satwik Kottur, Jinchao Li, Behnam Hedayatnia, Seungwhan Moon, Zhengcong Fei, Zekang Li, Jinchao Zhang, Yang Feng 0004, Jie Zhou 0016, Seokhwan Kim, Yang Liu 0004, Di Jin, Alexandros Papangelis, Karthik Gopalakrishnan 0001, Dilek Hakkani-Tur, Babak Damavandi, Alborz Geramifard, Chiori Hori, Ankit Shah, Chen Zhang 0020, Haizhou Li 0001, João Sedoc, Luis F. D'Haro, Rafael E. Banchs, Alexander Rudnicky, 
Overview of the Tenth Dialog System Technology Challenge: DSTC10.

TASLP2024 Mingyang Zhang 0003, Yi Zhou 0020, Yi Ren 0006, Chen Zhang 0020, Xiang Yin 0006, Haizhou Li 0001
RefXVC: Cross-Lingual Voice Conversion With Enhanced Reference Leveraging.

TASLP2024 Xuehao Zhou, Mingyang Zhang 0003, Yi Zhou 0020, Zhizheng Wu 0001, Haizhou Li 0001
Accented Text-to-Speech Synthesis With Limited Data.

ICASSP2024 Yu Chen, Xinyuan Qian, Zexu Pan, Kainan Chen, Haizhou Li 0001
LOCSELECT: Target Speaker Localization with an Auditory Selective Hearing Mechanism.

ICASSP2024 Sho Inoue, Kun Zhou 0003, Shuai Wang 0016, Haizhou Li 0001
Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.

ICASSP2024 Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li 0001
Prompt-Driven Target Speech Diarization.

ICASSP2024 Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang 0016, Haizhou Li 0001
Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-Talker Speech.

ICASSP2024 Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li 0001
Gradient Weighting for Speaker Verification in Extremely Low Signal-to-Noise Ratio.

ICASSP2024 Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li 0001
Spiking-Leaf: A Learnable Auditory Front-End for Spiking Neural Networks.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

ICASSP2024 Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li 0001
SVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks.

AAAI2024 Rui Liu 0008, Yifan Hu, Yi Ren 0006, Xiang Yin 0006, Haizhou Li 0001
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling.

AAAI2024 Jiadong Wang, Zexu Pan, Malu Zhang, Robby T. Tan, Haizhou Li 0001
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition.

SpeechComm2023 Buddhi Wickramasinghe, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Julien Epps, Haizhou Li 0001, Ting Dang, 
DNN controlled adaptive front-end for replay attack detection systems.

#4  | Lei Xie 0001 | DBLP Google Scholar  
By venueInterspeech: 58ICASSP: 40TASLP: 15SpeechComm: 2ACL: 1AAAI: 1
By year2024: 122023: 262022: 282021: 162020: 132019: 162018: 6
ISCA sessionsspeech synthesis: 12speech recognition: 4voice conversion and adaptation: 4speaker and language recognition: 2asr: 2adjusting to speaker, accent, and domain: 2anti-spoofing for speaker verification: 1multi-talker methods in speech processing: 1speech synthesis and voice conversion: 1statistical machine translation: 1models for streaming asr: 1novel models and training methods for asr: 1multi-, cross-lingual and other topics in asr: 1spoken language processing: 1other topics in speech recognition: 1spoofing-aware automatic speaker verification (sasv): 1dereverberation and echo cancellation: 1tools, corpora and resources: 1non-autoregressive sequential modeling for speech processing: 1interspeech 2021 deep noise suppression challenge: 1resource-constrained asr: 1search/decoding techniques and confidence measures for asr: 1interspeech 2021 acoustic echo cancellation challenge: 1robust speaker recognition: 1deep noise suppression challenge: 1singing voice computing and processing in music: 1summarization, semantic analysis and classification: 1the attacker’s perpective on automatic speaker verification: 1multi-channel speech enhancement: 1streaming asr: 1the interspeech 2020 far field speaker verification challenge: 1model adaptation for asr: 1asr for noisy and far-field speech: 1cross-lingual and multilingual asr: 1speech technologies for code-switching in multilingual communities: 1extracting information from audio: 1robust speech recognition: 1spoken term detection: 1
IEEE keywordsspeech recognition: 18speech synthesis: 11decoding: 8linguistics: 8timbre: 8voice conversion: 8speech: 7task analysis: 7speech enhancement: 7natural language processing: 7speaker recognition: 5automatic speech recognition: 5transforms: 4emotion transfer: 4data models: 4predictive models: 4noise reduction: 4multitasking: 3attention mechanism: 3analytical models: 3convolution: 3fuses: 3time frequency analysis: 3end to end: 3style transfer: 3cloning: 2zero shot: 2disentangling: 2data mining: 2conversational asr: 2conformer: 2degradation: 2data privacy: 2speaker anonymization: 2information filtering: 2privacy protection: 2singular value decomposition (svd): 2privacy: 2cross lingual: 2disentanglement: 2pipelines: 2robustness: 2representation learning: 2visualization: 2audio visual speech recognition: 2process control: 2multi scale: 2perturbation methods: 2acoustic distortion: 2reverberation: 2generative adversarial network: 2vocoders: 2low resource: 2headphones: 2personalized speech enhancement: 2real time: 2multi task learning: 2source separation: 2adaptation models: 2acoustic echo cancellation: 2noise suppression: 2echo cancellers: 2adversarial learning: 2recurrent neural networks: 2training data: 2end to end asr: 2microphone arrays: 2alimeeting: 2meeting transcription: 2noise measurement: 2voice activity detection: 2gradient methods: 2keyword spotting: 2attention: 2attention based model: 2end to end speech recognition: 2speaker cloning: 1u net: 1style cloning: 1spectrogram: 1two granularity modeling units: 1asr ar multi task learning: 1lasas: 1temporal channel retrieval: 1production: 1reviews: 1context: 1oral communication: 1context modeling: 1latent variational: 1cross modal representation: 1matrix decomposition: 1voiceprivacy challenge: 1computational modeling: 1streaming voice conversion: 1dynamic masked convolution: 1computer architecture: 1predictive coding: 1quiet attention: 1buildings: 1error analysis: 1multimodal: 1cross attention: 1staged approach: 1measurement: 1encoding: 1language models: 1generative model: 1self supervised learning: 1semantics: 1natural language prompts: 1latent diffusion: 1diffusion model: 1phonetics: 1diffusion processes: 1style modeling: 1adversarial attack: 1speaker identification: 1timbre reserved: 1speech distortion: 1information perturbation: 1feature fusion: 1expressive: 1generative adversarial networks: 1universal vocoder: 1digital signal processing: 1source filter model: 1speaking style: 1speaker adaptation: 1contrastive learning: 1clustering methods: 1upper bound: 1background sound: 1social networking (online): 1internet: 1voice privacy challenge: 1robust keyword spotting: 1real time systems: 1multi modality fusion: 1audio visual keywords spotting: 1lips: 1far field speaker verification: 1fine tuning: 1weight transfer: 1tuning: 1band split: 1complexity theory: 1maximum likelihood detection: 1two step network: 1logic gates: 1multiple factors decoupling: 1expressive speech synthesis: 1two stage: 1minimization: 1variational inference: 1neural tts: 1style and speaker attributes: 1disjoint datasets: 1autoregressive processes: 1emotional speech synthesis: 1virtual assistants: 1emotion strengths: 1principal component analysis: 1emotion strength control: 1natural languages: 1databases: 1text to speech (tts): 1text analysis: 1computational linguistics: 1long form: 1cross sentence: 1dilated complex dual path conformer: 1uformer: 1speech enhancement and dereverberation: 1encoder decoder attention: 1medical signal processing: 1modulation: 1hybrid encoder and decoder: 1filtering theory: 1auditory system: 1two stage network: 1estimation: 1ecapa tdnn: 1super wide band: 1information processing: 1s dccrn: 1adaptation: 1one shot: 1over fit: 1topic realted rescoring: 1latent variational module: 1meeting scenario: 1speak diarization: 1arrays: 1multi speaker asr: 1m2met: 1speaker diarization: 1variational autoencoder: 1audio signal processing: 1singing voice synthesis: 1music: 1normalizing flows: 1optical filters: 1corpus: 1matched filters: 1optical character recognition software: 1multi domain: 1shape: 1performance gain: 1lattice pruning: 1speech coding: 1decoder: 1lattice generation: 1acoustic modeling: 1accent recognition: 1accented speech recognition: 1lf mmi: 1convolutional neural nets: 1computational complexity: 1transformer: 1wake word detection: 1streaming: 1transfer learning: 1speaker adaption: 1target tracking: 1voice cloning: 1pattern matching: 1deep binary embeddings: 1temporal context: 1query by example: 1image retrieval: 1quantization (signal): 1wavenet adaptation: 1singular value decomposition: 1voice conversion (vc): 1sensor fusion: 1multi scale fusion: 1speech bandwidth extension: 1signal restoration: 1time domain analysis: 1document image processing: 1neural net architecture: 1class imbalance: 1hard examples: 1wake up word detection: 1error statistics: 1statistical distributions: 1cross entropy: 1listen attend and spell: 1interference suppression: 1virtual adversarial training: 1sequence to sequence: 1adversarial training: 1generators: 1signal to noise ratio: 1domain adversarial training: 1asr: 1computer aided instruction: 1esl: 1call: 1language model: 1code switching: 1pattern classification: 1kws: 1adversarial examples: 1permutation invariant training: 1speech separation: 1pitch tracking: 1deep clustering: 1self attention: 1text to speech synthesis: 1relative position aware representation: 1recurrent neural nets: 1sequence to sequence model: 1audio visual systems: 1robust speech recognition: 1dropout: 1bimodal df smn: 1multi condition training: 1
Most publications (all venues) at2023: 562021: 562022: 522024: 352019: 33

Affiliations
Northwestern Polytechnical University, School of Computer Science, Xi'an, China
The Chinese University of Hong Kong, Department of Systems Engineering and Engineering Management, Hong Kong (2006 - 2007)
City University of Hong Kong, School of Creative Media, Hong Kong (2004 - 2006)
Northwestern Polytechnical University, Xi'an, China (PhD 2004)
Vrije Universiteit Brussel, Department of Electronics and Information Processing, Belgium (2001 - 2002)

Recent publications

SpeechComm2024 Li Zhang 0106, Ning Jiang, Qing Wang 0039, Yue Li, Quan Lu, Lei Xie 0001
Whisper-SV: Adapting Whisper for low-data-resource speaker verification.

TASLP2024 Tao Li, Zhichao Wang 0002, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie 0001
U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning.

TASLP2024 Qijie Shao, Pengcheng Guo, Jinghao Yan, Pengfei Hu 0004, Lei Xie 0001
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition.

TASLP2024 Zhichao Wang 0002, Liumeng Xue, Qiuqiang Kong, Lei Xie 0001, Yuanzhe Chen, Qiao Tian, Yuping Wang, 
Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion.

TASLP2024 Kun Wei, Bei Li, Hang Lv 0001, Quan Lu, Ning Jiang, Lei Xie 0001
Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation.

TASLP2024 Jixun Yao, Qing Wang 0039, Pengcheng Guo, Ziqian Ning, Lei Xie 0001
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix.

TASLP2024 Xinfa Zhu, Yi Lei, Tao Li, Yongmao Zhang, Hongbin Zhou, Heng Lu 0004, Lei Xie 0001
METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer.

ICASSP2024 Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu 0004, Shuai Wang, Jixun Yao, Lei Xie 0001, Mengxiao Bi, 
Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion.

ICASSP2024 He Wang, Pengcheng Guo, Pan Zhou, Lei Xie 0001
MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition.

ICASSP2024 Ziqian Wang, Xinfa Zhu, Zihan Zhang, Yuanjun Lv, Ning Jiang, Guoqing Zhao, Lei Xie 0001
SELM: Speech Enhancement using Discrete Tokens and Language Models.

ICASSP2024 Jixun Yao, Yuguang Yang 0005, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu 0004, Lei Xie 0001
Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts.

ACL2024 Zhichao Wang 0002, Yuanzhe Chen, Xinsheng Wang, Lei Xie 0001, Yuping Wang, 
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion.

TASLP2023 Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie 0001
DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin.

TASLP2023 Zhichao Wang 0002, Xinsheng Wang, Qicong Xie, Tao Li, Lei Xie 0001, Qiao Tian, Yuping Wang, 
MSM-VC: High-Fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-Scale Style Modeling.

TASLP2023 Qing Wang 0039, Jixun Yao, Li Zhang 0106, Pengcheng Guo, Lei Xie 0001
Timbre-Reserved Adversarial Attack in Speaker Identification.

ICASSP2023 Mingshuai Liu, Shubo Lv, Zihan Zhang, Runduo Han, Xiang Hao, Xianjun Xia, Li Chen, Yijian Xiao, Lei Xie 0001
Two-Stage Neural Network for ICASSP 2023 Speech Signal Improvement Challenge.

ICASSP2023 Ziqian Ning, Qicong Xie, Pengcheng Zhu 0004, Zhichao Wang 0002, Liumeng Xue, Jixun Yao, Lei Xie 0001, Mengxiao Bi, 
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features.

ICASSP2023 Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie 0001, Gang He, Jinfeng Bai, 
DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP.

ICASSP2023 Zhichao Wang 0002, Xinsheng Wang, Lei Xie 0001, Yuanzhe Chen, Qiao Tian, Yuping Wang, 
Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints.

ICASSP2023 Xiaopeng Yan, Yindi Yang, Zhihao Guo, Liangliang Peng, Lei Xie 0001
The NPU-Elevoc Personalized Speech Enhancement System for Icassp2023 DNS Challenge.

#5  | Yanmin Qian | DBLP Google Scholar  
By venueInterspeech: 45ICASSP: 44TASLP: 15SpeechComm: 2NeurIPS: 1
By year2024: 142023: 252022: 232021: 162020: 132019: 122018: 4
ISCA sessionsspeaker and language identification: 5embedding and network architecture for speaker recognition: 4cross-lingual and multilingual asr: 2multi-talker methods in speech processing: 2speaker recognition and anti-spoofing: 2noise robust and distant speech recognition: 2speaker recognition: 2deep learning for source separation and pitch tracking: 2speaker and language diarization: 1speech recognition: 1acoustic model adaptation for asr: 1novel models and training methods for asr: 1speaker embedding and diarization: 1speech enhancement and intelligibility: 1source separation: 1topics in asr: 1sdsv challenge 2021: 1speech synthesis: 1multimodal systems: 1speaker, language, and privacy: 1speaker recognition challenges and applications: 1learning techniques for speaker recognition: 1targeted source separation: 1multilingual and code-switched asr: 1anti-spoofing and liveness detection: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1feature extraction for asr: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1asr neural network training: 1speech and audio source separation and scene analysis: 1robust speech recognition: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 22speaker recognition: 18speaker verification: 14transformers: 9task analysis: 8error analysis: 8data augmentation: 8adaptation models: 7data models: 6degradation: 6decoding: 5self supervised learning: 5robustness: 5training data: 5computational modeling: 5speech synthesis: 4speaker diarization: 4speech enhancement: 4end to end speech recognition: 4natural language processing: 4system performance: 3computer architecture: 3multi modality: 3encoding: 3noise measurement: 3quantization (signal): 3low resource speech recognition: 3domain adaptation: 3audio visual: 3speaker embedding: 3knowledge distillation: 3continuous speech separation: 3curriculum learning: 3end to end: 3speech separation: 3source separation: 3data handling: 3clustering algorithms: 2voice activity detection: 2transducers: 2factorized neural transducer: 2predictive models: 2vocabulary: 2noise robustness: 2interference: 2visualization: 2audio visual speech recognition: 2unified cross modal attention: 2resnet: 2data mining: 2fuses: 2data collection: 2switches: 2semantics: 2model compression: 2large margin fine tuning: 2dual path modeling: 2deep learning (artificial intelligence): 2unsupervised learning: 2recurrent neural nets: 2audio signal processing: 2perturbation methods: 2transforms: 2gaussian processes: 2reverberation: 2text dependent speaker verification: 2attention mechanism: 2neural speaker diarization: 1attention based encoder decoder: 1ami: 1iterative decoding: 1callhome: 1dihard: 1long content speech recognition: 1streaming and non streaming: 1context modeling: 1rnn t: 1label correction: 1iterative methods: 1self supervised speaker verification: 1cluster aware dino: 1reliability: 1dynamic loss gate: 1modality corruption: 1df resnet: 1performance evaluation: 1neural network quantization: 1lightweight systems: 1mobile handsets: 1analytical models: 1adaptive systems: 1text to seech: 1phonetics: 1data splicing: 1dictionaries: 1splicing: 1machine anomalous sound detection: 1self supervised pre train: 1fine tune: 1employee welfare: 13d speaker: 1cross domain learning: 1domain mismatch: 1target speech diarization: 1prompt driven: 1mimics: 1mixed sparsity: 1large language models: 1sparsity pruning: 1resource management: 1sensitivity: 1in the wild: 1filtering algorithms: 1dino: 1pipelines: 1target speech extraction: 1boosting: 1vocoders: 1frequency estimation: 1speech discretization: 1vocoder: 1recording: 1time frequency analysis: 1reproducibility of results: 1sampling frequency independent: 1microphone number invariant: 1frequency diversity: 1universal speech enhancement: 1attentive feature fusion: 1depth first architecture: 1complexity theory: 1ecapa tdnn: 1long form speech recognition: 1context and speech encoder: 1costs: 1factorized aed: 1text only: 1interpolation: 1search problems: 1binary classification: 1sphereface2: 1modality absence: 1noise robust: 1machine learning: 1multi clue processing: 1benchmark testing: 1cross modality attention: 1target sound extraction: 1misp challenge: 1tv: 1discriminator and transfer: 1log likelihood ratio: 1production: 1wespeaker: 1codes: 1robust speech recognition: 1supervised learning: 1hubert: 1tts conversion: 1transformer transducer: 1speech coding: 1code switching asr: 1cross modality learning: 1industries: 1learning systems: 1asymmetric scenario: 1duration mismatch: 1focusing: 1signal processing algorithms: 1collaboration: 1overlap ratio predictor: 1memory pool: 1multi accent: 1layer wise adaptation: 1accent embedding: 1length perturbation: 1optimisation: 1self supervised pretrain: 1representation learning: 1image representation: 1multilayer perceptrons: 1text independent: 1multi layer perceptron: 1convolution attention: 1local attention: 1local information: 1gaussian attention: 1skipping memory: 1low latency: 1real time: 1time domain analysis: 1self knowledge distillation: 1deep embedding learning: 1knowledge engineering: 1synchronisation: 1object detection: 1attention: 1low quality video: 1video signal processing: 1microphone arrays: 1multi speaker asr: 1meeting transcription: 1alimeeting: 1m2met: 1punctuation prediction: 1edge devices: 1streaming speech recognition: 1multi task learning: 1data utilization: 1dynamic scheduling: 1biometrics (access control): 1audio visual deep neural network: 1person verification: 1face recognition: 1data analysis: 1multi modal system: 1signal detection: 1modified magnitude phase spectrum: 1constant q modified octave coefficients: 1mixture models: 1signal classification: 1unknown kind spoofing detection: 1accent adaptation: 1accent speech recognition: 1rnnlm: 1signal to distortion ratio: 1blind source separation: 1acoustic beamforming: 1complex backpropagation: 1convolution: 1transfer functions: 1array signal processing: 1multi channel source separation: 1contrastive learning: 1i vector: 1tts based data augmentation: 1test time augmentation: 1phone posteriorgram: 1accent identification: 1ppg: 1data fusion: 1unit selection synthesis: 1x vector: 1long recording speech separation: 1convolutional neural nets: 1online processing: 1end to end asr: 1acoustic modeling: 1accent recognition: 1accented speech recognition: 1children’s speech recognition: 1text to speech: 1data selection: 1variational auto encoder: 1text independent speaker verification: 1generative adversarial network: 1end to end model: 1multi talker mixed speech recognition: 1permutation invariant training: 1overlapped speech recognition: 1transformer: 1neural beamforming: 1multitask learning: 1channel information: 1adversarial training: 1multimodal: 1audio visual systems: 1text dependent: 1adaptation: 1text mismatch: 1center loss: 1angular softmax: 1short duration text independent speaker verification: 1speaker neural embedding: 1triplet loss: 1ctc: 1hidden markov models: 1multi speaker speech recognition: 1cocktail party problem: 1teacher student learning: 1computer aided instruction: 1
Most publications (all venues) at2023: 422022: 372024: 302018: 212021: 20

Affiliations
URLs

Recent publications

SpeechComm2024 Shuai Wang 0016, Zhengyang Chen, Bing Han, Hongji Wang, Chengdong Liang, Binbin Zhang, Xu Xiang, Wen Ding, Johan Rohdin, Anna Silnova, Yanmin Qian, Haizhou Li 0001, 
Advancing speaker embedding learning: Wespeaker toolkit for research and production.

TASLP2024 Zhengyang Chen, Bing Han, Shuai Wang 0016, Yanmin Qian
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer.

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

TASLP2024 Bing Han, Zhengyang Chen, Yanmin Qian
Self-Supervised Learning With Cluster-Aware-DINO for High-Performance Robust Speaker Verification.

TASLP2024 Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian
Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond.

TASLP2024 Bei Liu, Haoyu Wang 0007, Yanmin Qian
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization.

TASLP2024 Wei Wang 0010, Yanmin Qian
Universal Cross-Lingual Data Generation for Low Resource ASR.

ICASSP2024 Bing Han, Zhiqiang Lv, Anbai Jiang, Wen Huang 0004, Zhengyang Chen, Yufeng Deng, Jiawei Ding, Cheng Lu 0007, Wei-Qiang Zhang 0001, Pingyi Fan, Jia Liu 0001, Yanmin Qian
Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection.

ICASSP2024 Wen Huang 0004, Bing Han, Shuai Wang 0016, Zhengyang Chen, Yanmin Qian
Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.

ICASSP2024 Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li 0001, 
Prompt-Driven Target Speech Diarization.

ICASSP2024 Hang Shao, Bei Liu, Yanmin Qian
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001, 
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

ICASSP2024 Linfeng Yu, Wangyou Zhang, Chenpeng Du, Leying Zhang, Zheng Liang, Yanmin Qian
Generation-Based Target Speech Extraction with Speech Discretization and Vocoder.

ICASSP2024 Wangyou Zhang, Jee-weon Jung, Yanmin Qian
Improving Design of Input Condition Invariant Speech Enhancement.

TASLP2023 Bei Liu, Zhengyang Chen, Yanmin Qian
Depth-First Neural Architecture With Attentive Feature Fusion for Efficient Speaker Verification.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

ICASSP2023 Xun Gong 0005, Wei Wang 0010, Hang Shao, Xie Chen 0001, Yanmin Qian
Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR.

ICASSP2023 Bing Han, Zhengyang Chen, Yanmin Qian
Exploring Binary Classification Loss for Speaker Verification.

ICASSP2023 Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian
Robust Audio-Visual ASR with Unified Cross-Modal Attention.

ICASSP2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Dongmei Wang, Takuya Yoshioka, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Target Sound Extraction with Variable Cross-Modality Clues.

#6  | Björn W. Schuller | DBLP Google Scholar  
By venueInterspeech: 68ICASSP: 30TASLP: 7
By year2024: 92023: 202022: 112021: 182020: 222019: 142018: 11
ISCA sessionsspeech emotion recognition: 5speech in health: 4the first dicova challenge: 3the interspeech 2020 computational paralinguistics challenge (compare): 3spoken dialog systems and conversational analysis: 2health-related speech analysis: 2voice conversion and adaptation: 2speech synthesis: 2multimodal systems: 2the interspeech 2021 computational paralinguistics challenge (compare): 2computational paralinguistics: 2social signals detection and speaker traits analysis: 2attention mechanism for speaker state recognition: 2the interspeech 2018 computational paralinguistics challenge (compare): 2multimodal speech emotion recognition: 1show and tell: 1speech, voice, and hearing disorders: 1speech and language in health: 1automatic analysis of paralinguistics: 1single-channel speech enhancement: 1atypical speech analysis and detection: 1asr technologies and systems: 1(multimodal) speech emotion recognition: 1pathological speech assessment: 1atypical speech detection: 1diverse modes of speech acquisition and processing: 1health and affect: 1speech type classification and diagnosis: 1speech in multimodality: 1alzheimer’s dementia recognition through spontaneous speech: 1diarization: 1acoustic scene classification: 1bioacoustics and articulation: 1speech enhancement: 1representation learning of emotion and paralinguistics: 1training strategy for speech emotion recognition: 1the interspeech 2019 computational paralinguistics challenge (compare): 1network architectures for emotion and paralinguistics recognition: 1speech signal characterization: 1representation learning for emotion: 1speech and language analytics for mental health: 1text analysis, multilingual issues and evaluation in speech synthesis: 1emotion modeling: 1emotion recognition and analysis: 1speech pathology, depression, and medical applications: 1speaker state and trait: 1second language acquisition and code-switching: 1
IEEE keywordsemotion recognition: 21speech recognition: 18speech emotion recognition: 13computational modeling: 6task analysis: 5speech enhancement: 4transformers: 4data models: 4transfer learning: 4multi task learning: 3adaptation models: 3recurrent neural nets: 3attention mechanism: 3predictive models: 2computer architecture: 2logic gates: 2multitasking: 2multi source domain adaptation: 2speaker independent: 2computer vision: 2robustness: 2linguistics: 2data privacy: 2self supervised learning: 2signal processing algorithms: 2machine learning: 2human computer interaction: 2mood: 2affective computing: 2semantics: 2computer audition: 2healthcare: 2audio signal processing: 2pattern classification: 2signal classification: 2artificial neural networks: 1low complexity: 1frame weighting: 1residual fusion: 1noise: 1time domain analysis: 1knowledge distillation (kd): 1probabilistic logic: 1audiogram: 1auditory system: 1multi head self attention: 1hearing aids: 1hearing aid: 1indexes: 1speech quality evaluation: 1alzheimer’s disease: 1computational complexity: 1convolution: 1hierarchical modelling: 1attention free transformer: 1alzheimer's disease: 1stability analysis: 1multi armed bandits: 1multi modality: 1joint distribution adaptation: 1acoustic scene classification: 1sharp minima: 1deep neural networks: 1acoustic measurements: 1scene classification: 1generalisation: 1loss landscape: 1prompt tuning: 1large language model: 1low rank adaptation: 1time frequency analysis: 1shifted window: 1aggregates: 1transformer: 1merging: 1hierarchical speech features: 1source free cross corpus speech emotion recognition: 1clustering algorithms: 1contrastive learning: 1masking: 1emotional: 1random splicing: 1prediction algorithms: 1speech: 1splicing: 1anonymization: 1lightweight deep learning: 1performance evaluation: 1edge device: 1neural structured learning: 1art: 1encoding: 1infant directed speech: 1adult directed speech: 1automatic speech classification: 1computational paralinguistics: 1covid 19: 1noise reduction: 1iterative optimisation: 1noise measurement: 1covid 19 detection: 1efficient edge analytics: 1adaptive inference: 1efficient deep learning: 1self distillation: 1redundancy: 1particle measurements: 1dataset bias reduction: 1hardware: 1asthma: 1personnel: 1speech modelling: 1redundancy reduction: 1recording: 1multitask learning: 1data collection: 1mental health: 1daily speech: 1dams: 1medical services: 1anxiety disorders: 1vo cal burst detection: 1animals: 1nonverbal vocalization: 1behavioral sciences: 1zero shot learning: 1generative learning: 1emotional prototypes: 1prototypes: 1federated learning: 1analytical models: 1stuttering monitoring: 1privacy: 1decoupled knowledge distillation: 1multi head attention: 1knowledge engineering: 1motion capture: 1unsupervised domain adaptation: 1adversarial learning: 1medical computing: 1hearing: 1intelligent medicine: 1health care: 1digital phenotype: 1overview: 1relativistic discriminator: 1domain adaptation: 1deep neural network: 1speech intelligibility: 1decoding: 1speech coding: 1maximum mean discrepancy: 1disentangled representation learning: 1audio generation: 1guided representation learning: 1and generative adversarial neural network: 1signal representation: 1support vector machines: 1multilayer perceptrons: 1glottal source estimation: 1iterative methods: 1diseases: 1glottal features: 1end to end systems: 1parkinson's disease: 1filtering theory: 1temporal convolutional networks: 1electroencephalography: 1medical signal processing: 1hierarchical attention mechanism: 1eeg signals: 1relu: 1arelu: 1gated recurrent unit: 1representation learning: 1computational linguistics: 1deep learning (artificial intelligence): 1semantic: 1paralinguistic: 1audiotextual information: 1vggish: 1ordinal classification: 1entropy: 1consistent rank logits: 1customer services: 1convolutional neural nets: 1adversarial attacks: 1gradient methods: 1convolutional neural network: 1data protection: 1adversarial training: 1end to end affective computing: 1adversarial networks: 1emotional speech synthesis: 1data augmentation: 1unsupervised learning: 1monotonic attention: 1mean square error methods: 1attention transfer: 1depression: 1hierarchical attention: 1psychology: 1behavioural sciences computing: 1speech emotion: 1frame level features: 1lstm: 1speech emotion prediction: 1end to end: 1joint training: 1emotion classification: 1audiovisual learning: 1audio visual systems: 1face recognition: 1emotion regression: 1state of mind: 1mood congruency: 1sentiment analysis: 1context modeling: 1hierarchical models: 1recurrent neural networks: 1gated recurrent units: 1attention mechanisms: 1
Most publications (all venues) at2023: 1022022: 972021: 972017: 842020: 76

Affiliations
Imperial College London, GLAM, UK
University of Augsburg, Department of Computer Science, Germany
University of Passau, Faculty of Computer Science and Mathematics, Germany (former)

Recent publications

TASLP2024 Jiaming Cheng, Ruiyu Liang, Lin Zhou 0001, Li Zhao 0003, Chengwei Huang, Björn W. Schuller
Residual Fusion Probabilistic Knowledge Distillation for Speech Enhancement.

TASLP2024 Ruiyu Liang, Yue Xie, Jiaming Cheng, Cong Pang, Björn W. Schuller
A Non-Invasive Speech Quality Evaluation Algorithm for Hearing Aids With Multi-Head Self-Attention and Audiogram-Based Features.

ICASSP2024 Zhongren Dong, Zixing Zhang 0001, Weixiang Xu, Jing Han 0010, Jianjun Ou, Björn W. Schuller
HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech.

ICASSP2024 Xiangheng He, Junjie Chen, Björn W. Schuller
Task Selection and Assignment for Multi-Modal Multi-Task Dialogue Act Classification with Non-Stationary Multi-Armed Bandits.

ICASSP2024 Cheng Lu 0005, Yuan Zong, Hailun Lian, Yan Zhao, Björn W. Schuller, Wenming Zheng, 
Improving Speaker-Independent Speech Emotion Recognition using Dynamic Joint Distribution Adaptation.

ICASSP2024 Manuel Milling, Andreas Triantafyllopoulos, Iosif Tsangko, Simon David Noel Rampp, Björn Wolfgang Schuller
Bringing the Discussion of Minima Sharpness to the Audio Domain: A Filter-Normalised Evaluation for Acoustic Scene Classification.

ICASSP2024 Liyizhe Peng, Zixing Zhang 0001, Tao Pang, Jing Han 0010, Huan Zhao 0003, Hao Chen, Björn W. Schuller
Customising General Large Language Models for Specialised Emotion Recognition Tasks.

ICASSP2024 Yong Wang, Cheng Lu 0005, Hailun Lian, Yan Zhao, Björn W. Schuller, Yuan Zong, Wenming Zheng, 
Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition.

ICASSP2024 Yan Zhao, Jincen Wang, Cheng Lu 0005, Sunan Li, Björn W. Schuller, Yuan Zong, Wenming Zheng, 
Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition.

ICASSP2023 Felix Burkhardt, Anna Derington, Matthias Kahlau, Klaus R. Scherer, Florian Eyben, Björn W. Schuller
Masking Speech Contents by Random Splicing: is Emotional Expression Preserved?

ICASSP2023 Yi Chang 0004, Zhao Ren, Thanh Tam Nguyen, Kun Qian 0003, Björn W. Schuller
Knowledge Transfer for on-Device Speech Emotion Recognition With Neural Structured Learning.

ICASSP2023 Najla D. Al Futaisi, Alejandrina Cristià, Björn W. Schuller
Hearttoheart: The Arts of Infant Versus Adult-Directed Speech Classification.

ICASSP2023 Shuo Liu 0012, Adria Mallol-Ragolta, Björn W. Schuller
COVID-19 Detection from Speech in Noisy Conditions.

ICASSP2023 Zhao Ren, Thanh Tam Nguyen, Yi Chang 0004, Björn W. Schuller
Fast Yet Effective Speech Emotion Recognition with Self-Distillation.

ICASSP2023 Georgios Rizos, Rafael A. Calvo, Björn W. Schuller
Positive-Pair Redundancy Reduction Regularisation for Speech-Based Asthma Diagnosis Prediction.

ICASSP2023 Meishu Song, Andreas Triantafyllopoulos, Zijiang Yang 0007, Hiroki Takeuchi, Toru Nakamura, Akifumi Kishi, Tetsuro Ishizawa, Kazuhiro Yoshiuchi, Xin Jing, Vincent Karas, Zhonghao Zhao, Kun Qian 0003, Bin Hu 0001, Björn W. Schuller, Yoshiharu Yamamoto, 
Daily Mental Health Monitoring from Speech: A Real-World Japanese Dataset and Multitask Learning Analysis.

ICASSP2023 Panagiotis Tzirakis, Alice Baird, Jeffrey A. Brooks, Christopher Gagne, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, Vineet Tiruvadi, Björn W. Schuller, Dacher Keltner, Alan Cowen, 
Large-Scale Nonverbal Vocalization Detection Using Transformers.

ICASSP2023 Xinzhou Xu, Jun Deng, Zixing Zhang 0001, Zhen Yang, Björn W. Schuller
Zero-Shot Speech Emotion Recognition Using Generative Learning with Reconstructed Prototypes.

ICASSP2023 Yongzi Yu, Wanyong Qiu, Chen Quan, Kun Qian 0003, Zhihua Wang, Yu Ma, Bin Hu 0001, Björn W. Schuller, Yoshiharu Yamamoto, 
Federated Intelligent Terminals Facilitate Stuttering Monitoring.

ICASSP2023 Ziping Zhao 0001, Huan Wang, Haishuai Wang, Björn W. Schuller
Hierarchical Network with Decoupled Knowledge Distillation for Speech Emotion Recognition.

#7  | Hung-yi Lee | DBLP Google Scholar  
By venueICASSP: 41Interspeech: 41TASLP: 11ACL: 5ACL-Findings: 2
By year2024: 122023: 162022: 222021: 162020: 172019: 132018: 4
ISCA sessionsspeech synthesis: 5speech recognition: 2spoken language processing: 2adaptation, transfer learning, and distillation for asr: 2voice conversion and adaptation: 2new trends in self-supervised speech processing: 2neural techniques for voice conversion and waveform generation: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech analysis: 1the voicemos challenge: 1trustworthy speech processing: 1spoofing-aware automatic speaker verification (sasv): 1embedding and network architecture for speaker recognition: 1neural network training methods for asr: 1source separation: 1spoken term detection & voice search: 1voice anti-spoofing and countermeasure: 1speech signal analysis and representation: 1search for speech recognition: 1conversational systems: 1speech synthesis paradigms and methods: 1applications of language technologies: 1language learning and databases: 1speech enhancement: 1the zero resource speech challenge 2019: 1turn management in dialogue: 1speech and audio source separation and scene analysis: 1voice conversion: 1extracting information from audio: 1spoken language understanding: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 16self supervised learning: 15speaker recognition: 11task analysis: 8speech synthesis: 8natural language processing: 7benchmark testing: 7computational modeling: 6robustness: 6adversarial attack: 6adaptation models: 5question answering (information retrieval): 5spoken language understanding: 5speech coding: 5voice conversion: 5representation learning: 4data models: 4unsupervised learning: 4security of data: 4predictive models: 3speech enhancement: 3linguistics: 3spoken question answering: 3semantics: 3generative adversarial networks: 3unsupervised asr: 3few shot: 3meta learning: 3generative adversarial network: 3speech representation learning: 3biometrics (access control): 3automatic speech recognition: 3transformer: 2transformers: 2knowledge distillation: 2evaluation: 2benchmark: 2analytical models: 2self supervised: 2emotion recognition: 2visualization: 2perturbation methods: 2speaker verification: 2vocoders: 2vocoder: 2decoding: 2pipelines: 2speech translation: 2noise robustness: 2maml: 2automatic speaker verification: 2supervised learning: 2adversarial defense: 2audio signal processing: 2anti spoofing: 2disentangled representations: 2interactive systems: 2source separation: 2speech separation: 2low resource: 2end to end: 2signal representation: 2adversarial training: 2prompting: 1speech language model: 1tuning: 1non autoregressive: 1neural machine translation: 1biological system modeling: 1task generalization: 1protocols: 1foundation model: 1speech: 1zero shot learning: 1upper bound: 1lattices: 1in context learning: 1large language models. asr confusion networks: 1buildings: 1instruction tuning: 1collaboration: 1multilingual: 1code switch: 1discrete unit: 1zero resource: 1manuals: 1spoken content retrieval: 1multitasking: 1switches: 1speech sentiment analysis: 1paralinguistics: 1large language models: 1spoken dialogue modeling: 1audio visual learning: 1soft sensors: 1scalability: 1rendering (computer graphics): 1purification: 1adversarial sample detection: 1ensemble learning: 1user experience: 1electronic mail: 1large scaled pre trained model: 1meta reinforcement learning: 1generators: 1natural language generation: 1monte carlo methods: 1autoregressive model: 1neural speech synthesis: 1neural network: 1bars: 1visually grounded speech: 1multimodal speech processing: 1image retrieval: 1multilingual speech processing: 1degradation: 1computational efficiency: 1once for all training: 1sequence compression: 1reproducibility of results: 1espnet: 1s3prl: 1learning systems: 1codes: 1tokenization: 1cloning: 1structured pruning: 1performance evaluation: 1trainable pruning: 1mobile handsets: 1personalized tts: 1voice cloning: 1superb: 1noise measurement: 1ensemble knowledge distillation: 1distortions: 1bridges: 1connectors: 1syntactics: 1unsupervised word segmentation: 1self supervised speech representations: 1unsupervised constituency parsing: 1speaker adaptation: 1tts: 1signal sampling: 1phone recognition: 1hidden markov models: 1pattern classification: 1adversarial attacks: 1data handling: 1model compression: 1voice activity detection: 1computer based training: 1open source: 1self supervised speech representation: 1error analysis: 1self supervised speech models: 1superb benchmark: 1data bias: 1partially fake audio detection: 1audio deep synthesis detection challenge: 1design methodology: 1sensor fusion: 1language translation: 1pre training: 1representation: 1adaptive instance normalization: 1activation guidance: 1speaker representation: 1multi speaker text to speech: 1semi supervised learning: 1any to any: 1con catenative: 1attention mechanism: 1anil: 1weapons: 1information filters: 1image rectification: 1gallium nitride: 1fisheye camera: 1acoustic distortion: 1data visualization: 1code switching: 1numerical models: 1language model: 1language adaptation: 1iarpa babel: 1analysis: 1interpretability: 1speech representation: 1representation quantization: 1quantisation (signal): 1unsupervised training: 1transformer encoders: 1vector quantization: 1spatial smoothing: 1spoofing countermeasure: 1label ambiguity problem: 1permutation invariant training: 1cocktail party problem: 1speech question answering: 1attention model: 1toefl: 1squad: 1computer aided instruction: 1domain adaptation: 1sqa: 1adversarial learning: 1text analysis: 1criticizing language model: 1deep q network: 1dialogue state tracking: 1deep reinforcement learning: 1
Most publications (all venues) at2024: 582022: 552023: 422021: 372020: 36

Affiliations
URLs

Recent publications

TASLP2024 Kai-Wei Chang, Haibin Wu, Yu-Kai Wang, Yuan-Kuei Wu, Hua Shen, Wei-Cheng Tseng, Iu-thing Kang, Shang-wen Li 0001, Hung-Yi Lee
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks.

TASLP2024 Shensian Syu, Juncheng Xie, Hung-yi Lee
Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC.

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Kevin Everson, Yile Gu, Chao-Han Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke, 
Towards ASR Robust Spoken Language Understanding Through in-Context Learning with Word Confusion Networks.

ICASSP2024 Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan S. Sharma, Shinji Watanabe 0001, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee
Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.

ICASSP2024 Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, Hung-Yi Lee
Zero Resource Code-Switched Speech Benchmark Using Speech Utterance Pairs for Multiple Spoken Languages.

ICASSP2024 Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-Yi Lee, Lin-Shan Lee, 
SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering.

ICASSP2024 Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko, 
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue.

ICASSP2024 Yuan Tseng, Layne Berry, Yiting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Poyao Huang 0001, Chun-Mao Lai, Shang-Wen Li 0001, David Harwath, Yu Tsao 0001, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.

ICASSP2024 Haibin Wu, Heng-Cheng Kuo, Yu Tsao 0001, Hung-Yi Lee
Scalable Ensemble-Based Detection Method Against Adversarial Attacks For Speaker Verification.

ACL2024 Guan-Ting Lin, Cheng-Han Chiang, Hung-yi Lee
Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations.

ACL-Findings2024 Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan S. Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe 0001, 
On the Evaluation of Speech Foundation Models for Spoken Language Understanding.

TASLP2023 Yun-Yen Chuang, Hung-Min Hsu, Kevin Lin, Ray-I Chang, Hung-Yi Lee
MetaEx-GAN: Meta Exploration to Improve Natural Language Generation via Generative Adversarial Networks.

TASLP2023 Po-Chun Hsu, Da-Rong Liu, Andy T. Liu, Hung-yi Lee
Parallel Synthesis for Autoregressive Speech Generation.

ICASSP2023 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath, 
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval.

ICASSP2023 Hsuan-Jui Chen, Yen Meng, Hung-yi Lee
Once-for-All Sequence Compression for Self-Supervised Speech Models.

ICASSP2023 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola García, Hung-Yi Lee, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Euro: Espnet Unsupervised ASR Open-Source Toolkit.

ICASSP2023 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao 0001, 
T5lephone: Bridging Speech and Text Self-Supervised Models for Spoken Language Understanding Via Phoneme Level T5.

ICASSP2023 Sung-Feng Huang, Chia-Ping Chen, Zhi-Sheng Chen, Yu-Pao Tsai, Hung-Yi Lee
Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning.

ICASSP2023 Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-Yi Lee
Ensemble Knowledge Distillation of Self-Supervised Speech Models.

#8  | Xunying Liu | DBLP Google Scholar  
By venueInterspeech: 44ICASSP: 38TASLP: 18
By year2024: 62023: 162022: 172021: 252020: 132019: 172018: 6
ISCA sessionsspeech and language in health: 6speech recognition of atypical speech: 5voice conversion and adaptation: 2topics in asr: 2asr neural network architectures: 2medical applications and visual asr: 2novel transformer models for asr: 1acoustic model adaptation for asr: 1speech recognition: 1multi-, cross-lingual and other topics in asr: 1novel models and training methods for asr: 1multimodal speech emotion recognition and paralinguistics: 1miscellaneous topics in speech, voice and hearing disorders: 1zero, low-resource and multi-modal speech recognition: 1voice anti-spoofing and countermeasure: 1non-autoregressive sequential modeling for speech processing: 1assessment of pathological speech and language: 1speaker recognition: 1multimodal speech processing: 1learning techniques for speaker recognition: 1speech and speaker recognition: 1neural techniques for voice conversion and waveform generation: 1speech and audio classification: 1model adaptation for asr: 1lexicon and language model for speech recognition: 1novel neural network architectures for acoustic modelling: 1second language acquisition and code-switching: 1voice conversion: 1multimodal systems: 1expressive speech synthesis: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 36speaker recognition: 14recurrent neural nets: 9natural language processing: 9bayes methods: 8adaptation models: 7speech synthesis: 7data models: 6task analysis: 6bayesian learning: 6speech separation: 6data augmentation: 5speaker adaptation: 5emotion recognition: 5deep learning (artificial intelligence): 5gaussian processes: 5optimisation: 5elderly speech: 4dysarthric speech: 4audio visual: 4switches: 4transformer: 4neural architecture search: 4speech coding: 4voice conversion: 4speech emotion recognition: 4quantisation (signal): 4pre trained asr system: 3older adults: 3decoding: 3perturbation methods: 3dysarthric speech reconstruction: 3conformer: 3speech disorders: 3end to end: 3domain adaptation: 3audio visual systems: 3speech intelligibility: 3multi channel: 3overlapped speech: 3language models: 3convolutional neural nets: 3wav2vec2.0: 2gan: 2multi modal: 2visualization: 2training data: 2controllability: 2error analysis: 2transformers: 2hidden markov models: 2estimation: 2speech enhancement: 2automatic speech recognition: 2computational modeling: 2semantics: 2self supervised learning: 2linguistics: 2adaptation: 2lf mmi: 2parameter estimation: 2uncertainty: 2handicapped aids: 2disordered speech recognition: 2time delay neural network: 2model uncertainty: 2neural language models: 2multi look: 2variational inference: 2inference mechanisms: 2lhuc: 2gradient methods: 2admm: 2knowledge distillation: 2quantization: 2speaker verification: 2code switching: 2standards: 1multi lingual xlsr: 1hubert: 1hybrid tdnn: 1end to end conformer: 1speech: 1av hubert: 1transforms: 1low latency: 1rapid adaptation: 1interpolation: 1specaugment: 1reinforcement learning: 1confidence score estimation: 1speech dereverberation: 1maximum likelihood detection: 1nonlinear filters: 1neural machine translation: 1hierarchical attention mechanism: 1machine translation: 1generative adversarial networks: 1vae: 1alzheimer’s disease: 1sociology: 1syntactics: 1task oriented: 1transfer learning: 1pretrained embeddings: 1multimodality: 1affective computing: 1multi label: 1bidirectional control: 1multi task learning: 1emotional expression: 1multi culture: 1vocal bursts: 1data analysis: 1bayesian: 1nist: 1elderly speech recognition: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1monte carlo methods: 1articulatory inversion: 1hybrid power systems: 1benchmark testing: 1dereverberation and recognition: 1reverberation: 1speaker change detection: 1audio signal processing: 1multitask learning: 1unsupervised learning: 1unsupervised speech decomposition: 1adversarial speaker adaptation: 1speaker identity: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1vector quantization: 1measurement: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1uniform sampling: 1path dropout: 1mean square error methods: 1neural network quantization: 1source separation: 1mixed precision: 1direction of arrival estimation: 1direction of arrival: 1speaker diarization: 1delays: 1generalisation (artificial intelligence): 1gaussian process: 1any to many: 1sequence to sequence modeling: 1signal reconstruction: 1signal sampling: 1signal representation: 1location relative attention: 1multimodal speech recognition: 1capsule: 1exemplary emotion descriptor: 1expressive speech synthesis: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1tdnn: 1switchboard: 1lstm rnn: 1low bit quantization: 1image recognition: 1microphone arrays: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1synthetic speech detection: 1res2net: 1voice activity detection: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1adress: 1cognition: 1patient diagnosis: 1alzheimer's disease detection: 1signal classification: 1diseases: 1features: 1geriatrics: 1medical diagnostic computing: 1asr: 1controllable and efficient: 1text to speech: 1semi autoregressive: 1prosody modelling: 1autoregressive processes: 1neurocognitive disorder detection: 1dementia: 1visual feature generation: 1audio visual speech recognition (avsr): 1phonetic pos teriorgrams: 1adversarial attack: 1x vector: 1gmm i vector: 1accent conversion: 1accented speech recognition: 1cross modal: 1seq2seq: 1recurrent neural networks: 1data compression: 1alternating direction methods of multipliers: 1audio visual speech recognition: 1probability: 1keyword search: 1language model: 1feedforward: 1recurrent neural network: 1succeeding words: 1multilingual speech synthesis: 1foreign accent: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1neural network language models: 1lstm: 1connectionist temporal classification (ctc): 1e learning: 1computer assisted pronunciation training (capt): 1con volutional neural network (cnn): 1mispronunciation detection and diagnosis (mdd): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1maximum likelihood estimation: 1entropy: 1natural gradient: 1rnnlms: 1
Most publications (all venues) at2022: 272021: 272024: 182023: 172019: 17

Affiliations
URLs

Recent publications

TASLP2024 Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition.

TASLP2024 Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu
Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition.

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Jiajun Deng, Xurong Xie, Guinan Li, Mingyu Cui, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Zhaoqing Li, Xunying Liu
Towards High-Performance and Low-Latency Feature-Based Speaker Adaptation of Conformer Speech Recognition Systems.

ICASSP2024 Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu
Towards Automatic Data Augmentation for Disordered Speech Recognition.

ICASSP2024 Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu
Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation.

TASLP2023 Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Guinan Li, Shujie Hu, Xunying Liu
Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems.

TASLP2023 Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

ICASSP2023 Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng, 
Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition.

ICASSP2023 Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu
Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition.

ICASSP2023 Jinchao Li, Kaitao Song, Junan Li, Bo Zheng, Dongsheng Li 0002, Xixin Wu, Xunying Liu, Helen Meng, 
Leveraging Pretrained Representations With Task-Related Keywords for Alzheimer's Disease Detection.

ICASSP2023 Jinchao Li, Xixin Wu, Kaitao Song, Dongsheng Li 0002, Xunying Liu, Helen Meng, 
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition.

ICASSP2023 Xurong Xie, Xunying Liu, Hui Chen 0020, Hongan Wang, 
Unsupervised Model-Based Speaker Adaptation of End-To-End Lattice-Free MMI Model for Speech Recognition.

Interspeech2023 Mingyu Cui, Jiawen Kang 0002, Jiajun Deng, Xi Yin 0010, Yutao Xie, Xie Chen 0001, Xunying Liu
Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems.

Interspeech2023 Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu
Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems.

Interspeech2023 Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu
Use of Speech Impairment Severity for Dysarthric Speech Recognition.

Interspeech2023 Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye 0001, Helen Meng, Xunying Liu
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition.

Interspeech2023 Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Helen Meng, Xunying Liu
Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition.

Interspeech2023 Zhaoqing Li, Tianzi Wang, Jiajun Deng, Junhao Xu, Shoukang Hu, Xunying Liu
Lossless 4-bit Quantization of Architecture Compressed Conformer ASR Systems on the 300-hr Switchboard Corpus.

#9  | Dong Yu 0001 | DBLP Google Scholar  
By venueInterspeech: 40ICASSP: 39TASLP: 9ACL: 2ICLR: 2ICML: 1EMNLP: 1ACL-Findings: 1IJCAI: 1NAACL: 1
By year2024: 52023: 112022: 152021: 182020: 222019: 182018: 8
ISCA sessionsspeech recognition: 3speech coding and enhancement: 2speech synthesis: 2voice conversion and adaptation: 2speaker recognition: 2source separation, dereverberation and echo cancellation: 2multi-channel speech enhancement: 2singing voice computing and processing in music: 2deep learning for source separation and pitch tracking: 2sequence models for asr: 2speech enhancement and bandwidth expansion: 1dereverberation and echo cancellation: 1multi-, cross-lingual and other topics in asr: 1topics in asr: 1source separation: 1novel neural network architectures for asr: 1speech localization, enhancement, and quality assessment: 1asr model training and strategies: 1speech synthesis paradigms and methods: 1multimodal speech processing: 1speech and audio source separation and scene analysis: 1speech enhancement: 1asr neural network architectures: 1asr neural network training: 1asr for noisy and far-field speech: 1robust speech recognition: 1speaker verification using neural network methods: 1expressive speech synthesis: 1topics in speech recognition: 1
IEEE keywordsspeech recognition: 23speaker recognition: 11speech synthesis: 9speech enhancement: 8speech separation: 7task analysis: 6natural language processing: 5speaker embedding: 5data augmentation: 5reverberation: 4decoding: 4microphone arrays: 4source separation: 4recurrent neural nets: 4end to end speech recognition: 4self supervised learning: 3automatic speech recognition: 3voice activity detection: 3unsupervised learning: 3voice conversion: 3vocoders: 2spectrogram: 2end to end: 2measurement: 2application program interfaces: 2graphics processing units: 2pattern clustering: 2audio visual systems: 2audio signal processing: 2text analysis: 2multi channel: 2filtering theory: 2semi supervised learning: 2overlapped speech: 2transfer learning: 2domain adaptation: 2maximum mean discrepancy: 2speech coding: 2code switching: 2speaker verification: 2self attention: 2attention based model: 2hidden markov models: 2artificial neural networks: 1loudspeakers: 1hybrid method: 1acoustic howling suppression: 1kalman filters: 1microphones: 1adaptation models: 1kalman filter: 1recursive training: 1noise reduction: 1diffusion models: 1signal to noise ratio: 1generative models: 1speech editing: 1unsupervised tts acoustic modeling: 1representation learning: 1wavlm: 1c dsvae: 1transducers: 1bayes methods: 1discriminative training: 1mutual information: 1maximum mutual information: 1minimum bayesian risk: 1sequential training: 1autoregressive model: 1diffusion model: 1text to sound generation: 1transforms: 1vocoder: 1zero shot style transfer: 1variational autoencoder: 1supervised learning: 1self supervised disentangled representation learning: 1low quality data: 1neural speech synthesis: 1style transfer: 1joint training: 1dual path: 1acoustic model: 1echo suppression: 1streaming: 1dynamic weight attention: 1acoustic environment: 1speech simulation: 1transient response: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1vector quantization: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1rnn t: 1code switched asr: 1bilingual asr: 1computational linguistics: 1expert systems: 1router architecture: 1mixture of experts: 1global information: 1accent embedding: 1domain embedding: 1speaker clustering: 1inference mechanisms: 1overlap speech detection: 1speaker diarization: 1sensor fusion: 1sound source separation: 1audio visual processing: 1rewriting systems: 1interactive systems: 1semantic role labeling: 1dialogue understanding: 1conversational semantic role labeling: 1natural language understanding: 1image recognition: 1audio visual: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1video signal processing: 1mvdr: 1array signal processing: 1adl mvdr: 1neural architecture search: 1transferable architecture: 1neural net architecture: 1multi granularity: 1single channel: 1self attentive network: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1direction of arrival estimation: 1source localization: 1contrastive learning: 1target speaker enhancement: 1robust speaker verification: 1interference suppression: 1speaker verification (sv): 1phonetic pos teriorgrams: 1speech intelligibility: 1regression analysis: 1singing synthesis: 1multi channel speech separation: 1inter channel convolution differences: 1spatial filters: 1spatial features: 1parallel optimization: 1random sampling.: 1model partition: 1lstm language model: 1bmuf: 1joint learning: 1noise measurement: 1speaker aware: 1target speech enhancement: 1time domain analysis: 1gain: 1teacher student: 1accent conversion: 1accented speech recognition: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1training data: 1diffuse reflection: 1acoustic simulation: 1reflection: 1persistent memory: 1dfsmn: 1multi modal: 1audio visual speech recognition: 1permutation invariant training: 1encoding: 1model integration: 1multi band: 1nist: 1artificial intelligence: 1mel frequency cepstral coefficient: 1loss function: 1boundary: 1top k loss: 1language model: 1error analysis: 1mathematical model: 1switches: 1attention based end to end speech recognition: 1early update: 1optimization: 1token wise training: 1discriminative feature learning: 1sequence discriminative training: 1acoustic variability: 1asr: 1variational inference: 1convolutional neural nets: 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1text to speech synthesis: 1relative position aware representation: 1sequence to sequence model: 1teacher student training: 1knowledge distillation: 1multi domain: 1all rounder: 1feedforward neural nets: 1cloud computing: 1quantization: 1polynomials: 1privacy preserving: 1dnn: 1cryptography: 1encryption: 1text dependent: 1end to end speaker verification: 1seq2seq attention: 1optimisation: 1siamese neural networks: 1
Most publications (all venues) at2023: 572022: 502020: 472019: 472024: 44

Affiliations
Tencent AI Lab, China
Microsoft Research, Redmond, WA, USA (1998 - 2017)
University of Idaho, Moscow, ID, USA (PhD)

Recent publications

TASLP2024 Hao Zhang, Yixuan Zhang 0005, Meng Yu 0003, Dong Yu 0001
Enhanced Acoustic Howling Suppression via Hybrid Kalman Filter and Deep Learning Models.

ICASSP2024 Muqiao Yang, Chunlei Zhang, Yong Xu 0004, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu 0001
uSee: Unified Speech Enhancement And Editing with Conditional Diffusion Models.

ICML2024 Manjie Xu, Chenxing Li, Duzhen Zhang, Dan Su 0002, Wei Liang, Dong Yu 0001
Prompt-guided Precise Audio Editing with Diffusion Models.

ACL2024 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang 0001, Ziyue Jiang 0001, Xuankai Chang, Jiatong Shi, Chao Weng, Zhou Zhao, Dong Yu 0001
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

ACL2024 Yongxin Zhu 0003, Dan Su 0002, Liqiang He, Linli Xu, Dong Yu 0001
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer.

TASLP2023 Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu 0001
Unsupervised TTS Acoustic Modeling for TTS With Conditional Disentangled Sequential VAE.

TASLP2023 Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu 0001
Integrating Lattice-Free MMI Into End-to-End Speech Recognition.

TASLP2023 Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu 0001
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation.

Interspeech2023 Yong Xu 0004, Vinay Kothapally, Meng Yu 0003, Shixiong Zhang, Dong Yu 0001
Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation.

Interspeech2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu 0001, Shinji Watanabe 0001, 
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.

Interspeech2023 Wei Xiao, Wenzhe Liu, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su 0002, Shidong Shang, Dong Yu 0001
Multi-mode Neural Speech Coding Based on Deep Generative Networks.

Interspeech2023 Yuping Yuan, Zhao You, Shulin Feng, Dan Su 0002, Yanchun Liang 0001, Xiaohu Shi, Dong Yu 0001
Compressed MoE ASR Model Based on Knowledge Distillation and Quantization.

Interspeech2023 Hao Zhang, Meng Yu 0003, Yuzhong Wu, Tao Yu, Dong Yu 0001
Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression.

Interspeech2023 Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu 0001, Zhao You, Dan Su 0002, Dong Yu 0001, Helen Meng, 
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation.

EMNLP2023 Dian Yu 0001, Xiaoyang Wang, Wanshun Chen, Nan Du, Longyue Wang, Haitao Mi, Dong Yu 0001
More Than Spoken Words: Nonverbal Message Extraction and Generation.

ACL-Findings2023 Rongjie Huang, Chunlei Zhang, Yi Ren 0006, Zhou Zhao, Dong Yu 0001
Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech.

ICASSP2022 Jiachen Lian, Chunlei Zhang, Dong Yu 0001
Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion.

ICASSP2022 Songxiang Liu, Shan Yang, Dan Su 0002, Dong Yu 0001
Referee: Towards Reference-Free Cross-Speaker Style Transfer with Low-Quality Data for Expressive Speech Synthesis.

ICASSP2022 Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su 0002, Dong Yu 0001
DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition.

ICASSP2022 Anton Ratnarajah, Shi-Xiong Zhang, Meng Yu 0003, Zhenyu Tang 0001, Dinesh Manocha, Dong Yu 0001
Fast-Rir: Fast Neural Diffuse Room Impulse Response Generator.

#10  | Zhiyong Wu 0001 | DBLP Google Scholar  
By venueICASSP: 48Interspeech: 34TASLP: 6AAAI: 3IJCAI: 2EMNLP: 1
By year2024: 122023: 252022: 222021: 142020: 62019: 112018: 4
ISCA sessionsspeech synthesis: 12voice conversion and adaptation: 3speech coding: 2speech recognition: 2spoken term detection: 2speech coding and enhancement: 1models for streaming asr: 1single-channel speech enhancement: 1embedding and network architecture for speaker recognition: 1non-autoregressive sequential modeling for speech processing: 1voice anti-spoofing and countermeasure: 1speech synthesis paradigms and methods: 1asr neural network architectures and training: 1new trends in self-supervised speech processing: 1neural techniques for voice conversion and waveform generation: 1emotion recognition and analysis: 1expressive speech synthesis: 1deep learning for source separation and pitch tracking: 1
IEEE keywordsspeech synthesis: 16speech recognition: 15natural language processing: 10speaker recognition: 8speech emotion recognition: 7emotion recognition: 7speech enhancement: 6speech coding: 6text analysis: 6vocoders: 5decoding: 5recurrent neural nets: 5expressive speech synthesis: 4semantics: 4noise reduction: 4hidden markov models: 3task analysis: 3data mining: 3data models: 3text to speech: 3voice conversion: 3self supervised learning: 3transformer: 3transformers: 3bidirectional attention mechanism: 2spectrogram: 2speech: 2visualization: 2training data: 2cloning: 2adaptation models: 2language model: 2timbre: 2multiple signal classification: 2encoding: 2instruments: 2coherence: 2linguistics: 2human computer interaction: 2hierarchical: 2predictive models: 2computational modeling: 2parallel processing: 2speaking style modelling: 2time frequency analysis: 2robustness: 2costs: 2speaker verification: 2automatic speaker verification: 2pattern classification: 2security of data: 2adversarial defense: 2trees (mathematics): 2deep learning (artificial intelligence): 2biometrics (access control): 2adversarial attack: 2optimisation: 2entropy: 2regression analysis: 2ordinal regression: 2code switching: 2convolutional neural nets: 2films: 1multiscale speaking style transfer: 1text to speech synthesis: 1games: 1automatic dubbing: 1cross lingual speaking style transfer: 1multi modal: 1av hubert: 1dysarthric speech reconstruction: 1transforms: 1audio visual: 1vq vae: 1pre training: 1self supervised style enhancing: 1dance expressiveness: 1dance generation: 1genre matching: 1dance dynamics: 1humanities: 1dynamics: 1beat alignment: 1speaker adaptation: 1zero shot: 1multi scale acoustic prompts: 1stereophonic music: 1degradation: 1codecs: 1music generation: 1neural codec: 1image coding: 1language models: 1long multi track: 1multi view midivae: 1symbolic music generation: 1two dimensional displays: 1speech disentanglement: 1vae: 1voice cloning: 1static var compensators: 1harmonic analysis: 1power harmonic filters: 1synthesizers: 1neural concatenation: 1signal generators: 1singing voice conversion: 1information retrieval: 1interaction gesture: 1multi agent conversational interaction: 1oral communication: 1cognition: 1dialog intention and emotion: 1co speech gesture generation: 1avatars: 1motion processing: 1multimodal learning: 1gesture generation: 1codes: 1context modeling: 1style modeling: 1bit error rate: 1multi scale: 1automatic speech recognition: 1neural machine translation: 1hierarchical attention mechanism: 1machine translation: 1subband interaction: 1inter subnet: 1global spectral information: 1speech signal improvement: 1generative adversarial networks: 1two stage: 1reverberation: 1real time systems: 1speech restoration: 1lightweight text to speech: 1streaming text to speech: 1diffusion probabilistic model: 1probabilistic logic: 1audiobook speech synthesis: 1prediction methods: 1context aware: 1multi sentence: 1hierarchical transformer: 1target speech extraction: 1multi modal fusion: 1fuses: 12d positional encoding.: 1cross attention: 1transducers: 1delays: 1streaming: 1computer architecture: 1signal processing algorithms: 1latency: 1network architecture: 1corrector network: 1source separation: 1time domain: 1time frequency domain: 1particle separators: 1speech separation: 1learning systems: 1synthetic corpus: 1measurement: 1audio recording: 1neural vocoder: 1semantic augmentation: 1upper bound: 1data augmentation: 1difficulty aware: 1stability analysis: 1error analysis: 1contextual biasing: 1conformer: 1biased words: 1sensitivity: 1open vocabulary keyword spotting: 1acoustic model: 1dynamic network pruning: 1melody unsupervision: 1differentiable up sampling layer: 1rhythm: 1vocal range: 1regulators: 1bidirectional control: 1annotations: 1singing voice synthesis: 1bi directional flow: 1adversarial attacks: 1supervised learning: 1tree structure: 1prosodic structure prediction: 1computational linguistics: 1span based decoder: 1character level: 1image segmentation: 1speech to animation: 1mixture of experts: 1computer animation: 1phonetic posteriorgrams: 1phase information: 1full band extractor: 1multi scale time sensitive channel attention: 1memory management: 1convolution: 1knowledge based systems: 1flat lattice transformer: 1rule based: 1chinese text normalization: 1none standard word: 1relative position encoding: 1xlnet: 1knowledge distillation: 1speaking style: 1conversational text to speech synthesis: 1graph neural network: 1matrix algebra: 1multi task learning: 1end to end model: 1forced alignment: 1audio signal processing: 1vocoder: 1neural architecture search: 1uniform sampling: 1path dropout: 1phoneme recognition: 1mispronunciation detection and diagnosis: 1acoustic phonetic linguistic embeddings: 1computer aided pronunciation training: 1connectionist temporal classification: 1cross entropy: 1disentangling: 1hybrid bottleneck features: 1voice activity detection: 1capsule: 1exemplary emotion descriptor: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1emotion: 1global style token: 1expressive: 1ctc: 1non autoregressive: 1autoregressive processes: 1neural network based text to speech: 1grammars: 1prosody control: 1word processing: 1syntactic parse tree traversal: 1syntactic representation learning: 1goodness of pronunciation: 1pronunciation assessment: 1computer assisted language learning: 1computer aided instruction: 1multi speaker and multi style tts: 1hifi gan: 1durian: 1low resource condition: 1weapons: 1perturbation methods: 1information filters: 1phonetic pos teriorgrams: 1speech intelligibility: 1accent conversion: 1accented speech recognition: 1multilingual speech synthesis: 1end to end: 1foreign accent: 1spectral analysis: 1center loss: 1discriminative features: 1multi head self attention: 1dilated residual network: 1wavenet: 1self attention: 1blstm: 1phonetic posteriorgrams(ppgs): 1anchored reference sample: 1mean opinion score (mos): 1speech fluency assessment: 1computer assisted language learning (call): 1variational inference: 1quasifully recurrent neural network (qrnn): 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1
Most publications (all venues) at2023: 442024: 332022: 332021: 222019: 17

Affiliations
Tsinghua University, Joint Research Center for Media Sciences, Beijing, China (PhD)
Chinese University of Hong Kong, Hong Kong

Recent publications

TASLP2024 Jingbei Li, Sipan Li, Ping Chen, Luwen Zhang, Yi Meng, Zhiyong Wu 0001, Helen Meng, Qiao Tian, Yuping Wang, Yuxuan Wang 0002, 
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing.

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Xueyuan Chen, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Zhiyong Wu 0001, Xixin Wu, Helen Meng, 
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

ICASSP2024 Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu 0001, Haozhi Huang 0004, Helen Meng, 
Enhancing Expressiveness in Dance Generation Via Integrating Frequency and Music Style Information.

ICASSP2024 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Dan Luo, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han 0001, Helen Meng, 
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.

ICASSP2024 Xingda Li, Fan Zhuo, Dan Luo, Jun Chen 0024, Shiyin Kang, Zhiyong Wu 0001, Tao Jiang, Yang Li, Han Fang, Yahui Zhou, 
Generating Stereophonic Music with Single-Stage Language Models.

ICASSP2024 Zhiwei Lin, Jun Chen 0024, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu 0001, Helen Meng, 
Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation.

ICASSP2024 Hui Lu, Xixin Wu, Haohan Guo, Songxiang Liu, Zhiyong Wu 0001, Helen Meng, 
Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations.

ICASSP2024 Binzhu Sha, Xu Li 0015, Zhiyong Wu 0001, Ying Shan, Helen Meng, 
Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion.

ICASSP2024 Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu 0001, Minglei Li 0001, Zonghong Dai, Helen Meng, 
Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models.

ICASSP2024 Sicheng Yang, Zunnan Xu, Haiwei Xue, Yongkang Cheng, Shaoli Huang, Mingming Gong, Zhiyong Wu 0001
FreeTalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness.

AAAI2024 Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu 0001, Shi-Xiong Zhang, Guangzhi Li, Yi Luo 0004, Rongzhi Gu, 
SECap: Speech Emotion Captioning with Large Language Model.

TASLP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Helen Meng, 
MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

ICASSP2023 Jun Chen 0024, Wei Rao, Zilin Wang, Jiuxin Lin, Zhiyong Wu 0001, Yannan Wang, Shidong Shang, Helen Meng, 
Inter-Subnet: Speech Enhancement with Subband Interaction.

ICASSP2023 Jun Chen 0024, Yupeng Shi, Wenzhe Liu, Wei Rao, Shulin He, Andong Li, Yannan Wang, Zhiyong Wu 0001, Shidong Shang, Chengshi Zheng, 
Gesper: A Unified Framework for General Speech Restoration.

ICASSP2023 Jie Chen, Xingchen Song, Zhendong Peng, Binbin Zhang, Fuping Pan, Zhiyong Wu 0001
LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech.

ICASSP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis.

ICASSP2023 Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen 0024, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu 0001, Yujun Wang, Helen Meng, 
Av-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction.

ICASSP2023 Xingchen Song, Di Wu 0061, Zhiyong Wu 0001, Binbin Zhang, Yuekai Zhang, Zhendong Peng, Wenpeng Li, Fuping Pan, Changbao Zhu, 
TrimTail: Low-Latency Streaming ASR with Simple But Effective Spectrogram-Level Length Penalty.

#11  | Jinyu Li 0001 | DBLP Google Scholar  
By venueICASSP: 43Interspeech: 37TASLP: 7ICML: 1ACL: 1EMNLP: 1
By year2024: 92023: 122022: 212021: 172020: 162019: 112018: 4
ISCA sessionsnovel models and training methods for asr: 3source separation: 3asr neural network architectures: 3streaming for asr/rnn transducers: 2multi- and cross-lingual asr, other topics in asr: 2streaming asr: 2speech recognition: 1statistical machine translation: 1speaker and language recognition: 1other topics in speech recognition: 1robust asr, and far-field/multi-talker asr: 1spoken language processing: 1topics in asr: 1self-supervision and semi-supervision for neural asr training: 1neural network training methods for asr: 1language and lexical modeling for asr: 1asr model training and strategies: 1acoustic model adaptation for asr: 1new trends in self-supervised speech processing: 1asr neural network architectures and training: 1search for speech recognition: 1multi-channel speech enhancement: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1asr neural network training: 1neural network training strategies for asr: 1novel neural network architectures for acoustic modelling: 1novel approaches to enhancement: 1deep enhancement: 1
IEEE keywordsspeech recognition: 34recurrent neural nets: 8transducers: 7data models: 7error analysis: 7self supervised learning: 7vocabulary: 6task analysis: 6speaker recognition: 6natural language processing: 6transformers: 5predictive models: 5speech enhancement: 5automatic speech recognition: 5factorized neural transducer: 4speech translation: 4representation learning: 4transformer: 4adaptation models: 4oral communication: 4end to end: 4speech coding: 3decoding: 3computational modeling: 3speech separation: 3ctc: 3multi talker automatic speech recognition: 3transformer transducer: 3continuous speech separation: 3source separation: 3speaker adaptation: 3teacher student learning: 3attention: 3adversarial learning: 3context modeling: 2codecs: 2language model: 2speech synthesis: 2semantics: 2real time systems: 2streaming: 2degradation: 2loading: 2streaming inference: 2speaker diarization: 2conversation transcription: 2training data: 2contextual biasing: 2contextual spelling correction: 2analytical models: 2interpolation: 2multi talker asr: 2transducer: 2combination: 2meeting transcription: 2encoding: 2audio signal processing: 2lstm: 2domain adaptation: 2deep neural network: 2neural network: 2long content speech recognition: 1streaming and non streaming: 1rnn t: 1computer architecture: 1speech removal: 1codes: 1speech generation: 1noise reduction: 1audio text input: 1multi task learning: 1noise suppression: 1target speaker extraction: 1zero shot text to speech: 1speech editing: 1machine translation: 1speech text joint pre training: 1discrete tokenization: 1unified modeling language: 1costs: 1timestamp: 1synchronization: 1joint: 1weight sharing: 1memory management: 1model compression: 1performance evaluation: 1speech recognition and translation: 1low rank approximation: 1token level serialized output training: 1multi talker speech recognition: 1text only adaptation: 1symbols: 1measurement: 1overlapping speech: 1recording: 1wavlm: 1multi speaker: 1bit error rate: 1hubert: 1ce: 1fuses: 1long form speech recognition: 1context and speech encoder: 1focusing: 1microphone arrays: 1geometry: 1microphone array: 1external attention: 1speech to speech translation: 1joint pre training: 1data mining: 1cross lingual modeling: 1speaker change detection: 1e2e asr: 1f1 score: 1limiting: 1data simulation: 1conversation analysis: 1signal processing algorithms: 1n gram: 1kl divergence: 1factorized transducer model: 1neural transducer model: 1non autoregressive: 1language model adaptation: 1multitasking: 1pre training: 1benchmark testing: 1speaker: 1linear programming: 1end to end end point detection: 1long form meeting transcription: 1dual path rnn: 1robust speech recognition: 1contrastive learning: 1wav2vec 2.0: 1robust automatic speech recognition: 1supervised learning: 1hybrid: 1cascaded: 1two pass: 1recurrent selective attention network: 1configurable multilingual model: 1multilingual speech recognition: 1speaker inventory: 1mathematical model: 1estimated speech: 1particle separators: 1computer science: 1correlation: 1speaker separation: 1multi channel microphone: 1deep learning (artificial intelligence): 1signal representation: 1real time decoding: 1multi speaker asr: 1conformer: 1attention based encoder decoder: 1recurrent neural network transducer: 1segmentation: 1filtering theory: 1system fusion: 1neural language generation: 1unsupervised learning: 1acoustic model adaptation: 1permutation invariant training: 1libricss: 1microphones: 1overlapped speech: 1production: 1tensors: 1rnn transducer: 1virtual assistants: 1alignments: 1pre training.: 1pattern classification: 1streaming attention based sequence to sequence asr: 1latency reduction: 1monotonic chunkwise attention: 1entropy: 1computer aided instruction: 1latency: 1label embedding: 1knowledge representation: 1backpropagation: 1end to end system: 1oov: 1acoustic to word: 1adaptation: 1universal acoustic model: 1mixture of experts: 1mixture models: 1layer trajectory: 1future context frames: 1temporal modeling: 1senone classification: 1signal classification: 1code switching: 1language identification: 1asr: 1domain invariant training: 1speaker verification: 1
Most publications (all venues) at2022: 272021: 252024: 222023: 222020: 18

Affiliations
Microsoft Corporation, Redmond, WA, USA
Georgia Institute of Technology, Center for Signal and Image Processing, Atlanta, GA, USA (PhD)
University of Science and Technology of China, iFlytek Speech Lab, Hefei, China

Recent publications

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

TASLP2024 Xiaofei Wang 0007, Manthan Thakker, Zhuo Chen 0006, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu 0001, Jinyu Li 0001, Takuya Yoshioka, 
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer.

TASLP2024 Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu 0012, Shujie Liu 0001, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Furu Wei, 
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation.

TASLP2024 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu 0012, Shuo Ren, Shujie Liu 0001, Zhuoyuan Yao, Xun Gong 0005, Li-Rong Dai 0001, Jinyu Li 0001, Furu Wei, 
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data.

ICASSP2024 Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li 0001, Yashesh Gaur, 
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation.

ICASSP2024 Yiming Wang, Jinyu Li 0001
Residualtransformer: Residual Low-Rank Learning With Weight-Sharing For Transformer Layers.

ICASSP2024 Jian Wu 0027, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao 0017, Zhuo Chen 0006, Jinyu Li 0001
T-SOT FNT: Streaming Multi-Talker ASR with Text-Only Domain Adaptation Capability.

ICASSP2024 Mu Yang, Naoyuki Kanda, Xiaofei Wang 0009, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li 0001, Takuya Yoshioka, 
Diarist: Streaming Speech Translation with Speaker Diarization.

ICML2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan 0003, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu 0001, Tao Qin 0001, Xiangyang Li 0001, Wei Ye 0004, Shikun Zhang, Jiang Bian 0002, Lei He 0005, Jinyu Li 0001, Sheng Zhao, 
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Ruchao Fan, Yiming Wang, Yashesh Gaur, Jinyu Li 0001
CTCBERT: Advancing Hidden-Unit Bert with CTC Objectives.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

ICASSP2023 Zili Huang, Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yiming Wang, Jinyu Li 0001, Takuya Yoshioka, Xiaofei Wang 0009, Peidong Wang, 
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition.

ICASSP2023 Naoyuki Kanda, Jian Wu 0027, Xiaofei Wang 0009, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Vararray Meets T-Sot: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition.

ICASSP2023 Xiaoqiang Wang 0006, Yanqing Liu, Jinyu Li 0001, Sheng Zhao, 
Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation.

ICASSP2023 Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu 0001, Lei He 0005, Jinyu Li 0001, Furu Wei, 
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation.

ICASSP2023 Jian Wu 0027, Zhuo Chen 0006, Min Hu, Xiong Xiao, Jinyu Li 0001
Speaker Change Detection For Transformer Transducer ASR.

ICASSP2023 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.

ICASSP2023 Rui Zhao 0017, Jian Xue, Partha Parthasarathy, Veljko Miljanic, Jinyu Li 0001
Fast and Accurate Factorized Neural Transducer for Text Adaption of End-to-End Speech Recognition Models.

Interspeech2023 Yuang Li, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, 
Accelerating Transducers through Adjacent Token Merging.

#12  | DeLiang Wang | DBLP Google Scholar  
By venueTASLP: 28Interspeech: 27ICASSP: 26
By year2024: 32023: 82022: 202021: 112020: 162019: 152018: 8
ISCA sessionsdeep enhancement: 3speech coding and privacy: 2single-channel speech enhancement: 2speech enhancement: 2asr for noisy and far-field speech: 2spatial and phase cues for source separation and speech recognition: 2multi-talker methods in speech processing: 1speech enhancement and denoising: 1speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1challenges and opportunities for signal processing and machine learning for multiple smart devices: 1speech representation: 1multi-channel speech enhancement and hearing aids: 1source separation, dereverberation and echo cancellation: 1speech and audio quality assessment: 1noise reduction and intelligibility: 1speaker and language recognition: 1novel approaches to enhancement: 1source separation from monaural input: 1deep learning for source separation and pitch tracking: 1
IEEE keywordsspeech enhancement: 32speaker recognition: 13complex spectral mapping: 10recurrent neural nets: 9source separation: 9speech intelligibility: 8convolutional neural nets: 8reverberation: 7speaker separation: 6speech recognition: 6microphone arrays: 6microphones: 6array signal processing: 5time domain: 5location based training: 4time frequency analysis: 4noise measurement: 4estimation: 4monaural speech enhancement: 4microphone array processing: 4fourier transforms: 4deep casa: 4signal to noise ratio: 3robust speaker localization: 3direction of arrival estimation: 3time domain analysis: 3neural cascade architecture: 3convolution: 3robustness: 3self attention: 3time domain enhancement: 3blind source separation: 3permutation invariant training: 3beamforming: 3covariance matrices: 3acoustic noise: 3monaural speech separation: 3audio signal processing: 3phase estimation: 3speech dereverberation: 3deep neural networks: 3task analysis: 2continuous speaker separation: 2continuous speech separation: 2geometry: 2automatic speech recognition: 2self supervised learning: 2speech separation: 2frequency domain analysis: 2cross corpus generalization: 2deep learning (artificial intelligence): 2multi channel speaker separation: 2complex domain: 2bone conduction: 2attention based fusion: 2natural language processing: 2speaker diarization: 2optimisation: 2pruning: 2quantization: 2model compression: 2sparse regularization: 2hearing: 2talker independent speaker separation: 2dereverberation: 2encoding: 2decoding: 2computational auditory scene analysis: 2time frequency masking: 2iterative methods: 2conversational speaker separation: 1streams: 1multi speaker speech recognition: 1separation processes: 1multi channel speaker diarization: 1audiovisual speaker separation: 1multimodal speech processing: 1visualization: 1attentive audiovisual fusion: 1systematics: 1mimo complex spectral mapping: 1location awareness: 1merging: 1data mining: 1speaker extraction: 1attentive training: 1talker independent: 1interference: 1speech: 1pitch tracking: 1multitasking: 1multi task learning: 1complex domain processing: 1densely connected convolutional recurrent neural network: 1voicing detection: 1frequency estimation: 1packet loss concealment: 1packet loss: 1semantics: 1diffusion model: 1low signal to noise ratio: 1generative model: 1background noise: 1recurrent neural network: 1talker independence: 1multi channel complex spectral mapping: 1spectrospatial filtering: 1spectrogram: 1neural net architecture: 1cascade architecture: 1signal representation: 1sensor fusion: 1signal denoising: 1air conduction: 1nonlinear distortions: 1acoustic echo cancellation: 1neurocontrollers: 1multi channel aec: 1echo suppression: 1mimo: 1fixed array: 1multichannel: 1triple path: 1robust automatic speech recognition: 1spectral magnitude: 1cross domain speech enhancement: 1multi speaker asr: 1meeting transcription: 1alimeeting: 1m2met: 1acoustic echo suppression: 1recurrent neural networks: 1feature combination: 1frame level snr estimation: 1long short term memory: 1dense convolutional network: 1self attention network: 1frequency domain loss: 1data compression: 1quantisation (signal): 1speaker inventory: 1mathematical model: 1estimated speech: 1particle separators: 1computer science: 1correlation: 1training data: 1data models: 1overlapped speech: 1modulation: 1computational modeling: 1performance evaluation: 1pipelines: 1quantization (signal): 1densely connected convolutional recurrent network: 1on device processing: 1real time speech enhancement: 1mobile communication: 1dual microphone mobile phones: 1complex domain separation: 1ensemble learning: 1singing voice separation: 1convolutional neural network: 1music: 1self attention mechanism: 1monaural speaker separation: 1causal processing: 1robust enhancement: 1channel generalization: 1robust speaker recognition: 1gammatone frequency cepstral coefficient (gfcc): 1masking based beamforming: 1x vector: 1gaussian processes: 1gated convolutional recurrent network: 1distortion independent acoustic modeling: 1speech distortion: 1transient response: 1temporal convolutional networks: 1room impulse response: 1dense network: 1time frequency loss: 1speaker and noise independent: 1fully convolutional: 1voice telecommunication: 1processing artifacts: 1cochannel speech separation: 1two stage network: 1pattern clustering: 1divide and conquer methods: 1audio databases: 1fully convolutional neural network: 1mean absolute error: 1generalisation (artificial intelligence): 1gated linear units: 1residual learning: 1dilated convolutions: 1feedforward neural nets: 1sequence to sequence mapping: 1chimera++ networks: 1deep clustering: 1spatial features: 1gcc phat: 1steered response power: 1ideal ratio mask: 1denoising: 1signal reconstruction: 1phase: 1noise independent and speaker independent speech enhancement: 1real time implementation: 1tcnn: 1temporal convolutional neural network: 1complex valued deep neural networks: 1learning phase: 1phase aware speech enhancement: 1cdnn: 1spectral analysis: 1convolutional recurrent network: 1causal system: 1phase reconstruction: 1chimera + + networks: 1
Most publications (all venues) at2022: 232018: 212020: 202021: 192019: 16


Recent publications

TASLP2024 Hassan Taherian, DeLiang Wang
Multi-Channel Conversational Speaker Separation via Neural Diarization.

ICASSP2024 Vahid Ahmadi Kalkhorani, Anurag Kumar 0003, Ke Tan 0001, Buye Xu, DeLiang Wang
Audiovisual Speaker Separation with Full- and Sub-Band Modeling in the Time-Frequency Domain.

ICASSP2024 Hassan Taherian, Ashutosh Pandey 0004, Daniel Wong, Buye Xu, DeLiang Wang
Leveraging Sound Localization to Improve Continuous Speaker Separation.

TASLP2023 Ashutosh Pandey 0004, DeLiang Wang
Attentive Training: A New Training Framework for Speech Enhancement.

TASLP2023 Yixuan Zhang 0005, Heming Wang, DeLiang Wang
$F0$ Estimation and Voicing Detection With Cascade Architecture in Noisy Speech.

ICASSP2023 Hassan Taherian, DeLiang Wang
Multi-Resolution Location-Based Training for Multi-Channel Continuous Speech Separation.

ICASSP2023 Heming Wang, Yao Qian, Hemin Yang, Nauyuki Kanda, Peidong Wang, Takuya Yoshioka, Xiaofei Wang 0009, Yiming Wang, Shujie Liu 0001, Zhuo Chen 0006, DeLiang Wang, Michael Zeng 0001, 
DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks.

ICASSP2023 Heming Wang, DeLiang Wang
Cross-Domain Diffusion Based Speech Enhancement for Very Noisy Speech.

Interspeech2023 Vahid Ahmadi Kalkhorani, Anurag Kumar 0003, Ke Tan 0001, Buye Xu, DeLiang Wang
Time-domain Transformer-based Audiovisual Speaker Separation.

Interspeech2023 Hassan Taherian, Ashutosh Pandey 0004, Daniel Wong, Buye Xu, DeLiang Wang
Multi-input Multi-output Complex Spectral Mapping for Speaker Separation.

Interspeech2023 Yufeng Yang, Ashutosh Pandey 0004, DeLiang Wang
Time-Domain Speech Enhancement for Robust Automatic Speech Recognition.

TASLP2022 Ashutosh Pandey 0004, DeLiang Wang
Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization.

TASLP2022 Hassan Taherian, Ke Tan 0001, DeLiang Wang
Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training.

TASLP2022 Ke Tan 0001, Zhong-Qiu Wang, DeLiang Wang
Neural Spectrospatial Filtering.

TASLP2022 Heming Wang, DeLiang Wang
Neural Cascade Architecture With Triple-Domain Loss for Speech Enhancement.

TASLP2022 Heming Wang, Xueliang Zhang 0001, DeLiang Wang
Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement.

TASLP2022 Hao Zhang, DeLiang Wang
Neural Cascade Architecture for Multi-Channel Acoustic Echo Suppression.

ICASSP2022 Ashutosh Pandey 0004, Buye Xu, Anurag Kumar 0003, Jacob Donley, Paul Calamia, DeLiang Wang
TPARN: Triple-Path Attentive Recurrent Network for Time-Domain Multichannel Speech Enhancement.

ICASSP2022 Hassan Taherian, Ke Tan 0001, DeLiang Wang
Location-Based Training for Multi-Channel Talker-Independent Speaker Separation.

ICASSP2022 Heming Wang, Yao Qian, Xiaofei Wang 0009, Yiming Wang, Chengyi Wang 0002, Shujie Liu 0001, Takuya Yoshioka, Jinyu Li 0001, DeLiang Wang
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction.

#13  | Jun Du | DBLP Google Scholar  
By venueInterspeech: 35ICASSP: 28TASLP: 13SpeechComm: 3
By year2024: 82023: 142022: 102021: 132020: 172019: 152018: 2
ISCA sessionsspeaker diarization: 3speaker recognition: 3speaker embedding and diarization: 2speech enhancement: 2speech enhancement and denoising: 1multi-talker methods in speech processing: 1spoken dialog systems and conversational analysis: 1speech recognition: 1spoken language processing: 1acoustic scene analysis: 1low-resource asr development: 1spoken dialogue systems and multimodality: 1multimodal systems: 1tools, corpora and resources: 1interspeech 2021 deep noise suppression challenge: 1single-channel speech enhancement: 1voice activity detection and keyword spotting: 1asr model training and strategies: 1acoustic model adaptation for asr: 1acoustic scene classification: 1multi-channel speech enhancement: 1speech emotion recognition: 1speech coding and evaluation: 1speech and audio classification: 1corpus annotation and evaluation: 1far-field speech recognition: 1the second dihard speech diarization challenge (dihard ii): 1deep enhancement: 1the first dihard speech diarization challenge: 1
IEEE keywordsspeech enhancement: 22speech recognition: 19speaker diarization: 9visualization: 7noise measurement: 5task analysis: 5speaker recognition: 5misp challenge: 4recording: 4voice activity detection: 4data models: 4progressive learning: 4regression analysis: 4deep neural network: 4audio visual: 3hidden markov models: 3noise: 3error analysis: 3automatic speech recognition: 3adaptation models: 3speech separation: 3signal to noise ratio: 3reverberation: 3convolutional neural nets: 3robust speech recognition: 2optimization: 2mathematical models: 2attention: 2iterative methods: 2chime 7 challenge: 2robustness: 2estimation: 2emotion recognition: 2face recognition: 2semantics: 2data mining: 2multimodality: 2memory aware speaker embedding: 2attention network: 2telephone sets: 2time domain analysis: 2data augmentation: 2decoding: 2speech coding: 2speech intelligibility: 2post processing: 2entropy: 2image analysis: 2acoustic scene classification: 2convolutional neural networks: 2improved minima controlled recursive averaging: 2neural network: 2signal classification: 2fully convolutional neural network: 2attention mechanism: 2generalized gaussian distribution: 2mean square error methods: 2maximum likelihood estimation: 2least mean squares methods: 2gaussian distribution: 2ideal ratio mask: 2task generic: 1measurement: 1optimization objective: 1distortion measurement: 1diffusion model: 1score based: 1speech denoising: 1interpolating diffusion model: 1interpolation: 1writing: 1multi modal: 1aggregated optical flow map: 1trajectory: 1handwriting recognition: 1handwritten mathematical expression recognition: 1topology: 1multi channel speech enhancement: 1iterative mask estimation: 1redundancy: 1feature fusion: 1multi modal emotion recognition: 1entropy based fusion: 1structured pruning: 1network architecture optimization: 1target speaker enhancement: 1self supervised learning: 1speaker adaptive: 1target speaker extraction: 1real world scenarios: 1benchmark testing: 1oral communication: 1memory management: 1chime challenge: 1graphics processing units: 1sequence to sequence architecture: 1codes: 1adaptive refinement: 1dictionary learning: 1adaptive systems: 1dynamic mask: 1data quality control: 1time frequency analysis: 1wiener filter: 1gevd: 1wiener filters: 1speech distortion: 1mean square error: 1correlation: 1low rank approximation: 1synchronization: 1dcase 2022: 1testing: 1sound event localization and detection: 1model architecture: 1realistic data: 1location awareness: 1transfer learning: 1synthetic speech detection: 1quantum transfer learning: 1integrated circuit modeling: 1quantum machine learning: 1variational quantum circuit: 1pre trained model: 1speech synthesis: 1tv: 1quality assessment: 1visual embedding reconstruction: 1acoustic distortion: 1learning systems: 1public domain software: 1wake word spotting: 1audio visual systems: 1microphone array: 1ts vad: 1m2met: 1snr constriction: 1time domain: 1dihard iii challenge: 1filtering: 1iteration: 1signal processing algorithms: 1robust automatic speech recognition: 1acoustic model: 1neural net architecture: 1probability: 1cross entropy: 1optimisation: 1deep neural network (dnn): 1local response normalization: 1multi level and adaptive fusion: 1factorized bilinear pooling: 1multimodal emotion recognition: 1analytical models: 1class activation mapping: 1adaptive noise and speech estimation: 1computer architecture: 1additives: 1noise reduction: 1computational modeling: 1convolutional layers: 1sehae: 1hierarchical autoencoder: 1computational complexity: 1speaker adaptation: 1memory aware networks: 1microphone arrays: 1snr progressive learning: 1recurrent neural nets: 1dense structure: 1acoustic segment model: 1ctc: 1matrix algebra: 1scaling: 1model adaptation: 1dilated convolution: 1speaker verification: 1baum welch statistics: 1maximum likelihood: 1shape factors update: 1multi objective learning: 1speech activity detection: 1snr estimation: 1dihard data: 1geometric constraint: 1geometry: 1linear programming: 1lstm: 12d to 2d mapping: 1fuzzy neural nets: 1performance evaluation: 1source separation: 1child speech extraction: 1realistic conditions: 1measures: 1prediction error modeling: 1gaussian processes: 1acoustic modeling: 1joint optimization: 1mixed bandwidth speech recognition: 1bandwidth expansion: 1function approximation: 1expressive power: 1universal approximation: 1vector to vector regression: 1improved speech presence probability: 1error statistics: 1teacher student learning: 1deep learning based speech enhancement: 1noise robust speech recognition: 1multiple speakers: 1interference: 1speaker dependent speech separation: 1chime 5 challenge: 1arrays: 1acoustic noise: 1statistical speech enhancement: 1signal denoising: 1gain function: 1
Most publications (all venues) at2023: 642024: 542020: 522021: 462019: 44

Affiliations
URLs

Recent publications

TASLP2024 Hang Chen, Qing Wang 0008, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001, 
Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition.

TASLP2024 Zilu Guo, Qing Wang 0008, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui Lee 0001, 
A Variance-Preserving Interpolation Approach for Diffusion Models With Applications to Single Channel Speech Enhancement and Recognition.

ICASSP2024 Hanbo Cheng, Jun Du, Pengfei Hu 0006, Jiefeng Ma, Zhenrong Zhang, Mobai Xue, 
Viewing Writing as Video: Optical Flow based Multi-Modal Handwritten Mathematical Expression Recognition.

ICASSP2024 Feng Ma, Yanhui Tu, Maokui He, Ruoyu Wang 0029, Shutong Niu, Lei Sun 0010, Zhongfu Ye, Jun Du, Jia Pan, Chin-Hui Lee 0001, 
A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

ICASSP2024 Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Yuling Ren, Yu Liu, 
Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization.

ICASSP2024 Minghui Wu, Haitao Tang, Jiahuan Fan, Ruoyu Wang, Hang Chen, Yanyong Zhang, Jun Du, Hengshun Zhou, Lei Sun, Xin Fang, Tian Gao, Genshun Wan, Jia Pan, Jianqing Gao, 
Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR through Efficient Joint Optimization.

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

ICASSP2024 Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang 0029, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee 0001, 
Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture.

SpeechComm2023 Shi Cheng, Jun Du, Shutong Niu, Alejandrina Cristià, Xin Wang 0037, Qing Wang 0008, Chin-Hui Lee 0001, 
Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions.

SpeechComm2023 Li Chai 0002, Hang Chen, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001, 
Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech.

TASLP2023 Mao-Kui He, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001, 
ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding.

TASLP2023 Shutong Niu, Jun Du, Lei Sun 0010, Yu Hu 0003, Chin-Hui Lee 0001, 
QDM-SSD: Quality-Aware Dynamic Masking for Separation-Based Speaker Diarization.

TASLP2023 Jie Zhang 0042, Rui Tao, Jun Du, Li-Rong Dai 0001, 
SDW-SWF: Speech Distortion Weighted Single-Channel Wiener Filter for Noise Reduction.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Shutong Niu, Jun Du, Qing Wang 0008, Li Chai 0002, Huaxin Wu, Zhaoxu Nian, Lei Sun 0010, Yi Fang, Jia Pan, Chin-Hui Lee 0001, 
An Experimental Study on Sound Event Localization and Detection Under Realistic Testing Conditions.

ICASSP2023 Ruoyu Wang 0029, Jun Du, Tian Gao, 
Quantum Transfer Learning Using the Large-Scale Unsupervised Pre-Trained Model Wavlm-Large for Synthetic Speech Detection.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

ICASSP2023 Chenyue Zhang, Hang Chen, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001, 
Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement.

Interspeech2023 Zilu Guo, Jun Du, Chin-Hui Lee 0001, Yu Gao, Wenbin Zhang, 
Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement.

Interspeech2023 Shutong Niu, Jun Du, Maokui He, Chin-Hui Lee 0001, Baoxiang Li, Jiakui Li, 
Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization.

#14  | Tara N. Sainath | DBLP Google Scholar  
By venueICASSP: 42Interspeech: 33TASLP: 1NAACL: 1ICLR: 1
By year2024: 92023: 202022: 162021: 112020: 82019: 112018: 3
ISCA sessionsspeech recognition: 3asr technologies and systems: 2asr: 2multi-, cross-lingual and other topics in asr: 2cross-lingual and multilingual asr: 2asr neural network architectures: 2analysis of speech and audio signals: 1feature modeling for asr: 1acoustic model adaptation for asr: 1search/decoding algorithms for asr: 1speech analysis: 1language modeling and lexical modeling for asr: 1speech representation: 1novel models and training methods for asr: 1resource-constrained asr: 1language and lexical modeling for asr: 1novel neural network architectures for asr: 1streaming for asr/rnn transducers: 1neural network training methods for asr: 1speech classification: 1lm adaptation, lexical units and punctuation: 1asr neural network architectures and training: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1end-to-end speech recognition: 1acoustic model adaptation: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 30decoding: 12recurrent neural nets: 9data models: 7computational modeling: 7adaptation models: 6end to end asr: 6speech coding: 6task analysis: 5transducers: 5error analysis: 5video on demand: 5conformer: 5natural language processing: 5automatic speech recognition: 4rnn t: 4vocabulary: 3context modeling: 3asr: 3multilingual: 3sequence to sequence: 3degradation: 2costs: 2computational efficiency: 2universal speech model: 2semisupervised learning: 2convolution: 2foundation model: 2buildings: 2computer architecture: 2transfer learning: 2production: 2predictive models: 2text analysis: 2two pass asr: 2rnnt: 2long form asr: 2latency: 2optimisation: 2phonetics: 2biasing: 2hidden markov models: 1end to end: 1tail: 1adapter finetuning: 1streaming multilingual asr: 1sparsity: 1topology: 1model pruning: 1model quantization: 1quantization (signal): 1dialect classifier: 1equity: 1us english: 1african american english: 1robustness: 1hardware: 1large language model: 1distance measurement: 1multilingual speech recognition: 1runtime efficiency: 1computational latency: 1large models: 1causal model: 1online asr: 1state space model: 1systematics: 1parameter efficient adaptation: 1tuning: 1acoustic beams: 1representations: 1modular: 1zero shot stitching: 1longform asr: 1fuses: 1tensors: 1weight sharing: 1machine learning: 1low rank decomposition: 1model compression: 1wearable computers: 1program processors: 1embedded speech recognition: 1segmentation: 1earth observing system: 1decoding algorithms: 1real time systems: 1signal processing algorithms: 1memory management: 1analytical models: 1domain adaptation: 1foundation models: 1frequency modulation: 1soft sensors: 1internal lm: 1text recognition: 1text injection: 1lattices: 1contextual biasing: 1network architecture: 1multitasking: 1capitalization: 1joint network: 1rnn transducer: 1pause prediction: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1kernel: 1encoding: 1switches: 1utf 8 byte: 1unified modeling language: 1word piece: 1multilingual asr: 1joint training: 1contrastive learning: 1indexes: 1self supervised learning: 1linear programming: 1massive: 1lifelong learning: 1speaker recognition: 1fusion: 1gating: 1bilinear pooling: 1signal representation: 1cascaded encoders: 1second pass asr: 1mean square error methods: 1transformer: 1calibration: 1confidence: 1voice activity detection: 1attention based end to end models: 1echo state network: 1long form: 1echo: 1regression analysis: 1probability: 1endpointer: 1supervised learning: 1attention: 1sequence to sequence models: 1unsupervised learning: 1filtering theory: 1semi supervised training: 1mathematical model: 1pronunciation: 1las: 1spelling correction: 1attention models: 1language model: 1mobile handsets: 1end to end speech synthesis: 1speech synthesis: 1end to end speech recognition: 1
Most publications (all venues) at2023: 322022: 252019: 162018: 152021: 14


Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001, 
End-to-End Speech Recognition: A Survey.

ICASSP2024 Junwen Bai, Bo Li 0028, Qiujia Li, Tara N. Sainath, Trevor Strohman, 
Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR.

ICASSP2024 Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li 0028, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal, 
USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models.

ICASSP2024 Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara N. Sainath, Françoise Beaufays, Pedro Moreno Mengibar, 
Improving Speech Recognition for African American English with Audio Classification.

ICASSP2024 W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang 0033, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study.

ICASSP2024 Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno 0001, 
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models.

ICASSP2024 Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof Choromanski, Tara N. Sainath
Augmenting Conformers With Structured State-Space Sequence Models For Online Speech Recognition.

ICASSP2024 Khe Chai Sim, Zhouyuan Huo, Tsendsuren Munkhdalai, Nikhil Siddhartha, Adam Stooke, Zhong Meng, Bo Li 0028, Tara N. Sainath
A Comparison of Parameter-Efficient ASR Domain Adaptation Methods for Universal Speech and Language Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar, 
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara N. Sainath, Françoise Beaufays, 
Lego-Features: Exporting Modular Encoder Features for Streaming and Deliberation ASR.

ICASSP2023 Shuo-Yiin Chang, Chao Zhang 0031, Tara N. Sainath, Bo Li 0028, Trevor Strohman, 
Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion.

ICASSP2023 Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw, 
Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models.

ICASSP2023 Ke Hu, Tara N. Sainath, Bo Li 0028, Nan Du 0002, Yanping Huang, Andrew M. Dai, Yu Zhang 0033, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman, 
Massively Multilingual Shallow Fusion with Large Language Models.

ICASSP2023 W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman, 
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model.

ICASSP2023 Zhouyuan Huo, Khe Chai Sim, Bo Li 0028, Dongseong Hwang, Tara N. Sainath, Trevor Strohman, 
Resource-Efficient Transfer Learning from Speech Foundation Model Using Hierarchical Feature Fusion.

ICASSP2023 Bo Li 0028, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang 0033, Wei Han 0002, Trevor Strohman, Françoise Beaufays, 
Efficient Domain Adaptation for Speech Foundation Models.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran, 
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

ICASSP2023 Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, W. Ronny Huang, Tara N. Sainath
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale.

ICASSP2023 Tara N. Sainath, Rohit Prabhavalkar, Diamantino Caseiro, Pat Rondon, Cyril Allauzen, 
Improving Contextual Biasing with Text Injection.

ICASSP2023 Weiran Wang, Ding Zhao, Shaojin Ding, Hao Zhang 0010, Shuo-Yiin Chang, David Rybach, Tara N. Sainath, Yanzhang He, Ian McGraw, Shankar Kumar, 
Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks.

#15  | Jianhua Tao 0001 | DBLP Google Scholar  
By venueInterspeech: 42ICASSP: 19TASLP: 11SpeechComm: 3AAAI: 1ICML: 1
By year2024: 52023: 92022: 82021: 162020: 202019: 122018: 7
ISCA sessionsspeech emotion recognition: 4speech synthesis: 4voice conversion and adaptation: 3speech coding and privacy: 2topics in asr: 2statistical parametric speech synthesis: 2speech coding and enhancement: 1speaker and language identification: 1paralinguistics: 1asr: 1health and affect: 1privacy-preserving machine learning for audio & speech processing: 1search/decoding techniques and confidence measures for asr: 1computational resource constrained speech recognition: 1multi-channel audio and emotion recognition: 1speech enhancement: 1speech in multimodality: 1asr neural network architectures: 1speech in health: 1sequence-to-sequence speech recognition: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speech and audio source separation and scene analysis: 1emotion and personality in conversation: 1audio signal characterization: 1speech and voice disorders: 1nn architectures for asr: 1speech synthesis paradigms and methods: 1emotion recognition and analysis: 1deep enhancement: 1source separation and spatial analysis: 1prosody modeling and generation: 1
IEEE keywordsspeech recognition: 14speech synthesis: 12natural language processing: 6end to end: 6speech enhancement: 5speaker recognition: 5predictive models: 4transfer learning: 4error analysis: 3speech coding: 3signal processing algorithms: 3text analysis: 3attention: 3decoding: 3emotion recognition: 3noise robustness: 2text to speech: 2adversarial training: 2filtering theory: 2text based speech editing: 2text editing: 2recurrent neural nets: 2optimisation: 2end to end model: 2autoregressive processes: 2multimodal fusion: 2self attention: 2transformer: 2speaker adaptation: 2low resource: 2synthetic speech detection: 1interactive fusion: 1noise measurement: 1data models: 1knowledge distillation: 1noise: 1noise robust: 1fewer tokens: 1language model: 1speech codecs: 1speech codec: 1time invariant: 1codes: 1asvspoof: 1multiscale permutation entropy: 1nonlinear dynamics: 1deepfakes: 1power spectral entropy: 1entropy: 1audio deepfake detection: 1splicing: 1tail: 1supervised learning: 1partial label learning: 1benchmark testing: 1imbalanced learning: 1pseudo label: 1phase locked loops: 1costs: 1prosodic boundaries: 1computational modeling: 1multi task learning: 1tagging: 1multi modal embeddings: 1bit error rate: 1linguistics: 1speaker dependent weighting: 1direction of arrival estimation: 1target speaker localization: 1generalized cross correlation: 1transforms: 1location awareness: 1controllability: 1oral communication: 1conversational tts: 1multi modal: 1semiconductor device modeling: 1multi grained: 1prosody: 1waveform generators: 1vocoders: 1deterministic plus stochastic: 1multiband excitation: 1noise control: 1vocoder: 1stochastic processes: 1one shot learning: 1coarse to fine decoding: 1mask prediction: 1covid 19: 1diseases: 1digital health: 1microorganisms: 1regression analysis: 1deep learning (artificial intelligence): 1depression: 1behavioural sciences computing: 1global information embedding: 1lstm: 1mask and prediction: 1fast: 1bert: 1non autoregressive: 1cross modal: 1teacher student learning: 1language modeling: 1gated recurrent fusion: 1robust end to end speech recognition: 1speech transformer: 1speech distortion: 1glottal source: 1arx lf model: 1iterative methods: 1vocal tract: 1signal denoising: 1inverse problems: 1source filter model: 1speaker sensitive modeling: 1conversational emotion recognition: 1conversational transformer network (ctnet): 1context sensitive modeling: 1signal classification: 1decoupled transformer: 1automatic speech recognition: 1code switching: 1bi level decoupling: 1prosody modeling: 1speaking style modeling: 1personalized speech synthesis: 1speech emotion recognition: 1cross attention: 1few shot speaker adaptation: 1the m2voc challenge: 1prosody and voice factorization: 1sequence to sequence: 1robustness: 1phoneme level autoregression: 1clustering algorithms: 1spectrogram: 1end to end post filter: 1deep clustering: 1permutation invariant training: 1deep attention fusion features: 1speech separation: 1interference: 1prosody transfer: 1audio signal processing: 1optimization strategy: 1multi head attention: 1audio visual systems: 1model level fusion: 1image fusion: 1video signal processing: 1continuous emotion recognition: 1forward backward algorithm: 1synchronous transformer: 1online speech recognition: 1encoding: 1asynchronous problem: 1chunk by chunk: 1cross lingual: 1phoneme representation: 1matrix decomposition: 1speaker embedding: 1word embedding: 1punctuation prediction: 1speech embedding: 1adversarial: 1language invariant: 1
Most publications (all venues) at2024: 582023: 452021: 432022: 362020: 36

Affiliations
Tsinghua University, Department of Automation, Beijing, China
University of Chinese Academy of Sciences, School of Artificial Intelligence, Beijing, China
Tsinghua University, Beijing, China (PhD 2001)

Recent publications

TASLP2024 Cunhang Fan, Mingming Ding, Jianhua Tao 0001, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv, 
Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection.

ICASSP2024 Yong Ren, Tao Wang 0074, Jiangyan Yi, Le Xu, Jianhua Tao 0001, Chu Yuan Zhang, Junzuo Zhou, 
Fewer-Token Neural Speech Codec with Time-Invariant Codes.

ICASSP2024 Chenglong Wang, Jiayi He, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Xiaohui Zhang 0006, 
Multi-Scale Permutation Entropy for Audio Deepfake Detection.

ICASSP2024 Mingyu Xu, Zheng Lian, Bin Liu 0041, Zerui Chen, Jianhua Tao 0001
Pseudo Labels Regularization for Imbalanced Partial-Label Learning.

AAAI2024 Xiaohui Zhang 0006, Jiangyan Yi, Chenglong Wang, Chu Yuan Zhang, Siding Zeng, Jianhua Tao 0001
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection.

SpeechComm2023 Jiangyan Yi, Jianhua Tao 0001, Ye Bai, Zhengkun Tian, Cunhang Fan, 
Transfer knowledge for punctuation prediction via adversarial training.

TASLP2023 Jiangyan Yi, Jianhua Tao 0001, Ruibo Fu, Tao Wang 0074, Chu Yuan Zhang, Chenglong Wang, 
Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings.

ICASSP2023 Guanjun Li, Wei Xue, Wenju Liu, Jiangyan Yi, Jianhua Tao 0001
GCC-Speaker: Target Speaker Localization with Optimal Speaker-Dependent Weighting in Multi-Speaker Scenarios.

ICASSP2023 Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, Jianhua Tao 0001, Jianqing Sun, Jiaen Liang, 
M2-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis.

Interspeech2023 Haiyang Sun, Zheng Lian, Bin Liu 0041, Ying Li, Jianhua Tao 0001, Licai Sun, Cong Cai, Meng Wang, Yuan Cheng, 
EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition.

Interspeech2023 Chenglong Wang, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Shuai Zhang 0014, Xun Chen, 
Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features.

Interspeech2023 Chenglong Wang, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Shuai Zhang 0014, Ruibo Fu, Xun Chen, 
TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection.

Interspeech2023 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Yongwei Li, Junhai Xu, Di Jin 0001, Jianhua Tao 0001
SOT: Self-supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition.

ICML2023 Xiaohui Zhang 0006, Jiangyan Yi, Jianhua Tao 0001, Chenglong Wang, Chu Yuan Zhang, 
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection.

SpeechComm2022 Wenhuan Lu, Xinyue Zhao, Na Guo, Yongwei Li, Jianguo Wei, Jianhua Tao 0001, Jianwu Dang 0001, 
One-shot emotional voice conversion based on feature separation.

TASLP2022 Tao Wang 0074, Ruibo Fu, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen, 
NeuralDPS: Neural Deterministic Plus Stochastic Model With Multiband Excitation for Noise-Controllable Waveform Generation.

TASLP2022 Tao Wang 0074, Jiangyan Yi, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, 
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing.

ICASSP2022 Cong Cai, Bin Liu 0041, Jianhua Tao 0001, Zhengkun Tian, Jiahao Lu, Kexin Wang, 
End-to-End Network Based on Transformer for Automatic Detection of Covid-19.

ICASSP2022 Ya Li, Mingyue Niu, Ziping Zhao 0001, Jianhua Tao 0001
Automatic Depression Level Assessment from Speech By Long-Term Global Information Embedding.

ICASSP2022 Tao Wang 0074, Jiangyan Yi, Liqun Deng, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, 
Context-Aware Mask Prediction Network for End-to-End Text-Based Speech Editing.

#16  | Longbiao Wang | DBLP Google Scholar  
By venueInterspeech: 38ICASSP: 30TASLP: 5SpeechComm: 3
By year2024: 82023: 172022: 212021: 132020: 122019: 22018: 3
ISCA sessionsanalysis of speech and audio signals: 3spatial audio: 3speech synthesis: 3asr: 2emotion and sentiment analysis: 2dnn architectures for speaker recognition: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multimodal speech emotion recognition: 1paralinguistics: 1biosignal-enabled spoken communication: 1speech quality assessment: 1speech representation: 1zero, low-resource and multi-modal speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1spoken dialogue systems and multimodality: 1spoken language processing: 1spoken dialogue systems: 1robust speaker recognition: 1targeted source separation: 1speech and voice disorders: 1speech emotion recognition: 1single-channel speech enhancement: 1voice and hearing disorders: 1learning techniques for speaker recognition: 1speech enhancement: 1adaptation and accommodation in conversation: 1robust speech recognition: 1spoofing detection: 1cognition and brain studies: 1
IEEE keywordsspeech recognition: 14speech synthesis: 6speaker verification: 6emotion recognition: 6representation learning: 5speech emotion recognition: 5speaker recognition: 5decoding: 4meta learning: 4natural language processing: 4predictive models: 3task analysis: 3transformers: 3speech enhancement: 3training data: 2data models: 2spectrogram: 2redundancy: 2ctap: 2contrastive learning: 2minimal supervision: 2self supervised learning: 2semantics: 2acoustic distortion: 2automatic speech recognition: 2visualization: 2degradation: 2transformer: 2convolution: 2noise measurement: 2time frequency analysis: 2time domain: 2domain adaptation: 2pattern classification: 2speaker extraction: 2speaker embedding: 2reverberation: 2naturalness: 2convolutional neural nets: 2interactive systems: 2image representation: 2capsule networks: 2multilingual: 1text to speech: 1self supervised representations: 1zero shot: 1low resource: 1text to speech (tts): 1pre training: 1agglutinative: 1language modeling: 1linguistics: 1morphology: 1prompt learning: 1syntactics: 1natural language understanding: 1hierarchical multi task learning: 1hidden markov models: 1labeling: 1cross domain slot filling: 1filling: 1pipelines: 1vc: 1text recognition: 1explosions: 1tts: 1asr: 1diffusion model: 1controllability: 1semantic coding: 1substitution: 1speech anti spoofing: 1concatenation: 1blending strategies: 1data augmentation: 1refining: 1adaptation: 1dysarthria: 1program processors: 1adaptation models: 1meta generalized speaker verification: 1performance evaluation: 1optimization: 1domain mismatch: 1recording: 1upper bound: 1audio visual data: 1co teaching+: 1vae: 1fast: 1complexity theory: 1knowledge distillation: 1lightweight: 1local global: 1positional encoding: 1natural languages: 1encoding: 1focusing: 1anti spoofing: 1learning systems: 1biometrics (access control): 1production: 1lip biometrics: 1visual speech: 1cross modal: 1correlation: 1lips: 1co learning: 1joint training: 1robust speech recognition: 1residual noise: 1speech distortion: 1robustness: 1refine network: 1fuses: 1multiresolution spectrograms: 1time domain analysis: 1noise robustness: 1disentangled representation learning: 1metric learning: 1extraterrestrial measurements: 1momentum augmentation: 1multimodal fusion: 1proposals: 1dense video captioning: 1center loss: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1array signal processing: 1mutual information: 1content: 1multiple references: 1audio signal processing: 1style: 1feature distillation: 1task driven loss: 1model compression: 1double constrained: 1utterance level representation: 1graph theory: 1atmosphere: 1dialogue level contextual information: 1recurrent neural nets: 1signal representation: 1signal classification: 1expressive speech synthesis: 1style modeling: 1style disentanglement: 1multilayer perceptrons: 1domain invariant: 1meta generalized transformation: 1query processing: 1knowledge based systems: 1knowledge retrieval: 1dialogue system: 1natural language generation: 1multi head attention: 1signal fusion: 1multi stage: 1pitch prediction: 1pitch control: 1speech coding: 1speech codecs: 1image recognition: 1spectro temporal attention: 1channel attention: 1auditory encoder: 1hearing: 1convolutional neural network: 1voice activity detection: 1ear: 1sensor fusion: 1graph convolutional: 1vgg 16: 1image fusion: 1multimodal emotion recognition: 1optimisation: 1meta speaker embedding network: 1cross channel: 1end to end model: 1dysarthric speech recognition: 1medical signal processing: 1articulatory attribute detection: 1multi view: 1time frequency: 1self attention: 1multi target learning: 1speech dereverberation: 1two stage: 1spectrograms fusion: 1acoustic and lexical context information: 1speech based user interfaces: 1mandarin dialog act recognition: 1hierarchical model.: 1
Most publications (all venues) at2022: 372021: 332023: 282020: 222019: 19

Affiliations
Nagaoka University of Technology

Recent publications

SpeechComm2024 Yuqin Lin, Jianwu Dang 0001, Longbiao Wang, Sheng Li 0010, Chenchen Ding, 
Disordered speech recognition considering low resources and abnormal articulation.

SpeechComm2024 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001, 
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network.

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi, 
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Rui Liu 0008, Yifan Hu, Haolin Zuo, Zhaojie Luo, Longbiao Wang, Guanglai Gao, 
Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training.

TASLP2024 Xiao Wei, Yuhang Li, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang 0001, 
A Prompt-Based Hierarchical Pipeline for Cross-Domain Slot Filling.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang 0074, Longbiao Wang, Jianwu Dang 0001, 
Learning Speech Representation from Contrastive Token-Acoustic Pretraining.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang 0001, 
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models.

ICASSP2024 Linjuan Zhang, Kong Aik Lee, Lin Zhang, Longbiao Wang, Baoning Niu, 
CPAUG: Refining Copy-Paste Augmentation for Speech Anti-Spoofing.

TASLP2023 Yuqin Lin, Longbiao Wang, Yanbing Yang, Jianwu Dang 0001, 
CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng, 
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, 
Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning.

ICASSP2023 Yuhao Liu, Cheng Gong, Longbiao Wang, Xixin Wu, Qiuyu Liu, Jianwu Dang 0001, 
VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

ICASSP2023 Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Jianwu Dang 0001, 
Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection.

ICASSP2023 Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang 0001, 
Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification.

ICASSP2023 Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang 0001, Xiaobao Wang, Shiliang Zhang, 
Speech and Noise Dual-Stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition.

ICASSP2023 Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang 0001, Tatsuya Kawahara, 
Time-Domain Speech Enhancement Assisted by Multi-Resolution Frequency Encoder and Decoder.

ICASSP2023 Yao Sun, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, 
Noise-Disentanglement Metric Learning for Robust Speaker Verification.

ICASSP2023 Yiwei Wei, Shaozu Yuan, Meng Chen 0006, Longbiao Wang
Enhancing Multimodal Alignment with Momentum Augmentation for Dense Video Captioning.

Interspeech2023 Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, Chengyun Deng, Fei Wang, 
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation.

Interspeech2023 Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang 0001, Shiliang Zhang, 
Rethinking the Visual Cues in Audio-Visual Speaker Extraction.

#17  | Jianwu Dang 0001 | DBLP Google Scholar  
By venueInterspeech: 38ICASSP: 28SpeechComm: 4TASLP: 4
By year2024: 62023: 152022: 212021: 122020: 142019: 32018: 3
ISCA sessionsanalysis of speech and audio signals: 3spatial audio: 3speech synthesis: 2asr: 2emotion and sentiment analysis: 2learning techniques for speaker recognition: 2speech processing in the brain: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multimodal speech emotion recognition: 1speech quality assessment: 1speech representation: 1zero, low-resource and multi-modal speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1spoken dialogue systems and multimodality: 1spoken language processing: 1spoken dialogue systems: 1robust speaker recognition: 1targeted source separation: 1speech and voice disorders: 1speech emotion recognition: 1conversational systems: 1single-channel speech enhancement: 1voice and hearing disorders: 1acoustic phonetics: 1speech enhancement: 1adaptation and accommodation in conversation: 1robust speech recognition: 1spoofing detection: 1cognition and brain studies: 1
IEEE keywordsspeech recognition: 14speaker verification: 6emotion recognition: 6speech synthesis: 5representation learning: 5speech emotion recognition: 5speaker recognition: 5meta learning: 4natural language processing: 4decoding: 3predictive models: 3task analysis: 3spectrogram: 2redundancy: 2ctap: 2minimal supervision: 2text recognition: 2self supervised learning: 2semantics: 2acoustic distortion: 2visualization: 2degradation: 2transformer: 2transformers: 2convolution: 2speech enhancement: 2noise measurement: 2time frequency analysis: 2time domain: 2domain adaptation: 2pattern classification: 2speaker extraction: 2speaker embedding: 2reverberation: 2naturalness: 2convolutional neural nets: 2interactive systems: 2image representation: 2capsule networks: 2training data: 1multilingual: 1text to speech: 1self supervised representations: 1data models: 1zero shot: 1low resource: 1prompt learning: 1syntactics: 1natural language understanding: 1hierarchical multi task learning: 1hidden markov models: 1labeling: 1cross domain slot filling: 1filling: 1pipelines: 1vc: 1contrastive learning: 1explosions: 1tts: 1asr: 1diffusion model: 1controllability: 1semantic coding: 1adaptation: 1automatic speech recognition: 1dysarthria: 1program processors: 1adaptation models: 1meta generalized speaker verification: 1performance evaluation: 1optimization: 1domain mismatch: 1recording: 1upper bound: 1audio visual data: 1co teaching+: 1intent understanding: 1oral communication: 1paralinguistic information: 1brain network features: 1eeg: 1human computer interaction: 1perturbation methods: 1linguistics: 1brain: 1vae: 1fast: 1complexity theory: 1knowledge distillation: 1lightweight: 1local global: 1positional encoding: 1natural languages: 1encoding: 1focusing: 1anti spoofing: 1learning systems: 1biometrics (access control): 1production: 1lip biometrics: 1visual speech: 1cross modal: 1correlation: 1lips: 1co learning: 1joint training: 1robust speech recognition: 1residual noise: 1speech distortion: 1robustness: 1refine network: 1fuses: 1multiresolution spectrograms: 1time domain analysis: 1noise robustness: 1disentangled representation learning: 1metric learning: 1extraterrestrial measurements: 1center loss: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1array signal processing: 1mutual information: 1content: 1multiple references: 1audio signal processing: 1style: 1feature distillation: 1task driven loss: 1model compression: 1double constrained: 1utterance level representation: 1graph theory: 1atmosphere: 1dialogue level contextual information: 1recurrent neural nets: 1signal representation: 1signal classification: 1multilayer perceptrons: 1domain invariant: 1meta generalized transformation: 1query processing: 1knowledge based systems: 1knowledge retrieval: 1dialogue system: 1natural language generation: 1multi head attention: 1signal fusion: 1multi stage: 1pitch prediction: 1pitch control: 1speech coding: 1speech codecs: 1image recognition: 1spectro temporal attention: 1channel attention: 1auditory encoder: 1hearing: 1convolutional neural network: 1voice activity detection: 1ear: 1sensor fusion: 1graph convolutional: 1vgg 16: 1image fusion: 1multimodal emotion recognition: 1optimisation: 1meta speaker embedding network: 1cross channel: 1end to end model: 1dysarthric speech recognition: 1medical signal processing: 1articulatory attribute detection: 1multi view: 1time frequency: 1self attention: 1multi target learning: 1speech dereverberation: 1two stage: 1spectrograms fusion: 1acoustic and lexical context information: 1speech based user interfaces: 1mandarin dialog act recognition: 1hierarchical model.: 1
Most publications (all venues) at2022: 422021: 392020: 292019: 292016: 25

Affiliations
Tianjin University, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, China
Institute of Communication Parlee, ICP, Center of National Research Scientific, France (2002-2003)
Japan Advanced Institute of Science and Technology, JAIST, Japan
Shizuoka University, Japan (PhD 1992)
URLs

Recent publications

SpeechComm2024 Yuqin Lin, Jianwu Dang 0001, Longbiao Wang, Sheng Li 0010, Chenchen Ding, 
Disordered speech recognition considering low resources and abnormal articulation.

SpeechComm2024 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network.

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi, 
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Xiao Wei, Yuhang Li, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang 0001
A Prompt-Based Hierarchical Pipeline for Cross-Domain Slot Filling.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang 0074, Longbiao Wang, Jianwu Dang 0001
Learning Speech Representation from Contrastive Token-Acoustic Pretraining.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang 0001
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models.

TASLP2023 Yuqin Lin, Longbiao Wang, Yanbing Yang, Jianwu Dang 0001
CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng, 
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001
Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning.

ICASSP2023 Zhongjie Li, Bin Zhao, Gaoyan Zhang, Jianwu Dang 0001
Brain Network Features Differentiate Intentions from Different Emotional Expressions of the Same Text.

ICASSP2023 Yuhao Liu, Cheng Gong, Longbiao Wang, Xixin Wu, Qiuyu Liu, Jianwu Dang 0001
VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

ICASSP2023 Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Jianwu Dang 0001
Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection.

ICASSP2023 Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang 0001
Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification.

ICASSP2023 Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang 0001, Xiaobao Wang, Shiliang Zhang, 
Speech and Noise Dual-Stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition.

ICASSP2023 Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang 0001, Tatsuya Kawahara, 
Time-Domain Speech Enhancement Assisted by Multi-Resolution Frequency Encoder and Decoder.

ICASSP2023 Yao Sun, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001
Noise-Disentanglement Metric Learning for Robust Speaker Verification.

Interspeech2023 Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, Chengyun Deng, Fei Wang, 
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation.

Interspeech2023 Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang 0001, Shiliang Zhang, 
Rethinking the Visual Cues in Audio-Visual Speaker Extraction.

Interspeech2023 Yuhang Li, Xiao Wei, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang 0001
Improving Zero-shot Cross-domain Slot Filling via Transformer-based Slot Semantics Fusion.

Interspeech2023 Zhongjie Li, Gaoyan Zhang, Longbiao Wang, Jianwu Dang 0001
Discrimination of the Different Intents Carried by the Same Text Through Integrating Multimodal Information.

#18  | Chng Eng Siong | DBLP Google Scholar  
By venueInterspeech: 32ICASSP: 30TASLP: 3ICLR: 2ACL: 2NeurIPS: 1AAAI: 1IJCAI: 1SpeechComm: 1EMNLP: 1
By year2024: 102023: 242022: 132021: 72020: 102019: 62018: 4
ISCA sessionsanalysis of speech and audio signals: 4speaker and language identification: 2speech recognition: 2speech enhancement, bandwidth extension and hearing aids: 2asr neural network architectures: 2cross-lingual and multilingual asr: 2end-to-end asr: 1self-supervised learning in asr: 1acoustic signal representation and analysis: 1robust asr, and far-field/multi-talker asr: 1multimodal speech emotion recognition and paralinguistics: 1speech segmentation: 1speech type classification and diagnosis: 1language and accent recognition: 1targeted source separation: 1bi- and multilinguality: 1acoustic model adaptation for asr: 1lexicon and language model for speech recognition: 1speaker and language recognition: 1neural waveform generation: 1speech technologies for code-switching in multilingual communities: 1language modeling: 1show and tell: 1source separation from monaural input: 1
IEEE keywordsspeech recognition: 14speech enhancement: 11noise measurement: 6speaker recognition: 5noise robustness: 4task analysis: 4contrastive learning: 4speaker extraction: 4automatic speech recognition: 3self supervised learning: 3error analysis: 3representation learning: 3adaptation models: 3multi task learning: 3signal processing algorithms: 3speaker embedding: 3speech coding: 2convolution: 2time domain analysis: 2robustness: 2background noise: 2transformer: 2transformers: 2speech separation: 2modulation: 2uncertainty: 2training data: 2generative adversarial network: 2multitasking: 2noise robust speech recognition: 2benchmark testing: 2codes: 2data augmentation: 2keyword spotting: 2entropy: 2natural language processing: 2information retrieval: 2sensor fusion: 2speech emotion recognition: 2emotion recognition: 2time domain: 2signal reconstruction: 2discrete codebook: 1speech distortion: 1code predictor: 1distortion: 1information interaction: 1dual branch: 1estimation: 1spectrogram: 1diagonal version of structured state space sequence (s4d) model: 1online diarization: 1filtering: 1low latency communication: 1spatial dictionary: 1multi channel: 1latency: 1prompt tuning: 1computational modeling: 1zero shot learning: 1explainable prompt: 1automatic speaker verification: 1label level knowledge distillation: 1knowledge engineering: 1knowledge distillation: 1data mining: 1attentive pooling: 1feature modulation: 1noisy speech separation: 1decoding: 1deepfake detection: 1deepfakes: 1representation regularization: 1audio visual fusion: 1measurement: 1diffusion probabilistic model: 1reinforcement learning: 1generative adversarial networks: 1unsupervised domain adaptation: 1supervised learning: 1gradient remedy: 1interference: 1gradient interference: 1noise robust speech separation: 1gradient modulation: 1end to end network: 1unify speech enhancement and separation: 1disentangling representations: 1noise robust automatic speech recognition: 1visualization: 1boosting: 1performance evaluation: 1low resource: 1datamaps: 1data models: 1mixup: 1xlsr: 1language identification: 1online speaker clustering: 1clustering algorithms: 1calibration: 1speaker verification: 1probabilistic logic: 1multi modal: 1representation: 1linguistics: 1linear programming: 1bidirectional attention: 1end to end: 1forced alignment: 1learning systems: 1optimisation: 1reinforcement leaning: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1reverberation: 1array signal processing: 1joint training approach: 1over suppression phenomenon: 1interactive feature fusion: 1noisy far field: 1small footprint: 1minimum word error: 1autoregressive processes: 1code switching: 1non autoregressive: 1asr: 1gaussian processes: 1dialogue relation extraction: 1interactive systems: 1pattern classification: 1text analysis: 1multi relations: 1bert: 1co attention mechanism: 1convolutional neural nets: 1multimodal fusion: 1recurrent neural nets: 1audio signal processing: 1multi level acoustic information: 1signal fusion: 1multi stage: 1image recognition: 1spectro temporal attention: 1channel attention: 1disentangled feature learning: 1signal denoising: 1adversarial training: 1signal representation: 1online speech recognition: 1early endpointing: 1scalegrad: 1analytical models: 1depth wise separable convolution: 1multi scale: 1multi scale fusion: 1speech bandwidth extension: 1signal restoration: 1low resource asr: 1pre training: 1catastrophic forgetting.: 1independent language model: 1fine tuning: 1spectrum approximation loss: 1source separation: 1
Most publications (all venues) at2023: 422016: 272015: 272022: 262024: 25


Recent publications

TASLP2024 Yuchen Hu, Chen Chen 0075, Qiushi Zhu, Eng Siong Chng
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR.

TASLP2024 Linhui Sun, Shuo Yuan, Aifei Gong, Lei Ye, Eng Siong Chng
Dual-Branch Modeling Based on State-Space Model for Speech Enhancement.

ICASSP2024 Weiguang Chen, Tran The Anh, Xionghu Zhong, Eng Siong Chng
Enhancing Low-Latency Speaker Diarization with Spatial Dictionary Learning.

ICASSP2024 Dianwen Ng, Chong Zhang 0003, Ruixi Zhang, Yukun Ma, Fabian Ritter Gutierrez, Trung Hieu Nguyen 0001, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma 0001, 
Are Soft Prompts Good Zero-Shot Learners for Speech Recognition?

ICASSP2024 Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng
Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification.

ICASSP2024 Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang 0003, Hao Wang 0199, Trung Hieu Nguyen 0001, Kun Zhou 0003, Dianwen Ng, Eng Siong Chng, Bin Ma 0001, 
SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance.

ICASSP2024 Zizheng Zhang, Chen Chen 0075, Hsin-Hung Chen, Xiang Liu, Yuchen Hu, Eng Siong Chng
Noise-Aware Speech Separation with Contrastive Learning.

ICASSP2024 Heqing Zou, Meng Shen 0002, Yuchen Hu, Chen Chen 0075, Eng Siong Chng, Deepu Rajan, 
Cross-Modality and Within-Modality Regularization for Audio-Visual Deepfake Detection.

ICLR2024 Chen Chen 0075, Ruizhe Li 0001, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Engsiong Chng, Chao-Han Huck Yang, 
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition.

ICLR2024 Yuchen Hu, Chen Chen 0075, Chao-Han Huck Yang, Ruizhe Li 0001, Chao Zhang 0031, Pin-Yu Chen, Engsiong Chng
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition.

ICASSP2023 Chen Chen 0075, Yuchen Hu, Weiwei Weng, Eng Siong Chng
Metric-Oriented Speech Enhancement Using Diffusion Probabilistic Model.

ICASSP2023 Chen Chen 0075, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng
Unsupervised Noise Adaptation Using Data Simulation.

ICASSP2023 Yuchen Hu, Chen Chen 0075, Ruizhe Li 0001, Qiushi Zhu, Eng Siong Chng
Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition.

ICASSP2023 Yuchen Hu, Chen Chen 0075, Heqing Zou, Xionghu Zhong, Eng Siong Chng
Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation.

ICASSP2023 Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, Jinjie Ni, Chong Zhang 0003, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma 0001, 
De'hubert: Disentangling Noise in a Self-Supervised Model for Robust Speech Recognition.

ICASSP2023 Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang 0003, Yukun Ma, Trung Hieu Nguyen 0001, Chongjia Ni, Eng Siong Chng, Bin Ma 0001, 
Contrastive Speech Mixup for Low-Resource Keyword Spotting.

ICASSP2023 Shangeth Rajaa, Kriti Anandan, Swaraj Dalmia, Tarun Gupta, Eng Siong Chng
Improving Spoken Language Identification with Map-Mix.

ICASSP2023 Alexey Sholokhov, Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng
Probabilistic Back-ends for Online Speaker Recognition and Clustering.

ICASSP2023 Yuhang Yang, Haihua Xu, Hao Huang 0009, Eng Siong Chng, Sheng Li 0010, 
Speech-Text Based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition.

Interspeech2023 Chen Chen 0075, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng
A Neural State-Space Modeling Approach to Efficient Speech Separation.

#19  | Kai Yu 0004 | DBLP Google Scholar  
By venueICASSP: 33Interspeech: 22TASLP: 14AAAI: 1SpeechComm: 1EMNLP: 1
By year2024: 112023: 122022: 132021: 62020: 172019: 112018: 2
ISCA sessionsspeech synthesis: 4speaker recognition: 2speech coding: 1speech recognition: 1automatic audio classification and audio captioning: 1pathological speech analysis: 1single-channel speech enhancement: 1speaker embedding and diarization: 1language and lexical modeling for asr: 1voice activity detection and keyword spotting: 1phonetic event detection and segmentation: 1spoken language understanding: 1anti-spoofing and liveness detection: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speaker recognition and anti-spoofing: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker verification using neural network methods: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 10natural language processing: 10speech synthesis: 8speaker recognition: 8task analysis: 5speech enhancement: 5decoding: 5vocoders: 5text analysis: 5visualization: 4audio signal processing: 4measurement: 3text to speech: 3semantics: 3time domain analysis: 3hidden markov models: 3speaker verification: 3natural language generation: 2signal processing algorithms: 2timbre: 2recording: 2diffusion: 2labeling: 2self supervised learning: 2language modeling: 2transfer learning: 2adaptation models: 2data models: 2optimization: 2natural languages: 2transformers: 2variational autoencoder: 2gaussian processes: 2lattice to sequence: 2adversarial training: 2teacher student learning: 2data augmentation: 2data handling: 2text dependent speaker verification: 2video signal processing: 2dialogue policy: 2slot filling: 2speaker embedding: 2interactive systems: 2attention models: 2recurrent neural nets: 2encoder decoder architecture: 1training schemes: 1evaluation metrics: 1audio recognition: 1automated audio captioning: 1efficiency: 1flow matching: 1mathematical models: 1rectified flow: 1trajectory: 1speed quality tradeoff: 1speaker embedding free: 1stability analysis: 1zero shot voice conversion: 1linguistics: 1cross attention: 1face animation: 1technological innovation: 1talking face: 1dubbing: 1synchronization: 1videos: 1rhetoric: 1expressive text to speech: 1tts dataset: 1large language models: 1annotations: 1manuals: 1textual expressiveness: 1systematics: 1byte pair encoding: 1syntactics: 1rescore: 1discrete audio token: 1correlation: 1category audio generation: 1multimodal: 1clustering: 1audio text learning: 1chatbots: 1data curation pipeline: 1detailed audio captioning: 1metadata: 1pipelines: 1hierarchical semantic frame: 1ontologies: 1spoken language understanding: 1relational graph attention network: 1degradation: 1multitasking: 1discrete tokens: 1speaker adaptation: 1timbre normalization: 1vector quantization: 1discrete fourier transforms: 1cepstrum: 1noise measurement: 1neural homomorphic synthesis: 1spectral masking: 1multi lingual: 1multi speaker: 1vqtts: 1limmits: 1classifier guidance: 1emotion intensity control: 1controllability: 1noise reduction: 1de noising diffusion models: 1emotional tts: 1complexity theory: 1sound generation: 1spice: 1variation quantized gan: 1text to sound: 1error analysis: 1audio visual: 1misp challenge: 1speaker diarization: 1inverse problems: 1speech editing: 1zero shot adaptation: 1diffusion probabilistic model: 1probabilistic logic: 1unit selection: 1probability: 1fastspeech2: 1speech codecs: 1voice cloning: 1autoregressive processes: 1mixture models: 1prosody cloning: 1prosody modelling: 1mixture density network: 1pre trained language model: 1algebra: 1lattice to lattice: 1prosody control: 1unsupervised learning: 1prosody tagging: 1decision trees: 1word level prosody: 1source filter model: 1complex neural network: 1weakly supervised learning: 1category adaptation: 1deep neural networks: 1source separation: 1supervised learning: 1information retrieval: 1audio text retrieval: 1aggregation: 1cross modal: 1pre trained model: 1pattern classification: 1arbitrary wake word: 1training detection criteria: 1entropy: 1wake word detection: 1text prompt: 1streaming: 1conditional generation: 1audio captioning: 1diverse caption generation: 1teacher training: 1voice activity detection: 1speech activity detection. weakly supervised learning: 1convolutional neural networks: 1i vector: 1sound event detection: 1dataset: 1music: 1text to audio grounding: 1scalability: 1multiple tasks: 1actor critic: 1parallel training: 1automatic speech recognition: 1attention based encoder decoder: 1standards: 1connectionist temporal classification: 1variational auto encoder: 1text independent speaker verification: 1generative adversarial network: 1binarization: 1product quantization: 1data compression: 1neural network language model: 1storage management: 1quantisation (signal): 1intent detection: 1natural language understanding (nlu): 1dual learning: 1semi supervised learning: 1domain adaptation: 1prior knowledge: 1label embedding: 1natural language understanding: 1on the fly data augmentation: 1specaugment: 1convolutional neural nets: 1multitask learning: 1channel information: 1low resource: 1dialogue state tracking: 1hierarchical: 1data sparsity: 1polysemy: 1multi sense embeddings: 1word processing: 1distributed representation: 1search problems: 1forward backward algorithm: 1word lattice: 1speech coding: 1training data: 1text dependent: 1adaptation: 1system performance: 1text mismatch: 1data collection: 1multi agent systems: 1policy adaptation: 1graph theory: 1ontologies (artificial intelligence): 1deep reinforcement learning: 1graph neural networks: 1center loss: 1angular softmax: 1short duration text independent speaker verification: 1speaker neural embedding: 1triplet loss: 1ctc: 1computational modeling: 1end to end speech recognition: 1multi speaker speech recognition: 1cocktail party problem: 1attention mechanism: 1knowledge distillation: 1computer aided instruction: 1language translation: 1audio databases: 1audio caption: 1recurrent neural networks: 1signal classification: 1
Most publications (all venues) at2020: 402024: 342023: 292022: 262019: 25

Affiliations
Shanghai Jiao Tong University, Computer Science and Engineering Department, China
Cambridge University, Engineering Department, UK (PhD 2006)

Recent publications

TASLP2024 Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu 0004
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning.

ICASSP2024 Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen 0001, Kai Yu 0004
VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching.

ICASSP2024 Junjie Li, Yiwei Guo, Xie Chen 0001, Kai Yu 0004
SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention.

ICASSP2024 Tao Liu, Chenpeng Du, Shuai Fan 0005, Feilong Chen, Kai Yu 0004
DiffDub: Person-Generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-Encoder.

ICASSP2024 Sen Liu, Yiwei Guo, Xie Chen 0001, Kai Yu 0004
StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations.

ICASSP2024 Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004
Acoustic BPE for Speech Generation with Discrete Tokens.

ICASSP2024 Zeyu Xie, Baihan Li, Xuenan Xu, Mengyue Wu, Kai Yu 0004
Enhancing Audio Generation Diversity with Visual Information.

ICASSP2024 Xuenan Xu, Xiaohang Xu 0004, Zeyu Xie, Pingyue Zhang, Mengyue Wu, Kai Yu 0004
A Detailed Audio-Text Data Simulation Pipeline Using Single-Event Sounds.

ICASSP2024 Hongshen Xu, Ruisheng Cao, Su Zhu, Sheng Jiang, Hanchong Zhang, Lu Chen 0002, Kai Yu 0004
A Birgat Model for Multi-Intent Spoken Language Understanding with Hierarchical Semantic Frames.

ICASSP2024 Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu 0004, Daniel Povey, Xie Chen 0001, 
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.

AAAI2024 Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen 0001, Shuai Wang 0016, Hui Zhang, Kai Yu 0004
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.

TASLP2023 Chenpeng Du, Yiwei Guo, Xie Chen 0001, Kai Yu 0004
Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature.

TASLP2023 Wenbin Jiang, Kai Yu 0004
Speech Enhancement With Integration of Neural Homomorphic Synthesis and Spectral Masking.

ICASSP2023 Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu 0004
Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge.

ICASSP2023 Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004
Emodiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance.

ICASSP2023 Guangwei Li, Xuenan Xu, Lingfeng Dai, Mengyue Wu, Kai Yu 0004
Diverse and Vivid Sound Generation from Text Descriptions.

ICASSP2023 Tao Liu, Zhengyang Chen, Yanmin Qian, Kai Yu 0004
Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge.

ICASSP2023 Zhijun Liu, Yiwei Guo, Kai Yu 0004
DiffVoice: Text-to-Speech with Latent Diffusion.

Interspeech2023 Wenbin Jiang, Fei Wen, Yifan Zhang, Kai Yu 0004
UnSE: Unsupervised Speech Enhancement Using Optimal Transport.

Interspeech2023 Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu 0004, Xie Chen 0001, 
Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation.

#20  | Tomoki Toda | DBLP Google Scholar  
By venueInterspeech: 31ICASSP: 30TASLP: 10SpeechComm: 1
By year2024: 72023: 152022: 122021: 142020: 112019: 82018: 5
ISCA sessionsspeech synthesis: 10voice conversion and adaptation: 3neural techniques for voice conversion and waveform generation: 3speech synthesis and voice conversion: 2speech enhancement, bandwidth extension and hearing aids: 2voice conversion and speech synthesis: 2speech quality assessment: 1spoken dialog systems and conversational analysis: 1the voicemos challenge: 1technology for disordered speech: 1the zero resource speech challenge 2020: 1neural waveform generation: 1novel paradigms for direct synthesis based on speech-related biosignals: 1sequence models for asr: 1speech synthesis paradigms and methods: 1
IEEE keywordsspeech synthesis: 16vocoders: 12voice conversion: 9neural vocoder: 8speech recognition: 7natural language processing: 5linguistics: 4speech enhancement: 4training data: 4speaker recognition: 4autoregressive processes: 4transformer: 4recurrent neural nets: 4controllability: 3text to speech: 3real time systems: 3electrolaryngeal speech: 3voice conversion (vc): 3open source software: 3speech intelligibility: 3sequence to sequence: 3convolutional neural nets: 3pathology: 2model pretraining: 2task analysis: 2artificial neural networks: 2data mining: 2speech emotion recognition: 2emotion recognition: 2self supervised learning: 2fundamental frequency control: 2source filter model: 2predictive models: 2decoding: 2error analysis: 2convolution: 2noisy to noisy vc: 2noisy speech modeling: 2self supervised speech representation: 2robustness: 2automatic speech recognition: 2computer architecture: 2mos prediction: 2non autoregressive: 2open source: 2pitch dependent dilated convolution: 2audio signal processing: 2probability: 2gaussian processes: 2supervised learning: 2speech coding: 2transfer learning: 1domain adaptation: 1automatic speech recognition (asr): 1low resourced asr: 1electrolaryngeal (el) speech: 1multichannel source separation: 1direction of arrival estimation: 1target speaker extraction: 1source separation: 1microphones: 1interference: 1multichannel variational autoencoder (mvae): 1estimation error: 1error correction: 1multi modal fusion: 1multitasking: 1semantics: 1coherence: 1asr error detection: 1visualization: 1asr error correction: 1audio difference captioning: 1audio difference learning: 1annotations: 1audio captioning: 1learning systems: 1finite impulse response filters: 1finite impulse response: 1synthesizers: 1jets: 1wavenext: 1convnext: 1transformers: 1intelligibility enhancement: 1atypical speech: 1harmonic analysis: 1speech rate conversion: 1generators: 1noise robustness: 1distortion: 1degradation: 1noise measurement: 1data augmentation: 1background noise: 1mathematical models: 1unified source filter networks: 1single channel speech enhancement: 1deep neural network: 1noise2noise: 1unsupervised learning: 1behavioral sciences: 1natural languages: 1sequence to sequence voice conversion: 1embedded systems: 1computers: 1low latency speech enhancement: 1speaker normalization: 1group theory: 1vocal tract length: 1asr: 1writing: 1minimally resourced asr: 1limiting: 1timbre: 1frequency synthesizers: 1documentation: 1autoregressive models: 1singing voice synthesis: 1pytorch: 1multi stream models: 1variational auto encoder: 1diffusion probabilistic model: 1representation learning: 1text to speech synthesis: 1tts: 1probabilistic logic: 1generative adversarial networks: 1source filter model: 1speech naturalness assessment: 1mean opinion score: 1streaming: 1speech quality assessment: 1hearing: 1sequence to sequence modeling: 1decision making: 1dysarthric speech: 1pathological speech: 1autoencoder: 1computer based training: 1signal denoising: 1pretraining: 1transformer network: 1attention: 1computational modeling: 1sequence to sequence learning: 1data models: 1many to many vc: 1parallel wavegan: 1quasi periodic wavenet: 1wavenet: 1quasi periodic structure: 1pitch controllability: 1vocoder: 1listener adaptation: 1perceived emotion: 1conformer: 1bert: 1language model: 1text analysis: 1vector quantized variational autoencoder: 1nonparallel: 1medical disorders: 1dysarthria: 1diffwave: 1diffusion probabilistic vocoder: 1sub modeling: 1wavegrad: 1noise: 1call centres: 1hierarchical multi task model: 1contact center call: 1customer satisfaction (cs): 1long short term memory recurrent neural networks: 1customer satisfaction: 1customer services: 1reproducibility of results: 1end to end: 1sound event detection: 1weakly supervised learning: 1self attention: 1weighted forced attention: 1forced alignment: 1sequence to sequence model: 1laplacian distribution: 1prediction theory: 1wavenet vocoder: 1multiple samples output: 1shallow model: 1linear prediction: 1fast fourier transforms: 1gaussian inverse autoregressive flow: 1parallel wavenet: 1fftnet: 1noise shaping: 1wavenet fine tuning: 1oversmoothed parameters: 1cyclic recurrent neural network: 1
Most publications (all venues) at2014: 422015: 372023: 302021: 282018: 25

Affiliations
URLs

Recent publications

TASLP2024 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda
Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition.

TASLP2024 Rui Wang, Li Li 0063, Tomoki Toda
Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information.

ICASSP2024 Jiajun He, Xiaohan Shi, Xingfeng Li 0001, Tomoki Toda
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction.

ICASSP2024 Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda
Audio Difference Learning for Audio Captioning.

ICASSP2024 Yamato Ohtani, Takuma Okamoto, Tomoki Toda, Hisashi Kawai, 
FIRNet: Fundamental Frequency Controllable Fast Neural Vocoder With Trainable Finite Impulse Response Filter.

ICASSP2024 Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai, 
Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion.

ICASSP2024 Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda
Electrolaryngeal Speech Intelligibility Enhancement through Robust Linguistic Encoders.

TASLP2023 Keisuke Matsubara, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Hisashi Kawai, 
Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder.

TASLP2023 Chao Xie, Tomoki Toda
Noisy-to-Noisy Voice Conversion Under Variations of Noisy Condition.

TASLP2023 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda
High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks.

ICASSP2023 Takuya Fujimura, Tomoki Toda
Analysis Of Noisy-Target Training For Dnn-Based Speech Enhancement.

ICASSP2023 Kazuhiro Kobayashi, Tomoki Hayashi, Tomoki Toda
Low-Latency Electrolaryngeal Speech Enhancement Based on Fastspeech2-Based Voice Conversion and Self-Supervised Speech Representation.

ICASSP2023 Atsushi Miyashita, Tomoki Toda
Representation of Vocal Tract Length Transformation Based on Group Theory.

ICASSP2023 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda
Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition.

ICASSP2023 Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda
NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit.

ICASSP2023 Yusuke Yasuda, Tomoki Toda
Text-To-Speech Synthesis Based on Latent Variable Conversion Using Diffusion Probabilistic Model and Variational Autoencoder.

ICASSP2023 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda
Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder.

Interspeech2023 Yeonjong Choi, Chao Xie, Tomoki Toda
Reverberation-Controllable Voice Conversion Using Reverberation Time Estimator.

Interspeech2023 Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda
Preference-based training framework for automatic speech quality assessment using deep neural network.

Interspeech2023 Takuma Okamoto, Tomoki Toda, Hisashi Kawai, 
E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion.

#21  | Junichi Yamagishi | DBLP Google Scholar  
By venueInterspeech: 32ICASSP: 24TASLP: 15
By year2024: 52023: 122022: 122021: 82020: 162019: 132018: 5
ISCA sessionsspeech synthesis: 8voice anti-spoofing and countermeasure: 3voice privacy challenge: 3speaker and language identification: 2speech synthesis paradigms and methods: 2anti-spoofing for speaker verification: 1the voicemos challenge: 1single-channel and multi-channel speech enhancement: 1speech coding and restoration: 1spoofing-aware automatic speaker verification (sasv): 1intelligibility-enhancing speech modification: 1single-channel speech enhancement: 1emotion modeling and analysis: 1neural techniques for voice conversion and waveform generation: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1expressive speech synthesis: 1voice conversion and speech synthesis: 1prosody modeling and generation: 1speaker verification: 1
IEEE keywordsspeech synthesis: 20speaker recognition: 11vocoders: 10speech recognition: 8text to speech: 7anti spoofing: 6training data: 5privacy: 5voice conversion: 5countermeasure: 5presentation attack detection: 5data privacy: 4pipelines: 4speech intelligibility: 4neural network: 4task analysis: 3logical access: 3neural vocoder: 3speaker anonymization: 3music: 3filtering theory: 3data models: 2protocols: 2self supervised learning: 2information filtering: 2asvspoof: 2speech enhancement: 2tacotron: 2automatic speaker verification: 2musical instruments: 2natural language processing: 2mos prediction: 2speaker verification: 2variational auto encoder: 2hidden markov models: 2speech coding: 2security of data: 2speaker adaptation: 2fourier transforms: 2autoregressive processes: 2multilingual: 1self supervised representations: 1decoding: 1zero shot: 1spectrogram: 1low resource: 1pseudonymisation: 1voice privacy: 1anonymisation: 1attack model: 1recording: 1degradation: 1deepfake detection: 1signal processing algorithms: 1privacy friendly data: 1language robust orthogonal householder neural network: 1codecs: 1deepfakes: 1spoofing: 1distributed databases: 1countermeasures: 1communication networks: 1selection based anonymizer: 1measurement: 1information integrity: 1synthetic aperture sonar: 1orthogonal householder neural network anonymizer: 1weighted additive angular softmax: 1internet: 1deepfake: 1databases: 1spoof localization: 1partialspoof: 1splicing: 1forgery: 1listening enhancement: 1oral communication: 1noise reduction: 1noise measurement: 1full end speech enhancement: 1intelligibility: 1transforms: 1privacy preservation: 1sex neutral voice: 1attribute privacy: 1multiple signal classification: 1computational modeling: 1software: 1transformer: 1text to speech synthesis: 1music audio synthesis: 1analytical models: 1buildings: 1spoof countermeasures: 1security: 1reinforcement learning: 1musical instrument embeddings: 1gaussian processes: 1linkability: 1speech naturalness assessment: 1mean opinion score: 1speech quality assessment: 1hearing: 1efficiency: 1pruning: 1vocoder: 1computer crime: 1estimation theory: 1resnet: 1attention: 1tdnn: 1feedforward neural nets: 1deep learning (artificial intelligence): 1time frequency analysis: 1generative adversarial networks: 1multi metric optimization: 1reverberation: 1speech analysis: 1voice conversion evaluation: 1voice conversion challenges: 1speaker characterization: 1vocoding: 1entertainment: 1listening test: 1rakugo: 1vector quantisation: 1representation learning: 1phone recognition: 1image coding: 1disentanglement: 1speaker diarization: 1duration modeling: 1vector quantization: 1automatic speaker verification (asv): 1detect ion cost function: 1spoofing counter measures: 1backpropagation: 1voice cloning: 1short time fourier transform: 1convolution: 1waveform model: 1recurrent neural nets: 1fundamental frequency: 1speaker embeddings: 1transfer learning: 1search problems: 1probability: 1sequences: 1sampling methods: 1sequence to sequence model: 1stochastic processes: 1neural waveform synthesizer: 1fine tuning: 1audio signal processing: 1zero shot adaptation: 1musical instrument sounds synthesis: 1cepstral analysis: 1complex valued representation: 1boltzmann machines: 1restricted boltzmann machine: 1signal classification: 1neural vocoding: 1gan: 1inference mechanisms: 1glottal excitation model: 1replay attacks: 1spoofing attack: 1vocal effort: 1style conversion: 1pulse model in log domain vocoder: 1cyclegan: 1lom bard speech: 1spectral analysis: 1wavenet: 1neural net architecture: 1neural waveform modeling: 1maximum likelihood estimation: 1waveform analysis: 1gaussian distribution: 1waveform generators: 1waveform modeling: 1gradient methods: 1text analysis: 1
Most publications (all venues) at2020: 322019: 322022: 312018: 292016: 29

Affiliations
National Institute of Informatics, Tokyo, Japan
University of Edinburgh, Scotland, UK (former)

Recent publications

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Michele Panariello, Natalia A. Tomashenko, Xin Wang 0037, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas W. D. Evans, Emmanuel Vincent 0001, Junichi Yamagishi
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation.

ICASSP2024 Xin Wang 0037, Junichi Yamagishi
Can Large-Scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-Supervised Front End?

ICASSP2024 Wanying Ge, Xin Wang 0037, Junichi Yamagishi, Massimiliano Todisco, Nicholas W. D. Evans, 
Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?

ICASSP2024 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Nicholas W. D. Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier, 
Synvox2: Towards A Privacy-Friendly Voxceleb2 Dataset.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee, 
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

TASLP2023 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Natalia A. Tomashenko, 
Speaker Anonymization Using Orthogonal Householder Neural Network.

TASLP2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance.

ICASSP2023 Haoyu Li, Yun Liu, Junichi Yamagishi
Joint Noise Reduction and Listening Enhancement for Full-End Speech Enhancement.

ICASSP2023 Paul-Gauthier Noé, Xiaoxiao Miao, Xin Wang 0037, Junichi Yamagishi, Jean-François Bonastre, Driss Matrouf, 
Hiding Speaker's Sex in Speech Using Zero-Evidence Speaker Representation in an Analysis/Synthesis Pipeline.

ICASSP2023 Xuan Shi, Erica Cooper, Xin Wang 0037, Junichi Yamagishi, Shrikanth Narayanan, 
Can Knowledge of End-to-End Text-to-Speech Models Improve Neural Midi-to-Audio Synthesis Systems?

ICASSP2023 Xin Wang 0037, Junichi Yamagishi
Spoofed Training Data for Speech Spoofing Countermeasure Can Be Efficiently Created Using Neural Vocoders.

Interspeech2023 Erica Cooper, Junichi Yamagishi
Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech.

Interspeech2023 Hieu-Thi Luong, Junichi Yamagishi
Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme.

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Chang Zeng, Xin Wang 0037, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi
Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms.

Interspeech2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi
Range-Based Equal Error Rate for Spoof Localization.

TASLP2022 Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi
Optimizing Tandem Speaker Verification and Anti-Spoofing Systems.

TASLP2022 Xuan Shi, Erica Cooper, Junichi Yamagishi
Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds.

TASLP2022 Brij Mohan Lal Srivastava, Mohamed Maouche, Md. Sahidullah, Emmanuel Vincent 0001, Aurélien Bellet, Marc Tommasi, Natalia A. Tomashenko, Xin Wang 0037, Junichi Yamagishi
Privacy and Utility of X-Vector Based Speaker Anonymization.

#22  | Marc Delcroix | DBLP Google Scholar  
By venueICASSP: 34Interspeech: 33TASLP: 4
By year2024: 72023: 122022: 122021: 142020: 132019: 92018: 4
ISCA sessionssource separation: 3adjusting to speaker, accent, and domain: 2analysis of neural speech representations: 1multi-talker methods in speech processing: 1speech coding: 1spoken language understanding, summarization, and information retrieval: 1speech recognition: 1speech coding and enhancement: 1dereverberation, noise reduction, and speaker extraction: 1speech enhancement and intelligibility: 1speaker embedding and diarization: 1search/decoding algorithms for asr: 1novel models and training methods for asr: 1single-channel speech enhancement: 1speaker diarization: 1streaming for asr/rnn transducers: 1source separation, dereverberation and echo cancellation: 1speech localization, enhancement, and quality assessment: 1target speaker detection, localization and separation: 1monaural source separation: 1asr neural network architectures and training: 1diarization: 1targeted source separation: 1lm adaptation, lexical units and punctuation: 1asr for noisy and far-field speech: 1asr neural network architectures: 1speech and audio source separation and scene analysis: 1neural networks for language modeling: 1distant asr: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 18speech enhancement: 13source separation: 10speaker recognition: 8automatic speech recognition: 5natural language processing: 5neural network: 5self supervised learning: 4reverberation: 4single channel speech enhancement: 3transformers: 3adaptation models: 3target speech extraction: 3recording: 3recurrent neural nets: 3blind source separation: 3array signal processing: 3degradation: 2noise robust speech recognition: 2processing distortion: 2analytical models: 2speech summarization: 2encoding: 2data models: 2speech translation: 2speech synthesis: 2joint training: 2bayes methods: 2hidden markov models: 2speaker diarization: 2artificial neural networks: 2continuous speech separation: 2permutation invariant training: 2particle separators: 2convolution: 2dynamic programming: 2computational modeling: 2computational efficiency: 2memory management: 2task analysis: 2training data: 2error analysis: 2meeting recognition: 2end to end speech recognition: 2sensor fusion: 2text analysis: 2diarization: 2signal to distortion ratio: 2speech separation: 2speech extraction: 2convolutional neural nets: 2online processing: 2dynamic stream weights: 2audio signal processing: 2time domain network: 2source counting: 2time domain analysis: 2backpropagation: 2nonlinear distortion: 1noise measurement: 1interference: 1speaker representation: 1refining: 1probing task: 1speech representation: 1linguistics: 1layer wise similarity analysis: 1long form asr: 1complexity theory: 1speaker embeddings: 1noise robustness: 1zero shot tts: 1self supervised learning model: 1acoustic distortion: 1interpolation: 1variational bayes: 1discriminative training: 1standards: 1vbx: 1tuning: 1clustering: 1feature aggregation: 1pre trained models: 1benchmark testing: 1telephone sets: 1data mining: 1few shot adaptation: 1sound event: 1soundbeam: 1target sound extraction: 1oral communication: 1graph pit: 1video on demand: 1end to end modeling: 1memory efficient encoders: 1dual speech/text encoder: 1long spoken document: 1end to end speech summarization: 1measurement: 1synthetic data augmentation: 1how2 dataset: 1multi modal data augmentation: 1software: 1tensors: 1word error rate: 1levenshtein distance: 1iterative methods: 1forward language model: 1iterative decoding: 1partial sentence aware backward language model: 1iterative shallow fusion: 1symbols: 1shallow fusion: 1language translation: 1attention fusion: 1rover: 1pattern clustering: 1infinite gmm: 1mixture models: 1gaussian processes: 1attention based decoder: 1recurrent neural network transducer: 1end to end: 1switches: 1loss function: 1large ensemble: 1complementary neural language models: 1iterative lattice generation: 1lattice rescoring: 1context carry over: 1lattices: 1input switching: 1deep learning (artificial intelligence): 1speakerbeam: 1acoustic beamforming: 1complex backpropagation: 1transfer functions: 1multi channel source separation: 1speaker activity: 1clustering algorithms: 1databases: 1signal processing algorithms: 1long recording speech separation: 1transforms: 1dual path modeling: 1end to end (e2e) speech recognition: 1estimation theory: 1bidirectional long short term memory (blstm): 1imbalanced datasets: 1confidence estimation: 1auxiliary features: 1audiovisual speaker localization: 1audio visual systems: 1image fusion: 1data fusion: 1video signal processing: 1beamforming: 1maximum likelihood estimation: 1dereverberation: 1optimisation: 1filtering theory: 1microphone array: 1microphone arrays: 1multi task loss: 1spatial features: 1separation: 1smart devices: 1robustness: 1signal denoising: 1robust asr: 1and multi head self attention: 1multi task learning: 1auxiliary information: 1computational complexity: 1multi speaker speech recognition: 1time domain: 1frequency domain analysis: 1audiovisual speaker tracking: 1kalman filters: 1tracking: 1backprop kalman filter: 1speaker embedding: 1adversarial learning: 1deep neural networks: 1phoneme invariant feature: 1text independent speaker recognition: 1signal classification: 1adaptation: 1auxiliary feature: 1domain adaptation: 1topic model: 1recurrent neural network language model: 1sequence summary network: 1semi supervised learning: 1decoding: 1encoder decoder: 1autoencoder: 1meeting diarization: 1speaker attention: 1speech separation/extraction: 1
Most publications (all venues) at2017: 242024: 222021: 222023: 202020: 17

Affiliations
URLs

Recent publications

TASLP2024 Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance.

ICASSP2024 Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima, 
What Do Self-Supervised Speech and Speaker Models Learn? New Findings from a Cross Model Layer-Wise Analysis.

ICASSP2024 William Chen, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001, 
Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing.

ICASSP2024 Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima, 
Noise-Robust Zero-Shot Text-to-Speech Synthesis Conditioned on Self-Supervised Speech-Representation Model with Adapters.

ICASSP2024 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

ICASSP2024 Dominik Klement, Mireia Díez, Federico Landini, Lukás Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara, 
Discriminative Training of VBx Diarization.

ICASSP2024 Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocký, 
Target Speech Extraction with Pre-Trained Self-Supervised Learning Models.

TASLP2023 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki, 
SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning.

TASLP2023 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach, 
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria.

ICASSP2023 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Roshan S. Sharma, Kohei Matsuura, Shinji Watanabe 0001, 
Speech Summarization of Long Spoken Document: Improving Memory Efficiency of Speech/Text Encoders.

ICASSP2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura, 
Leveraging Large Text Corpora For End-To-End Speech Summarization.

ICASSP2023 Thilo von Neumann, Christoph Böddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach, 
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems.

ICASSP2023 Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix
Iterative Shallow Fusion of Backward Language Model for End-To-End Speech Recognition.

Interspeech2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma, 
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Interspeech2023 Marc Delcroix, Naohiro Tawara, Mireia Díez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukás Burget, Shoko Araki, 
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization.

Interspeech2023 Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani, 
Target Speech Extraction with Conditional Diffusion Model.

Interspeech2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001, 
Integrating Multiple ASR Systems into NLP Backend with Attention Fusion.

#23  | Prasanta Kumar Ghosh | DBLP Google Scholar  
By venueInterspeech: 46ICASSP: 21TASLP: 2SpeechComm: 1
By year2024: 12023: 122022: 82021: 102020: 152019: 132018: 11
ISCA sessionsspeech signal characterization: 4show and tell: 3speech and voice disorders: 3human speech production: 3bioacoustics and articulation: 3articulatory information, modeling and inversion: 3speech, voice, and hearing disorders: 2speech production: 2speech signal analysis and representation: 2source and supra-segmentals: 2articulation: 1speech signal analysis: 1phonetics, phonology, and prosody: 1dysarthric speech assessment: 1analysis of speech and audio signals: 1low-resource asr development: 1speech production, perception and multimodality: 1assessment of pathological speech and language: 1cross/multi-lingual and code-switched asr: 1the first dicova challenge: 1diverse modes of speech acquisition and processing: 1speech in health: 1speaker recognition: 1applications in language learning and healthcare: 1deep enhancement: 1source separation and spatial analysis: 1voice conversion: 1speech and singing production: 1show and tell 6: 1
IEEE keywordsspeech recognition: 7amyotrophic lateral sclerosis: 5acoustic to articulatory inversion: 5speaker recognition: 5diseases: 4signal classification: 4cepstral analysis: 4whispered speech: 3convolutional neural nets: 3blstm: 3vowels: 2fricatives: 2dysarthria: 2production: 2tongue: 2data models: 2speech synthesis: 2transformers: 2convolution: 2natural language processing: 2parkinson’s disease: 2correlation methods: 2filtering theory: 2electromagnetic articulograph: 2audio signal processing: 2cnn: 2spectral analysis: 1severity: 1sociology: 1acoustic measurements: 1constriction: 1voicing: 1statistics: 1static: 1source filter: 1vowel: 1dynamic: 1shape: 1information filters: 1text to speech (tts): 1model compression: 1data constrained multi speaker: 1multi lingual tts: 1end to end: 1sequence to sequence learning: 1measurement: 1atmospheric modeling: 1speech production: 1real time magnetic resonance imaging: 1streaming media: 1magnetic resonance imaging: 1self supervised learning: 1articulatory to acoustic forward mapping: 1articulatory speech synthesis: 1recording device: 1dual attention pooling network: 1real time magnetic resonance imaging video: 1biomedical mri: 1air tissue boundary segmentation: 13 dimensional convolutional neural network: 1tongue base: 1velum: 1medical image processing: 1image segmentation: 1image registration: 1mel frequency cepstral coefficients: 1model complexity: 1noise: 1pitch: 1transfer learning: 1medical computing: 1x vectors: 1pitch drop: 1source filter interaction: 1natural languages: 1speaking rate: 1support vector machines: 1medical signal processing: 1recurrent neural nets: 1cnn lstm: 1adaptation: 1lf mmi: 1hidden markov models: 1maximum likelihood estimation: 1pseudo likelihood correction technique: 1acoustic signal detection: 1attention network: 1swallow sound signal: 1feature selection: 1biology computing: 1bioacoustics: 1cervical auscultation: 1acoustic analysis: 1gesture recognition: 1head gestures: 1euler angles: 1lstm: 1sustained phonations: 1asthma: 1classification: 1opensmile: 1latent variable model: 1expectation maximisation algorithm: 1dirichlet distribution: 1source separation: 1nmf: 1exponential family distributions: 1time varying: 1non negative: 1gif: 1gibbs sampling: 1probability: 1glottal inverse filtering: 1probabilistic weighted linear prediction: 1formants: 1amplitude modulation: 1speaker verification: 1articulatory data: 1automatic speech recognition: 1signal representation: 1neutral speech: 1
Most publications (all venues) at2019: 252018: 242021: 212023: 202020: 19

Affiliations
Indian Institute of Science, Department of Electrical Engineering, Bangalore, India
URLs

Recent publications

ICASSP2024 Chowdam Venkata Thirumala Kumar, Tanuka Bhattacharjee, Seena Vengalil, Saraswati Nashi, Madassu Keerthipriya, Yamini Belur, Atchayaram Nalini, Prasanta Kumar Ghosh
Spectral Analysis of Vowels and Fricatives at Varied Levels of Dysarthria Severity for Amyotrophic Lateral Sclerosis.

ICASSP2023 Tanuka Bhattacharjee, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Exploring the Role of Fricatives in Classifying Healthy Subjects and Patients with Amyotrophic Lateral Sclerosis and Parkinson's Disease.

ICASSP2023 Tanuka Bhattacharjee, Chowdam Venkata Thirumala Kumar, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Static and Dynamic Source and Filter Cues for Classification of Amyotrophic Lateral Sclerosis Patients and Healthy Subjects.

ICASSP2023 Abhayjeet Singh, Amala Nagireddi, Deekshitha G, Jesuraja Bandekar, Roopa R., Sandhya Badiger, Sathvik Udupa, Prasanta Kumar Ghosh, Hema A. Murthy, Heiga Zen, Pranaw Kumar, Kamal Kant, Amol Bole, Bira Chandra Singh, Keiichi Tokuda, Mark Hasegawa-Johnson, Philipp Olbrich, 
Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech.

ICASSP2023 Sathvik Udupa, Prasanta Kumar Ghosh
Real-Time MRI Video Synthesis from Time Aligned Phonemes with Sequence-to-Sequence Networks.

ICASSP2023 Sathvik Udupa, C. Siddarth, Prasanta Kumar Ghosh
Improved Acoustic-to-Articulatory Inversion Using Representations from Pretrained Self-Supervised Learning Models.

Interspeech2023 Jesuraja Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh
Exploring a classification approach using quantised articulatory movements for acoustic to articulatory inversion.

Interspeech2023 Varun Belagali, M. V. Achuth Rao, Prasanta Kumar Ghosh
Weakly supervised glottis segmentation in high-speed videoendoscopy using bounding box labels.

Interspeech2023 Tanuka Bhattacharjee, Anjali Jayakumar, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Transfer Learning to Aid Dysarthria Severity Classification for Patients with Amyotrophic Lateral Sclerosis.

Interspeech2023 Siddarth Chandrasekar, Arvind Ramesh, Tilak Purohit, Prasanta Kumar Ghosh
A Study on the Importance of Formant Transitions for Stop-Consonant Classification in VCV Sequence.

Interspeech2023 Shelly Jain, Priyanshi Pal, Anil Kumar Vuppala, Prasanta Kumar Ghosh, Chiranjeevi Yarra, 
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations.

Interspeech2023 Chowdam Venkata Thirumala Kumar, Tanuka Bhattacharjee, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Classification of Multi-class Vowels and Fricatives From Patients Having Amyotrophic Lateral Sclerosis with Varied Levels of Dysarthria Severity.

Interspeech2023 Mohammad Shaique Solanki, Ashutosh Bharadwaj, Jeevan Kylash, Prasanta Kumar Ghosh
Do Vocal Breath Sounds Encode Gender Cues for Automatic Gender Classification?

SpeechComm2022 Chiranjeevi Yarra, Prasanta Kumar Ghosh
Automatic syllable stress detection under non-parallel label and data condition.

ICASSP2022 Aravind Illa, Aanish Nair, Prasanta Kumar Ghosh
The impact of cross language on acoustic-to-articulatory inversion and its influence on articulatory speech synthesis.

ICASSP2022 Abinay Reddy Naini, Bhavuk Singhal, Prasanta Kumar Ghosh
Dual Attention Pooling Network for Recording Device Classification Using Neutral and Whispered Speech.

ICASSP2022 Anwesha Roy, Varun Belagali, Prasanta Kumar Ghosh
An Error Correction Scheme for Improved Air-Tissue Boundary in Real-Time MRI Video for Speech Production.

Interspeech2022 Anish Bhanushali, Grant Bridgman, Deekshitha G, Prasanta Kumar Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Srinivasan Umesh, Sathvik Udupa, Lodagala V. S. V. Durga Prasad, 
Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi.

Interspeech2022 Anwesha Roy, Varun Belagali, Prasanta Kumar Ghosh
Air tissue boundary segmentation using regional loss in real-time Magnetic Resonance Imaging video for speech production.

Interspeech2022 C. Siddarth, Sathvik Udupa, Prasanta Kumar Ghosh
Watch Me Speak: 2D Visualization of Human Mouth during Speech.

#24  | Chin-Hui Lee 0001 | DBLP Google Scholar  
By venueICASSP: 30Interspeech: 26TASLP: 11SpeechComm: 3
By year2024: 72023: 142022: 72021: 112020: 172019: 122018: 2
ISCA sessionsacoustic scene classification: 2multi-channel speech enhancement: 2speech enhancement: 2speech enhancement and denoising: 1speech coding and enhancement: 1multi-talker methods in speech processing: 1spoken dialog systems and conversational analysis: 1speech recognition: 1spoken language processing: 1speaker embedding and diarization: 1acoustic scene analysis: 1spoken dialogue systems and multimodality: 1multimodal systems: 1speaker diarization: 1privacy-preserving machine learning for audio & speech processing: 1single-channel speech enhancement: 1voice activity detection and keyword spotting: 1speech emotion recognition: 1speech coding and evaluation: 1speech and audio classification: 1far-field speech recognition: 1deep enhancement: 1the first dihard speech diarization challenge: 1
IEEE keywordsspeech enhancement: 22speech recognition: 20speaker diarization: 9visualization: 8task analysis: 6deep neural network: 6noise measurement: 5data models: 5regression analysis: 5hidden markov models: 4misp challenge: 4recording: 4adaptation models: 4voice activity detection: 4robust speech recognition: 3audio visual: 3noise: 3error analysis: 3speech separation: 3reverberation: 3progressive learning: 3speaker recognition: 3teacher student learning: 3signal to noise ratio: 3optimization: 2iterative methods: 2robustness: 2estimation: 2emotion recognition: 2data mining: 2benchmark testing: 2multimodality: 2memory aware speaker embedding: 2attention network: 2telephone sets: 2data augmentation: 2automatic speech recognition: 2post processing: 2image analysis: 2acoustic scene classification: 2convolutional neural networks: 2improved minima controlled recursive averaging: 2recurrent neural nets: 2speech intelligibility: 2fully convolutional neural network: 2domain adaptation: 2generalized gaussian distribution: 2mean square error methods: 2maximum likelihood estimation: 2least mean squares methods: 2gaussian distribution: 2ideal ratio mask: 2convolutional neural nets: 2transfer learning: 2task generic: 1measurement: 1optimization objective: 1distortion measurement: 1diffusion model: 1mathematical models: 1score based: 1speech denoising: 1interpolating diffusion model: 1interpolation: 1topology: 1multi channel speech enhancement: 1chime 7 challenge: 1iterative mask estimation: 1redundancy: 1feature fusion: 1multi modal emotion recognition: 1entropy based fusion: 1structured pruning: 1network architecture optimization: 1target speaker extraction: 1real world scenarios: 1oral communication: 1memory management: 1chime challenge: 1graphics processing units: 1sequence to sequence architecture: 1codes: 1degradation: 1knowledge based systems: 1boosting: 1multilingual automatic speech recognition: 1articulatory speech attributes: 1adaptive refinement: 1dictionary learning: 1adaptive systems: 1dynamic mask: 1data quality control: 1time domain analysis: 1synchronization: 1dcase 2022: 1testing: 1sound event localization and detection: 1model architecture: 1realistic data: 1location awareness: 1tv: 1quality assessment: 1convolution: 1kernel: 1encoding: 1visual embedding reconstruction: 1acoustic distortion: 1learning systems: 1public domain software: 1wake word spotting: 1audio visual systems: 1microphone array: 1decoding: 1speech coding: 1ts vad: 1m2met: 1dihard iii challenge: 1filtering: 1iteration: 1signal processing algorithms: 1robust automatic speech recognition: 1acoustic model: 1neural net architecture: 1probability: 1cross entropy: 1entropy: 1optimisation: 1deep neural network (dnn): 1local response normalization: 1multi level and adaptive fusion: 1face recognition: 1factorized bilinear pooling: 1multimodal emotion recognition: 1analytical models: 1class activation mapping: 1adaptive noise and speech estimation: 1computer architecture: 1additives: 1noise reduction: 1computational modeling: 1convolutional layers: 1sehae: 1hierarchical autoencoder: 1data privacy: 1acoustic modeling: 1and federated learning: 1quantum machine learning: 1microphone arrays: 1snr progressive learning: 1neural network: 1dense structure: 1acoustic segment model: 1semantics: 1attention mechanism: 1label embedding: 1knowledge representation: 1backpropagation: 1maximum likelihood: 1shape factors update: 1multi objective learning: 1tensors: 1tensor train network: 1tensor to vector regression: 1speech activity detection: 1snr estimation: 1dihard data: 1geometric constraint: 1geometry: 1linear programming: 1lstm: 12d to 2d mapping: 1fuzzy neural nets: 1performance evaluation: 1source separation: 1child speech extraction: 1realistic conditions: 1measures: 1signal classification: 1noise robustness: 1adversarial robustness: 1gradient methods: 1speech recognition safety: 1adversarial examples: 1prediction error modeling: 1gaussian processes: 1pattern classification: 1non native tone modeling and mispronunciation detection: 1computer assisted pronunciation training (capt): 1natural language processing: 1computer assisted language learning (call): 1function approximation: 1expressive power: 1universal approximation: 1vector to vector regression: 1improved speech presence probability: 1error statistics: 1deep learning based speech enhancement: 1noise robust speech recognition: 1cross modal training: 1environmental aware training: 1databases: 1student teacher training: 1audio visual speech recognition: 1multiple speakers: 1interference: 1speaker dependent speech separation: 1chime 5 challenge: 1arrays: 1acoustic noise: 1statistical speech enhancement: 1signal denoising: 1gain function: 1
Most publications (all venues) at2023: 232020: 232017: 202016: 202014: 20

Affiliations
Georgia Institute of Technology, School of Electrical and Computer Engineering, USA
Bell Laboratories, Dialogue Systems Research Department, Murray Hill, New Jersey, NY, USA (1981-2001)

Recent publications

TASLP2024 Hang Chen, Qing Wang 0008, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001
Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition.

TASLP2024 Zilu Guo, Qing Wang 0008, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui Lee 0001
A Variance-Preserving Interpolation Approach for Diffusion Models With Applications to Single Channel Speech Enhancement and Recognition.

ICASSP2024 Feng Ma, Yanhui Tu, Maokui He, Ruoyu Wang 0029, Shutong Niu, Lei Sun 0010, Zhongfu Ye, Jun Du, Jia Pan, Chin-Hui Lee 0001
A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

ICASSP2024 Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Yuling Ren, Yu Liu, 
Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization.

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

ICASSP2024 Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang 0029, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee 0001
Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture.

ICASSP2024 Hao Yen, Sabato Marco Siniscalchi, Chin-Hui Lee 0001
Boosting End-to-End Multilingual Phoneme Recognition Through Exploiting Universal Speech Attributes Constraints.

SpeechComm2023 Shi Cheng, Jun Du, Shutong Niu, Alejandrina Cristià, Xin Wang 0037, Qing Wang 0008, Chin-Hui Lee 0001
Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions.

SpeechComm2023 Li Chai 0002, Hang Chen, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001
Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech.

TASLP2023 Mao-Kui He, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001
ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding.

TASLP2023 Shutong Niu, Jun Du, Lei Sun 0010, Yu Hu 0003, Chin-Hui Lee 0001
QDM-SSD: Quality-Aware Dynamic Masking for Separation-Based Speaker Diarization.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Shutong Niu, Jun Du, Qing Wang 0008, Li Chai 0002, Huaxin Wu, Zhaoxu Nian, Lei Sun 0010, Yi Fang, Jia Pan, Chin-Hui Lee 0001
An Experimental Study on Sound Event Localization and Detection Under Realistic Testing Conditions.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee 0001
A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.

ICASSP2023 Chenyue Zhang, Hang Chen, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001
Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement.

Interspeech2023 Zilu Guo, Jun Du, Chin-Hui Lee 0001, Yu Gao, Wenbin Zhang, 
Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement.

Interspeech2023 Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee 0001
A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models.

Interspeech2023 Shutong Niu, Jun Du, Maokui He, Chin-Hui Lee 0001, Baoxiang Li, Jiakui Li, 
Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization.

Interspeech2023 Haotian Wang, Jun Du, Hengshun Zhou, Chin-Hui Lee 0001, Yuling Ren, Jiangjiang Zhao, 
A Multiple-Teacher Pruning Based Self-Distillation (MT-PSD) Approach to Model Compression for Audio-Visual Wake Word Spotting.

#25  | John H. L. Hansen | DBLP Google Scholar  
By venueInterspeech: 39ICASSP: 14TASLP: 9SpeechComm: 6
By year2024: 72023: 102022: 132021: 92020: 92019: 122018: 8
ISCA sessionsspeech recognition: 2applications in transcription, education and learning: 2dereverberation and echo cancellation: 2speaker recognition challenges and applications: 2integrating speech science and technology for clinical applications: 2speech coding and enhancement: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1speaker and language identification: 1spoken language processing: 1pathological speech analysis: 1resource-constrained asr: 1speech representation: 1speech enhancement and intelligibility: 1embedding and network architecture for speaker recognition: 1multi-, cross-lingual and other topics in asr: 1asr technologies and systems: 1target speaker detection, localization and separation: 1speech and audio quality assessment: 1language learning: 1the fearless steps challenge phase-02: 1speaker embedding: 1topics in speech and audio signal processing: 1speaker recognition and diarization: 1language learning and databases: 1speech perception in adverse listening conditions: 1speech enhancement: 1speaker and language recognition: 1speech and audio source separation and scene analysis: 1speaker verification: 1speaker verification using neural network methods: 1adjusting to speaker, accent, and domain: 1spoken corpora and annotation: 1speech analysis and representation: 1signal analysis for the natural, biological and social sciences: 1
IEEE keywordsspeaker recognition: 8task analysis: 5speaker verification: 4convolutional neural nets: 4speech enhancement: 3convolution: 3transformers: 3time frequency analysis: 3computational modeling: 3adaptation models: 3deep neural network: 2transformer: 2reverberation: 2transfer learning: 2training data: 2data models: 2speaker embedding: 2switches: 2speech recognition: 2generative adversarial networks: 2audio signal processing: 2neural net architecture: 2calibration: 2overlapping speech detection: 2co channel speech detection: 2speech separation: 2natural language processing: 2domain adaptation: 2deformable convolutional networks: 1monaural dereverberation: 1filtering: 1microphones: 1minimum variance distortionless response: 1deep filtering: 1reflection: 1distortion: 1harmonic analysis: 1noise measurement: 1u net: 1decoding: 1complex valued network: 1frequency transformation block: 1massive naturalistic community resource: 1nasa: 1nasa apollo missions: 1psychology: 1fearless steps: 1fs apollo: 1auditory system: 1real time systems: 1cci mobile: 1situational signal processing: 1"emaging": 1non linguistic: 1tagging: 1cochlear implants: 1sound source localization (ssl): 1wearable and portable devices: 1cochlear implant (ci): 1location awareness: 1signal processing algorithms: 1artificial neural networks: 1blind speech dereverberation: 1cepstral analysis: 1measurement: 1all pass system: 1channel estimation: 1minimum phase: 1costs: 1parameter efficiency: 1adapter: 1pre trained model: 1error analysis: 1graph networks: 1complexity theory: 1data augmentation: 1fearless steps apollo: 1focusing: 1historical archiving: 1speaker diarization: 1continual learning: 1speech re cognition: 1end to end systems: 1domain expansion: 1accented speech: 1model adaptation: 1attention: 1context modeling: 1dct transformation: 1aggregates: 1discrete cosine transforms: 1global context modeling: 1noise robustness: 1energy consumption: 1filterbank learning: 1performance evaluation: 1robustness: 1small footprint: 1keyword spotting: 1filter banks: 1end to end: 1operating systems: 1data mining: 1self attention: 1conformer: 1swin transformer: 1deep neural networks: 1forensics: 1discrepancy loss: 1text analysis: 1multi source domain adaptation: 1domain adversarial training: 1moment matching: 1maximum mean discrepancy: 1disentangled representation learning: 1audio generation: 1guided representation learning: 1and generative adversarial neural network: 1signal representation: 1optimisation: 1lombard effect: 1whisper/vocal effort: 1signal detection: 11 d cnn: 1convolutional neural network: 1speech synthesis: 1cocktail party problem: 1speech modeling: 1simultaneous speaker detection: 1residual learning: 1binary classifier: 1adversarial domain adaptation: 1deep learning (artificial intelligence): 1embedding disentangling: 1phone embedding: 1computer assisted language learning: 1mispronunciation verification: 1siamese networks: 1source counting: 1mixed speech: 1convolutional neural networks: 1voice activity detection: 1peer led team learning: 1speaker clustering: 1audio diarization: 1sincnet: 1speaker representation: 1mixers: 1adversarial training: 1nist sre: 1embedded systems: 1pattern classification: 1semi supervised learning: 1mixture models: 1unsupervised learning: 1arabic dialect identification: 1language identification: 1i vector: 1gaussian processes: 1
Most publications (all venues) at2010: 352014: 342015: 322017: 312016: 31


Recent publications

TASLP2024 Vinay Kothapally, John H. L. Hansen
Monaural Speech Dereverberation Using Deformable Convolutional Networks.

TASLP2024 Nursadul Mamun, John H. L. Hansen
Speech Enhancement for Cochlear Implant Recipients Using Deep Complex Convolution Transformer With Frequency Transformation.

ICASSP2024 John H. L. Hansen, Aditya Joglekar, Meena M. Chandra Shekar, Szu-Jui Chen, Xi Liu, 
Fearless Steps Apollo: Team Communications Based Community Resource Development for Science, Technology, Education, and Historical Preservation.

ICASSP2024 Taylor Lawson, John H. L. Hansen
Situational Signal Processing with Ecological Momentary Assessment: Leveraging Environmental Context for Cochlear Implant Users.

ICASSP2024 Xi Liu, Szu-Jui Chen, John H. L. Hansen
Dual-Path Minimum-Phase and All-Pass Decomposition Network for Single Channel Speech Dereverberation.

ICASSP2024 Mufan Sang, John H. L. Hansen
Efficient Adapter Tuning of Pre-Trained Speech Models for Automatic Speaker Verification.

ICASSP2024 Meena M. Chandra Shekar, John H. L. Hansen
Apollo's Unheard Voices: Graph Attention Networks for Speaker Diarization and Clustering for Fearless Steps Apollo Collection.

SpeechComm2023 Midia Yousefi, John H. L. Hansen
Single-channel speech separation using soft-minimum permutation invariant training.

TASLP2023 Shahram Ghorbani, John H. L. Hansen
Domain Expansion for End-to-End Speech Recognition: Applications for Accent/Dialect Speech.

TASLP2023 Wei Xia, John H. L. Hansen
Attention and DCT Based Global Context Modeling for Text-Independent Speaker Recognition.

ICASSP2023 Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen 0001, John H. L. Hansen
Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting.

ICASSP2023 Mufan Sang, Yong Zhao 0008, Gang Liu 0001, John H. L. Hansen, Jian Wu 0027, 
Improving Transformer-Based Networks with Locality for Automatic Speaker Verification.

Interspeech2023 Nursadul Mamun, John H. L. Hansen
CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement.

Interspeech2023 Meena M. Chandra Shekar, John H. L. Hansen
Speaker Tracking using Graph Attention Networks with Varying Duration Utterances across Multi-Channel Naturalistic Data: Fearless Steps Apollo-11 Audio Corpus.

Interspeech2023 Ram C. M. C. Shekar, Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, John H. L. Hansen
Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-level Goodness of Pronunciation Transformer.

Interspeech2023 Jiamin Xie, John H. L. Hansen
MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition.

Interspeech2023 Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen
What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model.

SpeechComm2022 Rasa Lileikyte, Dwight Irvin, John H. L. Hansen
Assessing child communication engagement and statistical speech patterns for American English via speech recognition in naturalistic active learning spaces.

TASLP2022 Vinay Kothapally, John H. L. Hansen
SkipConvGAN: Monaural Speech Dereverberation Using Generative Adversarial Networks via Complex Time-Frequency Masking.

TASLP2022 Zhenyu Wang, John H. L. Hansen
Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition.

#26  | Yu Tsao 0001 | DBLP Google Scholar  
By venueInterspeech: 38ICASSP: 17TASLP: 9ICLR: 2NeurIPS: 1ICML: 1
By year2024: 62023: 92022: 202021: 92020: 92019: 122018: 3
ISCA sessionsspeech enhancement and intelligibility: 5speech enhancement: 5single-channel speech enhancement: 4speech, voice, and hearing disorders: 2dereverberation, noise reduction, and speaker extraction: 2voice conversion and adaptation: 2speech synthesis: 2neural techniques for voice conversion and waveform generation: 2speech coding and enhancement: 1speech recognition: 1speech production, perception and multimodality: 1the voicemos challenge: 1source separation: 1speech intelligibility prediction for hearing-impaired listeners: 1speech coding and privacy: 1noise reduction and intelligibility: 1intelligibility-enhancing speech modification: 1model training for asr: 1speech and audio classification: 1speech intelligibility and quality: 1audio events and acoustic scenes: 1voice conversion: 1
IEEE keywordsspeech enhancement: 13speech recognition: 6predictive models: 4measurement: 3unsupervised learning: 3pattern classification: 3generative adversarial networks: 2error analysis: 2self supervised learning: 2task analysis: 2ensemble learning: 2perturbation methods: 2speaker verification: 2adaptation models: 2spoken language understanding: 2robustness: 2convolutional neural nets: 2deep learning (artificial intelligence): 2signal denoising: 2generative model: 2deep neural network: 2audio signal processing: 2decoding: 2natural language processing: 2stargan: 1face masked speech enhancement: 1human in the loop: 1noise measurement: 1generators: 1noise: 1recording: 1sinkhorn attention: 1cross modality alignment: 1transformers: 1automatic speech recognition (asr): 1pretrained language model (plm): 1linguistics: 1linear programming: 1evaluation: 1audio visual learning: 1representation learning: 1benchmark testing: 1soft sensors: 1visualization: 1scalability: 1rendering (computer graphics): 1purification: 1adversarial sample detection: 1adversarial attack: 1user experience: 1multiprotocol label switching: 13quest: 1knowledge transfer: 1sdi: 1speech quality prediction: 1multitasking: 1speech intelligibility prediction: 1stoi: 1pesq: 1quality assessment: 1robust automatic speech recognition: 1hidden markov models: 1articulatory attribute: 1broad phonetic classes: 1phonetics: 1end to end: 1non intrusive speech assessment models: 1acoustic distortion: 1psychoacoustic models: 1multi objective learning: 1codes: 1computational modeling: 1spoken question answering: 1speech translation: 1speech coding: 1question answering (information retrieval): 1tokenization: 1mos: 1auditory system: 1perturbation: 1speech quality models: 1adversarial examples: 1data privacy: 1low quality data: 1data compression: 1audio visual systems: 1recurrent neural nets: 1asynchronous multimodal learning: 1audio visual: 1floating point arithmetic: 1deep neural network model compression: 1inference acceleration: 1adders: 1speech dereverberation: 1floating point integer arithmetic circuit: 1unsupervised speech enhancement: 1metricgan: 1supervised learning: 1reverberation: 1speech recovery: 1intermittent systems: 1internet of things: 1performance evaluation: 1data models: 1speech signal processing: 1energy harvesting: 1interpolation: 1generative adversarial network: 1unsupervised asr: 1training data: 1signal processing algorithms: 1diffusion probabilistic model: 1sensor fusion: 1non invasive: 1multimodal: 1medical signal processing: 1electromyography: 1biometrics (access control): 1security of data: 1partially fake audio detection: 1anti spoofing: 1audio deep synthesis detection challenge: 1speech synthesis: 1quantum computing: 1text analysis: 1quantum machine learning: 1text classification: 1temporal convolution: 1and heterogeneous computing: 1bayes methods: 1joint bayesian model: 1affine transforms: 1discriminative model: 1speaker recognition: 1statistical distributions: 1unsupervised domain adaptation: 1optimal transport: 1spoken language identification: 1maml: 1meta learning: 1source separation: 1speech separation: 1anil: 1support vector machines: 1phonotactic language recognition: 1subspace based learning: 1matrix decomposition: 1subspace based representation: 1gaussian processes: 1multichannel speech enhancement: 1distributed microphones: 1fully convolutional network (fcn): 1microphones: 1phase estimation: 1inner ear microphones: 1raw waveform mapping: 1generalizability: 1dynamically sized decision tree: 1decision trees: 1deep neural networks: 1regression analysis: 1deep denoising autoencoder: 1signal classification: 1automatic speech recognition: 1character error rate: 1mean square error methods: 1reinforcement learning: 1
Most publications (all venues) at2022: 462023: 422021: 382019: 362017: 31

Affiliations
Academia Sinica, Research Center for Information Technology Innovation, Taipei, Taiwan

Recent publications

TASLP2024 Syu-Siang Wang, Jia-Yang Chen, Bo-Ren Bai, Shih-Hau Fang, Yu Tsao 0001
Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics.

ICASSP2024 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai, 
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-Based ASR.

ICASSP2024 Yuan Tseng, Layne Berry, Yiting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Poyao Huang 0001, Chun-Mao Lai, Shang-Wen Li 0001, David Harwath, Yu Tsao 0001, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee, 
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.

ICASSP2024 Haibin Wu, Heng-Cheng Kuo, Yu Tsao 0001, Hung-Yi Lee, 
Scalable Ensemble-Based Detection Method Against Adversarial Attacks For Speaker Verification.

ICASSP2024 Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001
Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model.

ICLR2024 Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao 0001, Yu-Chiang Frank Wang, 
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech.

TASLP2023 Yen-Ju Lu, Chia-Yu Chang, Cheng Yu, Ching-Feng Liu, Jeih-weih Hung, Shinji Watanabe 0001, Yu Tsao 0001
Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information.

TASLP2023 Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen 0011, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001
Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features.

ICASSP2023 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao 0001
T5lephone: Bridging Speech and Text Self-Supervised Models for Spoken Language Understanding Via Phoneme Level T5.

ICASSP2023 Hsin-Yi Lin, Huan-Hsin Tseng, Yu Tsao 0001
On the Robustness of Non-Intrusive Speech Quality Model by Adversarial Examples.

Interspeech2023 Hsin-Hao Chen 0006, Yung-Lun Chien, Ming-Chi Yen, Shu-Wei Tsai, Tai-Shih Chi, Hsin-Min Wang, Yu Tsao 0001
Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features.

Interspeech2023 Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao 0001, Hsin-Min Wang, 
A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech.

Interspeech2023 Yung-Lun Chien, Hsin-Hao Chen 0006, Ming-Chi Yen, Shu-Wei Tsai, Hsin-Min Wang, Yu Tsao 0001, Tai-Shih Chi, 
Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion.

Interspeech2023 Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao 0001
Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition.

ICLR2023 Chi-Chang Lee, Yu Tsao 0001, Hsin-Min Wang, Chu-Song Chen, 
D4AM: A General Denoising Framework for Downstream Acoustic Models.

TASLP2022 Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao 0001
Improved Lite Audio-Visual Speech Enhancement.

TASLP2022 Yu-Chen Lin, Cheng Yu, Yi-Te Hsu, Szu-Wei Fu, Yu Tsao 0001, Tei-Wei Kuo, 
SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points.

ICASSP2022 Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao 0001
MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech.

ICASSP2022 Yu-Chen Lin, Tsun-An Hsieh, Kuo-Hsuan Hung, Cheng Yu, Harinath Garudadri, Yu Tsao 0001, Tei-Wei Kuo, 
Speech Recovery For Real-World Self-Powered Intermittent Devices.

ICASSP2022 Guan-Ting Lin, Chan-Jan Hsu, Da-Rong Liu, Hung-Yi Lee, Yu Tsao 0001
Analyzing The Robustness of Unsupervised Speech Recognition.

#27  | Jing Xiao 0006 | DBLP Google Scholar  
By venueInterspeech: 34ICASSP: 30ICML: 2TASLP: 1EMNLP-Findings: 1
By year2024: 72023: 152022: 152021: 182020: 122019: 1
ISCA sessionsspeech synthesis: 9topics in asr: 2speech, voice, and hearing disorders: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech activity detection and modeling: 1analysis of speech and audio signals: 1speech perception, production, and acquisition: 1speaker and language identification: 1question answering from speech: 1speech emotion recognition: 1source separation: 1novel models and training methods for asr: 1multi-, cross-lingual and other topics in asr: 1spoken language modeling and understanding: 1acoustic event detection and classification: 1non-autoregressive sequential modeling for speech processing: 1speech signal analysis and representation: 1graph and end-to-end learning for speaker recognition: 1embedding and network architecture for speaker recognition: 1acoustic event detection and acoustic scene classification: 1voice conversion and adaptation: 1spoken language understanding: 1dnn architectures for speaker recognition: 1speech and audio quality assessment: 1phonetic event detection and segmentation: 1
IEEE keywordsspeech synthesis: 14speech recognition: 8voice conversion: 6task analysis: 6text to speech: 5natural language processing: 5speaker recognition: 4computational modeling: 3contrastive learning: 3timbre: 3predictive models: 3end to end: 2emotion recognition: 2emotional speech synthesis: 2fuses: 2mutual information: 2adaptation models: 2linguistics: 2correlation: 2computer vision: 2multi modal: 2convolution: 2vector quantization: 2dynamic programming: 2zero shot: 2text analysis: 2transformer: 2couplings: 1differentiable aligner: 1vae: 1hierarchical vae: 1computer architecture: 1time invariant retrieval: 1data mining: 1self supervised learning: 1phonetics: 1noise reduction: 1speech emotion diarization: 1diffusion denoising probabilistic model: 1probabilistic logic: 1static var compensators: 1emotion decoupling: 1adaptive style fusion: 1adaptive systems: 1singing voice conversion: 1llm: 1model bias: 1text categorization: 1zero shot learning: 1bias leverage: 1robustness: 1few shot learning: 1knn methods: 1gold: 1automatic speech recognition: 1benchmark testing: 1monotonic alignment: 1asr: 1environmental sound classification: 1data free: 1audio classification: 1knowledge distillation: 1multiple signal classification: 1music genre classification: 1multi label: 1contrastive loss: 1symmetric cross modal attention: 1adversarial learning: 1speech representation disentanglement: 1linear programming: 1intonation intensity control: 1relative attribute: 1aligned cross entropy: 1entropy: 1non autoregressive asr: 1mask ctc: 1brain modeling: 1time frequency analysis: 1feature fusion: 1federated learning: 1graph convolution network: 1electroencephalogram: 1regression analysis: 1pattern classification: 1variance regularization: 1attribute inference: 1speaker age estimation: 1label distribution learning: 1any to any: 1object detection: 1self supervised: 1low resource: 1query processing: 1pattern clustering: 1interactive systems: 1visual dialog: 1patch embedding: 1question answering (information retrieval): 1incomplete utterance rewriting: 1self attention weight matrix: 1text edit: 1synthetic noise: 1adversarial perturbation: 1contextual information: 1grapheme to phoneme: 1multi speaker text to speech: 1conditional variational autoencoder: 1nat: 1end to end speech recognition: 1parallel processing: 1sampling methods: 1single step generation: 1ctc alignments: 1intent detection: 1continual learning: 1computational linguistics: 1slot filling: 1grammar: 1error analysis: 1pointer generator network: 1generators: 1parameter genera tor: 1semiotics: 1text normalization: 1unsupervised: 1data acquisition: 1information bottleneck: 1unsupervised learning: 1instance discriminator: 1recurrent neural nets: 1self attention: 1rnn transducer: 1feature maps: 1network pruning: 1matrix algebra: 1pqr: 1wireless channels: 1linear dependency analysis: 1waveform generators: 1vocoders: 1waveform generation: 1location variable convolution: 1vocoder: 1convolutional codes: 1strain: 1speaker clustering: 1aggregation hierarchy cluster: 1digital tv: 1analytical models: 1tied variational autoencoder: 1clustering methods: 1generative flow: 1non autoregressive: 1autoregressive processes: 1speech coding: 1prosody modelling: 1graph theory: 1graph neural network: 1baum welch algorithm: 1real time systems: 1signal processing algorithms: 1feed forward transformer: 1
Most publications (all venues) at2021: 952022: 762020: 652023: 572024: 38

Affiliations
PingAn Technology, Shenzhen, China
Epson Research and Development, San Jose, CA, USA (former)
Carnegie Mellon University, Robotics Institute, Pittsburgh, PA, USA (PhD 2005)

Recent publications

TASLP2024 Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006
EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion.

ICASSP2024 Yimin Deng, Huaizhen Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang, 
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval.

ICASSP2024 Haobin Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang, 
ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis.

ICASSP2024 Zeyu Yang, Minchuan Chen, Yanping Li, Wei Hu, Shaojun Wang, Jing Xiao 0006, Zijian Li, 
ESVC: Combining Adaptive Style Fusion and Multi-Level Feature Disentanglement for Expressive Singing Voice Conversion.

ICASSP2024 Yong Zhang, Hanzhang Li, Zhitao Li, Ning Cheng 0001, Ming Li, Jing Xiao 0006, Jianzong Wang, 
Leveraging Biases in Large Language Models: "bias-kNN" for Effective Few-Shot Learning.

ICASSP2024 Ziyang Zhuang, Kun Zou, Chenfeng Miao, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao 0006
Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction.

ICML2024 Chenfeng Miao, Qingying Zhu, Minchuan Chen, Wei Hu, Zijian Li, Shaojun Wang, Jing Xiao 0006
DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation.

ICASSP2023 Zuheng Kang, Yayun He, Jianzong Wang, Junqing Peng, Xiaoyang Qu, Jing Xiao 0006
Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification.

ICASSP2023 Ganghui Ru, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
Improving Music Genre Classification from multi-modal Properties of Music and Genre Correlations Perspective.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
Learning Speech Representations with Flexible Hidden Feature Dimensions.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization.

ICASSP2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis.

ICASSP2023 Xulong Zhang 0001, Haobin Tang, Jianzong Wang, Ning Cheng 0001, Jian Luo, Jing Xiao 0006
Dynamic Alignment Mask CTC: Improved Mask CTC With Aligned Cross Entropy.

ICASSP2023 Kexin Zhu, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
Improving EEG-based Emotion Recognition by Fusing Time-Frequency and Spatial Representations.

Interspeech2023 Minchuan Chen, Chenfeng Miao, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006
Exploring multi-task learning and data augmentation in dementia detection with self-supervised pretrained models.

Interspeech2023 Jiaxin Fan, Yong Zhang, Hanzhang Li, Jianzong Wang, Zhitao Li, Sheng Ouyang, Ning Cheng 0001, Jing Xiao 0006
Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism.

Interspeech2023 Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao 0006
SVVAD: Personal Voice Activity Detection for Speaker Verification.

Interspeech2023 Yifu Sun, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Kaiyu Hu, Jing Xiao 0006
Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning.

Interspeech2023 Fengyun Tan, Chaofeng Feng, Tao Wei, Shuai Gong, Jinqiang Leng, Wei Chu, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006
Improving End-to-End Modeling For Mandarin-English Code-Switching Using Lightweight Switch-Routing Mixture-of-Experts.

Interspeech2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis.

#28  | Shri Narayanan | DBLP Google Scholar  
By venueICASSP: 32Interspeech: 32TASLP: 1ACL: 1
By year2024: 52023: 112022: 42021: 72020: 182019: 112018: 10
ISCA sessionstrustworthy speech processing: 2speaker recognition and diarization: 2speech and language analytics for mental health: 2speaker state and trait: 2speech pathology, depression, and medical applications: 2phonetics, phonology, and prosody: 1speaker and language diarization: 1pathological speech analysis: 1keynote 1 isca medallist: 1connecting speech-science and speech-technology for children's speech: 1assessment of pathological speech and language: 1emotion and sentiment analysis: 1phonetics: 1speech enhancement, bandwidth extension and hearing aids: 1the interspeech 2020 far field speaker verification challenge: 1evaluation of speech technology systems and methods for resource construction and annotation: 1speech in health: 1speech signal characterization: 1the voices from a distance challenge: 1emotion and personality in conversation: 1the second dihard speech diarization challenge (dihard ii): 1topics in speech and audio signal processing: 1integrating speech science and technology for clinical applications: 1speaker diarization: 1emotion recognition and analysis: 1spoken corpora and annotation: 1novel approaches to enhancement: 1
IEEE keywordsspeech recognition: 14emotion recognition: 8speaker recognition: 8annotations: 5task analysis: 5speech: 4speaker diarization: 4computational modeling: 3pipelines: 3visualization: 3child speech: 3pattern clustering: 3benchmark testing: 2data models: 2natural languages: 2speaker classification: 2autism: 2speech emotion recognition: 2predictive models: 2music: 2signal processing algorithms: 2data privacy: 2adversarial training: 2speaker embeddings: 2x vector: 2clustergan: 2hospitals: 2signal classification: 2robustness: 2annotation fusion: 2behavioural sciences computing: 2convolutional neural nets: 2video signal processing: 2autism spectrum disorder: 2pattern classification: 2medical disorders: 2audio signal processing: 2trustworthiness: 1system performance: 1self supervision: 1speech enhancement: 1large language model: 1foundation model: 1video summarization: 1transformers: 1data compression: 1multimodal transformers: 1image representation: 1cross modal retrieval: 1music information retrieval: 1contrastive learning: 1tagging: 1self supervised learning: 1multimodal learning: 1semantics: 1buildings: 1motion segmentation: 1lips: 1audiovisual: 1voice activity detection: 1reproducibility of results: 1emotion evaluation: 1iemocap: 1motion capture: 1protocols: 1reproducibility: 1multimodal interaction modeling: 1tv: 1face recognition: 1multimedia: 1crops: 1context understanding: 1multimodal vision language pretrained models: 1costs: 1multilingual emotion recognition: 1emotion clusters: 1zero shot: 1audio visual dataset: 1taxonomy: 1event detection: 1audio event detection: 1audio recognition: 1medical services: 1movies: 1multiple signal classification: 1software: 1transformer: 1vocoders: 1text to speech synthesis: 1music audio synthesis: 1analytical models: 1neural vocoder: 1tacotron: 1catalysts: 1federated learning: 1audio benchmarks: 1machine learning: 1statistical privacy: 1speech emotion: 1noise enjection: 1fairness: 1nme sc: 1generative adversarial networks: 1clustering algorithms: 1gallium nitride: 1prototypes: 1mcgan: 1circadian rhythms: 1diseases: 1medical signal processing: 1recurrent neural nets: 1personnel: 1ubiquitous computing: 1health care: 1statistics: 1hybrid adversarial training: 1multi task objective: 1perturbation methods: 1adversarial attack: 1feature scattering: 1multi scale: 1score fusion: 1uniform segmentation: 1child forensic interview: 1law administration: 1deception detection: 1behavioral signal processing: 1triplet embedding: 1trapezoidal signal regression: 1signal warping: 1cost accounting: 1sequential analysis: 1behavior: 1suicidal risk: 1asr: 1couples conversations: 1psychology: 1military computing: 1prosody: 1sensor fusion: 1support vector machines: 1machine learning.: 1wearable: 1time series: 1wearable computers: 1routine analysis: 1data clustering: 1music emotion recognition: 1triplet embeddings: 1inter rater agreement: 1music perception: 1segmentation: 1cnn: 1biomedical mri: 1convlstm: 1medical image processing: 1rtmri: 1extraterrestrial measurements: 1supervised learning: 1prototypical networks: 1patient diagnosis: 1gradient reversal: 1medical diagnostic computing: 1paediatrics: 1natural language processing: 1domain adversarial learning: 1affective computing: 1affective representation: 1speaker invariant: 1entropy: 1signal representation: 1deep latent space clustering: 1medical computing: 1adversarial invariance: 1robust speaker recognition: 1spectrogram: 1document handling: 1multitask learning: 1situation awareness: 1text classification: 1optimisation: 1emergency management: 1clustering: 1data mining: 1mouse ultrasonic vocalizations: 1biocommunications: 1filtering theory: 1subspace similarity: 1sparse subspace clustering: 1speaker role recognition: 1lattice rescoring: 1language model: 1convo lutional neural networks: 1speech activity detection: 1movie audio: 1entertainment: 1wearable sensing: 1foreground detection: 1detectors: 1speaking patterns: 1audio: 1speech activity detector: 1employment: 1multitaper: 1bioelectric potentials: 1eeg: 1brain computer interfaces: 1electroencephalography: 1delta: 1syllable: 1
Most publications (all venues) at2013: 632011: 532016: 522019: 502008: 50


Recent publications

ICASSP2024 Tiantian Feng, Rajat Hebbar, Shrikanth Narayanan
TRUST-SER: On The Trustworthiness Of Fine-Tuning Pre-Trained Speech Embeddings For Speech Emotion Recognition.

ICASSP2024 Tiantian Feng, Shrikanth Narayanan
Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting.

ICASSP2024 Yoonsoo Nam, Adam Lehavi, Daniel Yang, Digbalay Bose, Swabha Swayamdipta, Shrikanth Narayanan
Does Video Summarization Require Videos? Quantifying the Effectiveness of Language in Video Summarization.

ICASSP2024 Shanti Stewart, Kleanthis Avramidis, Tiantian Feng, Shrikanth Narayanan
Emotion-Aligned Contrastive Learning Between Images and Music.

ICASSP2024 Anfeng Xu, Kevin Huang, Tiantian Feng, Helen Tager-Flusberg, Shrikanth Narayanan
Audio-Visual Child-Adult Speaker Classification in Dyadic Interactions.

ICASSP2023 Nikolaos Antoniou, Athanasios Katsamanis, Theodoros Giannakopoulos, Shrikanth Narayanan
Designing and Evaluating Speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP.

ICASSP2023 Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Shrikanth Narayanan
Contextually-Rich Human Affect Perception Using Multimodal Scene Information.

ICASSP2023 Georgios Chochlakis, Gireesh Mahajan, Sabyasachee Baruah, Keith Burghardt, Kristina Lerman, Shrikanth Narayanan
Using Emotion Embeddings to Transfer Knowledge between Emotions, Languages, and Annotation Formats.

ICASSP2023 Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan
A Dataset for Audio-Visual Sound Event Detection in Movies.

ICASSP2023 Xuan Shi, Erica Cooper, Xin Wang 0037, Junichi Yamagishi, Shrikanth Narayanan
Can Knowledge of End-to-End Text-to-Speech Models Improve Neural Midi-to-Audio Synthesis Systems?

ICASSP2023 Tuo Zhang, Tiantian Feng, Samiul Alam, Sunwoo Lee, Mi Zhang 0002, Shrikanth S. Narayanan, Salman Avestimehr, 
FedAudio: A Federated Learning Benchmark for Audio Tasks.

Interspeech2023 Reed Blaylock, Shrikanth Narayanan
Beatboxing Kick Drum Kinematics.

Interspeech2023 Rimita Lahiri, Tiantian Feng, Rajat Hebbar, Catherine Lord, So Hyun Kim, Shrikanth Narayanan
Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism.

Interspeech2023 Thomas Melistas, Lefteris Kapelonis, Nikolaos Antoniou, Petros Mitseas, Dimitris Sgouropoulos, Theodoros Giannakopoulos, Athanasios Katsamanis, Shrikanth Narayanan
Cross-Lingual Features for Alzheimer's Dementia Detection from Speech.

Interspeech2023 Shrikanth Narayanan
Bridging Speech Science and Technology - Now and Into the Future.

Interspeech2023 Anfeng Xu, Rajat Hebbar, Rimita Lahiri, Tiantian Feng, Lindsay Butler, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan
Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings.

ICASSP2022 Tiantian Feng, Hanieh Hashemi, Murali Annavaram, Shrikanth S. Narayanan
Enhancing Privacy Through Domain Adaptive Noise Injection For Speech Emotion Recognition.

Interspeech2022 Tiantian Feng, Shrikanth Narayanan
Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling.

Interspeech2022 Tiantian Feng, Raghuveer Peri, Shrikanth Narayanan
User-Level Differential Privacy against Attribute Inference Attack of Speech Emotion Recognition on Federated Learning.

Interspeech2022 Nikolaos Flemotomos, Shrikanth Narayanan
Multimodal Clustering with Role Induced Constraints for Speaker Diarization.

#29  | Dan Su 0002 | DBLP Google Scholar  
By venueInterspeech: 30ICASSP: 29ICML: 1ACL: 1AAAI: 1ICLR: 1IJCAI: 1TASLP: 1
By year2024: 32023: 42022: 152021: 112020: 162019: 102018: 6
ISCA sessionsvoice conversion and adaptation: 4speech synthesis: 4speech recognition: 2deep learning for source separation and pitch tracking: 2speech coding and enhancement: 1speaker embedding and diarization: 1tools, corpora and resources: 1topics in asr: 1source separation, dereverberation and echo cancellation: 1novel neural network architectures for asr: 1multi-channel speech enhancement: 1speaker recognition: 1asr neural network architectures and training: 1new trends in self-supervised speech processing: 1speech synthesis paradigms and methods: 1multimodal speech processing: 1speech enhancement: 1asr neural network architectures: 1speaker verification using neural network methods: 1sequence models for asr: 1expressive speech synthesis: 1topics in speech recognition: 1
IEEE keywordsspeech recognition: 11speaker recognition: 8speech synthesis: 6speaker verification: 4multi channel: 4speech separation: 4natural language processing: 4recurrent neural nets: 3overlapped speech: 3speech enhancement: 3data augmentation: 3microphone arrays: 2voice activity detection: 2speaker diarization: 2multi look: 2transfer learning: 2domain adaptation: 2maximum mean discrepancy: 2code switching: 2speaker embedding: 2attention based model: 2automatic speech recognition: 2end to end speech recognition: 2expressive tts: 1transformers: 1bigvgan: 1durian e: 1adaptation models: 1linguistics: 1style adaptive instance normalization: 1signal generators: 1adaptive systems: 1vits: 1speaking style: 1text analysis: 1conversational text to speech synthesis: 1graph neural network: 1low quality data: 1neural speech synthesis: 1style transfer: 1joint training: 1dual path: 1acoustic model: 1echo suppression: 1streaming: 1dynamic weight attention: 1training data: 1error analysis: 1three dimensional displays: 1noisy label: 1convolution: 1attention module: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1vector quantization: 1measurement: 1voice conversion: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1expert systems: 1router architecture: 1mixture of experts: 1global information: 1accent embedding: 1domain embedding: 1feature fusion: 1data handling: 1m2met: 1direction of arrival estimation: 1direction of arrival: 1neural architecture search: 1transferable architecture: 1neural net architecture: 1multi granularity: 1single channel: 1self attentive network: 1source separation: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1ctc: 1non autoregressive: 1decoding: 1transformer: 1autoregressive processes: 1speaker verification (sv): 1phonetic pos teriorgrams: 1speech intelligibility: 1speech coding: 1end to end: 1multi channel speech separation: 1inter channel convolution differences: 1reverberation: 1spatial filters: 1filtering theory: 1spatial features: 1parallel optimization: 1random sampling.: 1model partition: 1graphics processing units: 1