Speech Rankings

A list of researchers in the area of speech ordered by the number of relevant publications, for the purpose of identifying potential academic supervisors.

Report exported at 2024-10-16 04:11:58, see here for how it is created.
Export parameters: --year_start 2019 --year_end 2024 --year_shift 1 --author_start_year 1900 --exclude_venue SSW,ASRU,IWSLT,SLT --n_pubs 20 --rank_start 0 --rank_end 200 --output speech_rankings.html

#1  | Shinji Watanabe 0001 | DBLP Google Scholar  
By venueICASSP: 101Interspeech: 89TASLP: 12ACL: 5NAACL: 4AAAI: 2EMNLP-Findings: 2ACL-Findings: 1NeurIPS: 1IJCAI: 1ICML: 1
By year2024: 282023: 602022: 432021: 342020: 182019: 262018: 10
ISCA sessionsspeech recognition: 8speech synthesis: 4non-autoregressive sequential modeling for speech processing: 4speaker diarization: 4low-resource asr development: 3spoken language translation, information retrieval, summarization, resources, and evaluation: 2spoken dialog systems and conversational analysis: 2spoken language understanding: 2spoken language processing: 2novel models and training methods for asr: 2asr: 2source separation: 2neural networks for language modeling: 2robust speech recognition: 2adjusting to speaker, accent, and domain: 2novel transformer models for asr: 1self-supervised learning in asr: 1search methods and decoding algorithms for asr: 1speech, voice, and hearing disorders: 1articulation: 1robust asr, and far-field/multi-talker asr: 1search/decoding algorithms for asr: 1streaming asr: 1speech enhancement and intelligibility: 1neural transducers, streaming asr and novel asr models: 1speech segmentation: 1adaptation, transfer learning, and distillation for asr: 1speech processing & measurement: 1single-channel and multi-channel speech enhancement: 1spoken dialogue systems and multimodality: 1tools, corpora and resources: 1streaming for asr/rnn transducers: 1acoustic event detection and acoustic scene classification: 1low-resource speech recognition: 1miscellanous topics in asr: 1emotion and sentiment analysis: 1topics in asr: 1cross/multi-lingual and code-switched asr: 1speech signal analysis and representation: 1target speaker detection, localization and separation: 1single-channel speech enhancement: 1asr neural network architectures and training: 1speaker embedding: 1noise robust and distant speech recognition: 1sequence-to-sequence speech recognition: 1asr for noisy and far-field speech: 1speaker recognition: 1speaker recognition evaluation: 1asr neural network training: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1asr neural network architectures: 1speech and voice disorders: 1search methods for speech recognition: 1speech technologies for code-switching in multilingual communities: 1nn architectures for asr: 1language identification: 1sequence models for asr: 1end-to-end speech recognition: 1the first dihard speech diarization challenge: 1deep enhancement: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 59task analysis: 19speech enhancement: 18self supervised learning: 15data models: 14end to end: 14natural language processing: 14decoding: 13predictive models: 13end to end speech recognition: 11computational modeling: 9pipelines: 9benchmark testing: 8ctc: 8spoken language understanding: 8speaker recognition: 8recurrent neural nets: 8adaptation models: 7signal processing algorithms: 7transformers: 7speaker diarization: 7speech coding: 7training data: 6encoding: 6transformer: 6vocabulary: 6automatic speech recognition: 5hidden markov models: 5representation learning: 5asr: 5speech translation: 5speech separation: 5semantics: 5speech synthesis: 5multitasking: 5text analysis: 5source separation: 5transfer learning: 4correlation: 4recording: 4annotations: 4transducers: 4supervised learning: 4time frequency analysis: 4language translation: 4pattern classification: 4signal classification: 4microphone arrays: 4symbols: 3analytical models: 3streaming: 3speech summarization: 3behavioral sciences: 3standards: 3memory management: 3data augmentation: 3measurement: 3oral communication: 3audio visual: 3hubert: 3attention: 3visualization: 3eend: 3noise measurement: 3complex spectral mapping: 3time domain analysis: 3array signal processing: 3misp challenge: 3encoder decoder: 3rnn t: 3open source: 3non autoregressive: 3autoregressive processes: 3sequence to sequence: 3connectionist temporal classification: 3convolutional neural nets: 3audio signal processing: 3biological system modeling: 2benchmark: 2protocols: 2speech: 2switches: 2inference algorithms: 2discrete units: 2complexity theory: 2interpretability: 2buildings: 2end to end models: 2machine translation: 2probabilistic logic: 2databases: 2linguistics: 2speaker diarisation: 2topic model: 2phonetics: 2sentiment analysis: 2error analysis: 2indexes: 2estimation: 2reverberation: 2frame online speech enhancement: 2end to end systems: 2stop challenge: 2degradation: 2self supervised representations: 2voice conversion: 2reproducibility of results: 2unsupervised asr: 2semi supervised learning: 2transducer: 2quality assessment: 2phase estimation: 2self supervision: 2code switched asr: 2public domain software: 2graph theory: 2end to end asr: 2text to speech: 2cycle consistency: 2end to end speech translation: 2pattern clustering: 2neural net architecture: 2self attention: 2joint ctc/attention: 2unpaired data: 2multiple microphone array: 2sound event detection: 2signal detection: 2multilingual text to speech: 1low resource adaptation: 1graphone: 1adaptation of masked language model: 1task generalization: 1evaluation: 1foundation model: 1semi autoregressive: 1redundancy: 1systematics: 1long form asr: 1data processing: 1probes: 1mutual information: 1linear probing: 1information theory: 1solids: 1conversational speech recognition: 1conversation transcription: 1multi talker automatic speech recognition: 1label priors: 1runtime: 1forced alignment: 1instruction tuning: 1collaboration: 1conversational speech: 1context modeling: 1contextual information: 1robustness: 1zero shot learning: 1code switching: 1splicing: 1synthetic summary: 1chatbots: 1large language model: 1chatgpt: 1multi modal tokens: 1image to speech synthesis: 1vector quantization: 1image to speech captioning: 1multi modal speech processing: 1memory: 1costs: 1dataset: 1boosting: 1modulation: 1multitask: 1spoken language model: 1animation: 1three dimensional displays: 1multi task learning: 1tongue: 1speech animation: 1solid modeling: 1ema: 1context: 1generative context: 1self supervised speech models: 1beam search: 1acoustic beams: 1contextualization: 1biasing: 1st: 1multi tasking: 1mt: 1low resource language lip reading: 1multilingual automated labeling: 1lips: 1visual speech recognition: 1lip reading: 1artificial intelligence: 1encoder decoder models: 1modularity: 1network architecture: 1online diarization: 1optimization: 1robust automatic speech recognition: 1articulatory attribute: 1broad phonetic classes: 1full and sub band integration: 1acoustic beamforming: 1computer architecture: 1discrete fourier transforms: 1low latency communication: 1microphone array processing: 1prediction algorithms: 1spoken dialog system: 1emotion recognition: 1joint modelling: 1history: 1speaker attributes: 1overthinking: 1synchronization: 1prosody transfer: 1rhythm: 1one shot: 1synthesizers: 1disentangled speech representation: 1codes: 1multilingual asr: 1face recognition: 1low resource asr: 1cleaning: 1usability: 1target tracking: 1disfluency detection: 1espnet: 1s3prl: 1learning systems: 1multiprotocol label switching: 1pseudo labeling: 1semisupervised learning: 1limiting: 1intermediate loss: 1pre trained language model: 1bit error rate: 1bert: 1masked language model: 1adapter: 1data mining: 1evaluation protocol: 1speaker verification: 1video on demand: 1computational efficiency: 1end to end modeling: 1memory efficient encoders: 1dual speech/text encoder: 1long spoken document: 1e2e: 1on device: 1tensors: 1e branchformer: 1sequential distillation: 1tensor decomposition: 1articulatory: 1gestural scores: 1production systems: 1factor analysis: 1kinematics: 1lda: 1unsupervised: 1wavlm: 1automatic speech quality assessment: 1speech language model: 1discrete token: 1closed box: 1real time systems: 1speech to text translation: 1out of order: 1heterogeneous networks: 1self supervised models: 1convolution: 1structured pruning: 1bridges: 1connectors: 1question answering (information retrieval): 1speech to speech translation: 1text to speech augmentation: 1fine tuning: 1speaker separation: 1low complexity speech enhancement: 1hearing aids design: 1road transportation: 1memory architecture: 1quantization (signal): 1tv: 1multimodality: 1production: 1articulatory inversion: 1articulatory speech processing: 1text recognition: 1spoken named enitiy recognition: 1zero shot asr: 1impedance matching: 1acoustic measurements: 1acoustic parameters: 1phonetic alignment: 1perceptual quality: 1noise reduction: 1enhancement: 1explainable enhancement evaluation: 1frequency estimation: 1eda: 1iterative methods: 1inference mechanisms: 1speech based user interfaces: 1gtc: 1multi speaker overlapped speech: 1wfst: 1wake word spotting: 1audio visual systems: 1microphone array: 1ctc/attention speech recognition: 1channel bank filters: 1fourier transforms: 1computer based training: 1self supervised speech representation: 1sensor fusion: 1attention fusion: 1rover: 1generative model: 1diffusion probabilistic model: 1bic: 1interactive systems: 1unit based language model: 1acoustic unit discovery: 1gtc t: 1noise robustness: 1joint modeling: 1natural languages: 1audio captioning: 1aac: 1linguistic annotation: 1re current neural network: 1sru++: 1bilingual asr: 1computational linguistics: 1audio processing: 1open source toolkit: 1software packages: 1python: 1end to end speech processing: 1conformer: 1image sequences: 1non autoregressive sequence generation: 1non autoregressive decoding: 1multiprocessing systems: 1conditional masked language model: 1long sequence data: 1gaussian processes: 1search problems: 1multitask learning: 1stochastic processes: 1continuous speech separation: 1long recording speech separation: 1online processing: 1transforms: 1dual path modeling: 1noisy speech: 1deep learning (artificial intelligence): 1signal denoising: 1loudspeakers: 1diarization: 1audio recording: 1entropy: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1direction of arrival estimation: 1source localization: 1multi encoder multi resolution (mem res): 1multi encoder multi array (mem array): 1hierarchical attention network (han): 1curriculum learning: 1end to end model: 1multi talker mixed speech recognition: 1knowledge distillation: 1permutation invariant training: 1overlapped speech recognition: 1neural beamforming: 1lightweight convolution: 1dynamic convolution: 1open source software: 1proposals: 1neural network: 1region proposal network: 1faster r cnn: 1speaker adaptation: 1end to end speech synthesis: 1joint training of asr tts: 1multi stream: 1two stage training: 1weakly supervised learning: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1voice activity detection: 1ctc greedy search: 1cloud computing: 1covariance matrix adaptation evolution strategy (cma es): 1multi objective optimization: 1pareto optimisation: 1genetic algorithm: 1parallel processing: 1deep neural network (dnn): 1evolutionary computation: 1attention models: 1discriminative training: 1optimisation: 1softmax margin: 1beam search training: 1sequence learning: 1multi speaker speech recognition: 1cocktail party problem: 1attention mechanism: 1cold fusion: 1automatic speech recognition (asr): 1language model: 1shallow fusion: 1storage management: 1deep fusion: 1expert systems: 1low resource language: 1multilingual speech recognition: 1acoustic model: 1autoencoder: 1weakly labeled data: 1restricted boltzmann machine: 1unsupervised learning: 1conditional restricted boltzmann machine: 1robust speech recognition: 1acoustic modeling: 1chime 5 challenge: 1kaldi: 1discrete representation: 1mask inference: 1interpolation: 1error statistics: 1stream attention: 1speech codecs: 1word processing: 1sub word modeling: 1
Most publications (all venues) at2023: 982024: 752022: 732021: 702019: 47

Affiliations
Carnegie Mellon University, Pittsburgh, PA, USA
Johns Hopkins University, Baltimore, MD, USA (former)
Mitsubishi Electric Research Laboratories, Cambridge, MA, USA (2012 - 2017)
NTT Communication Science Laboratories, Kyoto, Japan (2001 - 2011)
Waseda University, Tokyo, Japan (PhD 2006)

Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001
End-to-End Speech Recognition: A Survey.

TASLP2024 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe 0001, Shinnosuke Takamichi, Hiroshi Saruwatari, 
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis.

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee, 
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Siddhant Arora, George Saon, Shinji Watanabe 0001, Brian Kingsbury, 
Semi-Autoregressive Streaming ASR with Label Context.

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 William Chen, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001
Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing.

ICASSP2024 Kwanghee Choi, Jee-Weon Jung, Shinji Watanabe 0001
Understanding Probe Behaviors Through Variational Bounds of Mutual Information.

ICASSP2024 Samuele Cornell, Jee-Weon Jung, Shinji Watanabe 0001, Stefano Squartini, 
One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition.

ICASSP2024 Ruizhe Huang, Xiaohui Zhang 0007, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe 0001, Daniel Povey, Sanjeev Khudanpur, 
Less Peaky and More Accurate CTC Forced Alignment by Label Priors.

ICASSP2024 Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan S. Sharma, Shinji Watanabe 0001, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee, 
Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.

ICASSP2024 Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization.

ICASSP2024 Amir Hussein, Dorsa Zeinali, Ondrej Klejch, Matthew Wiesner, Brian Yan, Shammur Absar Chowdhury, Ahmed Ali 0002, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora.

ICASSP2024 Jee-Weon Jung, Roshan S. Sharma, William Chen, Bhiksha Raj, Shinji Watanabe 0001
AugSumm: Towards Generalizable Speech Summarization Using Synthetic Labels from Large Language Models.

ICASSP2024 Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe 0001, Yong Man Ro, 
Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens.

ICASSP2024 Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe 0001, Joon Son Chung, 
VoxMM: Rich Transcription of Conversations in the Wild.

ICASSP2024 Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhongqiu Wang, Shinji Watanabe 0001
Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor.

ICASSP2024 Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe 0001
Hubertopic: Enhancing Semantic Representation of Hubert Through Self-Supervision Utilizing Topic Model.

ICASSP2024 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-Weon Jung, Xuankai Chang, Shinji Watanabe 0001
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks.

ICASSP2024 Salvador Medina, Sarah L. Taylor, Carsten Stoll, Gareth Edwards, Alex Hauptmann 0001, Shinji Watanabe 0001, Iain A. Matthews, 
PhISANet: Phonetically Informed Speech Animation Network.

ICASSP2024 Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe 0001, Karen Livescu, 
Generative Context-Aware Fine-Tuning of Self-Supervised Speech Models.

#2  | Helen M. Meng | DBLP Google Scholar  
By venueICASSP: 74Interspeech: 72TASLP: 19ICML: 1IJCAI: 1
By year2024: 142023: 302022: 432021: 312020: 182019: 222018: 9
ISCA sessionsspeech synthesis: 13speech and language in health: 5voice conversion and adaptation: 5speech recognition of atypical speech: 4speech recognition: 2topics in asr: 2spoken term detection: 2asr neural network architectures: 2neural techniques for voice conversion and waveform generation: 2medical applications and visual asr: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multi-talker methods in speech processing: 1single-channel speech enhancement: 1multi-, cross-lingual and other topics in asr: 1novel models and training methods for asr: 1atypical speech analysis and detection: 1multimodal speech emotion recognition and paralinguistics: 1miscellaneous topics in speech, voice and hearing disorders: 1spoofing-aware automatic speaker verification (sasv): 1zero, low-resource and multi-modal speech recognition: 1embedding and network architecture for speaker recognition: 1voice anti-spoofing and countermeasure: 1non-autoregressive sequential modeling for speech processing: 1assessment of pathological speech and language: 1non-native speech: 1speaker recognition: 1speech synthesis paradigms and methods: 1speech in multimodality: 1asr neural network architectures and training: 1new trends in self-supervised speech processing: 1multimodal speech processing: 1learning techniques for speaker recognition: 1speech and speaker recognition: 1speech and audio classification: 1lexicon and language model for speech recognition: 1novel neural network architectures for acoustic modelling: 1second language acquisition and code-switching: 1voice conversion: 1emotion recognition and analysis: 1plenary talk: 1expressive speech synthesis: 1deep learning for source separation and pitch tracking: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 37speech synthesis: 21speaker recognition: 19natural language processing: 14recurrent neural nets: 11adaptation models: 8speech coding: 8speech separation: 8emotion recognition: 8task analysis: 7vocoders: 7speech enhancement: 7speaker verification: 7voice conversion: 7speech emotion recognition: 7decoding: 6semantics: 6self supervised learning: 6transformer: 6bayes methods: 6deep learning (artificial intelligence): 6text analysis: 6optimisation: 6text to speech: 5transformers: 5linguistics: 5adversarial attack: 5data models: 4representation learning: 4dysarthric speech reconstruction: 4audio visual: 4expressive speech synthesis: 4data mining: 4neural architecture search: 4security of data: 4gaussian processes: 4quantisation (signal): 4multi channel: 4overlapped speech: 4elderly speech: 3dysarthric speech: 3hidden markov models: 3speech: 3visualization: 3training data: 3speaker adaptation: 3predictive models: 3error analysis: 3domain adaptation: 3robustness: 3bayesian learning: 3knowledge distillation: 3audio signal processing: 3biometrics (access control): 3speech intelligibility: 3entropy: 3voice activity detection: 3variational inference: 3language models: 3convolutional neural nets: 3pre trained asr system: 2wav2vec2.0: 2older adults: 2bidirectional attention mechanism: 2spectrogram: 2multi modal: 2vq vae: 2cloning: 2language model: 2timbre: 2transfer learning: 2instruments: 2coherence: 2perturbation methods: 2cognition: 2vector quantization: 2hierarchical: 2estimation: 2conformer: 2end to end: 2automatic speech recognition: 2computational modeling: 2noise reduction: 2asr: 2speaking style modelling: 2bidirectional control: 2multi task learning: 2particle separators: 2time frequency analysis: 2source separation: 2costs: 2measurement: 2data augmentation: 2handicapped aids: 2disordered speech recognition: 2time delay neural network: 2automatic speaker verification: 2adversarial defense: 2model uncertainty: 2neural language models: 2trees (mathematics): 2benchmark testing: 2audio visual systems: 2anti spoofing: 2speaker diarization: 2multi look: 2inference mechanisms: 2gradient methods: 2admm: 2autoregressive processes: 2quantization: 2code switching: 2standards: 1multi lingual xlsr: 1hubert: 1films: 1multiscale speaking style transfer: 1text to speech synthesis: 1games: 1automatic dubbing: 1cross lingual speaking style transfer: 1prompt based learning: 1diffusion model: 1metric learning: 1natural languages: 1av hubert: 1transforms: 1pre training: 1self supervised style enhancing: 1dance expressiveness: 1dance generation: 1genre matching: 1dance dynamics: 1humanities: 1dynamics: 1beat alignment: 1zero shot: 1multi scale acoustic prompts: 1prompt tuning: 1parameter efficient tuning: 1transformer adapter: 1pre trained transformer: 1multiple signal classification: 1long multi track: 1multi view midivae: 1symbolic music generation: 1two dimensional displays: 1speech disentanglement: 1vae: 1voice cloning: 1static var compensators: 1harmonic analysis: 1power harmonic filters: 1synthesizers: 1neural concatenation: 1signal generators: 1singing voice conversion: 1speech normalization: 1speech units: 1pipelines: 1speech representation learning: 1information retrieval: 1interaction gesture: 1multi agent conversational interaction: 1oral communication: 1dialog intention and emotion: 1co speech gesture generation: 1neural tts: 1multi stage multi codebook (msmc): 1speech representation: 1context modeling: 1style modeling: 1bit error rate: 1multi scale: 1speech dereverberation: 1maximum likelihood detection: 1nonlinear filters: 1neural machine translation: 1hierarchical attention mechanism: 1machine translation: 1meta learning: 1meta generalized speaker verification: 1performance evaluation: 1optimization: 1domain mismatch: 1recording: 1subband interaction: 1inter subnet: 1global spectral information: 1feature selection: 1rabbits: 1disfluency pattern: 1dementia detection: 1audiobook speech synthesis: 1prediction methods: 1context aware: 1multi sentence: 1hierarchical transformer: 1additives: 1contrastive learning: 1multiobjective optimization: 1additive angular margin: 1optimization methods: 1attention mechanism: 1alzheimer’s disease: 1sociology: 1syntactics: 1task oriented: 1pretrained embeddings: 1multimodality: 1affective computing: 1multi label: 1emotional expression: 1multi culture: 1vocal bursts: 1data analysis: 1target speech extraction: 1multi modal fusion: 1fuses: 1encoding: 12d positional encoding.: 1cross attention: 1end to end speech recognition: 1multi talker speech recognition: 1network architecture: 1corrector network: 1time domain: 1time frequency domain: 1learning systems: 1synthetic corpus: 1audio recording: 1neural vocoder: 1semantic augmentation: 1upper bound: 1difficulty aware: 1stability analysis: 1contextual biasing: 1biased words: 1sensitivity: 1open vocabulary keyword spotting: 1acoustic model: 1dynamic network pruning: 1melody unsupervision: 1differentiable up sampling layer: 1rhythm: 1vocal range: 1regulators: 1annotations: 1singing voice synthesis: 1bi directional flow: 1elderly speech recognition: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1pattern classification: 1adversarial attacks: 1supervised learning: 1monte carlo methods: 1tree structure: 1prosodic structure prediction: 1computational linguistics: 1span based decoder: 1character level: 1image segmentation: 1phase information: 1full band extractor: 1multi scale time sensitive channel attention: 1memory management: 1convolution: 1knowledge based systems: 1flat lattice transformer: 1rule based: 1chinese text normalization: 1none standard word: 1relative position encoding: 1articulatory inversion: 1hybrid power systems: 1xlnet: 1speaking style: 1conversational text to speech synthesis: 1graph neural network: 1matrix algebra: 1end to end model: 1forced alignment: 1dereverberation and recognition: 1reverberation: 1speaker change detection: 1multitask learning: 1unsupervised learning: 1unsupervised speech decomposition: 1adversarial speaker adaptation: 1speaker identity: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1vocoder: 1uniform sampling: 1path dropout: 1partially fake audio detection: 1audio deep synthesis detection challenge: 1design methodology: 1mean square error methods: 1neural network quantization: 1mixed precision: 1connectionist temporal classification: 1cross entropy: 1disentangling: 1hybrid bottleneck features: 1feature fusion: 1data handling: 1m2met: 1direction of arrival estimation: 1direction of arrival: 1delays: 1generalisation (artificial intelligence): 1lf mmi: 1gaussian process: 1any to many: 1sequence to sequence modeling: 1signal reconstruction: 1signal sampling: 1signal representation: 1location relative attention: 1multimodal speech recognition: 1capsule: 1exemplary emotion descriptor: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1lstm rnn: 1low bit quantization: 1image recognition: 1microphone arrays: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1emotion: 1global style token: 1expressive: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1adress: 1patient diagnosis: 1alzheimer's disease detection: 1signal classification: 1diseases: 1features: 1geriatrics: 1medical diagnostic computing: 1ctc: 1non autoregressive: 1neural network based text to speech: 1grammars: 1prosody control: 1word processing: 1syntactic parse tree traversal: 1syntactic representation learning: 1controllable and efficient: 1semi autoregressive: 1prosody modelling: 1multi speaker and multi style tts: 1hifi gan: 1durian: 1low resource condition: 1weapons: 1information filters: 1switches: 1uncertainty: 1neurocognitive disorder detection: 1dementia: 1phonetic pos teriorgrams: 1x vector: 1gmm i vector: 1accent conversion: 1accented speech recognition: 1cross modal: 1seq2seq: 1adversarial training: 1spatial smoothing: 1spoofing countermeasure: 1recurrent neural networks: 1data compression: 1alternating direction methods of multipliers: 1audio visual speech recognition: 1multilingual speech synthesis: 1foreign accent: 1spectral analysis: 1center loss: 1human computer interaction: 1discriminative features: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1neural network language models: 1lstm: 1parameter estimation: 1connectionist temporal classification (ctc): 1e learning: 1computer assisted pronunciation training (capt): 1con volutional neural network (cnn): 1mispronunciation detection and diagnosis (mdd): 1multi head self attention: 1dilated residual network: 1wavenet: 1self attention: 1blstm: 1phonetic posteriorgrams(ppgs): 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1natural gradient: 1rnnlms: 1
Most publications (all venues) at2022: 762023: 562021: 542024: 452019: 29

Affiliations
The Chinese University of Hog Kong
Massachusetts Institute of Technology, Cambridge, MA, USA (former)

Recent publications

TASLP2024 Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu, 
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition.

TASLP2024 Jingbei Li, Sipan Li, Ping Chen, Luwen Zhang, Yi Meng, Zhiyong Wu 0001, Helen Meng, Qiao Tian, Yuping Wang, Yuxuan Wang 0002, 
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing.

TASLP2024 Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Xueyuan Chen, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Zhiyong Wu 0001, Xixin Wu, Helen Meng
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

ICASSP2024 Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu 0001, Haozhi Huang 0004, Helen Meng
Enhancing Expressiveness in Dance Generation Via Integrating Frequency and Music Style Information.

ICASSP2024 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Dan Luo, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han 0001, Helen Meng
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.

ICASSP2024 Zhe Li, Man-Wai Mak, Helen Mei-Ling Meng
Dual Parameter-Efficient Fine-Tuning for Speaker Representation Via Speaker Prompt Tuning and Adapters.

ICASSP2024 Zhiwei Lin, Jun Chen 0024, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu 0001, Helen Meng
Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation.

ICASSP2024 Hui Lu, Xixin Wu, Haohan Guo, Songxiang Liu, Zhiyong Wu 0001, Helen Meng
Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations.

ICASSP2024 Binzhu Sha, Xu Li 0015, Zhiyong Wu 0001, Ying Shan, Helen Meng
Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion.

ICASSP2024 Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization.

ICASSP2024 Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu 0001, Minglei Li 0001, Zonghong Dai, Helen Meng
Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng
UniAudio: Towards Universal Audio Generation with Large Language Models.

TASLP2023 Haohan Guo, Fenglong Xie, Xixin Wu, Frank K. Soong, Helen Meng
MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTS.

TASLP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Helen Meng
MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.

TASLP2023 Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu, 
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Jun Chen 0024, Wei Rao, Zilin Wang, Jiuxin Lin, Zhiyong Wu 0001, Yannan Wang, Shidong Shang, Helen Meng
Inter-Subnet: Speech Enhancement with Subband Interaction.

#3  | Haizhou Li 0001 | DBLP Google Scholar  
By venueInterspeech: 70ICASSP: 44TASLP: 29SpeechComm: 8AAAI: 2NeurIPS: 1
By year2024: 192023: 242022: 222021: 312020: 252019: 262018: 7
ISCA sessionsspeech synthesis: 6source separation: 4voice conversion and adaptation: 3speech signal characterization: 3speech technologies for code-switching in multilingual communities: 3invariant and robust pre-trained acoustic models: 2analysis of speech and audio signals: 2novel models and training methods for asr: 2speaker recognition: 2speech enhancement, bandwidth extension and hearing aids: 2spoken term detection: 2anti-spoofing for speaker verification: 1speaker and language identification: 1biosignal-enabled spoken communication: 1asr: 1resource-constrained asr: 1target speaker detection, localization and separation: 1the first dicova challenge: 1spoken language understanding: 1self-supervision and semi-supervision for neural asr training: 1speech enhancement and intelligibility: 1robust speaker recognition: 1feature, embedding and neural architecture for speaker recognition: 1neural signals for spoken communication: 1the attacker’s perpective on automatic speaker verification: 1targeted source separation: 1speech in multimodality: 1the interspeech 2020 far field speaker verification challenge: 1speaker recognition challenges and applications: 1anti-spoofing and liveness detection: 1asr neural network architectures: 1cross/multi-lingual and code-switched speech recognition: 1the interspeech 2019 computational paralinguistics challenge (compare): 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker recognition and anti-spoofing: 1speech processing and analysis: 1speaker recognition evaluation: 1speaker and language recognition: 1speech and audio characterization and segmentation: 1neural waveform generation: 1the zero resource speech challenge 2019: 1speech and speaker recognition: 1speaker recognition and diarization: 1cross-lingual and multilingual asr: 1speech and singing production: 1prosody modeling and generation: 1voice conversion and speech synthesis: 1speaker verification: 1show and tell: 1source separation from monaural input: 1
IEEE keywordsspeech recognition: 22speaker recognition: 21task analysis: 14speech synthesis: 14natural language processing: 9transformers: 7visualization: 6speaker embedding: 6data models: 6speech coding: 5speech enhancement: 5target speaker extraction: 5multi modal: 5emotion recognition: 5music: 5speaker extraction: 5lips: 4computational modeling: 4hidden markov models: 4phonetics: 4decoding: 4training data: 4self supervised learning: 4time domain: 4text analysis: 4transformer: 3time frequency analysis: 3data mining: 3rendering (computer graphics): 3signal processing algorithms: 3adaptation models: 3voice activity detection: 3synchronization: 3pipelines: 3music information retrieval: 3speech intelligibility: 3representation learning: 3multi task learning: 3singing voice separation: 3voice conversion: 3transfer learning: 3security of data: 3anti spoofing: 3speaker verification: 2accent: 2steganalysis: 2steganography: 2speech separation: 2internet: 2measurement: 2cross lingual voice conversion (xvc): 2predictive models: 2auditory system: 2filtering algorithms: 2direction of arrival: 2location awareness: 2linguistics: 2sparsely overlapped speech: 2noise robustness: 2recurrent neural networks: 2robustness: 2direction of arrival estimation: 2automatic dialogue evaluation: 2correlation: 2time frequency attention: 2cocktail party problem: 2codes: 2benchmark testing: 2image recognition: 2lyrics transcription: 2hearing: 2convolutional neural nets: 2pattern classification: 2pre training: 2tacotron: 2vocoders: 2speaker characterization: 2voice conversion (vc): 2cross lingual: 2word processing: 2tts: 2signal detection: 2signal reconstruction: 2cepstral analysis: 2source separation: 2automatic cued speech recognition: 1computational efficiency: 1computation and parameter efficient: 1cross attention: 1resnet: 1stride configuration: 1temporal resolution: 12d cnn: 1convolutional neural networks: 1image resolution: 1computer architecture: 1controllable: 1text to speech (tts) synthesis: 1accent intensity: 1multi agent deep learning: 1weight parameter aggregation: 1streams: 1low bit rate speech streams: 1long short term memory: 1pretraining: 1siamese network: 1psychoacoustic models: 1self supervise: 1synthetic data: 1maximum mean discrepancy: 1predictive coding: 1electronics packaging: 1multimodal sensors: 1oral communication: 1context modeling: 1dialog systems: 1history: 1multi reference: 1timbre: 1pitch normalization: 1text to speech (tts): 1phonetic variation: 1prosodic variation: 1target speaker localization: 1speaker dependent mask: 1focusing: 1emotional text to speech: 1emotion prediction: 1emotion control: 1target speech diarization: 1switches: 1semantics: 1speaker diarization: 1prompt driven: 1mimics: 1active speaker detection: 1audio visual: 1interference: 1speech: 1low snr: 1testing: 1optimization: 1artificial noise: 1signal to noise ratio: 1background noise: 1gradient: 1noise robust: 1neuromorphics: 1neurons: 1encoding: 1spiking neural networks: 1spike encoding: 1filter banks: 1learnable audio front end: 1system performance: 1in the wild: 1dino: 1biological system modeling: 1spiking neural network (snn): 1voice activity detection (vad): 1auditory attention: 1power demand: 1multiple signal classification: 1lyrics transcription in polyphonic music: 1integrated fine tuning: 1vocal extraction: 1robots: 1speaker tracking: 1cross modal attention: 1estimation: 1audio visual fusion: 1progressive clustering: 1diverse positive pairs: 1supervised learning: 1face recognition: 1speech streams: 1delays: 1resistance: 1pitch delays: 1deep neural networks: 1distortion: 1voice over internet protocol: 1quantization (signal): 1multitask learning: 1adapters: 1multi domain generalization: 1noise measurement: 1restcn: 1error analysis: 1linguistic loss: 1brain modeling: 1speech stimulus: 1electroencephalography: 1eeg decoding: 1speech envelope: 1match mismatch classification: 1visual occlusions: 1design methodology: 1inpainting: 1noisy label: 1deep cleansing: 1audiovisual: 1joint pre training: 1speech representation: 1analytical models: 1feeds: 1transformer cores: 1spare self attention: 1central moment discrepancy (cmd): 1missing modality imagination: 1invariant feature: 1multimodal emotion recognition: 1automatic lyrics transcription in polyphonic music: 1multitasking: 1instruments: 1singing skill evaluation: 1lyrics synchronization: 1singing information processing: 1audio signal processing: 1singing voice synthesis: 1singing voice: 1 $general speech mixture$ : 1scenario aware differentiated loss: 1filtering theory: 1speech lip synchronization: 1self enrollment: 1multilingual: 1language translation: 1grammars: 1natural languages: 1selective auditory attention: 1globalphone: 1target language extraction: 1lyrics transcription of polyphonic music: 1beamforming: 1doa estimation: 1speaker localizer: 1reverberation: 1array signal processing: 1multi scale frequency channel attention: 1short utterance: 1text independent speaker verification: 1text detection: 1visual text to speech: 1automatic voice over: 1textual visual attention: 1image fusion: 1lip speech synchronization: 1video signal processing: 1pseudo label selection: 1self supervised speaker recognition: 1loss gated learning: 1unsupervised learning: 1temporal convolutional network: 1energy distribution: 1prompt: 1multimodal: 1phrase break prediction: 1morphological and phonological features: 1deep learning (artificial intelligence): 1self attention: 1prosodic phrasing: 1mongolian speech synthesis: 1expressive speech synthesis: 1audio databases: 1frame and style reconstruction loss: 1speech analysis: 1voice conversion evaluation: 1voice conversion challenges: 1vocoding: 1target speaker verification: 1single and multi talker speaker verification: 1interactive systems: 1speech based user interfaces: 1human computer interaction: 1sport: 1holistic framework: 1text to speech (tts): 1non parallel: 1context vector: 1autoencoder: 1personalized speech generation: 1language agnostic: 1syntax: 1computational linguistics: 1graph theory: 1graph neural network: 1synthetic speech detection: 1signal companding: 1data augmentation: 1signal fusion: 1multi stage: 1spectro temporal attention: 1speech emotion recognition: 1convolution: 1channel attention: 1disentangled feature learning: 1signal denoising: 1adversarial training: 1signal representation: 1image sequences: 1acoustic embed dings: 1linguistic embeddings: 1image classification: 1intent classification: 1cloning: 1speaker adaption: 1target tracking: 1voice cloning: 1speech emotion recognition (ser): 1emotional voice conversion: 1emotional speech dataset: 1evaluation by ranking: 1musical acoustics: 1evaluation of singing quality: 1inter singer measures: 1music theory motivated measures: 1self organising feature maps: 1depth wise separable convolution: 1multi scale: 1inference mechanisms: 1knowledge distillation: 1autoregressive processes: 1chains corpus: 1vocal tract constriction: 1whispered speech: 1synthetic attacks: 1replay attacks: 1generalized countermeasures: 1asvspoof 2019: 1wavenet adaptation: 1singular value decomposition: 1singular value decomposition (svd): 1automatic speech recognition: 1acoustic modeling: 1music genre: 1lyrics alignment: 1sensor fusion: 1multi scale fusion: 1speech bandwidth extension: 1signal restoration: 1time domain analysis: 1low resource asr: 1catastrophic forgetting.: 1independent language model: 1fine tuning: 1text to speech: 1code switching: 1crosslingual word embedding: 1end to end: 1continuous wavelet transforms: 1tandem feature: 1phonetic posteriorgrams (ppgs): 1wavenet vocoder: 1sparse matrices: 1dictionaries: 1prosody conversion: 1language modelling: 1cross lingual embedding: 1code switch: 1audio source separation: 1polyphonic music: 1asr: 1lyrics to audio alignment: 1asvspoof 2017: 1channel bank filters: 1automatic speaker verification: 1spatial differentiation: 1band pass filters: 1iir filters: 1spectrum approximation loss: 1phonetic posteriorgram (ppg): 1average modeling approach (ama): 1
Most publications (all venues) at2024: 742010: 702023: 672021: 652015: 61

Affiliations
Chinese University of Hong Kong (Shenzhen), China
National University of Singapore, Department of Electrical and Computer Engineering, Singapore
Nanyang Technological University, Singapore (2006 - 2016)
Institute for Infocomm Research, A*STAR, Singapore (2003 - 2016)
University of New South Wales, Sydney, Australia (2011)
University of Eastern Finland, Kuopio, Finland (2009)
South China University of Technology, Guangzhou, China (PhD 1990)

Recent publications

SpeechComm2024 Shuai Wang 0016, Zhengyang Chen, Bing Han, Hongji Wang, Chengdong Liang, Binbin Zhang, Xu Xiang, Wen Ding, Johan Rohdin, Anna Silnova, Yanmin Qian, Haizhou Li 0001
Advancing speaker embedding learning: Wespeaker toolkit for research and production.

TASLP2024 Lei Liu, Li Liu 0036, Haizhou Li 0001
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition.

TASLP2024 Tianchi Liu 0004, Kong Aik Lee, Qiongqiong Wang, Haizhou Li 0001
Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification.

TASLP2024 Rui Liu 0008, Berrak Sisman, Guanglai Gao, Haizhou Li 0001
Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering.

TASLP2024 Congcong Sun, Hui Tian 0002, Peng Tian, Haizhou Li 0001, Zhenxing Qian, 
Multi-Agent Deep Learning for the Detection of Multiple Speech Steganography Methods.

TASLP2024 Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang 0016, Haizhou Li 0001
Speech Separation With Pretrained Frontend to Minimize Domain Mismatch.

TASLP2024 Koichiro Yoshino, Yun-Nung Chen, Paul A. Crook, Satwik Kottur, Jinchao Li, Behnam Hedayatnia, Seungwhan Moon, Zhengcong Fei, Zekang Li, Jinchao Zhang, Yang Feng 0004, Jie Zhou 0016, Seokhwan Kim, Yang Liu 0004, Di Jin, Alexandros Papangelis, Karthik Gopalakrishnan 0001, Dilek Hakkani-Tur, Babak Damavandi, Alborz Geramifard, Chiori Hori, Ankit Shah, Chen Zhang 0020, Haizhou Li 0001, João Sedoc, Luis F. D'Haro, Rafael E. Banchs, Alexander Rudnicky, 
Overview of the Tenth Dialog System Technology Challenge: DSTC10.

TASLP2024 Mingyang Zhang 0003, Yi Zhou 0020, Yi Ren 0006, Chen Zhang 0020, Xiang Yin 0006, Haizhou Li 0001
RefXVC: Cross-Lingual Voice Conversion With Enhanced Reference Leveraging.

TASLP2024 Xuehao Zhou, Mingyang Zhang 0003, Yi Zhou 0020, Zhizheng Wu 0001, Haizhou Li 0001
Accented Text-to-Speech Synthesis With Limited Data.

ICASSP2024 Yu Chen, Xinyuan Qian, Zexu Pan, Kainan Chen, Haizhou Li 0001
LOCSELECT: Target Speaker Localization with an Auditory Selective Hearing Mechanism.

ICASSP2024 Sho Inoue, Kun Zhou 0003, Shuai Wang 0016, Haizhou Li 0001
Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.

ICASSP2024 Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li 0001
Prompt-Driven Target Speech Diarization.

ICASSP2024 Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang 0016, Haizhou Li 0001
Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-Talker Speech.

ICASSP2024 Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li 0001
Gradient Weighting for Speaker Verification in Extremely Low Signal-to-Noise Ratio.

ICASSP2024 Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li 0001
Spiking-Leaf: A Learnable Auditory Front-End for Spiking Neural Networks.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

ICASSP2024 Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li 0001
SVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks.

AAAI2024 Rui Liu 0008, Yifan Hu, Yi Ren 0006, Xiang Yin 0006, Haizhou Li 0001
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling.

AAAI2024 Jiadong Wang, Zexu Pan, Malu Zhang, Robby T. Tan, Haizhou Li 0001
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition.

SpeechComm2023 Buddhi Wickramasinghe, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Julien Epps, Haizhou Li 0001, Ting Dang, 
DNN controlled adaptive front-end for replay attack detection systems.

#4  | Lei Xie 0001 | DBLP Google Scholar  
By venueInterspeech: 58ICASSP: 40TASLP: 15SpeechComm: 2ACL: 1AAAI: 1
By year2024: 122023: 262022: 282021: 162020: 132019: 162018: 6
ISCA sessionsspeech synthesis: 12speech recognition: 4voice conversion and adaptation: 4speaker and language recognition: 2asr: 2adjusting to speaker, accent, and domain: 2anti-spoofing for speaker verification: 1multi-talker methods in speech processing: 1speech synthesis and voice conversion: 1statistical machine translation: 1models for streaming asr: 1novel models and training methods for asr: 1multi-, cross-lingual and other topics in asr: 1spoken language processing: 1other topics in speech recognition: 1spoofing-aware automatic speaker verification (sasv): 1dereverberation and echo cancellation: 1tools, corpora and resources: 1non-autoregressive sequential modeling for speech processing: 1interspeech 2021 deep noise suppression challenge: 1resource-constrained asr: 1search/decoding techniques and confidence measures for asr: 1interspeech 2021 acoustic echo cancellation challenge: 1robust speaker recognition: 1deep noise suppression challenge: 1singing voice computing and processing in music: 1summarization, semantic analysis and classification: 1the attacker’s perpective on automatic speaker verification: 1multi-channel speech enhancement: 1streaming asr: 1the interspeech 2020 far field speaker verification challenge: 1model adaptation for asr: 1asr for noisy and far-field speech: 1cross-lingual and multilingual asr: 1speech technologies for code-switching in multilingual communities: 1extracting information from audio: 1robust speech recognition: 1spoken term detection: 1
IEEE keywordsspeech recognition: 18speech synthesis: 11decoding: 8linguistics: 8timbre: 8voice conversion: 8speech: 7task analysis: 7speech enhancement: 7natural language processing: 7speaker recognition: 5automatic speech recognition: 5transforms: 4emotion transfer: 4data models: 4predictive models: 4noise reduction: 4multitasking: 3attention mechanism: 3analytical models: 3convolution: 3fuses: 3time frequency analysis: 3end to end: 3style transfer: 3cloning: 2zero shot: 2disentangling: 2data mining: 2conversational asr: 2conformer: 2degradation: 2data privacy: 2speaker anonymization: 2information filtering: 2privacy protection: 2singular value decomposition (svd): 2privacy: 2cross lingual: 2disentanglement: 2pipelines: 2robustness: 2representation learning: 2visualization: 2audio visual speech recognition: 2process control: 2multi scale: 2perturbation methods: 2acoustic distortion: 2reverberation: 2generative adversarial network: 2vocoders: 2low resource: 2headphones: 2personalized speech enhancement: 2real time: 2multi task learning: 2source separation: 2adaptation models: 2acoustic echo cancellation: 2noise suppression: 2echo cancellers: 2adversarial learning: 2recurrent neural networks: 2training data: 2end to end asr: 2microphone arrays: 2alimeeting: 2meeting transcription: 2noise measurement: 2voice activity detection: 2gradient methods: 2keyword spotting: 2attention: 2attention based model: 2end to end speech recognition: 2speaker cloning: 1u net: 1style cloning: 1spectrogram: 1two granularity modeling units: 1asr ar multi task learning: 1lasas: 1temporal channel retrieval: 1production: 1reviews: 1context: 1oral communication: 1context modeling: 1latent variational: 1cross modal representation: 1matrix decomposition: 1voiceprivacy challenge: 1computational modeling: 1streaming voice conversion: 1dynamic masked convolution: 1computer architecture: 1predictive coding: 1quiet attention: 1buildings: 1error analysis: 1multimodal: 1cross attention: 1staged approach: 1measurement: 1encoding: 1language models: 1generative model: 1self supervised learning: 1semantics: 1natural language prompts: 1latent diffusion: 1diffusion model: 1phonetics: 1diffusion processes: 1style modeling: 1adversarial attack: 1speaker identification: 1timbre reserved: 1speech distortion: 1information perturbation: 1feature fusion: 1expressive: 1generative adversarial networks: 1universal vocoder: 1digital signal processing: 1source filter model: 1speaking style: 1speaker adaptation: 1contrastive learning: 1clustering methods: 1upper bound: 1background sound: 1social networking (online): 1internet: 1voice privacy challenge: 1robust keyword spotting: 1real time systems: 1multi modality fusion: 1audio visual keywords spotting: 1lips: 1far field speaker verification: 1fine tuning: 1weight transfer: 1tuning: 1band split: 1complexity theory: 1maximum likelihood detection: 1two step network: 1logic gates: 1multiple factors decoupling: 1expressive speech synthesis: 1two stage: 1minimization: 1variational inference: 1neural tts: 1style and speaker attributes: 1disjoint datasets: 1autoregressive processes: 1emotional speech synthesis: 1virtual assistants: 1emotion strengths: 1principal component analysis: 1emotion strength control: 1natural languages: 1databases: 1text to speech (tts): 1text analysis: 1computational linguistics: 1long form: 1cross sentence: 1dilated complex dual path conformer: 1uformer: 1speech enhancement and dereverberation: 1encoder decoder attention: 1medical signal processing: 1modulation: 1hybrid encoder and decoder: 1filtering theory: 1auditory system: 1two stage network: 1estimation: 1ecapa tdnn: 1super wide band: 1information processing: 1s dccrn: 1adaptation: 1one shot: 1over fit: 1topic realted rescoring: 1latent variational module: 1meeting scenario: 1speak diarization: 1arrays: 1multi speaker asr: 1m2met: 1speaker diarization: 1variational autoencoder: 1audio signal processing: 1singing voice synthesis: 1music: 1normalizing flows: 1optical filters: 1corpus: 1matched filters: 1optical character recognition software: 1multi domain: 1shape: 1performance gain: 1lattice pruning: 1speech coding: 1decoder: 1lattice generation: 1acoustic modeling: 1accent recognition: 1accented speech recognition: 1lf mmi: 1convolutional neural nets: 1computational complexity: 1transformer: 1wake word detection: 1streaming: 1transfer learning: 1speaker adaption: 1target tracking: 1voice cloning: 1pattern matching: 1deep binary embeddings: 1temporal context: 1query by example: 1image retrieval: 1quantization (signal): 1wavenet adaptation: 1singular value decomposition: 1voice conversion (vc): 1sensor fusion: 1multi scale fusion: 1speech bandwidth extension: 1signal restoration: 1time domain analysis: 1document image processing: 1neural net architecture: 1class imbalance: 1hard examples: 1wake up word detection: 1error statistics: 1statistical distributions: 1cross entropy: 1listen attend and spell: 1interference suppression: 1virtual adversarial training: 1sequence to sequence: 1adversarial training: 1generators: 1signal to noise ratio: 1domain adversarial training: 1asr: 1computer aided instruction: 1esl: 1call: 1language model: 1code switching: 1pattern classification: 1kws: 1adversarial examples: 1permutation invariant training: 1speech separation: 1pitch tracking: 1deep clustering: 1self attention: 1text to speech synthesis: 1relative position aware representation: 1recurrent neural nets: 1sequence to sequence model: 1audio visual systems: 1robust speech recognition: 1dropout: 1bimodal df smn: 1multi condition training: 1
Most publications (all venues) at2023: 562021: 562022: 522024: 352019: 33

Affiliations
Northwestern Polytechnical University, School of Computer Science, Xi'an, China
The Chinese University of Hong Kong, Department of Systems Engineering and Engineering Management, Hong Kong (2006 - 2007)
City University of Hong Kong, School of Creative Media, Hong Kong (2004 - 2006)
Northwestern Polytechnical University, Xi'an, China (PhD 2004)
Vrije Universiteit Brussel, Department of Electronics and Information Processing, Belgium (2001 - 2002)

Recent publications

SpeechComm2024 Li Zhang 0106, Ning Jiang, Qing Wang 0039, Yue Li, Quan Lu, Lei Xie 0001
Whisper-SV: Adapting Whisper for low-data-resource speaker verification.

TASLP2024 Tao Li, Zhichao Wang 0002, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie 0001
U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning.

TASLP2024 Qijie Shao, Pengcheng Guo, Jinghao Yan, Pengfei Hu 0004, Lei Xie 0001
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition.

TASLP2024 Zhichao Wang 0002, Liumeng Xue, Qiuqiang Kong, Lei Xie 0001, Yuanzhe Chen, Qiao Tian, Yuping Wang, 
Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion.

TASLP2024 Kun Wei, Bei Li, Hang Lv 0001, Quan Lu, Ning Jiang, Lei Xie 0001
Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation.

TASLP2024 Jixun Yao, Qing Wang 0039, Pengcheng Guo, Ziqian Ning, Lei Xie 0001
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix.

TASLP2024 Xinfa Zhu, Yi Lei, Tao Li, Yongmao Zhang, Hongbin Zhou, Heng Lu 0004, Lei Xie 0001
METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer.

ICASSP2024 Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu 0004, Shuai Wang, Jixun Yao, Lei Xie 0001, Mengxiao Bi, 
Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion.

ICASSP2024 He Wang, Pengcheng Guo, Pan Zhou, Lei Xie 0001
MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition.

ICASSP2024 Ziqian Wang, Xinfa Zhu, Zihan Zhang, Yuanjun Lv, Ning Jiang, Guoqing Zhao, Lei Xie 0001
SELM: Speech Enhancement using Discrete Tokens and Language Models.

ICASSP2024 Jixun Yao, Yuguang Yang 0005, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu 0004, Lei Xie 0001
Promptvc: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts.

ACL2024 Zhichao Wang 0002, Yuanzhe Chen, Xinsheng Wang, Lei Xie 0001, Yuping Wang, 
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion.

TASLP2023 Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie 0001
DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin.

TASLP2023 Zhichao Wang 0002, Xinsheng Wang, Qicong Xie, Tao Li, Lei Xie 0001, Qiao Tian, Yuping Wang, 
MSM-VC: High-Fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-Scale Style Modeling.

TASLP2023 Qing Wang 0039, Jixun Yao, Li Zhang 0106, Pengcheng Guo, Lei Xie 0001
Timbre-Reserved Adversarial Attack in Speaker Identification.

ICASSP2023 Mingshuai Liu, Shubo Lv, Zihan Zhang, Runduo Han, Xiang Hao, Xianjun Xia, Li Chen, Yijian Xiao, Lei Xie 0001
Two-Stage Neural Network for ICASSP 2023 Speech Signal Improvement Challenge.

ICASSP2023 Ziqian Ning, Qicong Xie, Pengcheng Zhu 0004, Zhichao Wang 0002, Liumeng Xue, Jixun Yao, Lei Xie 0001, Mengxiao Bi, 
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features.

ICASSP2023 Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie 0001, Gang He, Jinfeng Bai, 
DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP.

ICASSP2023 Zhichao Wang 0002, Xinsheng Wang, Lei Xie 0001, Yuanzhe Chen, Qiao Tian, Yuping Wang, 
Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints.

ICASSP2023 Xiaopeng Yan, Yindi Yang, Zhihao Guo, Liangliang Peng, Lei Xie 0001
The NPU-Elevoc Personalized Speech Enhancement System for Icassp2023 DNS Challenge.

#5  | Yanmin Qian | DBLP Google Scholar  
By venueInterspeech: 45ICASSP: 44TASLP: 15SpeechComm: 2NeurIPS: 1
By year2024: 142023: 252022: 232021: 162020: 132019: 122018: 4
ISCA sessionsspeaker and language identification: 5embedding and network architecture for speaker recognition: 4cross-lingual and multilingual asr: 2multi-talker methods in speech processing: 2speaker recognition and anti-spoofing: 2noise robust and distant speech recognition: 2speaker recognition: 2deep learning for source separation and pitch tracking: 2speaker and language diarization: 1speech recognition: 1acoustic model adaptation for asr: 1novel models and training methods for asr: 1speaker embedding and diarization: 1speech enhancement and intelligibility: 1source separation: 1topics in asr: 1sdsv challenge 2021: 1speech synthesis: 1multimodal systems: 1speaker, language, and privacy: 1speaker recognition challenges and applications: 1learning techniques for speaker recognition: 1targeted source separation: 1multilingual and code-switched asr: 1anti-spoofing and liveness detection: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1feature extraction for asr: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1asr neural network training: 1speech and audio source separation and scene analysis: 1robust speech recognition: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 22speaker recognition: 18speaker verification: 14transformers: 9task analysis: 8error analysis: 8data augmentation: 8adaptation models: 7data models: 6degradation: 6decoding: 5self supervised learning: 5robustness: 5training data: 5computational modeling: 5speech synthesis: 4speaker diarization: 4speech enhancement: 4end to end speech recognition: 4natural language processing: 4system performance: 3computer architecture: 3multi modality: 3encoding: 3noise measurement: 3quantization (signal): 3low resource speech recognition: 3domain adaptation: 3audio visual: 3speaker embedding: 3knowledge distillation: 3continuous speech separation: 3curriculum learning: 3end to end: 3speech separation: 3source separation: 3data handling: 3clustering algorithms: 2voice activity detection: 2transducers: 2factorized neural transducer: 2predictive models: 2vocabulary: 2noise robustness: 2interference: 2visualization: 2audio visual speech recognition: 2unified cross modal attention: 2resnet: 2data mining: 2fuses: 2data collection: 2switches: 2semantics: 2model compression: 2large margin fine tuning: 2dual path modeling: 2deep learning (artificial intelligence): 2unsupervised learning: 2recurrent neural nets: 2audio signal processing: 2perturbation methods: 2transforms: 2gaussian processes: 2reverberation: 2text dependent speaker verification: 2attention mechanism: 2neural speaker diarization: 1attention based encoder decoder: 1ami: 1iterative decoding: 1callhome: 1dihard: 1long content speech recognition: 1streaming and non streaming: 1context modeling: 1rnn t: 1label correction: 1iterative methods: 1self supervised speaker verification: 1cluster aware dino: 1reliability: 1dynamic loss gate: 1modality corruption: 1df resnet: 1performance evaluation: 1neural network quantization: 1lightweight systems: 1mobile handsets: 1analytical models: 1adaptive systems: 1text to seech: 1phonetics: 1data splicing: 1dictionaries: 1splicing: 1machine anomalous sound detection: 1self supervised pre train: 1fine tune: 1employee welfare: 13d speaker: 1cross domain learning: 1domain mismatch: 1target speech diarization: 1prompt driven: 1mimics: 1mixed sparsity: 1large language models: 1sparsity pruning: 1resource management: 1sensitivity: 1in the wild: 1filtering algorithms: 1dino: 1pipelines: 1target speech extraction: 1boosting: 1vocoders: 1frequency estimation: 1speech discretization: 1vocoder: 1recording: 1time frequency analysis: 1reproducibility of results: 1sampling frequency independent: 1microphone number invariant: 1frequency diversity: 1universal speech enhancement: 1attentive feature fusion: 1depth first architecture: 1complexity theory: 1ecapa tdnn: 1long form speech recognition: 1context and speech encoder: 1costs: 1factorized aed: 1text only: 1interpolation: 1search problems: 1binary classification: 1sphereface2: 1modality absence: 1noise robust: 1machine learning: 1multi clue processing: 1benchmark testing: 1cross modality attention: 1target sound extraction: 1misp challenge: 1tv: 1discriminator and transfer: 1log likelihood ratio: 1production: 1wespeaker: 1codes: 1robust speech recognition: 1supervised learning: 1hubert: 1tts conversion: 1transformer transducer: 1speech coding: 1code switching asr: 1cross modality learning: 1industries: 1learning systems: 1asymmetric scenario: 1duration mismatch: 1focusing: 1signal processing algorithms: 1collaboration: 1overlap ratio predictor: 1memory pool: 1multi accent: 1layer wise adaptation: 1accent embedding: 1length perturbation: 1optimisation: 1self supervised pretrain: 1representation learning: 1image representation: 1multilayer perceptrons: 1text independent: 1multi layer perceptron: 1convolution attention: 1local attention: 1local information: 1gaussian attention: 1skipping memory: 1low latency: 1real time: 1time domain analysis: 1self knowledge distillation: 1deep embedding learning: 1knowledge engineering: 1synchronisation: 1object detection: 1attention: 1low quality video: 1video signal processing: 1microphone arrays: 1multi speaker asr: 1meeting transcription: 1alimeeting: 1m2met: 1punctuation prediction: 1edge devices: 1streaming speech recognition: 1multi task learning: 1data utilization: 1dynamic scheduling: 1biometrics (access control): 1audio visual deep neural network: 1person verification: 1face recognition: 1data analysis: 1multi modal system: 1signal detection: 1modified magnitude phase spectrum: 1constant q modified octave coefficients: 1mixture models: 1signal classification: 1unknown kind spoofing detection: 1accent adaptation: 1accent speech recognition: 1rnnlm: 1signal to distortion ratio: 1blind source separation: 1acoustic beamforming: 1complex backpropagation: 1convolution: 1transfer functions: 1array signal processing: 1multi channel source separation: 1contrastive learning: 1i vector: 1tts based data augmentation: 1test time augmentation: 1phone posteriorgram: 1accent identification: 1ppg: 1data fusion: 1unit selection synthesis: 1x vector: 1long recording speech separation: 1convolutional neural nets: 1online processing: 1end to end asr: 1acoustic modeling: 1accent recognition: 1accented speech recognition: 1children’s speech recognition: 1text to speech: 1data selection: 1variational auto encoder: 1text independent speaker verification: 1generative adversarial network: 1end to end model: 1multi talker mixed speech recognition: 1permutation invariant training: 1overlapped speech recognition: 1transformer: 1neural beamforming: 1multitask learning: 1channel information: 1adversarial training: 1multimodal: 1audio visual systems: 1text dependent: 1adaptation: 1text mismatch: 1center loss: 1angular softmax: 1short duration text independent speaker verification: 1speaker neural embedding: 1triplet loss: 1ctc: 1hidden markov models: 1multi speaker speech recognition: 1cocktail party problem: 1teacher student learning: 1computer aided instruction: 1
Most publications (all venues) at2023: 422022: 372024: 302018: 212021: 20

Affiliations
URLs

Recent publications

SpeechComm2024 Shuai Wang 0016, Zhengyang Chen, Bing Han, Hongji Wang, Chengdong Liang, Binbin Zhang, Xu Xiang, Wen Ding, Johan Rohdin, Anna Silnova, Yanmin Qian, Haizhou Li 0001, 
Advancing speaker embedding learning: Wespeaker toolkit for research and production.

TASLP2024 Zhengyang Chen, Bing Han, Shuai Wang 0016, Yanmin Qian
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer.

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

TASLP2024 Bing Han, Zhengyang Chen, Yanmin Qian
Self-Supervised Learning With Cluster-Aware-DINO for High-Performance Robust Speaker Verification.

TASLP2024 Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian
Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond.

TASLP2024 Bei Liu, Haoyu Wang 0007, Yanmin Qian
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization.

TASLP2024 Wei Wang 0010, Yanmin Qian
Universal Cross-Lingual Data Generation for Low Resource ASR.

ICASSP2024 Bing Han, Zhiqiang Lv, Anbai Jiang, Wen Huang 0004, Zhengyang Chen, Yufeng Deng, Jiawei Ding, Cheng Lu 0007, Wei-Qiang Zhang 0001, Pingyi Fan, Jia Liu 0001, Yanmin Qian
Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection.

ICASSP2024 Wen Huang 0004, Bing Han, Shuai Wang 0016, Zhengyang Chen, Yanmin Qian
Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.

ICASSP2024 Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li 0001, 
Prompt-Driven Target Speech Diarization.

ICASSP2024 Hang Shao, Bei Liu, Yanmin Qian
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001, 
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

ICASSP2024 Linfeng Yu, Wangyou Zhang, Chenpeng Du, Leying Zhang, Zheng Liang, Yanmin Qian
Generation-Based Target Speech Extraction with Speech Discretization and Vocoder.

ICASSP2024 Wangyou Zhang, Jee-weon Jung, Yanmin Qian
Improving Design of Input Condition Invariant Speech Enhancement.

TASLP2023 Bei Liu, Zhengyang Chen, Yanmin Qian
Depth-First Neural Architecture With Attentive Feature Fusion for Efficient Speaker Verification.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

ICASSP2023 Xun Gong 0005, Wei Wang 0010, Hang Shao, Xie Chen 0001, Yanmin Qian
Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR.

ICASSP2023 Bing Han, Zhengyang Chen, Yanmin Qian
Exploring Binary Classification Loss for Speaker Verification.

ICASSP2023 Jiahong Li, Chenda Li, Yifei Wu, Yanmin Qian
Robust Audio-Visual ASR with Unified Cross-Modal Attention.

ICASSP2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Dongmei Wang, Takuya Yoshioka, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Target Sound Extraction with Variable Cross-Modality Clues.

#6  | Björn W. Schuller | DBLP Google Scholar  
By venueInterspeech: 68ICASSP: 30TASLP: 7
By year2024: 92023: 202022: 112021: 182020: 222019: 142018: 11
ISCA sessionsspeech emotion recognition: 5speech in health: 4the first dicova challenge: 3the interspeech 2020 computational paralinguistics challenge (compare): 3spoken dialog systems and conversational analysis: 2health-related speech analysis: 2voice conversion and adaptation: 2speech synthesis: 2multimodal systems: 2the interspeech 2021 computational paralinguistics challenge (compare): 2computational paralinguistics: 2social signals detection and speaker traits analysis: 2attention mechanism for speaker state recognition: 2the interspeech 2018 computational paralinguistics challenge (compare): 2multimodal speech emotion recognition: 1show and tell: 1speech, voice, and hearing disorders: 1speech and language in health: 1automatic analysis of paralinguistics: 1single-channel speech enhancement: 1atypical speech analysis and detection: 1asr technologies and systems: 1(multimodal) speech emotion recognition: 1pathological speech assessment: 1atypical speech detection: 1diverse modes of speech acquisition and processing: 1health and affect: 1speech type classification and diagnosis: 1speech in multimodality: 1alzheimer’s dementia recognition through spontaneous speech: 1diarization: 1acoustic scene classification: 1bioacoustics and articulation: 1speech enhancement: 1representation learning of emotion and paralinguistics: 1training strategy for speech emotion recognition: 1the interspeech 2019 computational paralinguistics challenge (compare): 1network architectures for emotion and paralinguistics recognition: 1speech signal characterization: 1representation learning for emotion: 1speech and language analytics for mental health: 1text analysis, multilingual issues and evaluation in speech synthesis: 1emotion modeling: 1emotion recognition and analysis: 1speech pathology, depression, and medical applications: 1speaker state and trait: 1second language acquisition and code-switching: 1
IEEE keywordsemotion recognition: 21speech recognition: 18speech emotion recognition: 13computational modeling: 6task analysis: 5speech enhancement: 4transformers: 4data models: 4transfer learning: 4multi task learning: 3adaptation models: 3recurrent neural nets: 3attention mechanism: 3predictive models: 2computer architecture: 2logic gates: 2multitasking: 2multi source domain adaptation: 2speaker independent: 2computer vision: 2robustness: 2linguistics: 2data privacy: 2self supervised learning: 2signal processing algorithms: 2machine learning: 2human computer interaction: 2mood: 2affective computing: 2semantics: 2computer audition: 2healthcare: 2audio signal processing: 2pattern classification: 2signal classification: 2artificial neural networks: 1low complexity: 1frame weighting: 1residual fusion: 1noise: 1time domain analysis: 1knowledge distillation (kd): 1probabilistic logic: 1audiogram: 1auditory system: 1multi head self attention: 1hearing aids: 1hearing aid: 1indexes: 1speech quality evaluation: 1alzheimer’s disease: 1computational complexity: 1convolution: 1hierarchical modelling: 1attention free transformer: 1alzheimer's disease: 1stability analysis: 1multi armed bandits: 1multi modality: 1joint distribution adaptation: 1acoustic scene classification: 1sharp minima: 1deep neural networks: 1acoustic measurements: 1scene classification: 1generalisation: 1loss landscape: 1prompt tuning: 1large language model: 1low rank adaptation: 1time frequency analysis: 1shifted window: 1aggregates: 1transformer: 1merging: 1hierarchical speech features: 1source free cross corpus speech emotion recognition: 1clustering algorithms: 1contrastive learning: 1masking: 1emotional: 1random splicing: 1prediction algorithms: 1speech: 1splicing: 1anonymization: 1lightweight deep learning: 1performance evaluation: 1edge device: 1neural structured learning: 1art: 1encoding: 1infant directed speech: 1adult directed speech: 1automatic speech classification: 1computational paralinguistics: 1covid 19: 1noise reduction: 1iterative optimisation: 1noise measurement: 1covid 19 detection: 1efficient edge analytics: 1adaptive inference: 1efficient deep learning: 1self distillation: 1redundancy: 1particle measurements: 1dataset bias reduction: 1hardware: 1asthma: 1personnel: 1speech modelling: 1redundancy reduction: 1recording: 1multitask learning: 1data collection: 1mental health: 1daily speech: 1dams: 1medical services: 1anxiety disorders: 1vo cal burst detection: 1animals: 1nonverbal vocalization: 1behavioral sciences: 1zero shot learning: 1generative learning: 1emotional prototypes: 1prototypes: 1federated learning: 1analytical models: 1stuttering monitoring: 1privacy: 1decoupled knowledge distillation: 1multi head attention: 1knowledge engineering: 1motion capture: 1unsupervised domain adaptation: 1adversarial learning: 1medical computing: 1hearing: 1intelligent medicine: 1health care: 1digital phenotype: 1overview: 1relativistic discriminator: 1domain adaptation: 1deep neural network: 1speech intelligibility: 1decoding: 1speech coding: 1maximum mean discrepancy: 1disentangled representation learning: 1audio generation: 1guided representation learning: 1and generative adversarial neural network: 1signal representation: 1support vector machines: 1multilayer perceptrons: 1glottal source estimation: 1iterative methods: 1diseases: 1glottal features: 1end to end systems: 1parkinson's disease: 1filtering theory: 1temporal convolutional networks: 1electroencephalography: 1medical signal processing: 1hierarchical attention mechanism: 1eeg signals: 1relu: 1arelu: 1gated recurrent unit: 1representation learning: 1computational linguistics: 1deep learning (artificial intelligence): 1semantic: 1paralinguistic: 1audiotextual information: 1vggish: 1ordinal classification: 1entropy: 1consistent rank logits: 1customer services: 1convolutional neural nets: 1adversarial attacks: 1gradient methods: 1convolutional neural network: 1data protection: 1adversarial training: 1end to end affective computing: 1adversarial networks: 1emotional speech synthesis: 1data augmentation: 1unsupervised learning: 1monotonic attention: 1mean square error methods: 1attention transfer: 1depression: 1hierarchical attention: 1psychology: 1behavioural sciences computing: 1speech emotion: 1frame level features: 1lstm: 1speech emotion prediction: 1end to end: 1joint training: 1emotion classification: 1audiovisual learning: 1audio visual systems: 1face recognition: 1emotion regression: 1state of mind: 1mood congruency: 1sentiment analysis: 1context modeling: 1hierarchical models: 1recurrent neural networks: 1gated recurrent units: 1attention mechanisms: 1
Most publications (all venues) at2023: 1022022: 972021: 972017: 842020: 76

Affiliations
Imperial College London, GLAM, UK
University of Augsburg, Department of Computer Science, Germany
University of Passau, Faculty of Computer Science and Mathematics, Germany (former)

Recent publications

TASLP2024 Jiaming Cheng, Ruiyu Liang, Lin Zhou 0001, Li Zhao 0003, Chengwei Huang, Björn W. Schuller
Residual Fusion Probabilistic Knowledge Distillation for Speech Enhancement.

TASLP2024 Ruiyu Liang, Yue Xie, Jiaming Cheng, Cong Pang, Björn W. Schuller
A Non-Invasive Speech Quality Evaluation Algorithm for Hearing Aids With Multi-Head Self-Attention and Audiogram-Based Features.

ICASSP2024 Zhongren Dong, Zixing Zhang 0001, Weixiang Xu, Jing Han 0010, Jianjun Ou, Björn W. Schuller
HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech.

ICASSP2024 Xiangheng He, Junjie Chen, Björn W. Schuller
Task Selection and Assignment for Multi-Modal Multi-Task Dialogue Act Classification with Non-Stationary Multi-Armed Bandits.

ICASSP2024 Cheng Lu 0005, Yuan Zong, Hailun Lian, Yan Zhao, Björn W. Schuller, Wenming Zheng, 
Improving Speaker-Independent Speech Emotion Recognition using Dynamic Joint Distribution Adaptation.

ICASSP2024 Manuel Milling, Andreas Triantafyllopoulos, Iosif Tsangko, Simon David Noel Rampp, Björn Wolfgang Schuller
Bringing the Discussion of Minima Sharpness to the Audio Domain: A Filter-Normalised Evaluation for Acoustic Scene Classification.

ICASSP2024 Liyizhe Peng, Zixing Zhang 0001, Tao Pang, Jing Han 0010, Huan Zhao 0003, Hao Chen, Björn W. Schuller
Customising General Large Language Models for Specialised Emotion Recognition Tasks.

ICASSP2024 Yong Wang, Cheng Lu 0005, Hailun Lian, Yan Zhao, Björn W. Schuller, Yuan Zong, Wenming Zheng, 
Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition.

ICASSP2024 Yan Zhao, Jincen Wang, Cheng Lu 0005, Sunan Li, Björn W. Schuller, Yuan Zong, Wenming Zheng, 
Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition.

ICASSP2023 Felix Burkhardt, Anna Derington, Matthias Kahlau, Klaus R. Scherer, Florian Eyben, Björn W. Schuller
Masking Speech Contents by Random Splicing: is Emotional Expression Preserved?

ICASSP2023 Yi Chang 0004, Zhao Ren, Thanh Tam Nguyen, Kun Qian 0003, Björn W. Schuller
Knowledge Transfer for on-Device Speech Emotion Recognition With Neural Structured Learning.

ICASSP2023 Najla D. Al Futaisi, Alejandrina Cristià, Björn W. Schuller
Hearttoheart: The Arts of Infant Versus Adult-Directed Speech Classification.

ICASSP2023 Shuo Liu 0012, Adria Mallol-Ragolta, Björn W. Schuller
COVID-19 Detection from Speech in Noisy Conditions.

ICASSP2023 Zhao Ren, Thanh Tam Nguyen, Yi Chang 0004, Björn W. Schuller
Fast Yet Effective Speech Emotion Recognition with Self-Distillation.

ICASSP2023 Georgios Rizos, Rafael A. Calvo, Björn W. Schuller
Positive-Pair Redundancy Reduction Regularisation for Speech-Based Asthma Diagnosis Prediction.

ICASSP2023 Meishu Song, Andreas Triantafyllopoulos, Zijiang Yang 0007, Hiroki Takeuchi, Toru Nakamura, Akifumi Kishi, Tetsuro Ishizawa, Kazuhiro Yoshiuchi, Xin Jing, Vincent Karas, Zhonghao Zhao, Kun Qian 0003, Bin Hu 0001, Björn W. Schuller, Yoshiharu Yamamoto, 
Daily Mental Health Monitoring from Speech: A Real-World Japanese Dataset and Multitask Learning Analysis.

ICASSP2023 Panagiotis Tzirakis, Alice Baird, Jeffrey A. Brooks, Christopher Gagne, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, Vineet Tiruvadi, Björn W. Schuller, Dacher Keltner, Alan Cowen, 
Large-Scale Nonverbal Vocalization Detection Using Transformers.

ICASSP2023 Xinzhou Xu, Jun Deng, Zixing Zhang 0001, Zhen Yang, Björn W. Schuller
Zero-Shot Speech Emotion Recognition Using Generative Learning with Reconstructed Prototypes.

ICASSP2023 Yongzi Yu, Wanyong Qiu, Chen Quan, Kun Qian 0003, Zhihua Wang, Yu Ma, Bin Hu 0001, Björn W. Schuller, Yoshiharu Yamamoto, 
Federated Intelligent Terminals Facilitate Stuttering Monitoring.

ICASSP2023 Ziping Zhao 0001, Huan Wang, Haishuai Wang, Björn W. Schuller
Hierarchical Network with Decoupled Knowledge Distillation for Speech Emotion Recognition.

#7  | Hung-yi Lee | DBLP Google Scholar  
By venueICASSP: 41Interspeech: 41TASLP: 11ACL: 5ACL-Findings: 2
By year2024: 122023: 162022: 222021: 162020: 172019: 132018: 4
ISCA sessionsspeech synthesis: 5speech recognition: 2spoken language processing: 2adaptation, transfer learning, and distillation for asr: 2voice conversion and adaptation: 2new trends in self-supervised speech processing: 2neural techniques for voice conversion and waveform generation: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech analysis: 1the voicemos challenge: 1trustworthy speech processing: 1spoofing-aware automatic speaker verification (sasv): 1embedding and network architecture for speaker recognition: 1neural network training methods for asr: 1source separation: 1spoken term detection & voice search: 1voice anti-spoofing and countermeasure: 1speech signal analysis and representation: 1search for speech recognition: 1conversational systems: 1speech synthesis paradigms and methods: 1applications of language technologies: 1language learning and databases: 1speech enhancement: 1the zero resource speech challenge 2019: 1turn management in dialogue: 1speech and audio source separation and scene analysis: 1voice conversion: 1extracting information from audio: 1spoken language understanding: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 16self supervised learning: 15speaker recognition: 11task analysis: 8speech synthesis: 8natural language processing: 7benchmark testing: 7computational modeling: 6robustness: 6adversarial attack: 6adaptation models: 5question answering (information retrieval): 5spoken language understanding: 5speech coding: 5voice conversion: 5representation learning: 4data models: 4unsupervised learning: 4security of data: 4predictive models: 3speech enhancement: 3linguistics: 3spoken question answering: 3semantics: 3generative adversarial networks: 3unsupervised asr: 3few shot: 3meta learning: 3generative adversarial network: 3speech representation learning: 3biometrics (access control): 3automatic speech recognition: 3transformer: 2transformers: 2knowledge distillation: 2evaluation: 2benchmark: 2analytical models: 2self supervised: 2emotion recognition: 2visualization: 2perturbation methods: 2speaker verification: 2vocoders: 2vocoder: 2decoding: 2pipelines: 2speech translation: 2noise robustness: 2maml: 2automatic speaker verification: 2supervised learning: 2adversarial defense: 2audio signal processing: 2anti spoofing: 2disentangled representations: 2interactive systems: 2source separation: 2speech separation: 2low resource: 2end to end: 2signal representation: 2adversarial training: 2prompting: 1speech language model: 1tuning: 1non autoregressive: 1neural machine translation: 1biological system modeling: 1task generalization: 1protocols: 1foundation model: 1speech: 1zero shot learning: 1upper bound: 1lattices: 1in context learning: 1large language models. asr confusion networks: 1buildings: 1instruction tuning: 1collaboration: 1multilingual: 1code switch: 1discrete unit: 1zero resource: 1manuals: 1spoken content retrieval: 1multitasking: 1switches: 1speech sentiment analysis: 1paralinguistics: 1large language models: 1spoken dialogue modeling: 1audio visual learning: 1soft sensors: 1scalability: 1rendering (computer graphics): 1purification: 1adversarial sample detection: 1ensemble learning: 1user experience: 1electronic mail: 1large scaled pre trained model: 1meta reinforcement learning: 1generators: 1natural language generation: 1monte carlo methods: 1autoregressive model: 1neural speech synthesis: 1neural network: 1bars: 1visually grounded speech: 1multimodal speech processing: 1image retrieval: 1multilingual speech processing: 1degradation: 1computational efficiency: 1once for all training: 1sequence compression: 1reproducibility of results: 1espnet: 1s3prl: 1learning systems: 1codes: 1tokenization: 1cloning: 1structured pruning: 1performance evaluation: 1trainable pruning: 1mobile handsets: 1personalized tts: 1voice cloning: 1superb: 1noise measurement: 1ensemble knowledge distillation: 1distortions: 1bridges: 1connectors: 1syntactics: 1unsupervised word segmentation: 1self supervised speech representations: 1unsupervised constituency parsing: 1speaker adaptation: 1tts: 1signal sampling: 1phone recognition: 1hidden markov models: 1pattern classification: 1adversarial attacks: 1data handling: 1model compression: 1voice activity detection: 1computer based training: 1open source: 1self supervised speech representation: 1error analysis: 1self supervised speech models: 1superb benchmark: 1data bias: 1partially fake audio detection: 1audio deep synthesis detection challenge: 1design methodology: 1sensor fusion: 1language translation: 1pre training: 1representation: 1adaptive instance normalization: 1activation guidance: 1speaker representation: 1multi speaker text to speech: 1semi supervised learning: 1any to any: 1con catenative: 1attention mechanism: 1anil: 1weapons: 1information filters: 1image rectification: 1gallium nitride: 1fisheye camera: 1acoustic distortion: 1data visualization: 1code switching: 1numerical models: 1language model: 1language adaptation: 1iarpa babel: 1analysis: 1interpretability: 1speech representation: 1representation quantization: 1quantisation (signal): 1unsupervised training: 1transformer encoders: 1vector quantization: 1spatial smoothing: 1spoofing countermeasure: 1label ambiguity problem: 1permutation invariant training: 1cocktail party problem: 1speech question answering: 1attention model: 1toefl: 1squad: 1computer aided instruction: 1domain adaptation: 1sqa: 1adversarial learning: 1text analysis: 1criticizing language model: 1deep q network: 1dialogue state tracking: 1deep reinforcement learning: 1
Most publications (all venues) at2024: 582022: 552023: 422021: 372020: 36

Affiliations
URLs

Recent publications

TASLP2024 Kai-Wei Chang, Haibin Wu, Yu-Kai Wang, Yuan-Kuei Wu, Hua Shen, Wei-Cheng Tseng, Iu-thing Kang, Shang-wen Li 0001, Hung-Yi Lee
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks.

TASLP2024 Shensian Syu, Juncheng Xie, Hung-yi Lee
Improving Non-Autoregressive Translation Quality With Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC.

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Kevin Everson, Yile Gu, Chao-Han Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke, 
Towards ASR Robust Spoken Language Understanding Through in-Context Learning with Word Confusion Networks.

ICASSP2024 Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan S. Sharma, Shinji Watanabe 0001, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee
Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.

ICASSP2024 Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, Hung-Yi Lee
Zero Resource Code-Switched Speech Benchmark Using Speech Utterance Pairs for Multiple Spoken Languages.

ICASSP2024 Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-Yi Lee, Lin-Shan Lee, 
SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering.

ICASSP2024 Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko, 
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue.

ICASSP2024 Yuan Tseng, Layne Berry, Yiting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Poyao Huang 0001, Chun-Mao Lai, Shang-Wen Li 0001, David Harwath, Yu Tsao 0001, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.

ICASSP2024 Haibin Wu, Heng-Cheng Kuo, Yu Tsao 0001, Hung-Yi Lee
Scalable Ensemble-Based Detection Method Against Adversarial Attacks For Speaker Verification.

ACL2024 Guan-Ting Lin, Cheng-Han Chiang, Hung-yi Lee
Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations.

ACL-Findings2024 Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan S. Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe 0001, 
On the Evaluation of Speech Foundation Models for Spoken Language Understanding.

TASLP2023 Yun-Yen Chuang, Hung-Min Hsu, Kevin Lin, Ray-I Chang, Hung-Yi Lee
MetaEx-GAN: Meta Exploration to Improve Natural Language Generation via Generative Adversarial Networks.

TASLP2023 Po-Chun Hsu, Da-Rong Liu, Andy T. Liu, Hung-yi Lee
Parallel Synthesis for Autoregressive Speech Generation.

ICASSP2023 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath, 
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval.

ICASSP2023 Hsuan-Jui Chen, Yen Meng, Hung-yi Lee
Once-for-All Sequence Compression for Self-Supervised Speech Models.

ICASSP2023 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola García, Hung-Yi Lee, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Euro: Espnet Unsupervised ASR Open-Source Toolkit.

ICASSP2023 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao 0001, 
T5lephone: Bridging Speech and Text Self-Supervised Models for Spoken Language Understanding Via Phoneme Level T5.

ICASSP2023 Sung-Feng Huang, Chia-Ping Chen, Zhi-Sheng Chen, Yu-Pao Tsai, Hung-Yi Lee
Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning.

ICASSP2023 Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-Yi Lee
Ensemble Knowledge Distillation of Self-Supervised Speech Models.

#8  | Xunying Liu | DBLP Google Scholar  
By venueInterspeech: 44ICASSP: 38TASLP: 18
By year2024: 62023: 162022: 172021: 252020: 132019: 172018: 6
ISCA sessionsspeech and language in health: 6speech recognition of atypical speech: 5voice conversion and adaptation: 2topics in asr: 2asr neural network architectures: 2medical applications and visual asr: 2novel transformer models for asr: 1acoustic model adaptation for asr: 1speech recognition: 1multi-, cross-lingual and other topics in asr: 1novel models and training methods for asr: 1multimodal speech emotion recognition and paralinguistics: 1miscellaneous topics in speech, voice and hearing disorders: 1zero, low-resource and multi-modal speech recognition: 1voice anti-spoofing and countermeasure: 1non-autoregressive sequential modeling for speech processing: 1assessment of pathological speech and language: 1speaker recognition: 1multimodal speech processing: 1learning techniques for speaker recognition: 1speech and speaker recognition: 1neural techniques for voice conversion and waveform generation: 1speech and audio classification: 1model adaptation for asr: 1lexicon and language model for speech recognition: 1novel neural network architectures for acoustic modelling: 1second language acquisition and code-switching: 1voice conversion: 1multimodal systems: 1expressive speech synthesis: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 36speaker recognition: 14recurrent neural nets: 9natural language processing: 9bayes methods: 8adaptation models: 7speech synthesis: 7data models: 6task analysis: 6bayesian learning: 6speech separation: 6data augmentation: 5speaker adaptation: 5emotion recognition: 5deep learning (artificial intelligence): 5gaussian processes: 5optimisation: 5elderly speech: 4dysarthric speech: 4audio visual: 4switches: 4transformer: 4neural architecture search: 4speech coding: 4voice conversion: 4speech emotion recognition: 4quantisation (signal): 4pre trained asr system: 3older adults: 3decoding: 3perturbation methods: 3dysarthric speech reconstruction: 3conformer: 3speech disorders: 3end to end: 3domain adaptation: 3audio visual systems: 3speech intelligibility: 3multi channel: 3overlapped speech: 3language models: 3convolutional neural nets: 3wav2vec2.0: 2gan: 2multi modal: 2visualization: 2training data: 2controllability: 2error analysis: 2transformers: 2hidden markov models: 2estimation: 2speech enhancement: 2automatic speech recognition: 2computational modeling: 2semantics: 2self supervised learning: 2linguistics: 2adaptation: 2lf mmi: 2parameter estimation: 2uncertainty: 2handicapped aids: 2disordered speech recognition: 2time delay neural network: 2model uncertainty: 2neural language models: 2multi look: 2variational inference: 2inference mechanisms: 2lhuc: 2gradient methods: 2admm: 2knowledge distillation: 2quantization: 2speaker verification: 2code switching: 2standards: 1multi lingual xlsr: 1hubert: 1hybrid tdnn: 1end to end conformer: 1speech: 1av hubert: 1transforms: 1low latency: 1rapid adaptation: 1interpolation: 1specaugment: 1reinforcement learning: 1confidence score estimation: 1speech dereverberation: 1maximum likelihood detection: 1nonlinear filters: 1neural machine translation: 1hierarchical attention mechanism: 1machine translation: 1generative adversarial networks: 1vae: 1alzheimer’s disease: 1sociology: 1syntactics: 1task oriented: 1transfer learning: 1pretrained embeddings: 1multimodality: 1affective computing: 1multi label: 1bidirectional control: 1multi task learning: 1emotional expression: 1multi culture: 1vocal bursts: 1data analysis: 1bayesian: 1nist: 1elderly speech recognition: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1monte carlo methods: 1articulatory inversion: 1hybrid power systems: 1benchmark testing: 1dereverberation and recognition: 1reverberation: 1speaker change detection: 1audio signal processing: 1multitask learning: 1unsupervised learning: 1unsupervised speech decomposition: 1adversarial speaker adaptation: 1speaker identity: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1vector quantization: 1measurement: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1uniform sampling: 1path dropout: 1mean square error methods: 1neural network quantization: 1source separation: 1mixed precision: 1direction of arrival estimation: 1direction of arrival: 1speaker diarization: 1delays: 1generalisation (artificial intelligence): 1gaussian process: 1any to many: 1sequence to sequence modeling: 1signal reconstruction: 1signal sampling: 1signal representation: 1location relative attention: 1multimodal speech recognition: 1capsule: 1exemplary emotion descriptor: 1expressive speech synthesis: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1tdnn: 1switchboard: 1lstm rnn: 1low bit quantization: 1image recognition: 1microphone arrays: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1synthetic speech detection: 1res2net: 1voice activity detection: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1adress: 1cognition: 1patient diagnosis: 1alzheimer's disease detection: 1signal classification: 1diseases: 1features: 1geriatrics: 1medical diagnostic computing: 1asr: 1controllable and efficient: 1text to speech: 1semi autoregressive: 1prosody modelling: 1autoregressive processes: 1neurocognitive disorder detection: 1dementia: 1visual feature generation: 1audio visual speech recognition (avsr): 1phonetic pos teriorgrams: 1adversarial attack: 1x vector: 1gmm i vector: 1accent conversion: 1accented speech recognition: 1cross modal: 1seq2seq: 1recurrent neural networks: 1data compression: 1alternating direction methods of multipliers: 1audio visual speech recognition: 1probability: 1keyword search: 1language model: 1feedforward: 1recurrent neural network: 1succeeding words: 1multilingual speech synthesis: 1foreign accent: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1neural network language models: 1lstm: 1connectionist temporal classification (ctc): 1e learning: 1computer assisted pronunciation training (capt): 1con volutional neural network (cnn): 1mispronunciation detection and diagnosis (mdd): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1maximum likelihood estimation: 1entropy: 1natural gradient: 1rnnlms: 1
Most publications (all venues) at2022: 272021: 272024: 182023: 172019: 17

Affiliations
URLs

Recent publications

TASLP2024 Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition.

TASLP2024 Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu
Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition.

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Jiajun Deng, Xurong Xie, Guinan Li, Mingyu Cui, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Zhaoqing Li, Xunying Liu
Towards High-Performance and Low-Latency Feature-Based Speaker Adaptation of Conformer Speech Recognition Systems.

ICASSP2024 Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu
Towards Automatic Data Augmentation for Disordered Speech Recognition.

ICASSP2024 Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu
Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation.

TASLP2023 Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Guinan Li, Shujie Hu, Xunying Liu
Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems.

TASLP2023 Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

ICASSP2023 Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng, 
Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition.

ICASSP2023 Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu
Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition.

ICASSP2023 Jinchao Li, Kaitao Song, Junan Li, Bo Zheng, Dongsheng Li 0002, Xixin Wu, Xunying Liu, Helen Meng, 
Leveraging Pretrained Representations With Task-Related Keywords for Alzheimer's Disease Detection.

ICASSP2023 Jinchao Li, Xixin Wu, Kaitao Song, Dongsheng Li 0002, Xunying Liu, Helen Meng, 
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition.

ICASSP2023 Xurong Xie, Xunying Liu, Hui Chen 0020, Hongan Wang, 
Unsupervised Model-Based Speaker Adaptation of End-To-End Lattice-Free MMI Model for Speech Recognition.

Interspeech2023 Mingyu Cui, Jiawen Kang 0002, Jiajun Deng, Xi Yin 0010, Yutao Xie, Xie Chen 0001, Xunying Liu
Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems.

Interspeech2023 Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu
Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems.

Interspeech2023 Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu
Use of Speech Impairment Severity for Dysarthric Speech Recognition.

Interspeech2023 Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye 0001, Helen Meng, Xunying Liu
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition.

Interspeech2023 Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Helen Meng, Xunying Liu
Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition.

Interspeech2023 Zhaoqing Li, Tianzi Wang, Jiajun Deng, Junhao Xu, Shoukang Hu, Xunying Liu
Lossless 4-bit Quantization of Architecture Compressed Conformer ASR Systems on the 300-hr Switchboard Corpus.

#9  | Dong Yu 0001 | DBLP Google Scholar  
By venueInterspeech: 40ICASSP: 39TASLP: 9ACL: 2ICLR: 2ICML: 1EMNLP: 1ACL-Findings: 1IJCAI: 1NAACL: 1
By year2024: 52023: 112022: 152021: 182020: 222019: 182018: 8
ISCA sessionsspeech recognition: 3speech coding and enhancement: 2speech synthesis: 2voice conversion and adaptation: 2speaker recognition: 2source separation, dereverberation and echo cancellation: 2multi-channel speech enhancement: 2singing voice computing and processing in music: 2deep learning for source separation and pitch tracking: 2sequence models for asr: 2speech enhancement and bandwidth expansion: 1dereverberation and echo cancellation: 1multi-, cross-lingual and other topics in asr: 1topics in asr: 1source separation: 1novel neural network architectures for asr: 1speech localization, enhancement, and quality assessment: 1asr model training and strategies: 1speech synthesis paradigms and methods: 1multimodal speech processing: 1speech and audio source separation and scene analysis: 1speech enhancement: 1asr neural network architectures: 1asr neural network training: 1asr for noisy and far-field speech: 1robust speech recognition: 1speaker verification using neural network methods: 1expressive speech synthesis: 1topics in speech recognition: 1
IEEE keywordsspeech recognition: 23speaker recognition: 11speech synthesis: 9speech enhancement: 8speech separation: 7task analysis: 6natural language processing: 5speaker embedding: 5data augmentation: 5reverberation: 4decoding: 4microphone arrays: 4source separation: 4recurrent neural nets: 4end to end speech recognition: 4self supervised learning: 3automatic speech recognition: 3voice activity detection: 3unsupervised learning: 3voice conversion: 3vocoders: 2spectrogram: 2end to end: 2measurement: 2application program interfaces: 2graphics processing units: 2pattern clustering: 2audio visual systems: 2audio signal processing: 2text analysis: 2multi channel: 2filtering theory: 2semi supervised learning: 2overlapped speech: 2transfer learning: 2domain adaptation: 2maximum mean discrepancy: 2speech coding: 2code switching: 2speaker verification: 2self attention: 2attention based model: 2hidden markov models: 2artificial neural networks: 1loudspeakers: 1hybrid method: 1acoustic howling suppression: 1kalman filters: 1microphones: 1adaptation models: 1kalman filter: 1recursive training: 1noise reduction: 1diffusion models: 1signal to noise ratio: 1generative models: 1speech editing: 1unsupervised tts acoustic modeling: 1representation learning: 1wavlm: 1c dsvae: 1transducers: 1bayes methods: 1discriminative training: 1mutual information: 1maximum mutual information: 1minimum bayesian risk: 1sequential training: 1autoregressive model: 1diffusion model: 1text to sound generation: 1transforms: 1vocoder: 1zero shot style transfer: 1variational autoencoder: 1supervised learning: 1self supervised disentangled representation learning: 1low quality data: 1neural speech synthesis: 1style transfer: 1joint training: 1dual path: 1acoustic model: 1echo suppression: 1streaming: 1dynamic weight attention: 1acoustic environment: 1speech simulation: 1transient response: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1vector quantization: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1rnn t: 1code switched asr: 1bilingual asr: 1computational linguistics: 1expert systems: 1router architecture: 1mixture of experts: 1global information: 1accent embedding: 1domain embedding: 1speaker clustering: 1inference mechanisms: 1overlap speech detection: 1speaker diarization: 1sensor fusion: 1sound source separation: 1audio visual processing: 1rewriting systems: 1interactive systems: 1semantic role labeling: 1dialogue understanding: 1conversational semantic role labeling: 1natural language understanding: 1image recognition: 1audio visual: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1video signal processing: 1mvdr: 1array signal processing: 1adl mvdr: 1neural architecture search: 1transferable architecture: 1neural net architecture: 1multi granularity: 1single channel: 1self attentive network: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1direction of arrival estimation: 1source localization: 1contrastive learning: 1target speaker enhancement: 1robust speaker verification: 1interference suppression: 1speaker verification (sv): 1phonetic pos teriorgrams: 1speech intelligibility: 1regression analysis: 1singing synthesis: 1multi channel speech separation: 1inter channel convolution differences: 1spatial filters: 1spatial features: 1parallel optimization: 1random sampling.: 1model partition: 1lstm language model: 1bmuf: 1joint learning: 1noise measurement: 1speaker aware: 1target speech enhancement: 1time domain analysis: 1gain: 1teacher student: 1accent conversion: 1accented speech recognition: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1training data: 1diffuse reflection: 1acoustic simulation: 1reflection: 1persistent memory: 1dfsmn: 1multi modal: 1audio visual speech recognition: 1permutation invariant training: 1encoding: 1model integration: 1multi band: 1nist: 1artificial intelligence: 1mel frequency cepstral coefficient: 1loss function: 1boundary: 1top k loss: 1language model: 1error analysis: 1mathematical model: 1switches: 1attention based end to end speech recognition: 1early update: 1optimization: 1token wise training: 1discriminative feature learning: 1sequence discriminative training: 1acoustic variability: 1asr: 1variational inference: 1convolutional neural nets: 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1text to speech synthesis: 1relative position aware representation: 1sequence to sequence model: 1teacher student training: 1knowledge distillation: 1multi domain: 1all rounder: 1feedforward neural nets: 1cloud computing: 1quantization: 1polynomials: 1privacy preserving: 1dnn: 1cryptography: 1encryption: 1text dependent: 1end to end speaker verification: 1seq2seq attention: 1optimisation: 1siamese neural networks: 1
Most publications (all venues) at2023: 572022: 502020: 472019: 472024: 44

Affiliations
Tencent AI Lab, China
Microsoft Research, Redmond, WA, USA (1998 - 2017)
University of Idaho, Moscow, ID, USA (PhD)

Recent publications

TASLP2024 Hao Zhang, Yixuan Zhang 0005, Meng Yu 0003, Dong Yu 0001
Enhanced Acoustic Howling Suppression via Hybrid Kalman Filter and Deep Learning Models.

ICASSP2024 Muqiao Yang, Chunlei Zhang, Yong Xu 0004, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu 0001
uSee: Unified Speech Enhancement And Editing with Conditional Diffusion Models.

ICML2024 Manjie Xu, Chenxing Li, Duzhen Zhang, Dan Su 0002, Wei Liang, Dong Yu 0001
Prompt-guided Precise Audio Editing with Diffusion Models.

ACL2024 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang 0001, Ziyue Jiang 0001, Xuankai Chang, Jiatong Shi, Chao Weng, Zhou Zhao, Dong Yu 0001
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

ACL2024 Yongxin Zhu 0003, Dan Su 0002, Liqiang He, Linli Xu, Dong Yu 0001
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer.

TASLP2023 Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu 0001
Unsupervised TTS Acoustic Modeling for TTS With Conditional Disentangled Sequential VAE.

TASLP2023 Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu 0001
Integrating Lattice-Free MMI Into End-to-End Speech Recognition.

TASLP2023 Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu 0001
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation.

Interspeech2023 Yong Xu 0004, Vinay Kothapally, Meng Yu 0003, Shixiong Zhang, Dong Yu 0001
Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation.

Interspeech2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu 0001, Shinji Watanabe 0001, 
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.

Interspeech2023 Wei Xiao, Wenzhe Liu, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su 0002, Shidong Shang, Dong Yu 0001
Multi-mode Neural Speech Coding Based on Deep Generative Networks.

Interspeech2023 Yuping Yuan, Zhao You, Shulin Feng, Dan Su 0002, Yanchun Liang 0001, Xiaohu Shi, Dong Yu 0001
Compressed MoE ASR Model Based on Knowledge Distillation and Quantization.

Interspeech2023 Hao Zhang, Meng Yu 0003, Yuzhong Wu, Tao Yu, Dong Yu 0001
Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression.

Interspeech2023 Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu 0001, Zhao You, Dan Su 0002, Dong Yu 0001, Helen Meng, 
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation.

EMNLP2023 Dian Yu 0001, Xiaoyang Wang, Wanshun Chen, Nan Du, Longyue Wang, Haitao Mi, Dong Yu 0001
More Than Spoken Words: Nonverbal Message Extraction and Generation.

ACL-Findings2023 Rongjie Huang, Chunlei Zhang, Yi Ren 0006, Zhou Zhao, Dong Yu 0001
Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech.

ICASSP2022 Jiachen Lian, Chunlei Zhang, Dong Yu 0001
Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion.

ICASSP2022 Songxiang Liu, Shan Yang, Dan Su 0002, Dong Yu 0001
Referee: Towards Reference-Free Cross-Speaker Style Transfer with Low-Quality Data for Expressive Speech Synthesis.

ICASSP2022 Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su 0002, Dong Yu 0001
DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition.

ICASSP2022 Anton Ratnarajah, Shi-Xiong Zhang, Meng Yu 0003, Zhenyu Tang 0001, Dinesh Manocha, Dong Yu 0001
Fast-Rir: Fast Neural Diffuse Room Impulse Response Generator.

#10  | Zhiyong Wu 0001 | DBLP Google Scholar  
By venueICASSP: 48Interspeech: 34TASLP: 6AAAI: 3IJCAI: 2EMNLP: 1
By year2024: 122023: 252022: 222021: 142020: 62019: 112018: 4
ISCA sessionsspeech synthesis: 12voice conversion and adaptation: 3speech coding: 2speech recognition: 2spoken term detection: 2speech coding and enhancement: 1models for streaming asr: 1single-channel speech enhancement: 1embedding and network architecture for speaker recognition: 1non-autoregressive sequential modeling for speech processing: 1voice anti-spoofing and countermeasure: 1speech synthesis paradigms and methods: 1asr neural network architectures and training: 1new trends in self-supervised speech processing: 1neural techniques for voice conversion and waveform generation: 1emotion recognition and analysis: 1expressive speech synthesis: 1deep learning for source separation and pitch tracking: 1
IEEE keywordsspeech synthesis: 16speech recognition: 15natural language processing: 10speaker recognition: 8speech emotion recognition: 7emotion recognition: 7speech enhancement: 6speech coding: 6text analysis: 6vocoders: 5decoding: 5recurrent neural nets: 5expressive speech synthesis: 4semantics: 4noise reduction: 4hidden markov models: 3task analysis: 3data mining: 3data models: 3text to speech: 3voice conversion: 3self supervised learning: 3transformer: 3transformers: 3bidirectional attention mechanism: 2spectrogram: 2speech: 2visualization: 2training data: 2cloning: 2adaptation models: 2language model: 2timbre: 2multiple signal classification: 2encoding: 2instruments: 2coherence: 2linguistics: 2human computer interaction: 2hierarchical: 2predictive models: 2computational modeling: 2parallel processing: 2speaking style modelling: 2time frequency analysis: 2robustness: 2costs: 2speaker verification: 2automatic speaker verification: 2pattern classification: 2security of data: 2adversarial defense: 2trees (mathematics): 2deep learning (artificial intelligence): 2biometrics (access control): 2adversarial attack: 2optimisation: 2entropy: 2regression analysis: 2ordinal regression: 2code switching: 2convolutional neural nets: 2films: 1multiscale speaking style transfer: 1text to speech synthesis: 1games: 1automatic dubbing: 1cross lingual speaking style transfer: 1multi modal: 1av hubert: 1dysarthric speech reconstruction: 1transforms: 1audio visual: 1vq vae: 1pre training: 1self supervised style enhancing: 1dance expressiveness: 1dance generation: 1genre matching: 1dance dynamics: 1humanities: 1dynamics: 1beat alignment: 1speaker adaptation: 1zero shot: 1multi scale acoustic prompts: 1stereophonic music: 1degradation: 1codecs: 1music generation: 1neural codec: 1image coding: 1language models: 1long multi track: 1multi view midivae: 1symbolic music generation: 1two dimensional displays: 1speech disentanglement: 1vae: 1voice cloning: 1static var compensators: 1harmonic analysis: 1power harmonic filters: 1synthesizers: 1neural concatenation: 1signal generators: 1singing voice conversion: 1information retrieval: 1interaction gesture: 1multi agent conversational interaction: 1oral communication: 1cognition: 1dialog intention and emotion: 1co speech gesture generation: 1avatars: 1motion processing: 1multimodal learning: 1gesture generation: 1codes: 1context modeling: 1style modeling: 1bit error rate: 1multi scale: 1automatic speech recognition: 1neural machine translation: 1hierarchical attention mechanism: 1machine translation: 1subband interaction: 1inter subnet: 1global spectral information: 1speech signal improvement: 1generative adversarial networks: 1two stage: 1reverberation: 1real time systems: 1speech restoration: 1lightweight text to speech: 1streaming text to speech: 1diffusion probabilistic model: 1probabilistic logic: 1audiobook speech synthesis: 1prediction methods: 1context aware: 1multi sentence: 1hierarchical transformer: 1target speech extraction: 1multi modal fusion: 1fuses: 12d positional encoding.: 1cross attention: 1transducers: 1delays: 1streaming: 1computer architecture: 1signal processing algorithms: 1latency: 1network architecture: 1corrector network: 1source separation: 1time domain: 1time frequency domain: 1particle separators: 1speech separation: 1learning systems: 1synthetic corpus: 1measurement: 1audio recording: 1neural vocoder: 1semantic augmentation: 1upper bound: 1data augmentation: 1difficulty aware: 1stability analysis: 1error analysis: 1contextual biasing: 1conformer: 1biased words: 1sensitivity: 1open vocabulary keyword spotting: 1acoustic model: 1dynamic network pruning: 1melody unsupervision: 1differentiable up sampling layer: 1rhythm: 1vocal range: 1regulators: 1bidirectional control: 1annotations: 1singing voice synthesis: 1bi directional flow: 1adversarial attacks: 1supervised learning: 1tree structure: 1prosodic structure prediction: 1computational linguistics: 1span based decoder: 1character level: 1image segmentation: 1speech to animation: 1mixture of experts: 1computer animation: 1phonetic posteriorgrams: 1phase information: 1full band extractor: 1multi scale time sensitive channel attention: 1memory management: 1convolution: 1knowledge based systems: 1flat lattice transformer: 1rule based: 1chinese text normalization: 1none standard word: 1relative position encoding: 1xlnet: 1knowledge distillation: 1speaking style: 1conversational text to speech synthesis: 1graph neural network: 1matrix algebra: 1multi task learning: 1end to end model: 1forced alignment: 1audio signal processing: 1vocoder: 1neural architecture search: 1uniform sampling: 1path dropout: 1phoneme recognition: 1mispronunciation detection and diagnosis: 1acoustic phonetic linguistic embeddings: 1computer aided pronunciation training: 1connectionist temporal classification: 1cross entropy: 1disentangling: 1hybrid bottleneck features: 1voice activity detection: 1capsule: 1exemplary emotion descriptor: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1emotion: 1global style token: 1expressive: 1ctc: 1non autoregressive: 1autoregressive processes: 1neural network based text to speech: 1grammars: 1prosody control: 1word processing: 1syntactic parse tree traversal: 1syntactic representation learning: 1goodness of pronunciation: 1pronunciation assessment: 1computer assisted language learning: 1computer aided instruction: 1multi speaker and multi style tts: 1hifi gan: 1durian: 1low resource condition: 1weapons: 1perturbation methods: 1information filters: 1phonetic pos teriorgrams: 1speech intelligibility: 1accent conversion: 1accented speech recognition: 1multilingual speech synthesis: 1end to end: 1foreign accent: 1spectral analysis: 1center loss: 1discriminative features: 1multi head self attention: 1dilated residual network: 1wavenet: 1self attention: 1blstm: 1phonetic posteriorgrams(ppgs): 1anchored reference sample: 1mean opinion score (mos): 1speech fluency assessment: 1computer assisted language learning (call): 1variational inference: 1quasifully recurrent neural network (qrnn): 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1
Most publications (all venues) at2023: 442024: 332022: 332021: 222019: 17

Affiliations
Tsinghua University, Joint Research Center for Media Sciences, Beijing, China (PhD)
Chinese University of Hong Kong, Hong Kong

Recent publications

TASLP2024 Jingbei Li, Sipan Li, Ping Chen, Luwen Zhang, Yi Meng, Zhiyong Wu 0001, Helen Meng, Qiao Tian, Yuping Wang, Yuxuan Wang 0002, 
Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing.

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Xueyuan Chen, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Zhiyong Wu 0001, Xixin Wu, Helen Meng, 
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

ICASSP2024 Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu 0001, Haozhi Huang 0004, Helen Meng, 
Enhancing Expressiveness in Dance Generation Via Integrating Frequency and Music Style Information.

ICASSP2024 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Dan Luo, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han 0001, Helen Meng, 
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.

ICASSP2024 Xingda Li, Fan Zhuo, Dan Luo, Jun Chen 0024, Shiyin Kang, Zhiyong Wu 0001, Tao Jiang, Yang Li, Han Fang, Yahui Zhou, 
Generating Stereophonic Music with Single-Stage Language Models.

ICASSP2024 Zhiwei Lin, Jun Chen 0024, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu 0001, Helen Meng, 
Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation.

ICASSP2024 Hui Lu, Xixin Wu, Haohan Guo, Songxiang Liu, Zhiyong Wu 0001, Helen Meng, 
Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations.

ICASSP2024 Binzhu Sha, Xu Li 0015, Zhiyong Wu 0001, Ying Shan, Helen Meng, 
Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion.

ICASSP2024 Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu 0001, Minglei Li 0001, Zonghong Dai, Helen Meng, 
Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models.

ICASSP2024 Sicheng Yang, Zunnan Xu, Haiwei Xue, Yongkang Cheng, Shaoli Huang, Mingming Gong, Zhiyong Wu 0001
FreeTalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness.

AAAI2024 Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu 0001, Shi-Xiong Zhang, Guangzhi Li, Yi Luo 0004, Rongzhi Gu, 
SECap: Speech Emotion Captioning with Large Language Model.

TASLP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Helen Meng, 
MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

ICASSP2023 Jun Chen 0024, Wei Rao, Zilin Wang, Jiuxin Lin, Zhiyong Wu 0001, Yannan Wang, Shidong Shang, Helen Meng, 
Inter-Subnet: Speech Enhancement with Subband Interaction.

ICASSP2023 Jun Chen 0024, Yupeng Shi, Wenzhe Liu, Wei Rao, Shulin He, Andong Li, Yannan Wang, Zhiyong Wu 0001, Shidong Shang, Chengshi Zheng, 
Gesper: A Unified Framework for General Speech Restoration.

ICASSP2023 Jie Chen, Xingchen Song, Zhendong Peng, Binbin Zhang, Fuping Pan, Zhiyong Wu 0001
LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech.

ICASSP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis.

ICASSP2023 Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen 0024, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu 0001, Yujun Wang, Helen Meng, 
Av-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction.

ICASSP2023 Xingchen Song, Di Wu 0061, Zhiyong Wu 0001, Binbin Zhang, Yuekai Zhang, Zhendong Peng, Wenpeng Li, Fuping Pan, Changbao Zhu, 
TrimTail: Low-Latency Streaming ASR with Simple But Effective Spectrogram-Level Length Penalty.

#11  | Jinyu Li 0001 | DBLP Google Scholar  
By venueICASSP: 43Interspeech: 37TASLP: 7ICML: 1ACL: 1EMNLP: 1
By year2024: 92023: 122022: 212021: 172020: 162019: 112018: 4
ISCA sessionsnovel models and training methods for asr: 3source separation: 3asr neural network architectures: 3streaming for asr/rnn transducers: 2multi- and cross-lingual asr, other topics in asr: 2streaming asr: 2speech recognition: 1statistical machine translation: 1speaker and language recognition: 1other topics in speech recognition: 1robust asr, and far-field/multi-talker asr: 1spoken language processing: 1topics in asr: 1self-supervision and semi-supervision for neural asr training: 1neural network training methods for asr: 1language and lexical modeling for asr: 1asr model training and strategies: 1acoustic model adaptation for asr: 1new trends in self-supervised speech processing: 1asr neural network architectures and training: 1search for speech recognition: 1multi-channel speech enhancement: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1asr neural network training: 1neural network training strategies for asr: 1novel neural network architectures for acoustic modelling: 1novel approaches to enhancement: 1deep enhancement: 1
IEEE keywordsspeech recognition: 34recurrent neural nets: 8transducers: 7data models: 7error analysis: 7self supervised learning: 7vocabulary: 6task analysis: 6speaker recognition: 6natural language processing: 6transformers: 5predictive models: 5speech enhancement: 5automatic speech recognition: 5factorized neural transducer: 4speech translation: 4representation learning: 4transformer: 4adaptation models: 4oral communication: 4end to end: 4speech coding: 3decoding: 3computational modeling: 3speech separation: 3ctc: 3multi talker automatic speech recognition: 3transformer transducer: 3continuous speech separation: 3source separation: 3speaker adaptation: 3teacher student learning: 3attention: 3adversarial learning: 3context modeling: 2codecs: 2language model: 2speech synthesis: 2semantics: 2real time systems: 2streaming: 2degradation: 2loading: 2streaming inference: 2speaker diarization: 2conversation transcription: 2training data: 2contextual biasing: 2contextual spelling correction: 2analytical models: 2interpolation: 2multi talker asr: 2transducer: 2combination: 2meeting transcription: 2encoding: 2audio signal processing: 2lstm: 2domain adaptation: 2deep neural network: 2neural network: 2long content speech recognition: 1streaming and non streaming: 1rnn t: 1computer architecture: 1speech removal: 1codes: 1speech generation: 1noise reduction: 1audio text input: 1multi task learning: 1noise suppression: 1target speaker extraction: 1zero shot text to speech: 1speech editing: 1machine translation: 1speech text joint pre training: 1discrete tokenization: 1unified modeling language: 1costs: 1timestamp: 1synchronization: 1joint: 1weight sharing: 1memory management: 1model compression: 1performance evaluation: 1speech recognition and translation: 1low rank approximation: 1token level serialized output training: 1multi talker speech recognition: 1text only adaptation: 1symbols: 1measurement: 1overlapping speech: 1recording: 1wavlm: 1multi speaker: 1bit error rate: 1hubert: 1ce: 1fuses: 1long form speech recognition: 1context and speech encoder: 1focusing: 1microphone arrays: 1geometry: 1microphone array: 1external attention: 1speech to speech translation: 1joint pre training: 1data mining: 1cross lingual modeling: 1speaker change detection: 1e2e asr: 1f1 score: 1limiting: 1data simulation: 1conversation analysis: 1signal processing algorithms: 1n gram: 1kl divergence: 1factorized transducer model: 1neural transducer model: 1non autoregressive: 1language model adaptation: 1multitasking: 1pre training: 1benchmark testing: 1speaker: 1linear programming: 1end to end end point detection: 1long form meeting transcription: 1dual path rnn: 1robust speech recognition: 1contrastive learning: 1wav2vec 2.0: 1robust automatic speech recognition: 1supervised learning: 1hybrid: 1cascaded: 1two pass: 1recurrent selective attention network: 1configurable multilingual model: 1multilingual speech recognition: 1speaker inventory: 1mathematical model: 1estimated speech: 1particle separators: 1computer science: 1correlation: 1speaker separation: 1multi channel microphone: 1deep learning (artificial intelligence): 1signal representation: 1real time decoding: 1multi speaker asr: 1conformer: 1attention based encoder decoder: 1recurrent neural network transducer: 1segmentation: 1filtering theory: 1system fusion: 1neural language generation: 1unsupervised learning: 1acoustic model adaptation: 1permutation invariant training: 1libricss: 1microphones: 1overlapped speech: 1production: 1tensors: 1rnn transducer: 1virtual assistants: 1alignments: 1pre training.: 1pattern classification: 1streaming attention based sequence to sequence asr: 1latency reduction: 1monotonic chunkwise attention: 1entropy: 1computer aided instruction: 1latency: 1label embedding: 1knowledge representation: 1backpropagation: 1end to end system: 1oov: 1acoustic to word: 1adaptation: 1universal acoustic model: 1mixture of experts: 1mixture models: 1layer trajectory: 1future context frames: 1temporal modeling: 1senone classification: 1signal classification: 1code switching: 1language identification: 1asr: 1domain invariant training: 1speaker verification: 1
Most publications (all venues) at2022: 272021: 252024: 222023: 222020: 18

Affiliations
Microsoft Corporation, Redmond, WA, USA
Georgia Institute of Technology, Center for Signal and Image Processing, Atlanta, GA, USA (PhD)
University of Science and Technology of China, iFlytek Speech Lab, Hefei, China

Recent publications

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

TASLP2024 Xiaofei Wang 0007, Manthan Thakker, Zhuo Chen 0006, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu 0001, Jinyu Li 0001, Takuya Yoshioka, 
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer.

TASLP2024 Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu 0012, Shujie Liu 0001, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Furu Wei, 
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation.

TASLP2024 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu 0012, Shuo Ren, Shujie Liu 0001, Zhuoyuan Yao, Xun Gong 0005, Li-Rong Dai 0001, Jinyu Li 0001, Furu Wei, 
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data.

ICASSP2024 Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li 0001, Yashesh Gaur, 
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation.

ICASSP2024 Yiming Wang, Jinyu Li 0001
Residualtransformer: Residual Low-Rank Learning With Weight-Sharing For Transformer Layers.

ICASSP2024 Jian Wu 0027, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao 0017, Zhuo Chen 0006, Jinyu Li 0001
T-SOT FNT: Streaming Multi-Talker ASR with Text-Only Domain Adaptation Capability.

ICASSP2024 Mu Yang, Naoyuki Kanda, Xiaofei Wang 0009, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li 0001, Takuya Yoshioka, 
Diarist: Streaming Speech Translation with Speaker Diarization.

ICML2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan 0003, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu 0001, Tao Qin 0001, Xiangyang Li 0001, Wei Ye 0004, Shikun Zhang, Jiang Bian 0002, Lei He 0005, Jinyu Li 0001, Sheng Zhao, 
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Ruchao Fan, Yiming Wang, Yashesh Gaur, Jinyu Li 0001
CTCBERT: Advancing Hidden-Unit Bert with CTC Objectives.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

ICASSP2023 Zili Huang, Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yiming Wang, Jinyu Li 0001, Takuya Yoshioka, Xiaofei Wang 0009, Peidong Wang, 
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition.

ICASSP2023 Naoyuki Kanda, Jian Wu 0027, Xiaofei Wang 0009, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Vararray Meets T-Sot: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition.

ICASSP2023 Xiaoqiang Wang 0006, Yanqing Liu, Jinyu Li 0001, Sheng Zhao, 
Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation.

ICASSP2023 Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu 0001, Lei He 0005, Jinyu Li 0001, Furu Wei, 
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation.

ICASSP2023 Jian Wu 0027, Zhuo Chen 0006, Min Hu, Xiong Xiao, Jinyu Li 0001
Speaker Change Detection For Transformer Transducer ASR.

ICASSP2023 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.

ICASSP2023 Rui Zhao 0017, Jian Xue, Partha Parthasarathy, Veljko Miljanic, Jinyu Li 0001
Fast and Accurate Factorized Neural Transducer for Text Adaption of End-to-End Speech Recognition Models.

Interspeech2023 Yuang Li, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, 
Accelerating Transducers through Adjacent Token Merging.

#12  | DeLiang Wang | DBLP Google Scholar  
By venueTASLP: 28Interspeech: 27ICASSP: 26
By year2024: 32023: 82022: 202021: 112020: 162019: 152018: 8
ISCA sessionsdeep enhancement: 3speech coding and privacy: 2single-channel speech enhancement: 2speech enhancement: 2asr for noisy and far-field speech: 2spatial and phase cues for source separation and speech recognition: 2multi-talker methods in speech processing: 1speech enhancement and denoising: 1speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1challenges and opportunities for signal processing and machine learning for multiple smart devices: 1speech representation: 1multi-channel speech enhancement and hearing aids: 1source separation, dereverberation and echo cancellation: 1speech and audio quality assessment: 1noise reduction and intelligibility: 1speaker and language recognition: 1novel approaches to enhancement: 1source separation from monaural input: 1deep learning for source separation and pitch tracking: 1
IEEE keywordsspeech enhancement: 32speaker recognition: 13complex spectral mapping: 10recurrent neural nets: 9source separation: 9speech intelligibility: 8convolutional neural nets: 8reverberation: 7speaker separation: 6speech recognition: 6microphone arrays: 6microphones: 6array signal processing: 5time domain: 5location based training: 4time frequency analysis: 4noise measurement: 4estimation: 4monaural speech enhancement: 4microphone array processing: 4fourier transforms: 4deep casa: 4signal to noise ratio: 3robust speaker localization: 3direction of arrival estimation: 3time domain analysis: 3neural cascade architecture: 3convolution: 3robustness: 3self attention: 3time domain enhancement: 3blind source separation: 3permutation invariant training: 3beamforming: 3covariance matrices: 3acoustic noise: 3monaural speech separation: 3audio signal processing: 3phase estimation: 3speech dereverberation: 3deep neural networks: 3task analysis: 2continuous speaker separation: 2continuous speech separation: 2geometry: 2automatic speech recognition: 2self supervised learning: 2speech separation: 2frequency domain analysis: 2cross corpus generalization: 2deep learning (artificial intelligence): 2multi channel speaker separation: 2complex domain: 2bone conduction: 2attention based fusion: 2natural language processing: 2speaker diarization: 2optimisation: 2pruning: 2quantization: 2model compression: 2sparse regularization: 2hearing: 2talker independent speaker separation: 2dereverberation: 2encoding: 2decoding: 2computational auditory scene analysis: 2time frequency masking: 2iterative methods: 2conversational speaker separation: 1streams: 1multi speaker speech recognition: 1separation processes: 1multi channel speaker diarization: 1audiovisual speaker separation: 1multimodal speech processing: 1visualization: 1attentive audiovisual fusion: 1systematics: 1mimo complex spectral mapping: 1location awareness: 1merging: 1data mining: 1speaker extraction: 1attentive training: 1talker independent: 1interference: 1speech: 1pitch tracking: 1multitasking: 1multi task learning: 1complex domain processing: 1densely connected convolutional recurrent neural network: 1voicing detection: 1frequency estimation: 1packet loss concealment: 1packet loss: 1semantics: 1diffusion model: 1low signal to noise ratio: 1generative model: 1background noise: 1recurrent neural network: 1talker independence: 1multi channel complex spectral mapping: 1spectrospatial filtering: 1spectrogram: 1neural net architecture: 1cascade architecture: 1signal representation: 1sensor fusion: 1signal denoising: 1air conduction: 1nonlinear distortions: 1acoustic echo cancellation: 1neurocontrollers: 1multi channel aec: 1echo suppression: 1mimo: 1fixed array: 1multichannel: 1triple path: 1robust automatic speech recognition: 1spectral magnitude: 1cross domain speech enhancement: 1multi speaker asr: 1meeting transcription: 1alimeeting: 1m2met: 1acoustic echo suppression: 1recurrent neural networks: 1feature combination: 1frame level snr estimation: 1long short term memory: 1dense convolutional network: 1self attention network: 1frequency domain loss: 1data compression: 1quantisation (signal): 1speaker inventory: 1mathematical model: 1estimated speech: 1particle separators: 1computer science: 1correlation: 1training data: 1data models: 1overlapped speech: 1modulation: 1computational modeling: 1performance evaluation: 1pipelines: 1quantization (signal): 1densely connected convolutional recurrent network: 1on device processing: 1real time speech enhancement: 1mobile communication: 1dual microphone mobile phones: 1complex domain separation: 1ensemble learning: 1singing voice separation: 1convolutional neural network: 1music: 1self attention mechanism: 1monaural speaker separation: 1causal processing: 1robust enhancement: 1channel generalization: 1robust speaker recognition: 1gammatone frequency cepstral coefficient (gfcc): 1masking based beamforming: 1x vector: 1gaussian processes: 1gated convolutional recurrent network: 1distortion independent acoustic modeling: 1speech distortion: 1transient response: 1temporal convolutional networks: 1room impulse response: 1dense network: 1time frequency loss: 1speaker and noise independent: 1fully convolutional: 1voice telecommunication: 1processing artifacts: 1cochannel speech separation: 1two stage network: 1pattern clustering: 1divide and conquer methods: 1audio databases: 1fully convolutional neural network: 1mean absolute error: 1generalisation (artificial intelligence): 1gated linear units: 1residual learning: 1dilated convolutions: 1feedforward neural nets: 1sequence to sequence mapping: 1chimera++ networks: 1deep clustering: 1spatial features: 1gcc phat: 1steered response power: 1ideal ratio mask: 1denoising: 1signal reconstruction: 1phase: 1noise independent and speaker independent speech enhancement: 1real time implementation: 1tcnn: 1temporal convolutional neural network: 1complex valued deep neural networks: 1learning phase: 1phase aware speech enhancement: 1cdnn: 1spectral analysis: 1convolutional recurrent network: 1causal system: 1phase reconstruction: 1chimera + + networks: 1
Most publications (all venues) at2022: 232018: 212020: 202021: 192019: 16


Recent publications

TASLP2024 Hassan Taherian, DeLiang Wang
Multi-Channel Conversational Speaker Separation via Neural Diarization.

ICASSP2024 Vahid Ahmadi Kalkhorani, Anurag Kumar 0003, Ke Tan 0001, Buye Xu, DeLiang Wang
Audiovisual Speaker Separation with Full- and Sub-Band Modeling in the Time-Frequency Domain.

ICASSP2024 Hassan Taherian, Ashutosh Pandey 0004, Daniel Wong, Buye Xu, DeLiang Wang
Leveraging Sound Localization to Improve Continuous Speaker Separation.

TASLP2023 Ashutosh Pandey 0004, DeLiang Wang
Attentive Training: A New Training Framework for Speech Enhancement.

TASLP2023 Yixuan Zhang 0005, Heming Wang, DeLiang Wang
$F0$ Estimation and Voicing Detection With Cascade Architecture in Noisy Speech.

ICASSP2023 Hassan Taherian, DeLiang Wang
Multi-Resolution Location-Based Training for Multi-Channel Continuous Speech Separation.

ICASSP2023 Heming Wang, Yao Qian, Hemin Yang, Nauyuki Kanda, Peidong Wang, Takuya Yoshioka, Xiaofei Wang 0009, Yiming Wang, Shujie Liu 0001, Zhuo Chen 0006, DeLiang Wang, Michael Zeng 0001, 
DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks.

ICASSP2023 Heming Wang, DeLiang Wang
Cross-Domain Diffusion Based Speech Enhancement for Very Noisy Speech.

Interspeech2023 Vahid Ahmadi Kalkhorani, Anurag Kumar 0003, Ke Tan 0001, Buye Xu, DeLiang Wang
Time-domain Transformer-based Audiovisual Speaker Separation.

Interspeech2023 Hassan Taherian, Ashutosh Pandey 0004, Daniel Wong, Buye Xu, DeLiang Wang
Multi-input Multi-output Complex Spectral Mapping for Speaker Separation.

Interspeech2023 Yufeng Yang, Ashutosh Pandey 0004, DeLiang Wang
Time-Domain Speech Enhancement for Robust Automatic Speech Recognition.

TASLP2022 Ashutosh Pandey 0004, DeLiang Wang
Self-Attending RNN for Speech Enhancement to Improve Cross-Corpus Generalization.

TASLP2022 Hassan Taherian, Ke Tan 0001, DeLiang Wang
Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training.

TASLP2022 Ke Tan 0001, Zhong-Qiu Wang, DeLiang Wang
Neural Spectrospatial Filtering.

TASLP2022 Heming Wang, DeLiang Wang
Neural Cascade Architecture With Triple-Domain Loss for Speech Enhancement.

TASLP2022 Heming Wang, Xueliang Zhang 0001, DeLiang Wang
Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement.

TASLP2022 Hao Zhang, DeLiang Wang
Neural Cascade Architecture for Multi-Channel Acoustic Echo Suppression.

ICASSP2022 Ashutosh Pandey 0004, Buye Xu, Anurag Kumar 0003, Jacob Donley, Paul Calamia, DeLiang Wang
TPARN: Triple-Path Attentive Recurrent Network for Time-Domain Multichannel Speech Enhancement.

ICASSP2022 Hassan Taherian, Ke Tan 0001, DeLiang Wang
Location-Based Training for Multi-Channel Talker-Independent Speaker Separation.

ICASSP2022 Heming Wang, Yao Qian, Xiaofei Wang 0009, Yiming Wang, Chengyi Wang 0002, Shujie Liu 0001, Takuya Yoshioka, Jinyu Li 0001, DeLiang Wang
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction.

#13  | Jun Du | DBLP Google Scholar  
By venueInterspeech: 35ICASSP: 28TASLP: 13SpeechComm: 3
By year2024: 82023: 142022: 102021: 132020: 172019: 152018: 2
ISCA sessionsspeaker diarization: 3speaker recognition: 3speaker embedding and diarization: 2speech enhancement: 2speech enhancement and denoising: 1multi-talker methods in speech processing: 1spoken dialog systems and conversational analysis: 1speech recognition: 1spoken language processing: 1acoustic scene analysis: 1low-resource asr development: 1spoken dialogue systems and multimodality: 1multimodal systems: 1tools, corpora and resources: 1interspeech 2021 deep noise suppression challenge: 1single-channel speech enhancement: 1voice activity detection and keyword spotting: 1asr model training and strategies: 1acoustic model adaptation for asr: 1acoustic scene classification: 1multi-channel speech enhancement: 1speech emotion recognition: 1speech coding and evaluation: 1speech and audio classification: 1corpus annotation and evaluation: 1far-field speech recognition: 1the second dihard speech diarization challenge (dihard ii): 1deep enhancement: 1the first dihard speech diarization challenge: 1
IEEE keywordsspeech enhancement: 22speech recognition: 19speaker diarization: 9visualization: 7noise measurement: 5task analysis: 5speaker recognition: 5misp challenge: 4recording: 4voice activity detection: 4data models: 4progressive learning: 4regression analysis: 4deep neural network: 4audio visual: 3hidden markov models: 3noise: 3error analysis: 3automatic speech recognition: 3adaptation models: 3speech separation: 3signal to noise ratio: 3reverberation: 3convolutional neural nets: 3robust speech recognition: 2optimization: 2mathematical models: 2attention: 2iterative methods: 2chime 7 challenge: 2robustness: 2estimation: 2emotion recognition: 2face recognition: 2semantics: 2data mining: 2multimodality: 2memory aware speaker embedding: 2attention network: 2telephone sets: 2time domain analysis: 2data augmentation: 2decoding: 2speech coding: 2speech intelligibility: 2post processing: 2entropy: 2image analysis: 2acoustic scene classification: 2convolutional neural networks: 2improved minima controlled recursive averaging: 2neural network: 2signal classification: 2fully convolutional neural network: 2attention mechanism: 2generalized gaussian distribution: 2mean square error methods: 2maximum likelihood estimation: 2least mean squares methods: 2gaussian distribution: 2ideal ratio mask: 2task generic: 1measurement: 1optimization objective: 1distortion measurement: 1diffusion model: 1score based: 1speech denoising: 1interpolating diffusion model: 1interpolation: 1writing: 1multi modal: 1aggregated optical flow map: 1trajectory: 1handwriting recognition: 1handwritten mathematical expression recognition: 1topology: 1multi channel speech enhancement: 1iterative mask estimation: 1redundancy: 1feature fusion: 1multi modal emotion recognition: 1entropy based fusion: 1structured pruning: 1network architecture optimization: 1target speaker enhancement: 1self supervised learning: 1speaker adaptive: 1target speaker extraction: 1real world scenarios: 1benchmark testing: 1oral communication: 1memory management: 1chime challenge: 1graphics processing units: 1sequence to sequence architecture: 1codes: 1adaptive refinement: 1dictionary learning: 1adaptive systems: 1dynamic mask: 1data quality control: 1time frequency analysis: 1wiener filter: 1gevd: 1wiener filters: 1speech distortion: 1mean square error: 1correlation: 1low rank approximation: 1synchronization: 1dcase 2022: 1testing: 1sound event localization and detection: 1model architecture: 1realistic data: 1location awareness: 1transfer learning: 1synthetic speech detection: 1quantum transfer learning: 1integrated circuit modeling: 1quantum machine learning: 1variational quantum circuit: 1pre trained model: 1speech synthesis: 1tv: 1quality assessment: 1visual embedding reconstruction: 1acoustic distortion: 1learning systems: 1public domain software: 1wake word spotting: 1audio visual systems: 1microphone array: 1ts vad: 1m2met: 1snr constriction: 1time domain: 1dihard iii challenge: 1filtering: 1iteration: 1signal processing algorithms: 1robust automatic speech recognition: 1acoustic model: 1neural net architecture: 1probability: 1cross entropy: 1optimisation: 1deep neural network (dnn): 1local response normalization: 1multi level and adaptive fusion: 1factorized bilinear pooling: 1multimodal emotion recognition: 1analytical models: 1class activation mapping: 1adaptive noise and speech estimation: 1computer architecture: 1additives: 1noise reduction: 1computational modeling: 1convolutional layers: 1sehae: 1hierarchical autoencoder: 1computational complexity: 1speaker adaptation: 1memory aware networks: 1microphone arrays: 1snr progressive learning: 1recurrent neural nets: 1dense structure: 1acoustic segment model: 1ctc: 1matrix algebra: 1scaling: 1model adaptation: 1dilated convolution: 1speaker verification: 1baum welch statistics: 1maximum likelihood: 1shape factors update: 1multi objective learning: 1speech activity detection: 1snr estimation: 1dihard data: 1geometric constraint: 1geometry: 1linear programming: 1lstm: 12d to 2d mapping: 1fuzzy neural nets: 1performance evaluation: 1source separation: 1child speech extraction: 1realistic conditions: 1measures: 1prediction error modeling: 1gaussian processes: 1acoustic modeling: 1joint optimization: 1mixed bandwidth speech recognition: 1bandwidth expansion: 1function approximation: 1expressive power: 1universal approximation: 1vector to vector regression: 1improved speech presence probability: 1error statistics: 1teacher student learning: 1deep learning based speech enhancement: 1noise robust speech recognition: 1multiple speakers: 1interference: 1speaker dependent speech separation: 1chime 5 challenge: 1arrays: 1acoustic noise: 1statistical speech enhancement: 1signal denoising: 1gain function: 1
Most publications (all venues) at2023: 642024: 542020: 522021: 462019: 44

Affiliations
URLs

Recent publications

TASLP2024 Hang Chen, Qing Wang 0008, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001, 
Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition.

TASLP2024 Zilu Guo, Qing Wang 0008, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui Lee 0001, 
A Variance-Preserving Interpolation Approach for Diffusion Models With Applications to Single Channel Speech Enhancement and Recognition.

ICASSP2024 Hanbo Cheng, Jun Du, Pengfei Hu 0006, Jiefeng Ma, Zhenrong Zhang, Mobai Xue, 
Viewing Writing as Video: Optical Flow based Multi-Modal Handwritten Mathematical Expression Recognition.

ICASSP2024 Feng Ma, Yanhui Tu, Maokui He, Ruoyu Wang 0029, Shutong Niu, Lei Sun 0010, Zhongfu Ye, Jun Du, Jia Pan, Chin-Hui Lee 0001, 
A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

ICASSP2024 Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Yuling Ren, Yu Liu, 
Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization.

ICASSP2024 Minghui Wu, Haitao Tang, Jiahuan Fan, Ruoyu Wang, Hang Chen, Yanyong Zhang, Jun Du, Hengshun Zhou, Lei Sun, Xin Fang, Tian Gao, Genshun Wan, Jia Pan, Jianqing Gao, 
Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR through Efficient Joint Optimization.

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

ICASSP2024 Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang 0029, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee 0001, 
Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture.

SpeechComm2023 Shi Cheng, Jun Du, Shutong Niu, Alejandrina Cristià, Xin Wang 0037, Qing Wang 0008, Chin-Hui Lee 0001, 
Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions.

SpeechComm2023 Li Chai 0002, Hang Chen, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001, 
Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech.

TASLP2023 Mao-Kui He, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001, 
ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding.

TASLP2023 Shutong Niu, Jun Du, Lei Sun 0010, Yu Hu 0003, Chin-Hui Lee 0001, 
QDM-SSD: Quality-Aware Dynamic Masking for Separation-Based Speaker Diarization.

TASLP2023 Jie Zhang 0042, Rui Tao, Jun Du, Li-Rong Dai 0001, 
SDW-SWF: Speech Distortion Weighted Single-Channel Wiener Filter for Noise Reduction.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Shutong Niu, Jun Du, Qing Wang 0008, Li Chai 0002, Huaxin Wu, Zhaoxu Nian, Lei Sun 0010, Yi Fang, Jia Pan, Chin-Hui Lee 0001, 
An Experimental Study on Sound Event Localization and Detection Under Realistic Testing Conditions.

ICASSP2023 Ruoyu Wang 0029, Jun Du, Tian Gao, 
Quantum Transfer Learning Using the Large-Scale Unsupervised Pre-Trained Model Wavlm-Large for Synthetic Speech Detection.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

ICASSP2023 Chenyue Zhang, Hang Chen, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001, 
Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement.

Interspeech2023 Zilu Guo, Jun Du, Chin-Hui Lee 0001, Yu Gao, Wenbin Zhang, 
Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement.

Interspeech2023 Shutong Niu, Jun Du, Maokui He, Chin-Hui Lee 0001, Baoxiang Li, Jiakui Li, 
Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization.

#14  | Tara N. Sainath | DBLP Google Scholar  
By venueICASSP: 42Interspeech: 33TASLP: 1NAACL: 1ICLR: 1
By year2024: 92023: 202022: 162021: 112020: 82019: 112018: 3
ISCA sessionsspeech recognition: 3asr technologies and systems: 2asr: 2multi-, cross-lingual and other topics in asr: 2cross-lingual and multilingual asr: 2asr neural network architectures: 2analysis of speech and audio signals: 1feature modeling for asr: 1acoustic model adaptation for asr: 1search/decoding algorithms for asr: 1speech analysis: 1language modeling and lexical modeling for asr: 1speech representation: 1novel models and training methods for asr: 1resource-constrained asr: 1language and lexical modeling for asr: 1novel neural network architectures for asr: 1streaming for asr/rnn transducers: 1neural network training methods for asr: 1speech classification: 1lm adaptation, lexical units and punctuation: 1asr neural network architectures and training: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1end-to-end speech recognition: 1acoustic model adaptation: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 30decoding: 12recurrent neural nets: 9data models: 7computational modeling: 7adaptation models: 6end to end asr: 6speech coding: 6task analysis: 5transducers: 5error analysis: 5video on demand: 5conformer: 5natural language processing: 5automatic speech recognition: 4rnn t: 4vocabulary: 3context modeling: 3asr: 3multilingual: 3sequence to sequence: 3degradation: 2costs: 2computational efficiency: 2universal speech model: 2semisupervised learning: 2convolution: 2foundation model: 2buildings: 2computer architecture: 2transfer learning: 2production: 2predictive models: 2text analysis: 2two pass asr: 2rnnt: 2long form asr: 2latency: 2optimisation: 2phonetics: 2biasing: 2hidden markov models: 1end to end: 1tail: 1adapter finetuning: 1streaming multilingual asr: 1sparsity: 1topology: 1model pruning: 1model quantization: 1quantization (signal): 1dialect classifier: 1equity: 1us english: 1african american english: 1robustness: 1hardware: 1large language model: 1distance measurement: 1multilingual speech recognition: 1runtime efficiency: 1computational latency: 1large models: 1causal model: 1online asr: 1state space model: 1systematics: 1parameter efficient adaptation: 1tuning: 1acoustic beams: 1representations: 1modular: 1zero shot stitching: 1longform asr: 1fuses: 1tensors: 1weight sharing: 1machine learning: 1low rank decomposition: 1model compression: 1wearable computers: 1program processors: 1embedded speech recognition: 1segmentation: 1earth observing system: 1decoding algorithms: 1real time systems: 1signal processing algorithms: 1memory management: 1analytical models: 1domain adaptation: 1foundation models: 1frequency modulation: 1soft sensors: 1internal lm: 1text recognition: 1text injection: 1lattices: 1contextual biasing: 1network architecture: 1multitasking: 1capitalization: 1joint network: 1rnn transducer: 1pause prediction: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1kernel: 1encoding: 1switches: 1utf 8 byte: 1unified modeling language: 1word piece: 1multilingual asr: 1joint training: 1contrastive learning: 1indexes: 1self supervised learning: 1linear programming: 1massive: 1lifelong learning: 1speaker recognition: 1fusion: 1gating: 1bilinear pooling: 1signal representation: 1cascaded encoders: 1second pass asr: 1mean square error methods: 1transformer: 1calibration: 1confidence: 1voice activity detection: 1attention based end to end models: 1echo state network: 1long form: 1echo: 1regression analysis: 1probability: 1endpointer: 1supervised learning: 1attention: 1sequence to sequence models: 1unsupervised learning: 1filtering theory: 1semi supervised training: 1mathematical model: 1pronunciation: 1las: 1spelling correction: 1attention models: 1language model: 1mobile handsets: 1end to end speech synthesis: 1speech synthesis: 1end to end speech recognition: 1
Most publications (all venues) at2023: 322022: 252019: 162018: 152021: 14


Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001, 
End-to-End Speech Recognition: A Survey.

ICASSP2024 Junwen Bai, Bo Li 0028, Qiujia Li, Tara N. Sainath, Trevor Strohman, 
Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR.

ICASSP2024 Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li 0028, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal, 
USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models.

ICASSP2024 Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara N. Sainath, Françoise Beaufays, Pedro Moreno Mengibar, 
Improving Speech Recognition for African American English with Audio Classification.

ICASSP2024 W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang 0033, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study.

ICASSP2024 Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno 0001, 
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models.

ICASSP2024 Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof Choromanski, Tara N. Sainath
Augmenting Conformers With Structured State-Space Sequence Models For Online Speech Recognition.

ICASSP2024 Khe Chai Sim, Zhouyuan Huo, Tsendsuren Munkhdalai, Nikhil Siddhartha, Adam Stooke, Zhong Meng, Bo Li 0028, Tara N. Sainath
A Comparison of Parameter-Efficient ASR Domain Adaptation Methods for Universal Speech and Language Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar, 
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara N. Sainath, Françoise Beaufays, 
Lego-Features: Exporting Modular Encoder Features for Streaming and Deliberation ASR.

ICASSP2023 Shuo-Yiin Chang, Chao Zhang 0031, Tara N. Sainath, Bo Li 0028, Trevor Strohman, 
Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion.

ICASSP2023 Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw, 
Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models.

ICASSP2023 Ke Hu, Tara N. Sainath, Bo Li 0028, Nan Du 0002, Yanping Huang, Andrew M. Dai, Yu Zhang 0033, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman, 
Massively Multilingual Shallow Fusion with Large Language Models.

ICASSP2023 W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman, 
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model.

ICASSP2023 Zhouyuan Huo, Khe Chai Sim, Bo Li 0028, Dongseong Hwang, Tara N. Sainath, Trevor Strohman, 
Resource-Efficient Transfer Learning from Speech Foundation Model Using Hierarchical Feature Fusion.

ICASSP2023 Bo Li 0028, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang 0033, Wei Han 0002, Trevor Strohman, Françoise Beaufays, 
Efficient Domain Adaptation for Speech Foundation Models.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran, 
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

ICASSP2023 Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, W. Ronny Huang, Tara N. Sainath
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale.

ICASSP2023 Tara N. Sainath, Rohit Prabhavalkar, Diamantino Caseiro, Pat Rondon, Cyril Allauzen, 
Improving Contextual Biasing with Text Injection.

ICASSP2023 Weiran Wang, Ding Zhao, Shaojin Ding, Hao Zhang 0010, Shuo-Yiin Chang, David Rybach, Tara N. Sainath, Yanzhang He, Ian McGraw, Shankar Kumar, 
Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks.

#15  | Jianhua Tao 0001 | DBLP Google Scholar  
By venueInterspeech: 42ICASSP: 19TASLP: 11SpeechComm: 3AAAI: 1ICML: 1
By year2024: 52023: 92022: 82021: 162020: 202019: 122018: 7
ISCA sessionsspeech emotion recognition: 4speech synthesis: 4voice conversion and adaptation: 3speech coding and privacy: 2topics in asr: 2statistical parametric speech synthesis: 2speech coding and enhancement: 1speaker and language identification: 1paralinguistics: 1asr: 1health and affect: 1privacy-preserving machine learning for audio & speech processing: 1search/decoding techniques and confidence measures for asr: 1computational resource constrained speech recognition: 1multi-channel audio and emotion recognition: 1speech enhancement: 1speech in multimodality: 1asr neural network architectures: 1speech in health: 1sequence-to-sequence speech recognition: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speech and audio source separation and scene analysis: 1emotion and personality in conversation: 1audio signal characterization: 1speech and voice disorders: 1nn architectures for asr: 1speech synthesis paradigms and methods: 1emotion recognition and analysis: 1deep enhancement: 1source separation and spatial analysis: 1prosody modeling and generation: 1
IEEE keywordsspeech recognition: 14speech synthesis: 12natural language processing: 6end to end: 6speech enhancement: 5speaker recognition: 5predictive models: 4transfer learning: 4error analysis: 3speech coding: 3signal processing algorithms: 3text analysis: 3attention: 3decoding: 3emotion recognition: 3noise robustness: 2text to speech: 2adversarial training: 2filtering theory: 2text based speech editing: 2text editing: 2recurrent neural nets: 2optimisation: 2end to end model: 2autoregressive processes: 2multimodal fusion: 2self attention: 2transformer: 2speaker adaptation: 2low resource: 2synthetic speech detection: 1interactive fusion: 1noise measurement: 1data models: 1knowledge distillation: 1noise: 1noise robust: 1fewer tokens: 1language model: 1speech codecs: 1speech codec: 1time invariant: 1codes: 1asvspoof: 1multiscale permutation entropy: 1nonlinear dynamics: 1deepfakes: 1power spectral entropy: 1entropy: 1audio deepfake detection: 1splicing: 1tail: 1supervised learning: 1partial label learning: 1benchmark testing: 1imbalanced learning: 1pseudo label: 1phase locked loops: 1costs: 1prosodic boundaries: 1computational modeling: 1multi task learning: 1tagging: 1multi modal embeddings: 1bit error rate: 1linguistics: 1speaker dependent weighting: 1direction of arrival estimation: 1target speaker localization: 1generalized cross correlation: 1transforms: 1location awareness: 1controllability: 1oral communication: 1conversational tts: 1multi modal: 1semiconductor device modeling: 1multi grained: 1prosody: 1waveform generators: 1vocoders: 1deterministic plus stochastic: 1multiband excitation: 1noise control: 1vocoder: 1stochastic processes: 1one shot learning: 1coarse to fine decoding: 1mask prediction: 1covid 19: 1diseases: 1digital health: 1microorganisms: 1regression analysis: 1deep learning (artificial intelligence): 1depression: 1behavioural sciences computing: 1global information embedding: 1lstm: 1mask and prediction: 1fast: 1bert: 1non autoregressive: 1cross modal: 1teacher student learning: 1language modeling: 1gated recurrent fusion: 1robust end to end speech recognition: 1speech transformer: 1speech distortion: 1glottal source: 1arx lf model: 1iterative methods: 1vocal tract: 1signal denoising: 1inverse problems: 1source filter model: 1speaker sensitive modeling: 1conversational emotion recognition: 1conversational transformer network (ctnet): 1context sensitive modeling: 1signal classification: 1decoupled transformer: 1automatic speech recognition: 1code switching: 1bi level decoupling: 1prosody modeling: 1speaking style modeling: 1personalized speech synthesis: 1speech emotion recognition: 1cross attention: 1few shot speaker adaptation: 1the m2voc challenge: 1prosody and voice factorization: 1sequence to sequence: 1robustness: 1phoneme level autoregression: 1clustering algorithms: 1spectrogram: 1end to end post filter: 1deep clustering: 1permutation invariant training: 1deep attention fusion features: 1speech separation: 1interference: 1prosody transfer: 1audio signal processing: 1optimization strategy: 1multi head attention: 1audio visual systems: 1model level fusion: 1image fusion: 1video signal processing: 1continuous emotion recognition: 1forward backward algorithm: 1synchronous transformer: 1online speech recognition: 1encoding: 1asynchronous problem: 1chunk by chunk: 1cross lingual: 1phoneme representation: 1matrix decomposition: 1speaker embedding: 1word embedding: 1punctuation prediction: 1speech embedding: 1adversarial: 1language invariant: 1
Most publications (all venues) at2024: 582023: 452021: 432022: 362020: 36

Affiliations
Tsinghua University, Department of Automation, Beijing, China
University of Chinese Academy of Sciences, School of Artificial Intelligence, Beijing, China
Tsinghua University, Beijing, China (PhD 2001)

Recent publications

TASLP2024 Cunhang Fan, Mingming Ding, Jianhua Tao 0001, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv, 
Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection.

ICASSP2024 Yong Ren, Tao Wang 0074, Jiangyan Yi, Le Xu, Jianhua Tao 0001, Chu Yuan Zhang, Junzuo Zhou, 
Fewer-Token Neural Speech Codec with Time-Invariant Codes.

ICASSP2024 Chenglong Wang, Jiayi He, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Xiaohui Zhang 0006, 
Multi-Scale Permutation Entropy for Audio Deepfake Detection.

ICASSP2024 Mingyu Xu, Zheng Lian, Bin Liu 0041, Zerui Chen, Jianhua Tao 0001
Pseudo Labels Regularization for Imbalanced Partial-Label Learning.

AAAI2024 Xiaohui Zhang 0006, Jiangyan Yi, Chenglong Wang, Chu Yuan Zhang, Siding Zeng, Jianhua Tao 0001
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection.

SpeechComm2023 Jiangyan Yi, Jianhua Tao 0001, Ye Bai, Zhengkun Tian, Cunhang Fan, 
Transfer knowledge for punctuation prediction via adversarial training.

TASLP2023 Jiangyan Yi, Jianhua Tao 0001, Ruibo Fu, Tao Wang 0074, Chu Yuan Zhang, Chenglong Wang, 
Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings.

ICASSP2023 Guanjun Li, Wei Xue, Wenju Liu, Jiangyan Yi, Jianhua Tao 0001
GCC-Speaker: Target Speaker Localization with Optimal Speaker-Dependent Weighting in Multi-Speaker Scenarios.

ICASSP2023 Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, Jianhua Tao 0001, Jianqing Sun, Jiaen Liang, 
M2-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis.

Interspeech2023 Haiyang Sun, Zheng Lian, Bin Liu 0041, Ying Li, Jianhua Tao 0001, Licai Sun, Cong Cai, Meng Wang, Yuan Cheng, 
EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition.

Interspeech2023 Chenglong Wang, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Shuai Zhang 0014, Xun Chen, 
Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features.

Interspeech2023 Chenglong Wang, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Shuai Zhang 0014, Ruibo Fu, Xun Chen, 
TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection.

Interspeech2023 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Yongwei Li, Junhai Xu, Di Jin 0001, Jianhua Tao 0001
SOT: Self-supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition.

ICML2023 Xiaohui Zhang 0006, Jiangyan Yi, Jianhua Tao 0001, Chenglong Wang, Chu Yuan Zhang, 
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection.

SpeechComm2022 Wenhuan Lu, Xinyue Zhao, Na Guo, Yongwei Li, Jianguo Wei, Jianhua Tao 0001, Jianwu Dang 0001, 
One-shot emotional voice conversion based on feature separation.

TASLP2022 Tao Wang 0074, Ruibo Fu, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen, 
NeuralDPS: Neural Deterministic Plus Stochastic Model With Multiband Excitation for Noise-Controllable Waveform Generation.

TASLP2022 Tao Wang 0074, Jiangyan Yi, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, 
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing.

ICASSP2022 Cong Cai, Bin Liu 0041, Jianhua Tao 0001, Zhengkun Tian, Jiahao Lu, Kexin Wang, 
End-to-End Network Based on Transformer for Automatic Detection of Covid-19.

ICASSP2022 Ya Li, Mingyue Niu, Ziping Zhao 0001, Jianhua Tao 0001
Automatic Depression Level Assessment from Speech By Long-Term Global Information Embedding.

ICASSP2022 Tao Wang 0074, Jiangyan Yi, Liqun Deng, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, 
Context-Aware Mask Prediction Network for End-to-End Text-Based Speech Editing.

#16  | Longbiao Wang | DBLP Google Scholar  
By venueInterspeech: 38ICASSP: 30TASLP: 5SpeechComm: 3
By year2024: 82023: 172022: 212021: 132020: 122019: 22018: 3
ISCA sessionsanalysis of speech and audio signals: 3spatial audio: 3speech synthesis: 3asr: 2emotion and sentiment analysis: 2dnn architectures for speaker recognition: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multimodal speech emotion recognition: 1paralinguistics: 1biosignal-enabled spoken communication: 1speech quality assessment: 1speech representation: 1zero, low-resource and multi-modal speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1spoken dialogue systems and multimodality: 1spoken language processing: 1spoken dialogue systems: 1robust speaker recognition: 1targeted source separation: 1speech and voice disorders: 1speech emotion recognition: 1single-channel speech enhancement: 1voice and hearing disorders: 1learning techniques for speaker recognition: 1speech enhancement: 1adaptation and accommodation in conversation: 1robust speech recognition: 1spoofing detection: 1cognition and brain studies: 1
IEEE keywordsspeech recognition: 14speech synthesis: 6speaker verification: 6emotion recognition: 6representation learning: 5speech emotion recognition: 5speaker recognition: 5decoding: 4meta learning: 4natural language processing: 4predictive models: 3task analysis: 3transformers: 3speech enhancement: 3training data: 2data models: 2spectrogram: 2redundancy: 2ctap: 2contrastive learning: 2minimal supervision: 2self supervised learning: 2semantics: 2acoustic distortion: 2automatic speech recognition: 2visualization: 2degradation: 2transformer: 2convolution: 2noise measurement: 2time frequency analysis: 2time domain: 2domain adaptation: 2pattern classification: 2speaker extraction: 2speaker embedding: 2reverberation: 2naturalness: 2convolutional neural nets: 2interactive systems: 2image representation: 2capsule networks: 2multilingual: 1text to speech: 1self supervised representations: 1zero shot: 1low resource: 1text to speech (tts): 1pre training: 1agglutinative: 1language modeling: 1linguistics: 1morphology: 1prompt learning: 1syntactics: 1natural language understanding: 1hierarchical multi task learning: 1hidden markov models: 1labeling: 1cross domain slot filling: 1filling: 1pipelines: 1vc: 1text recognition: 1explosions: 1tts: 1asr: 1diffusion model: 1controllability: 1semantic coding: 1substitution: 1speech anti spoofing: 1concatenation: 1blending strategies: 1data augmentation: 1refining: 1adaptation: 1dysarthria: 1program processors: 1adaptation models: 1meta generalized speaker verification: 1performance evaluation: 1optimization: 1domain mismatch: 1recording: 1upper bound: 1audio visual data: 1co teaching+: 1vae: 1fast: 1complexity theory: 1knowledge distillation: 1lightweight: 1local global: 1positional encoding: 1natural languages: 1encoding: 1focusing: 1anti spoofing: 1learning systems: 1biometrics (access control): 1production: 1lip biometrics: 1visual speech: 1cross modal: 1correlation: 1lips: 1co learning: 1joint training: 1robust speech recognition: 1residual noise: 1speech distortion: 1robustness: 1refine network: 1fuses: 1multiresolution spectrograms: 1time domain analysis: 1noise robustness: 1disentangled representation learning: 1metric learning: 1extraterrestrial measurements: 1momentum augmentation: 1multimodal fusion: 1proposals: 1dense video captioning: 1center loss: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1array signal processing: 1mutual information: 1content: 1multiple references: 1audio signal processing: 1style: 1feature distillation: 1task driven loss: 1model compression: 1double constrained: 1utterance level representation: 1graph theory: 1atmosphere: 1dialogue level contextual information: 1recurrent neural nets: 1signal representation: 1signal classification: 1expressive speech synthesis: 1style modeling: 1style disentanglement: 1multilayer perceptrons: 1domain invariant: 1meta generalized transformation: 1query processing: 1knowledge based systems: 1knowledge retrieval: 1dialogue system: 1natural language generation: 1multi head attention: 1signal fusion: 1multi stage: 1pitch prediction: 1pitch control: 1speech coding: 1speech codecs: 1image recognition: 1spectro temporal attention: 1channel attention: 1auditory encoder: 1hearing: 1convolutional neural network: 1voice activity detection: 1ear: 1sensor fusion: 1graph convolutional: 1vgg 16: 1image fusion: 1multimodal emotion recognition: 1optimisation: 1meta speaker embedding network: 1cross channel: 1end to end model: 1dysarthric speech recognition: 1medical signal processing: 1articulatory attribute detection: 1multi view: 1time frequency: 1self attention: 1multi target learning: 1speech dereverberation: 1two stage: 1spectrograms fusion: 1acoustic and lexical context information: 1speech based user interfaces: 1mandarin dialog act recognition: 1hierarchical model.: 1
Most publications (all venues) at2022: 372021: 332023: 282020: 222019: 19

Affiliations
Nagaoka University of Technology

Recent publications

SpeechComm2024 Yuqin Lin, Jianwu Dang 0001, Longbiao Wang, Sheng Li 0010, Chenchen Ding, 
Disordered speech recognition considering low resources and abnormal articulation.

SpeechComm2024 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001, 
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network.

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi, 
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Rui Liu 0008, Yifan Hu, Haolin Zuo, Zhaojie Luo, Longbiao Wang, Guanglai Gao, 
Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training.

TASLP2024 Xiao Wei, Yuhang Li, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang 0001, 
A Prompt-Based Hierarchical Pipeline for Cross-Domain Slot Filling.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang 0074, Longbiao Wang, Jianwu Dang 0001, 
Learning Speech Representation from Contrastive Token-Acoustic Pretraining.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang 0001, 
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models.

ICASSP2024 Linjuan Zhang, Kong Aik Lee, Lin Zhang, Longbiao Wang, Baoning Niu, 
CPAUG: Refining Copy-Paste Augmentation for Speech Anti-Spoofing.

TASLP2023 Yuqin Lin, Longbiao Wang, Yanbing Yang, Jianwu Dang 0001, 
CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng, 
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, 
Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning.

ICASSP2023 Yuhao Liu, Cheng Gong, Longbiao Wang, Xixin Wu, Qiuyu Liu, Jianwu Dang 0001, 
VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

ICASSP2023 Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Jianwu Dang 0001, 
Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection.

ICASSP2023 Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang 0001, 
Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification.

ICASSP2023 Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang 0001, Xiaobao Wang, Shiliang Zhang, 
Speech and Noise Dual-Stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition.

ICASSP2023 Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang 0001, Tatsuya Kawahara, 
Time-Domain Speech Enhancement Assisted by Multi-Resolution Frequency Encoder and Decoder.

ICASSP2023 Yao Sun, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, 
Noise-Disentanglement Metric Learning for Robust Speaker Verification.

ICASSP2023 Yiwei Wei, Shaozu Yuan, Meng Chen 0006, Longbiao Wang
Enhancing Multimodal Alignment with Momentum Augmentation for Dense Video Captioning.

Interspeech2023 Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, Chengyun Deng, Fei Wang, 
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation.

Interspeech2023 Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang 0001, Shiliang Zhang, 
Rethinking the Visual Cues in Audio-Visual Speaker Extraction.

#17  | Jianwu Dang 0001 | DBLP Google Scholar  
By venueInterspeech: 38ICASSP: 28SpeechComm: 4TASLP: 4
By year2024: 62023: 152022: 212021: 122020: 142019: 32018: 3
ISCA sessionsanalysis of speech and audio signals: 3spatial audio: 3speech synthesis: 2asr: 2emotion and sentiment analysis: 2learning techniques for speaker recognition: 2speech processing in the brain: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multimodal speech emotion recognition: 1speech quality assessment: 1speech representation: 1zero, low-resource and multi-modal speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1spoken dialogue systems and multimodality: 1spoken language processing: 1spoken dialogue systems: 1robust speaker recognition: 1targeted source separation: 1speech and voice disorders: 1speech emotion recognition: 1conversational systems: 1single-channel speech enhancement: 1voice and hearing disorders: 1acoustic phonetics: 1speech enhancement: 1adaptation and accommodation in conversation: 1robust speech recognition: 1spoofing detection: 1cognition and brain studies: 1
IEEE keywordsspeech recognition: 14speaker verification: 6emotion recognition: 6speech synthesis: 5representation learning: 5speech emotion recognition: 5speaker recognition: 5meta learning: 4natural language processing: 4decoding: 3predictive models: 3task analysis: 3spectrogram: 2redundancy: 2ctap: 2minimal supervision: 2text recognition: 2self supervised learning: 2semantics: 2acoustic distortion: 2visualization: 2degradation: 2transformer: 2transformers: 2convolution: 2speech enhancement: 2noise measurement: 2time frequency analysis: 2time domain: 2domain adaptation: 2pattern classification: 2speaker extraction: 2speaker embedding: 2reverberation: 2naturalness: 2convolutional neural nets: 2interactive systems: 2image representation: 2capsule networks: 2training data: 1multilingual: 1text to speech: 1self supervised representations: 1data models: 1zero shot: 1low resource: 1prompt learning: 1syntactics: 1natural language understanding: 1hierarchical multi task learning: 1hidden markov models: 1labeling: 1cross domain slot filling: 1filling: 1pipelines: 1vc: 1contrastive learning: 1explosions: 1tts: 1asr: 1diffusion model: 1controllability: 1semantic coding: 1adaptation: 1automatic speech recognition: 1dysarthria: 1program processors: 1adaptation models: 1meta generalized speaker verification: 1performance evaluation: 1optimization: 1domain mismatch: 1recording: 1upper bound: 1audio visual data: 1co teaching+: 1intent understanding: 1oral communication: 1paralinguistic information: 1brain network features: 1eeg: 1human computer interaction: 1perturbation methods: 1linguistics: 1brain: 1vae: 1fast: 1complexity theory: 1knowledge distillation: 1lightweight: 1local global: 1positional encoding: 1natural languages: 1encoding: 1focusing: 1anti spoofing: 1learning systems: 1biometrics (access control): 1production: 1lip biometrics: 1visual speech: 1cross modal: 1correlation: 1lips: 1co learning: 1joint training: 1robust speech recognition: 1residual noise: 1speech distortion: 1robustness: 1refine network: 1fuses: 1multiresolution spectrograms: 1time domain analysis: 1noise robustness: 1disentangled representation learning: 1metric learning: 1extraterrestrial measurements: 1center loss: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1array signal processing: 1mutual information: 1content: 1multiple references: 1audio signal processing: 1style: 1feature distillation: 1task driven loss: 1model compression: 1double constrained: 1utterance level representation: 1graph theory: 1atmosphere: 1dialogue level contextual information: 1recurrent neural nets: 1signal representation: 1signal classification: 1multilayer perceptrons: 1domain invariant: 1meta generalized transformation: 1query processing: 1knowledge based systems: 1knowledge retrieval: 1dialogue system: 1natural language generation: 1multi head attention: 1signal fusion: 1multi stage: 1pitch prediction: 1pitch control: 1speech coding: 1speech codecs: 1image recognition: 1spectro temporal attention: 1channel attention: 1auditory encoder: 1hearing: 1convolutional neural network: 1voice activity detection: 1ear: 1sensor fusion: 1graph convolutional: 1vgg 16: 1image fusion: 1multimodal emotion recognition: 1optimisation: 1meta speaker embedding network: 1cross channel: 1end to end model: 1dysarthric speech recognition: 1medical signal processing: 1articulatory attribute detection: 1multi view: 1time frequency: 1self attention: 1multi target learning: 1speech dereverberation: 1two stage: 1spectrograms fusion: 1acoustic and lexical context information: 1speech based user interfaces: 1mandarin dialog act recognition: 1hierarchical model.: 1
Most publications (all venues) at2022: 422021: 392020: 292019: 292016: 25

Affiliations
Tianjin University, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, China
Institute of Communication Parlee, ICP, Center of National Research Scientific, France (2002-2003)
Japan Advanced Institute of Science and Technology, JAIST, Japan
Shizuoka University, Japan (PhD 1992)
URLs

Recent publications

SpeechComm2024 Yuqin Lin, Jianwu Dang 0001, Longbiao Wang, Sheng Li 0010, Chenchen Ding, 
Disordered speech recognition considering low resources and abnormal articulation.

SpeechComm2024 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network.

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi, 
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Xiao Wei, Yuhang Li, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang 0001
A Prompt-Based Hierarchical Pipeline for Cross-Domain Slot Filling.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang 0074, Longbiao Wang, Jianwu Dang 0001
Learning Speech Representation from Contrastive Token-Acoustic Pretraining.

ICASSP2024 Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang 0001
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models.

TASLP2023 Yuqin Lin, Longbiao Wang, Yanbing Yang, Jianwu Dang 0001
CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng, 
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001
Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning.

ICASSP2023 Zhongjie Li, Bin Zhao, Gaoyan Zhang, Jianwu Dang 0001
Brain Network Features Differentiate Intentions from Different Emotional Expressions of the Same Text.

ICASSP2023 Yuhao Liu, Cheng Gong, Longbiao Wang, Xixin Wu, Qiuyu Liu, Jianwu Dang 0001
VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

ICASSP2023 Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Jianwu Dang 0001
Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection.

ICASSP2023 Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang 0001
Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification.

ICASSP2023 Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang 0001, Xiaobao Wang, Shiliang Zhang, 
Speech and Noise Dual-Stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition.

ICASSP2023 Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang 0001, Tatsuya Kawahara, 
Time-Domain Speech Enhancement Assisted by Multi-Resolution Frequency Encoder and Decoder.

ICASSP2023 Yao Sun, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001
Noise-Disentanglement Metric Learning for Robust Speaker Verification.

Interspeech2023 Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, Chengyun Deng, Fei Wang, 
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation.

Interspeech2023 Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang 0001, Shiliang Zhang, 
Rethinking the Visual Cues in Audio-Visual Speaker Extraction.

Interspeech2023 Yuhang Li, Xiao Wei, Yuke Si, Longbiao Wang, Xiaobao Wang, Jianwu Dang 0001
Improving Zero-shot Cross-domain Slot Filling via Transformer-based Slot Semantics Fusion.

Interspeech2023 Zhongjie Li, Gaoyan Zhang, Longbiao Wang, Jianwu Dang 0001
Discrimination of the Different Intents Carried by the Same Text Through Integrating Multimodal Information.

#18  | Chng Eng Siong | DBLP Google Scholar  
By venueInterspeech: 32ICASSP: 30TASLP: 3ICLR: 2ACL: 2NeurIPS: 1AAAI: 1IJCAI: 1SpeechComm: 1EMNLP: 1
By year2024: 102023: 242022: 132021: 72020: 102019: 62018: 4
ISCA sessionsanalysis of speech and audio signals: 4speaker and language identification: 2speech recognition: 2speech enhancement, bandwidth extension and hearing aids: 2asr neural network architectures: 2cross-lingual and multilingual asr: 2end-to-end asr: 1self-supervised learning in asr: 1acoustic signal representation and analysis: 1robust asr, and far-field/multi-talker asr: 1multimodal speech emotion recognition and paralinguistics: 1speech segmentation: 1speech type classification and diagnosis: 1language and accent recognition: 1targeted source separation: 1bi- and multilinguality: 1acoustic model adaptation for asr: 1lexicon and language model for speech recognition: 1speaker and language recognition: 1neural waveform generation: 1speech technologies for code-switching in multilingual communities: 1language modeling: 1show and tell: 1source separation from monaural input: 1
IEEE keywordsspeech recognition: 14speech enhancement: 11noise measurement: 6speaker recognition: 5noise robustness: 4task analysis: 4contrastive learning: 4speaker extraction: 4automatic speech recognition: 3self supervised learning: 3error analysis: 3representation learning: 3adaptation models: 3multi task learning: 3signal processing algorithms: 3speaker embedding: 3speech coding: 2convolution: 2time domain analysis: 2robustness: 2background noise: 2transformer: 2transformers: 2speech separation: 2modulation: 2uncertainty: 2training data: 2generative adversarial network: 2multitasking: 2noise robust speech recognition: 2benchmark testing: 2codes: 2data augmentation: 2keyword spotting: 2entropy: 2natural language processing: 2information retrieval: 2sensor fusion: 2speech emotion recognition: 2emotion recognition: 2time domain: 2signal reconstruction: 2discrete codebook: 1speech distortion: 1code predictor: 1distortion: 1information interaction: 1dual branch: 1estimation: 1spectrogram: 1diagonal version of structured state space sequence (s4d) model: 1online diarization: 1filtering: 1low latency communication: 1spatial dictionary: 1multi channel: 1latency: 1prompt tuning: 1computational modeling: 1zero shot learning: 1explainable prompt: 1automatic speaker verification: 1label level knowledge distillation: 1knowledge engineering: 1knowledge distillation: 1data mining: 1attentive pooling: 1feature modulation: 1noisy speech separation: 1decoding: 1deepfake detection: 1deepfakes: 1representation regularization: 1audio visual fusion: 1measurement: 1diffusion probabilistic model: 1reinforcement learning: 1generative adversarial networks: 1unsupervised domain adaptation: 1supervised learning: 1gradient remedy: 1interference: 1gradient interference: 1noise robust speech separation: 1gradient modulation: 1end to end network: 1unify speech enhancement and separation: 1disentangling representations: 1noise robust automatic speech recognition: 1visualization: 1boosting: 1performance evaluation: 1low resource: 1datamaps: 1data models: 1mixup: 1xlsr: 1language identification: 1online speaker clustering: 1clustering algorithms: 1calibration: 1speaker verification: 1probabilistic logic: 1multi modal: 1representation: 1linguistics: 1linear programming: 1bidirectional attention: 1end to end: 1forced alignment: 1learning systems: 1optimisation: 1reinforcement leaning: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1reverberation: 1array signal processing: 1joint training approach: 1over suppression phenomenon: 1interactive feature fusion: 1noisy far field: 1small footprint: 1minimum word error: 1autoregressive processes: 1code switching: 1non autoregressive: 1asr: 1gaussian processes: 1dialogue relation extraction: 1interactive systems: 1pattern classification: 1text analysis: 1multi relations: 1bert: 1co attention mechanism: 1convolutional neural nets: 1multimodal fusion: 1recurrent neural nets: 1audio signal processing: 1multi level acoustic information: 1signal fusion: 1multi stage: 1image recognition: 1spectro temporal attention: 1channel attention: 1disentangled feature learning: 1signal denoising: 1adversarial training: 1signal representation: 1online speech recognition: 1early endpointing: 1scalegrad: 1analytical models: 1depth wise separable convolution: 1multi scale: 1multi scale fusion: 1speech bandwidth extension: 1signal restoration: 1low resource asr: 1pre training: 1catastrophic forgetting.: 1independent language model: 1fine tuning: 1spectrum approximation loss: 1source separation: 1
Most publications (all venues) at2023: 422016: 272015: 272022: 262024: 25


Recent publications

TASLP2024 Yuchen Hu, Chen Chen 0075, Qiushi Zhu, Eng Siong Chng
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR.

TASLP2024 Linhui Sun, Shuo Yuan, Aifei Gong, Lei Ye, Eng Siong Chng
Dual-Branch Modeling Based on State-Space Model for Speech Enhancement.

ICASSP2024 Weiguang Chen, Tran The Anh, Xionghu Zhong, Eng Siong Chng
Enhancing Low-Latency Speaker Diarization with Spatial Dictionary Learning.

ICASSP2024 Dianwen Ng, Chong Zhang 0003, Ruixi Zhang, Yukun Ma, Fabian Ritter Gutierrez, Trung Hieu Nguyen 0001, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma 0001, 
Are Soft Prompts Good Zero-Shot Learners for Speech Recognition?

ICASSP2024 Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng
Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification.

ICASSP2024 Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang 0003, Hao Wang 0199, Trung Hieu Nguyen 0001, Kun Zhou 0003, Dianwen Ng, Eng Siong Chng, Bin Ma 0001, 
SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance.

ICASSP2024 Zizheng Zhang, Chen Chen 0075, Hsin-Hung Chen, Xiang Liu, Yuchen Hu, Eng Siong Chng
Noise-Aware Speech Separation with Contrastive Learning.

ICASSP2024 Heqing Zou, Meng Shen 0002, Yuchen Hu, Chen Chen 0075, Eng Siong Chng, Deepu Rajan, 
Cross-Modality and Within-Modality Regularization for Audio-Visual Deepfake Detection.

ICLR2024 Chen Chen 0075, Ruizhe Li 0001, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Engsiong Chng, Chao-Han Huck Yang, 
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition.

ICLR2024 Yuchen Hu, Chen Chen 0075, Chao-Han Huck Yang, Ruizhe Li 0001, Chao Zhang 0031, Pin-Yu Chen, Engsiong Chng
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition.

ICASSP2023 Chen Chen 0075, Yuchen Hu, Weiwei Weng, Eng Siong Chng
Metric-Oriented Speech Enhancement Using Diffusion Probabilistic Model.

ICASSP2023 Chen Chen 0075, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng
Unsupervised Noise Adaptation Using Data Simulation.

ICASSP2023 Yuchen Hu, Chen Chen 0075, Ruizhe Li 0001, Qiushi Zhu, Eng Siong Chng
Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition.

ICASSP2023 Yuchen Hu, Chen Chen 0075, Heqing Zou, Xionghu Zhong, Eng Siong Chng
Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation.

ICASSP2023 Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, Jinjie Ni, Chong Zhang 0003, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma 0001, 
De'hubert: Disentangling Noise in a Self-Supervised Model for Robust Speech Recognition.

ICASSP2023 Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang 0003, Yukun Ma, Trung Hieu Nguyen 0001, Chongjia Ni, Eng Siong Chng, Bin Ma 0001, 
Contrastive Speech Mixup for Low-Resource Keyword Spotting.

ICASSP2023 Shangeth Rajaa, Kriti Anandan, Swaraj Dalmia, Tarun Gupta, Eng Siong Chng
Improving Spoken Language Identification with Map-Mix.

ICASSP2023 Alexey Sholokhov, Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng
Probabilistic Back-ends for Online Speaker Recognition and Clustering.

ICASSP2023 Yuhang Yang, Haihua Xu, Hao Huang 0009, Eng Siong Chng, Sheng Li 0010, 
Speech-Text Based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition.

Interspeech2023 Chen Chen 0075, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng
A Neural State-Space Modeling Approach to Efficient Speech Separation.

#19  | Kai Yu 0004 | DBLP Google Scholar  
By venueICASSP: 33Interspeech: 22TASLP: 14AAAI: 1SpeechComm: 1EMNLP: 1
By year2024: 112023: 122022: 132021: 62020: 172019: 112018: 2
ISCA sessionsspeech synthesis: 4speaker recognition: 2speech coding: 1speech recognition: 1automatic audio classification and audio captioning: 1pathological speech analysis: 1single-channel speech enhancement: 1speaker embedding and diarization: 1language and lexical modeling for asr: 1voice activity detection and keyword spotting: 1phonetic event detection and segmentation: 1spoken language understanding: 1anti-spoofing and liveness detection: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speaker recognition and anti-spoofing: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker verification using neural network methods: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 10natural language processing: 10speech synthesis: 8speaker recognition: 8task analysis: 5speech enhancement: 5decoding: 5vocoders: 5text analysis: 5visualization: 4audio signal processing: 4measurement: 3text to speech: 3semantics: 3time domain analysis: 3hidden markov models: 3speaker verification: 3natural language generation: 2signal processing algorithms: 2timbre: 2recording: 2diffusion: 2labeling: 2self supervised learning: 2language modeling: 2transfer learning: 2adaptation models: 2data models: 2optimization: 2natural languages: 2transformers: 2variational autoencoder: 2gaussian processes: 2lattice to sequence: 2adversarial training: 2teacher student learning: 2data augmentation: 2data handling: 2text dependent speaker verification: 2video signal processing: 2dialogue policy: 2slot filling: 2speaker embedding: 2interactive systems: 2attention models: 2recurrent neural nets: 2encoder decoder architecture: 1training schemes: 1evaluation metrics: 1audio recognition: 1automated audio captioning: 1efficiency: 1flow matching: 1mathematical models: 1rectified flow: 1trajectory: 1speed quality tradeoff: 1speaker embedding free: 1stability analysis: 1zero shot voice conversion: 1linguistics: 1cross attention: 1face animation: 1technological innovation: 1talking face: 1dubbing: 1synchronization: 1videos: 1rhetoric: 1expressive text to speech: 1tts dataset: 1large language models: 1annotations: 1manuals: 1textual expressiveness: 1systematics: 1byte pair encoding: 1syntactics: 1rescore: 1discrete audio token: 1correlation: 1category audio generation: 1multimodal: 1clustering: 1audio text learning: 1chatbots: 1data curation pipeline: 1detailed audio captioning: 1metadata: 1pipelines: 1hierarchical semantic frame: 1ontologies: 1spoken language understanding: 1relational graph attention network: 1degradation: 1multitasking: 1discrete tokens: 1speaker adaptation: 1timbre normalization: 1vector quantization: 1discrete fourier transforms: 1cepstrum: 1noise measurement: 1neural homomorphic synthesis: 1spectral masking: 1multi lingual: 1multi speaker: 1vqtts: 1limmits: 1classifier guidance: 1emotion intensity control: 1controllability: 1noise reduction: 1de noising diffusion models: 1emotional tts: 1complexity theory: 1sound generation: 1spice: 1variation quantized gan: 1text to sound: 1error analysis: 1audio visual: 1misp challenge: 1speaker diarization: 1inverse problems: 1speech editing: 1zero shot adaptation: 1diffusion probabilistic model: 1probabilistic logic: 1unit selection: 1probability: 1fastspeech2: 1speech codecs: 1voice cloning: 1autoregressive processes: 1mixture models: 1prosody cloning: 1prosody modelling: 1mixture density network: 1pre trained language model: 1algebra: 1lattice to lattice: 1prosody control: 1unsupervised learning: 1prosody tagging: 1decision trees: 1word level prosody: 1source filter model: 1complex neural network: 1weakly supervised learning: 1category adaptation: 1deep neural networks: 1source separation: 1supervised learning: 1information retrieval: 1audio text retrieval: 1aggregation: 1cross modal: 1pre trained model: 1pattern classification: 1arbitrary wake word: 1training detection criteria: 1entropy: 1wake word detection: 1text prompt: 1streaming: 1conditional generation: 1audio captioning: 1diverse caption generation: 1teacher training: 1voice activity detection: 1speech activity detection. weakly supervised learning: 1convolutional neural networks: 1i vector: 1sound event detection: 1dataset: 1music: 1text to audio grounding: 1scalability: 1multiple tasks: 1actor critic: 1parallel training: 1automatic speech recognition: 1attention based encoder decoder: 1standards: 1connectionist temporal classification: 1variational auto encoder: 1text independent speaker verification: 1generative adversarial network: 1binarization: 1product quantization: 1data compression: 1neural network language model: 1storage management: 1quantisation (signal): 1intent detection: 1natural language understanding (nlu): 1dual learning: 1semi supervised learning: 1domain adaptation: 1prior knowledge: 1label embedding: 1natural language understanding: 1on the fly data augmentation: 1specaugment: 1convolutional neural nets: 1multitask learning: 1channel information: 1low resource: 1dialogue state tracking: 1hierarchical: 1data sparsity: 1polysemy: 1multi sense embeddings: 1word processing: 1distributed representation: 1search problems: 1forward backward algorithm: 1word lattice: 1speech coding: 1training data: 1text dependent: 1adaptation: 1system performance: 1text mismatch: 1data collection: 1multi agent systems: 1policy adaptation: 1graph theory: 1ontologies (artificial intelligence): 1deep reinforcement learning: 1graph neural networks: 1center loss: 1angular softmax: 1short duration text independent speaker verification: 1speaker neural embedding: 1triplet loss: 1ctc: 1computational modeling: 1end to end speech recognition: 1multi speaker speech recognition: 1cocktail party problem: 1attention mechanism: 1knowledge distillation: 1computer aided instruction: 1language translation: 1audio databases: 1audio caption: 1recurrent neural networks: 1signal classification: 1
Most publications (all venues) at2020: 402024: 342023: 292022: 262019: 25

Affiliations
Shanghai Jiao Tong University, Computer Science and Engineering Department, China
Cambridge University, Engineering Department, UK (PhD 2006)

Recent publications

TASLP2024 Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu 0004
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning.

ICASSP2024 Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen 0001, Kai Yu 0004
VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching.

ICASSP2024 Junjie Li, Yiwei Guo, Xie Chen 0001, Kai Yu 0004
SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention.

ICASSP2024 Tao Liu, Chenpeng Du, Shuai Fan 0005, Feilong Chen, Kai Yu 0004
DiffDub: Person-Generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-Encoder.

ICASSP2024 Sen Liu, Yiwei Guo, Xie Chen 0001, Kai Yu 0004
StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations.

ICASSP2024 Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004
Acoustic BPE for Speech Generation with Discrete Tokens.

ICASSP2024 Zeyu Xie, Baihan Li, Xuenan Xu, Mengyue Wu, Kai Yu 0004
Enhancing Audio Generation Diversity with Visual Information.

ICASSP2024 Xuenan Xu, Xiaohang Xu 0004, Zeyu Xie, Pingyue Zhang, Mengyue Wu, Kai Yu 0004
A Detailed Audio-Text Data Simulation Pipeline Using Single-Event Sounds.

ICASSP2024 Hongshen Xu, Ruisheng Cao, Su Zhu, Sheng Jiang, Hanchong Zhang, Lu Chen 0002, Kai Yu 0004
A Birgat Model for Multi-Intent Spoken Language Understanding with Hierarchical Semantic Frames.

ICASSP2024 Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu 0004, Daniel Povey, Xie Chen 0001, 
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.

AAAI2024 Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen 0001, Shuai Wang 0016, Hui Zhang, Kai Yu 0004
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.

TASLP2023 Chenpeng Du, Yiwei Guo, Xie Chen 0001, Kai Yu 0004
Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature.

TASLP2023 Wenbin Jiang, Kai Yu 0004
Speech Enhancement With Integration of Neural Homomorphic Synthesis and Spectral Masking.

ICASSP2023 Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu 0004
Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge.

ICASSP2023 Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004
Emodiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance.

ICASSP2023 Guangwei Li, Xuenan Xu, Lingfeng Dai, Mengyue Wu, Kai Yu 0004
Diverse and Vivid Sound Generation from Text Descriptions.

ICASSP2023 Tao Liu, Zhengyang Chen, Yanmin Qian, Kai Yu 0004
Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge.

ICASSP2023 Zhijun Liu, Yiwei Guo, Kai Yu 0004
DiffVoice: Text-to-Speech with Latent Diffusion.

Interspeech2023 Wenbin Jiang, Fei Wen, Yifan Zhang, Kai Yu 0004
UnSE: Unsupervised Speech Enhancement Using Optimal Transport.

Interspeech2023 Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu 0004, Xie Chen 0001, 
Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation.

#20  | Tomoki Toda | DBLP Google Scholar  
By venueInterspeech: 31ICASSP: 30TASLP: 10SpeechComm: 1
By year2024: 72023: 152022: 122021: 142020: 112019: 82018: 5
ISCA sessionsspeech synthesis: 10voice conversion and adaptation: 3neural techniques for voice conversion and waveform generation: 3speech synthesis and voice conversion: 2speech enhancement, bandwidth extension and hearing aids: 2voice conversion and speech synthesis: 2speech quality assessment: 1spoken dialog systems and conversational analysis: 1the voicemos challenge: 1technology for disordered speech: 1the zero resource speech challenge 2020: 1neural waveform generation: 1novel paradigms for direct synthesis based on speech-related biosignals: 1sequence models for asr: 1speech synthesis paradigms and methods: 1
IEEE keywordsspeech synthesis: 16vocoders: 12voice conversion: 9neural vocoder: 8speech recognition: 7natural language processing: 5linguistics: 4speech enhancement: 4training data: 4speaker recognition: 4autoregressive processes: 4transformer: 4recurrent neural nets: 4controllability: 3text to speech: 3real time systems: 3electrolaryngeal speech: 3voice conversion (vc): 3open source software: 3speech intelligibility: 3sequence to sequence: 3convolutional neural nets: 3pathology: 2model pretraining: 2task analysis: 2artificial neural networks: 2data mining: 2speech emotion recognition: 2emotion recognition: 2self supervised learning: 2fundamental frequency control: 2source filter model: 2predictive models: 2decoding: 2error analysis: 2convolution: 2noisy to noisy vc: 2noisy speech modeling: 2self supervised speech representation: 2robustness: 2automatic speech recognition: 2computer architecture: 2mos prediction: 2non autoregressive: 2open source: 2pitch dependent dilated convolution: 2audio signal processing: 2probability: 2gaussian processes: 2supervised learning: 2speech coding: 2transfer learning: 1domain adaptation: 1automatic speech recognition (asr): 1low resourced asr: 1electrolaryngeal (el) speech: 1multichannel source separation: 1direction of arrival estimation: 1target speaker extraction: 1source separation: 1microphones: 1interference: 1multichannel variational autoencoder (mvae): 1estimation error: 1error correction: 1multi modal fusion: 1multitasking: 1semantics: 1coherence: 1asr error detection: 1visualization: 1asr error correction: 1audio difference captioning: 1audio difference learning: 1annotations: 1audio captioning: 1learning systems: 1finite impulse response filters: 1finite impulse response: 1synthesizers: 1jets: 1wavenext: 1convnext: 1transformers: 1intelligibility enhancement: 1atypical speech: 1harmonic analysis: 1speech rate conversion: 1generators: 1noise robustness: 1distortion: 1degradation: 1noise measurement: 1data augmentation: 1background noise: 1mathematical models: 1unified source filter networks: 1single channel speech enhancement: 1deep neural network: 1noise2noise: 1unsupervised learning: 1behavioral sciences: 1natural languages: 1sequence to sequence voice conversion: 1embedded systems: 1computers: 1low latency speech enhancement: 1speaker normalization: 1group theory: 1vocal tract length: 1asr: 1writing: 1minimally resourced asr: 1limiting: 1timbre: 1frequency synthesizers: 1documentation: 1autoregressive models: 1singing voice synthesis: 1pytorch: 1multi stream models: 1variational auto encoder: 1diffusion probabilistic model: 1representation learning: 1text to speech synthesis: 1tts: 1probabilistic logic: 1generative adversarial networks: 1source filter model: 1speech naturalness assessment: 1mean opinion score: 1streaming: 1speech quality assessment: 1hearing: 1sequence to sequence modeling: 1decision making: 1dysarthric speech: 1pathological speech: 1autoencoder: 1computer based training: 1signal denoising: 1pretraining: 1transformer network: 1attention: 1computational modeling: 1sequence to sequence learning: 1data models: 1many to many vc: 1parallel wavegan: 1quasi periodic wavenet: 1wavenet: 1quasi periodic structure: 1pitch controllability: 1vocoder: 1listener adaptation: 1perceived emotion: 1conformer: 1bert: 1language model: 1text analysis: 1vector quantized variational autoencoder: 1nonparallel: 1medical disorders: 1dysarthria: 1diffwave: 1diffusion probabilistic vocoder: 1sub modeling: 1wavegrad: 1noise: 1call centres: 1hierarchical multi task model: 1contact center call: 1customer satisfaction (cs): 1long short term memory recurrent neural networks: 1customer satisfaction: 1customer services: 1reproducibility of results: 1end to end: 1sound event detection: 1weakly supervised learning: 1self attention: 1weighted forced attention: 1forced alignment: 1sequence to sequence model: 1laplacian distribution: 1prediction theory: 1wavenet vocoder: 1multiple samples output: 1shallow model: 1linear prediction: 1fast fourier transforms: 1gaussian inverse autoregressive flow: 1parallel wavenet: 1fftnet: 1noise shaping: 1wavenet fine tuning: 1oversmoothed parameters: 1cyclic recurrent neural network: 1
Most publications (all venues) at2014: 422015: 372023: 302021: 282018: 25

Affiliations
URLs

Recent publications

TASLP2024 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda
Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition.

TASLP2024 Rui Wang, Li Li 0063, Tomoki Toda
Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information.

ICASSP2024 Jiajun He, Xiaohan Shi, Xingfeng Li 0001, Tomoki Toda
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction.

ICASSP2024 Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda
Audio Difference Learning for Audio Captioning.

ICASSP2024 Yamato Ohtani, Takuma Okamoto, Tomoki Toda, Hisashi Kawai, 
FIRNet: Fundamental Frequency Controllable Fast Neural Vocoder With Trainable Finite Impulse Response Filter.

ICASSP2024 Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai, 
Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion.

ICASSP2024 Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda
Electrolaryngeal Speech Intelligibility Enhancement through Robust Linguistic Encoders.

TASLP2023 Keisuke Matsubara, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Hisashi Kawai, 
Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder.

TASLP2023 Chao Xie, Tomoki Toda
Noisy-to-Noisy Voice Conversion Under Variations of Noisy Condition.

TASLP2023 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda
High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks.

ICASSP2023 Takuya Fujimura, Tomoki Toda
Analysis Of Noisy-Target Training For Dnn-Based Speech Enhancement.

ICASSP2023 Kazuhiro Kobayashi, Tomoki Hayashi, Tomoki Toda
Low-Latency Electrolaryngeal Speech Enhancement Based on Fastspeech2-Based Voice Conversion and Self-Supervised Speech Representation.

ICASSP2023 Atsushi Miyashita, Tomoki Toda
Representation of Vocal Tract Length Transformation Based on Group Theory.

ICASSP2023 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda
Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition.

ICASSP2023 Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda
NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit.

ICASSP2023 Yusuke Yasuda, Tomoki Toda
Text-To-Speech Synthesis Based on Latent Variable Conversion Using Diffusion Probabilistic Model and Variational Autoencoder.

ICASSP2023 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda
Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder.

Interspeech2023 Yeonjong Choi, Chao Xie, Tomoki Toda
Reverberation-Controllable Voice Conversion Using Reverberation Time Estimator.

Interspeech2023 Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda
Preference-based training framework for automatic speech quality assessment using deep neural network.

Interspeech2023 Takuma Okamoto, Tomoki Toda, Hisashi Kawai, 
E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion.

#21  | Junichi Yamagishi | DBLP Google Scholar  
By venueInterspeech: 32ICASSP: 24TASLP: 15
By year2024: 52023: 122022: 122021: 82020: 162019: 132018: 5
ISCA sessionsspeech synthesis: 8voice anti-spoofing and countermeasure: 3voice privacy challenge: 3speaker and language identification: 2speech synthesis paradigms and methods: 2anti-spoofing for speaker verification: 1the voicemos challenge: 1single-channel and multi-channel speech enhancement: 1speech coding and restoration: 1spoofing-aware automatic speaker verification (sasv): 1intelligibility-enhancing speech modification: 1single-channel speech enhancement: 1emotion modeling and analysis: 1neural techniques for voice conversion and waveform generation: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1expressive speech synthesis: 1voice conversion and speech synthesis: 1prosody modeling and generation: 1speaker verification: 1
IEEE keywordsspeech synthesis: 20speaker recognition: 11vocoders: 10speech recognition: 8text to speech: 7anti spoofing: 6training data: 5privacy: 5voice conversion: 5countermeasure: 5presentation attack detection: 5data privacy: 4pipelines: 4speech intelligibility: 4neural network: 4task analysis: 3logical access: 3neural vocoder: 3speaker anonymization: 3music: 3filtering theory: 3data models: 2protocols: 2self supervised learning: 2information filtering: 2asvspoof: 2speech enhancement: 2tacotron: 2automatic speaker verification: 2musical instruments: 2natural language processing: 2mos prediction: 2speaker verification: 2variational auto encoder: 2hidden markov models: 2speech coding: 2security of data: 2speaker adaptation: 2fourier transforms: 2autoregressive processes: 2multilingual: 1self supervised representations: 1decoding: 1zero shot: 1spectrogram: 1low resource: 1pseudonymisation: 1voice privacy: 1anonymisation: 1attack model: 1recording: 1degradation: 1deepfake detection: 1signal processing algorithms: 1privacy friendly data: 1language robust orthogonal householder neural network: 1codecs: 1deepfakes: 1spoofing: 1distributed databases: 1countermeasures: 1communication networks: 1selection based anonymizer: 1measurement: 1information integrity: 1synthetic aperture sonar: 1orthogonal householder neural network anonymizer: 1weighted additive angular softmax: 1internet: 1deepfake: 1databases: 1spoof localization: 1partialspoof: 1splicing: 1forgery: 1listening enhancement: 1oral communication: 1noise reduction: 1noise measurement: 1full end speech enhancement: 1intelligibility: 1transforms: 1privacy preservation: 1sex neutral voice: 1attribute privacy: 1multiple signal classification: 1computational modeling: 1software: 1transformer: 1text to speech synthesis: 1music audio synthesis: 1analytical models: 1buildings: 1spoof countermeasures: 1security: 1reinforcement learning: 1musical instrument embeddings: 1gaussian processes: 1linkability: 1speech naturalness assessment: 1mean opinion score: 1speech quality assessment: 1hearing: 1efficiency: 1pruning: 1vocoder: 1computer crime: 1estimation theory: 1resnet: 1attention: 1tdnn: 1feedforward neural nets: 1deep learning (artificial intelligence): 1time frequency analysis: 1generative adversarial networks: 1multi metric optimization: 1reverberation: 1speech analysis: 1voice conversion evaluation: 1voice conversion challenges: 1speaker characterization: 1vocoding: 1entertainment: 1listening test: 1rakugo: 1vector quantisation: 1representation learning: 1phone recognition: 1image coding: 1disentanglement: 1speaker diarization: 1duration modeling: 1vector quantization: 1automatic speaker verification (asv): 1detect ion cost function: 1spoofing counter measures: 1backpropagation: 1voice cloning: 1short time fourier transform: 1convolution: 1waveform model: 1recurrent neural nets: 1fundamental frequency: 1speaker embeddings: 1transfer learning: 1search problems: 1probability: 1sequences: 1sampling methods: 1sequence to sequence model: 1stochastic processes: 1neural waveform synthesizer: 1fine tuning: 1audio signal processing: 1zero shot adaptation: 1musical instrument sounds synthesis: 1cepstral analysis: 1complex valued representation: 1boltzmann machines: 1restricted boltzmann machine: 1signal classification: 1neural vocoding: 1gan: 1inference mechanisms: 1glottal excitation model: 1replay attacks: 1spoofing attack: 1vocal effort: 1style conversion: 1pulse model in log domain vocoder: 1cyclegan: 1lom bard speech: 1spectral analysis: 1wavenet: 1neural net architecture: 1neural waveform modeling: 1maximum likelihood estimation: 1waveform analysis: 1gaussian distribution: 1waveform generators: 1waveform modeling: 1gradient methods: 1text analysis: 1
Most publications (all venues) at2020: 322019: 322022: 312018: 292016: 29

Affiliations
National Institute of Informatics, Tokyo, Japan
University of Edinburgh, Scotland, UK (former)

Recent publications

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Michele Panariello, Natalia A. Tomashenko, Xin Wang 0037, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas W. D. Evans, Emmanuel Vincent 0001, Junichi Yamagishi
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation.

ICASSP2024 Xin Wang 0037, Junichi Yamagishi
Can Large-Scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-Supervised Front End?

ICASSP2024 Wanying Ge, Xin Wang 0037, Junichi Yamagishi, Massimiliano Todisco, Nicholas W. D. Evans, 
Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?

ICASSP2024 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Nicholas W. D. Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier, 
Synvox2: Towards A Privacy-Friendly Voxceleb2 Dataset.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee, 
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

TASLP2023 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Natalia A. Tomashenko, 
Speaker Anonymization Using Orthogonal Householder Neural Network.

TASLP2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance.

ICASSP2023 Haoyu Li, Yun Liu, Junichi Yamagishi
Joint Noise Reduction and Listening Enhancement for Full-End Speech Enhancement.

ICASSP2023 Paul-Gauthier Noé, Xiaoxiao Miao, Xin Wang 0037, Junichi Yamagishi, Jean-François Bonastre, Driss Matrouf, 
Hiding Speaker's Sex in Speech Using Zero-Evidence Speaker Representation in an Analysis/Synthesis Pipeline.

ICASSP2023 Xuan Shi, Erica Cooper, Xin Wang 0037, Junichi Yamagishi, Shrikanth Narayanan, 
Can Knowledge of End-to-End Text-to-Speech Models Improve Neural Midi-to-Audio Synthesis Systems?

ICASSP2023 Xin Wang 0037, Junichi Yamagishi
Spoofed Training Data for Speech Spoofing Countermeasure Can Be Efficiently Created Using Neural Vocoders.

Interspeech2023 Erica Cooper, Junichi Yamagishi
Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech.

Interspeech2023 Hieu-Thi Luong, Junichi Yamagishi
Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme.

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Chang Zeng, Xin Wang 0037, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi
Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms.

Interspeech2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi
Range-Based Equal Error Rate for Spoof Localization.

TASLP2022 Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi
Optimizing Tandem Speaker Verification and Anti-Spoofing Systems.

TASLP2022 Xuan Shi, Erica Cooper, Junichi Yamagishi
Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds.

TASLP2022 Brij Mohan Lal Srivastava, Mohamed Maouche, Md. Sahidullah, Emmanuel Vincent 0001, Aurélien Bellet, Marc Tommasi, Natalia A. Tomashenko, Xin Wang 0037, Junichi Yamagishi
Privacy and Utility of X-Vector Based Speaker Anonymization.

#22  | Marc Delcroix | DBLP Google Scholar  
By venueICASSP: 34Interspeech: 33TASLP: 4
By year2024: 72023: 122022: 122021: 142020: 132019: 92018: 4
ISCA sessionssource separation: 3adjusting to speaker, accent, and domain: 2analysis of neural speech representations: 1multi-talker methods in speech processing: 1speech coding: 1spoken language understanding, summarization, and information retrieval: 1speech recognition: 1speech coding and enhancement: 1dereverberation, noise reduction, and speaker extraction: 1speech enhancement and intelligibility: 1speaker embedding and diarization: 1search/decoding algorithms for asr: 1novel models and training methods for asr: 1single-channel speech enhancement: 1speaker diarization: 1streaming for asr/rnn transducers: 1source separation, dereverberation and echo cancellation: 1speech localization, enhancement, and quality assessment: 1target speaker detection, localization and separation: 1monaural source separation: 1asr neural network architectures and training: 1diarization: 1targeted source separation: 1lm adaptation, lexical units and punctuation: 1asr for noisy and far-field speech: 1asr neural network architectures: 1speech and audio source separation and scene analysis: 1neural networks for language modeling: 1distant asr: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 18speech enhancement: 13source separation: 10speaker recognition: 8automatic speech recognition: 5natural language processing: 5neural network: 5self supervised learning: 4reverberation: 4single channel speech enhancement: 3transformers: 3adaptation models: 3target speech extraction: 3recording: 3recurrent neural nets: 3blind source separation: 3array signal processing: 3degradation: 2noise robust speech recognition: 2processing distortion: 2analytical models: 2speech summarization: 2encoding: 2data models: 2speech translation: 2speech synthesis: 2joint training: 2bayes methods: 2hidden markov models: 2speaker diarization: 2artificial neural networks: 2continuous speech separation: 2permutation invariant training: 2particle separators: 2convolution: 2dynamic programming: 2computational modeling: 2computational efficiency: 2memory management: 2task analysis: 2training data: 2error analysis: 2meeting recognition: 2end to end speech recognition: 2sensor fusion: 2text analysis: 2diarization: 2signal to distortion ratio: 2speech separation: 2speech extraction: 2convolutional neural nets: 2online processing: 2dynamic stream weights: 2audio signal processing: 2time domain network: 2source counting: 2time domain analysis: 2backpropagation: 2nonlinear distortion: 1noise measurement: 1interference: 1speaker representation: 1refining: 1probing task: 1speech representation: 1linguistics: 1layer wise similarity analysis: 1long form asr: 1complexity theory: 1speaker embeddings: 1noise robustness: 1zero shot tts: 1self supervised learning model: 1acoustic distortion: 1interpolation: 1variational bayes: 1discriminative training: 1standards: 1vbx: 1tuning: 1clustering: 1feature aggregation: 1pre trained models: 1benchmark testing: 1telephone sets: 1data mining: 1few shot adaptation: 1sound event: 1soundbeam: 1target sound extraction: 1oral communication: 1graph pit: 1video on demand: 1end to end modeling: 1memory efficient encoders: 1dual speech/text encoder: 1long spoken document: 1end to end speech summarization: 1measurement: 1synthetic data augmentation: 1how2 dataset: 1multi modal data augmentation: 1software: 1tensors: 1word error rate: 1levenshtein distance: 1iterative methods: 1forward language model: 1iterative decoding: 1partial sentence aware backward language model: 1iterative shallow fusion: 1symbols: 1shallow fusion: 1language translation: 1attention fusion: 1rover: 1pattern clustering: 1infinite gmm: 1mixture models: 1gaussian processes: 1attention based decoder: 1recurrent neural network transducer: 1end to end: 1switches: 1loss function: 1large ensemble: 1complementary neural language models: 1iterative lattice generation: 1lattice rescoring: 1context carry over: 1lattices: 1input switching: 1deep learning (artificial intelligence): 1speakerbeam: 1acoustic beamforming: 1complex backpropagation: 1transfer functions: 1multi channel source separation: 1speaker activity: 1clustering algorithms: 1databases: 1signal processing algorithms: 1long recording speech separation: 1transforms: 1dual path modeling: 1end to end (e2e) speech recognition: 1estimation theory: 1bidirectional long short term memory (blstm): 1imbalanced datasets: 1confidence estimation: 1auxiliary features: 1audiovisual speaker localization: 1audio visual systems: 1image fusion: 1data fusion: 1video signal processing: 1beamforming: 1maximum likelihood estimation: 1dereverberation: 1optimisation: 1filtering theory: 1microphone array: 1microphone arrays: 1multi task loss: 1spatial features: 1separation: 1smart devices: 1robustness: 1signal denoising: 1robust asr: 1and multi head self attention: 1multi task learning: 1auxiliary information: 1computational complexity: 1multi speaker speech recognition: 1time domain: 1frequency domain analysis: 1audiovisual speaker tracking: 1kalman filters: 1tracking: 1backprop kalman filter: 1speaker embedding: 1adversarial learning: 1deep neural networks: 1phoneme invariant feature: 1text independent speaker recognition: 1signal classification: 1adaptation: 1auxiliary feature: 1domain adaptation: 1topic model: 1recurrent neural network language model: 1sequence summary network: 1semi supervised learning: 1decoding: 1encoder decoder: 1autoencoder: 1meeting diarization: 1speaker attention: 1speech separation/extraction: 1
Most publications (all venues) at2017: 242024: 222021: 222023: 202020: 17

Affiliations
URLs

Recent publications

TASLP2024 Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance.

ICASSP2024 Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima, 
What Do Self-Supervised Speech and Speaker Models Learn? New Findings from a Cross Model Layer-Wise Analysis.

ICASSP2024 William Chen, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001, 
Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing.

ICASSP2024 Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima, 
Noise-Robust Zero-Shot Text-to-Speech Synthesis Conditioned on Self-Supervised Speech-Representation Model with Adapters.

ICASSP2024 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

ICASSP2024 Dominik Klement, Mireia Díez, Federico Landini, Lukás Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara, 
Discriminative Training of VBx Diarization.

ICASSP2024 Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocký, 
Target Speech Extraction with Pre-Trained Self-Supervised Learning Models.

TASLP2023 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki, 
SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning.

TASLP2023 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach, 
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria.

ICASSP2023 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Roshan S. Sharma, Kohei Matsuura, Shinji Watanabe 0001, 
Speech Summarization of Long Spoken Document: Improving Memory Efficiency of Speech/Text Encoders.

ICASSP2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura, 
Leveraging Large Text Corpora For End-To-End Speech Summarization.

ICASSP2023 Thilo von Neumann, Christoph Böddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach, 
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems.

ICASSP2023 Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix
Iterative Shallow Fusion of Backward Language Model for End-To-End Speech Recognition.

Interspeech2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma, 
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Interspeech2023 Marc Delcroix, Naohiro Tawara, Mireia Díez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukás Burget, Shoko Araki, 
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization.

Interspeech2023 Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani, 
Target Speech Extraction with Conditional Diffusion Model.

Interspeech2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001, 
Integrating Multiple ASR Systems into NLP Backend with Attention Fusion.

#23  | Prasanta Kumar Ghosh | DBLP Google Scholar  
By venueInterspeech: 46ICASSP: 21TASLP: 2SpeechComm: 1
By year2024: 12023: 122022: 82021: 102020: 152019: 132018: 11
ISCA sessionsspeech signal characterization: 4show and tell: 3speech and voice disorders: 3human speech production: 3bioacoustics and articulation: 3articulatory information, modeling and inversion: 3speech, voice, and hearing disorders: 2speech production: 2speech signal analysis and representation: 2source and supra-segmentals: 2articulation: 1speech signal analysis: 1phonetics, phonology, and prosody: 1dysarthric speech assessment: 1analysis of speech and audio signals: 1low-resource asr development: 1speech production, perception and multimodality: 1assessment of pathological speech and language: 1cross/multi-lingual and code-switched asr: 1the first dicova challenge: 1diverse modes of speech acquisition and processing: 1speech in health: 1speaker recognition: 1applications in language learning and healthcare: 1deep enhancement: 1source separation and spatial analysis: 1voice conversion: 1speech and singing production: 1show and tell 6: 1
IEEE keywordsspeech recognition: 7amyotrophic lateral sclerosis: 5acoustic to articulatory inversion: 5speaker recognition: 5diseases: 4signal classification: 4cepstral analysis: 4whispered speech: 3convolutional neural nets: 3blstm: 3vowels: 2fricatives: 2dysarthria: 2production: 2tongue: 2data models: 2speech synthesis: 2transformers: 2convolution: 2natural language processing: 2parkinson’s disease: 2correlation methods: 2filtering theory: 2electromagnetic articulograph: 2audio signal processing: 2cnn: 2spectral analysis: 1severity: 1sociology: 1acoustic measurements: 1constriction: 1voicing: 1statistics: 1static: 1source filter: 1vowel: 1dynamic: 1shape: 1information filters: 1text to speech (tts): 1model compression: 1data constrained multi speaker: 1multi lingual tts: 1end to end: 1sequence to sequence learning: 1measurement: 1atmospheric modeling: 1speech production: 1real time magnetic resonance imaging: 1streaming media: 1magnetic resonance imaging: 1self supervised learning: 1articulatory to acoustic forward mapping: 1articulatory speech synthesis: 1recording device: 1dual attention pooling network: 1real time magnetic resonance imaging video: 1biomedical mri: 1air tissue boundary segmentation: 13 dimensional convolutional neural network: 1tongue base: 1velum: 1medical image processing: 1image segmentation: 1image registration: 1mel frequency cepstral coefficients: 1model complexity: 1noise: 1pitch: 1transfer learning: 1medical computing: 1x vectors: 1pitch drop: 1source filter interaction: 1natural languages: 1speaking rate: 1support vector machines: 1medical signal processing: 1recurrent neural nets: 1cnn lstm: 1adaptation: 1lf mmi: 1hidden markov models: 1maximum likelihood estimation: 1pseudo likelihood correction technique: 1acoustic signal detection: 1attention network: 1swallow sound signal: 1feature selection: 1biology computing: 1bioacoustics: 1cervical auscultation: 1acoustic analysis: 1gesture recognition: 1head gestures: 1euler angles: 1lstm: 1sustained phonations: 1asthma: 1classification: 1opensmile: 1latent variable model: 1expectation maximisation algorithm: 1dirichlet distribution: 1source separation: 1nmf: 1exponential family distributions: 1time varying: 1non negative: 1gif: 1gibbs sampling: 1probability: 1glottal inverse filtering: 1probabilistic weighted linear prediction: 1formants: 1amplitude modulation: 1speaker verification: 1articulatory data: 1automatic speech recognition: 1signal representation: 1neutral speech: 1
Most publications (all venues) at2019: 252018: 242021: 212023: 202020: 19

Affiliations
Indian Institute of Science, Department of Electrical Engineering, Bangalore, India
URLs

Recent publications

ICASSP2024 Chowdam Venkata Thirumala Kumar, Tanuka Bhattacharjee, Seena Vengalil, Saraswati Nashi, Madassu Keerthipriya, Yamini Belur, Atchayaram Nalini, Prasanta Kumar Ghosh
Spectral Analysis of Vowels and Fricatives at Varied Levels of Dysarthria Severity for Amyotrophic Lateral Sclerosis.

ICASSP2023 Tanuka Bhattacharjee, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Exploring the Role of Fricatives in Classifying Healthy Subjects and Patients with Amyotrophic Lateral Sclerosis and Parkinson's Disease.

ICASSP2023 Tanuka Bhattacharjee, Chowdam Venkata Thirumala Kumar, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Static and Dynamic Source and Filter Cues for Classification of Amyotrophic Lateral Sclerosis Patients and Healthy Subjects.

ICASSP2023 Abhayjeet Singh, Amala Nagireddi, Deekshitha G, Jesuraja Bandekar, Roopa R., Sandhya Badiger, Sathvik Udupa, Prasanta Kumar Ghosh, Hema A. Murthy, Heiga Zen, Pranaw Kumar, Kamal Kant, Amol Bole, Bira Chandra Singh, Keiichi Tokuda, Mark Hasegawa-Johnson, Philipp Olbrich, 
Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech.

ICASSP2023 Sathvik Udupa, Prasanta Kumar Ghosh
Real-Time MRI Video Synthesis from Time Aligned Phonemes with Sequence-to-Sequence Networks.

ICASSP2023 Sathvik Udupa, C. Siddarth, Prasanta Kumar Ghosh
Improved Acoustic-to-Articulatory Inversion Using Representations from Pretrained Self-Supervised Learning Models.

Interspeech2023 Jesuraja Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh
Exploring a classification approach using quantised articulatory movements for acoustic to articulatory inversion.

Interspeech2023 Varun Belagali, M. V. Achuth Rao, Prasanta Kumar Ghosh
Weakly supervised glottis segmentation in high-speed videoendoscopy using bounding box labels.

Interspeech2023 Tanuka Bhattacharjee, Anjali Jayakumar, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Transfer Learning to Aid Dysarthria Severity Classification for Patients with Amyotrophic Lateral Sclerosis.

Interspeech2023 Siddarth Chandrasekar, Arvind Ramesh, Tilak Purohit, Prasanta Kumar Ghosh
A Study on the Importance of Formant Transitions for Stop-Consonant Classification in VCV Sequence.

Interspeech2023 Shelly Jain, Priyanshi Pal, Anil Kumar Vuppala, Prasanta Kumar Ghosh, Chiranjeevi Yarra, 
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations.

Interspeech2023 Chowdam Venkata Thirumala Kumar, Tanuka Bhattacharjee, Yamini Belur, Atchayaram Nalini, Ravi Yadav, Prasanta Kumar Ghosh
Classification of Multi-class Vowels and Fricatives From Patients Having Amyotrophic Lateral Sclerosis with Varied Levels of Dysarthria Severity.

Interspeech2023 Mohammad Shaique Solanki, Ashutosh Bharadwaj, Jeevan Kylash, Prasanta Kumar Ghosh
Do Vocal Breath Sounds Encode Gender Cues for Automatic Gender Classification?

SpeechComm2022 Chiranjeevi Yarra, Prasanta Kumar Ghosh
Automatic syllable stress detection under non-parallel label and data condition.

ICASSP2022 Aravind Illa, Aanish Nair, Prasanta Kumar Ghosh
The impact of cross language on acoustic-to-articulatory inversion and its influence on articulatory speech synthesis.

ICASSP2022 Abinay Reddy Naini, Bhavuk Singhal, Prasanta Kumar Ghosh
Dual Attention Pooling Network for Recording Device Classification Using Neutral and Whispered Speech.

ICASSP2022 Anwesha Roy, Varun Belagali, Prasanta Kumar Ghosh
An Error Correction Scheme for Improved Air-Tissue Boundary in Real-Time MRI Video for Speech Production.

Interspeech2022 Anish Bhanushali, Grant Bridgman, Deekshitha G, Prasanta Kumar Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Srinivasan Umesh, Sathvik Udupa, Lodagala V. S. V. Durga Prasad, 
Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi.

Interspeech2022 Anwesha Roy, Varun Belagali, Prasanta Kumar Ghosh
Air tissue boundary segmentation using regional loss in real-time Magnetic Resonance Imaging video for speech production.

Interspeech2022 C. Siddarth, Sathvik Udupa, Prasanta Kumar Ghosh
Watch Me Speak: 2D Visualization of Human Mouth during Speech.

#24  | Chin-Hui Lee 0001 | DBLP Google Scholar  
By venueICASSP: 30Interspeech: 26TASLP: 11SpeechComm: 3
By year2024: 72023: 142022: 72021: 112020: 172019: 122018: 2
ISCA sessionsacoustic scene classification: 2multi-channel speech enhancement: 2speech enhancement: 2speech enhancement and denoising: 1speech coding and enhancement: 1multi-talker methods in speech processing: 1spoken dialog systems and conversational analysis: 1speech recognition: 1spoken language processing: 1speaker embedding and diarization: 1acoustic scene analysis: 1spoken dialogue systems and multimodality: 1multimodal systems: 1speaker diarization: 1privacy-preserving machine learning for audio & speech processing: 1single-channel speech enhancement: 1voice activity detection and keyword spotting: 1speech emotion recognition: 1speech coding and evaluation: 1speech and audio classification: 1far-field speech recognition: 1deep enhancement: 1the first dihard speech diarization challenge: 1
IEEE keywordsspeech enhancement: 22speech recognition: 20speaker diarization: 9visualization: 8task analysis: 6deep neural network: 6noise measurement: 5data models: 5regression analysis: 5hidden markov models: 4misp challenge: 4recording: 4adaptation models: 4voice activity detection: 4robust speech recognition: 3audio visual: 3noise: 3error analysis: 3speech separation: 3reverberation: 3progressive learning: 3speaker recognition: 3teacher student learning: 3signal to noise ratio: 3optimization: 2iterative methods: 2robustness: 2estimation: 2emotion recognition: 2data mining: 2benchmark testing: 2multimodality: 2memory aware speaker embedding: 2attention network: 2telephone sets: 2data augmentation: 2automatic speech recognition: 2post processing: 2image analysis: 2acoustic scene classification: 2convolutional neural networks: 2improved minima controlled recursive averaging: 2recurrent neural nets: 2speech intelligibility: 2fully convolutional neural network: 2domain adaptation: 2generalized gaussian distribution: 2mean square error methods: 2maximum likelihood estimation: 2least mean squares methods: 2gaussian distribution: 2ideal ratio mask: 2convolutional neural nets: 2transfer learning: 2task generic: 1measurement: 1optimization objective: 1distortion measurement: 1diffusion model: 1mathematical models: 1score based: 1speech denoising: 1interpolating diffusion model: 1interpolation: 1topology: 1multi channel speech enhancement: 1chime 7 challenge: 1iterative mask estimation: 1redundancy: 1feature fusion: 1multi modal emotion recognition: 1entropy based fusion: 1structured pruning: 1network architecture optimization: 1target speaker extraction: 1real world scenarios: 1oral communication: 1memory management: 1chime challenge: 1graphics processing units: 1sequence to sequence architecture: 1codes: 1degradation: 1knowledge based systems: 1boosting: 1multilingual automatic speech recognition: 1articulatory speech attributes: 1adaptive refinement: 1dictionary learning: 1adaptive systems: 1dynamic mask: 1data quality control: 1time domain analysis: 1synchronization: 1dcase 2022: 1testing: 1sound event localization and detection: 1model architecture: 1realistic data: 1location awareness: 1tv: 1quality assessment: 1convolution: 1kernel: 1encoding: 1visual embedding reconstruction: 1acoustic distortion: 1learning systems: 1public domain software: 1wake word spotting: 1audio visual systems: 1microphone array: 1decoding: 1speech coding: 1ts vad: 1m2met: 1dihard iii challenge: 1filtering: 1iteration: 1signal processing algorithms: 1robust automatic speech recognition: 1acoustic model: 1neural net architecture: 1probability: 1cross entropy: 1entropy: 1optimisation: 1deep neural network (dnn): 1local response normalization: 1multi level and adaptive fusion: 1face recognition: 1factorized bilinear pooling: 1multimodal emotion recognition: 1analytical models: 1class activation mapping: 1adaptive noise and speech estimation: 1computer architecture: 1additives: 1noise reduction: 1computational modeling: 1convolutional layers: 1sehae: 1hierarchical autoencoder: 1data privacy: 1acoustic modeling: 1and federated learning: 1quantum machine learning: 1microphone arrays: 1snr progressive learning: 1neural network: 1dense structure: 1acoustic segment model: 1semantics: 1attention mechanism: 1label embedding: 1knowledge representation: 1backpropagation: 1maximum likelihood: 1shape factors update: 1multi objective learning: 1tensors: 1tensor train network: 1tensor to vector regression: 1speech activity detection: 1snr estimation: 1dihard data: 1geometric constraint: 1geometry: 1linear programming: 1lstm: 12d to 2d mapping: 1fuzzy neural nets: 1performance evaluation: 1source separation: 1child speech extraction: 1realistic conditions: 1measures: 1signal classification: 1noise robustness: 1adversarial robustness: 1gradient methods: 1speech recognition safety: 1adversarial examples: 1prediction error modeling: 1gaussian processes: 1pattern classification: 1non native tone modeling and mispronunciation detection: 1computer assisted pronunciation training (capt): 1natural language processing: 1computer assisted language learning (call): 1function approximation: 1expressive power: 1universal approximation: 1vector to vector regression: 1improved speech presence probability: 1error statistics: 1deep learning based speech enhancement: 1noise robust speech recognition: 1cross modal training: 1environmental aware training: 1databases: 1student teacher training: 1audio visual speech recognition: 1multiple speakers: 1interference: 1speaker dependent speech separation: 1chime 5 challenge: 1arrays: 1acoustic noise: 1statistical speech enhancement: 1signal denoising: 1gain function: 1
Most publications (all venues) at2023: 232020: 232017: 202016: 202014: 20

Affiliations
Georgia Institute of Technology, School of Electrical and Computer Engineering, USA
Bell Laboratories, Dialogue Systems Research Department, Murray Hill, New Jersey, NY, USA (1981-2001)

Recent publications

TASLP2024 Hang Chen, Qing Wang 0008, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001
Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition.

TASLP2024 Zilu Guo, Qing Wang 0008, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui Lee 0001
A Variance-Preserving Interpolation Approach for Diffusion Models With Applications to Single Channel Speech Enhancement and Recognition.

ICASSP2024 Feng Ma, Yanhui Tu, Maokui He, Ruoyu Wang 0029, Shutong Niu, Lei Sun 0010, Zhongfu Ye, Jun Du, Jia Pan, Chin-Hui Lee 0001
A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

ICASSP2024 Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Yuling Ren, Yu Liu, 
Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization.

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

ICASSP2024 Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang 0029, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee 0001
Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture.

ICASSP2024 Hao Yen, Sabato Marco Siniscalchi, Chin-Hui Lee 0001
Boosting End-to-End Multilingual Phoneme Recognition Through Exploiting Universal Speech Attributes Constraints.

SpeechComm2023 Shi Cheng, Jun Du, Shutong Niu, Alejandrina Cristià, Xin Wang 0037, Qing Wang 0008, Chin-Hui Lee 0001
Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions.

SpeechComm2023 Li Chai 0002, Hang Chen, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001
Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech.

TASLP2023 Mao-Kui He, Jun Du, Qing-Feng Liu, Chin-Hui Lee 0001
ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding.

TASLP2023 Shutong Niu, Jun Du, Lei Sun 0010, Yu Hu 0003, Chin-Hui Lee 0001
QDM-SSD: Quality-Aware Dynamic Masking for Separation-Based Speaker Diarization.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Shutong Niu, Jun Du, Qing Wang 0008, Li Chai 0002, Huaxin Wu, Zhaoxu Nian, Lei Sun 0010, Yi Fang, Jia Pan, Chin-Hui Lee 0001
An Experimental Study on Sound Event Localization and Detection Under Realistic Testing Conditions.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee 0001
A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.

ICASSP2023 Chenyue Zhang, Hang Chen, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 0001
Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement.

Interspeech2023 Zilu Guo, Jun Du, Chin-Hui Lee 0001, Yu Gao, Wenbin Zhang, 
Variance-Preserving-Based Interpolation Diffusion Models for Speech Enhancement.

Interspeech2023 Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee 0001
A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models.

Interspeech2023 Shutong Niu, Jun Du, Maokui He, Chin-Hui Lee 0001, Baoxiang Li, Jiakui Li, 
Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization.

Interspeech2023 Haotian Wang, Jun Du, Hengshun Zhou, Chin-Hui Lee 0001, Yuling Ren, Jiangjiang Zhao, 
A Multiple-Teacher Pruning Based Self-Distillation (MT-PSD) Approach to Model Compression for Audio-Visual Wake Word Spotting.

#25  | John H. L. Hansen | DBLP Google Scholar  
By venueInterspeech: 39ICASSP: 14TASLP: 9SpeechComm: 6
By year2024: 72023: 102022: 132021: 92020: 92019: 122018: 8
ISCA sessionsspeech recognition: 2applications in transcription, education and learning: 2dereverberation and echo cancellation: 2speaker recognition challenges and applications: 2integrating speech science and technology for clinical applications: 2speech coding and enhancement: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1speaker and language identification: 1spoken language processing: 1pathological speech analysis: 1resource-constrained asr: 1speech representation: 1speech enhancement and intelligibility: 1embedding and network architecture for speaker recognition: 1multi-, cross-lingual and other topics in asr: 1asr technologies and systems: 1target speaker detection, localization and separation: 1speech and audio quality assessment: 1language learning: 1the fearless steps challenge phase-02: 1speaker embedding: 1topics in speech and audio signal processing: 1speaker recognition and diarization: 1language learning and databases: 1speech perception in adverse listening conditions: 1speech enhancement: 1speaker and language recognition: 1speech and audio source separation and scene analysis: 1speaker verification: 1speaker verification using neural network methods: 1adjusting to speaker, accent, and domain: 1spoken corpora and annotation: 1speech analysis and representation: 1signal analysis for the natural, biological and social sciences: 1
IEEE keywordsspeaker recognition: 8task analysis: 5speaker verification: 4convolutional neural nets: 4speech enhancement: 3convolution: 3transformers: 3time frequency analysis: 3computational modeling: 3adaptation models: 3deep neural network: 2transformer: 2reverberation: 2transfer learning: 2training data: 2data models: 2speaker embedding: 2switches: 2speech recognition: 2generative adversarial networks: 2audio signal processing: 2neural net architecture: 2calibration: 2overlapping speech detection: 2co channel speech detection: 2speech separation: 2natural language processing: 2domain adaptation: 2deformable convolutional networks: 1monaural dereverberation: 1filtering: 1microphones: 1minimum variance distortionless response: 1deep filtering: 1reflection: 1distortion: 1harmonic analysis: 1noise measurement: 1u net: 1decoding: 1complex valued network: 1frequency transformation block: 1massive naturalistic community resource: 1nasa: 1nasa apollo missions: 1psychology: 1fearless steps: 1fs apollo: 1auditory system: 1real time systems: 1cci mobile: 1situational signal processing: 1"emaging": 1non linguistic: 1tagging: 1cochlear implants: 1sound source localization (ssl): 1wearable and portable devices: 1cochlear implant (ci): 1location awareness: 1signal processing algorithms: 1artificial neural networks: 1blind speech dereverberation: 1cepstral analysis: 1measurement: 1all pass system: 1channel estimation: 1minimum phase: 1costs: 1parameter efficiency: 1adapter: 1pre trained model: 1error analysis: 1graph networks: 1complexity theory: 1data augmentation: 1fearless steps apollo: 1focusing: 1historical archiving: 1speaker diarization: 1continual learning: 1speech re cognition: 1end to end systems: 1domain expansion: 1accented speech: 1model adaptation: 1attention: 1context modeling: 1dct transformation: 1aggregates: 1discrete cosine transforms: 1global context modeling: 1noise robustness: 1energy consumption: 1filterbank learning: 1performance evaluation: 1robustness: 1small footprint: 1keyword spotting: 1filter banks: 1end to end: 1operating systems: 1data mining: 1self attention: 1conformer: 1swin transformer: 1deep neural networks: 1forensics: 1discrepancy loss: 1text analysis: 1multi source domain adaptation: 1domain adversarial training: 1moment matching: 1maximum mean discrepancy: 1disentangled representation learning: 1audio generation: 1guided representation learning: 1and generative adversarial neural network: 1signal representation: 1optimisation: 1lombard effect: 1whisper/vocal effort: 1signal detection: 11 d cnn: 1convolutional neural network: 1speech synthesis: 1cocktail party problem: 1speech modeling: 1simultaneous speaker detection: 1residual learning: 1binary classifier: 1adversarial domain adaptation: 1deep learning (artificial intelligence): 1embedding disentangling: 1phone embedding: 1computer assisted language learning: 1mispronunciation verification: 1siamese networks: 1source counting: 1mixed speech: 1convolutional neural networks: 1voice activity detection: 1peer led team learning: 1speaker clustering: 1audio diarization: 1sincnet: 1speaker representation: 1mixers: 1adversarial training: 1nist sre: 1embedded systems: 1pattern classification: 1semi supervised learning: 1mixture models: 1unsupervised learning: 1arabic dialect identification: 1language identification: 1i vector: 1gaussian processes: 1
Most publications (all venues) at2010: 352014: 342015: 322017: 312016: 31


Recent publications

TASLP2024 Vinay Kothapally, John H. L. Hansen
Monaural Speech Dereverberation Using Deformable Convolutional Networks.

TASLP2024 Nursadul Mamun, John H. L. Hansen
Speech Enhancement for Cochlear Implant Recipients Using Deep Complex Convolution Transformer With Frequency Transformation.

ICASSP2024 John H. L. Hansen, Aditya Joglekar, Meena M. Chandra Shekar, Szu-Jui Chen, Xi Liu, 
Fearless Steps Apollo: Team Communications Based Community Resource Development for Science, Technology, Education, and Historical Preservation.

ICASSP2024 Taylor Lawson, John H. L. Hansen
Situational Signal Processing with Ecological Momentary Assessment: Leveraging Environmental Context for Cochlear Implant Users.

ICASSP2024 Xi Liu, Szu-Jui Chen, John H. L. Hansen
Dual-Path Minimum-Phase and All-Pass Decomposition Network for Single Channel Speech Dereverberation.

ICASSP2024 Mufan Sang, John H. L. Hansen
Efficient Adapter Tuning of Pre-Trained Speech Models for Automatic Speaker Verification.

ICASSP2024 Meena M. Chandra Shekar, John H. L. Hansen
Apollo's Unheard Voices: Graph Attention Networks for Speaker Diarization and Clustering for Fearless Steps Apollo Collection.

SpeechComm2023 Midia Yousefi, John H. L. Hansen
Single-channel speech separation using soft-minimum permutation invariant training.

TASLP2023 Shahram Ghorbani, John H. L. Hansen
Domain Expansion for End-to-End Speech Recognition: Applications for Accent/Dialect Speech.

TASLP2023 Wei Xia, John H. L. Hansen
Attention and DCT Based Global Context Modeling for Text-Independent Speaker Recognition.

ICASSP2023 Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen 0001, John H. L. Hansen
Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting.

ICASSP2023 Mufan Sang, Yong Zhao 0008, Gang Liu 0001, John H. L. Hansen, Jian Wu 0027, 
Improving Transformer-Based Networks with Locality for Automatic Speaker Verification.

Interspeech2023 Nursadul Mamun, John H. L. Hansen
CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement.

Interspeech2023 Meena M. Chandra Shekar, John H. L. Hansen
Speaker Tracking using Graph Attention Networks with Varying Duration Utterances across Multi-Channel Naturalistic Data: Fearless Steps Apollo-11 Audio Corpus.

Interspeech2023 Ram C. M. C. Shekar, Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, John H. L. Hansen
Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-level Goodness of Pronunciation Transformer.

Interspeech2023 Jiamin Xie, John H. L. Hansen
MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition.

Interspeech2023 Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen
What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model.

SpeechComm2022 Rasa Lileikyte, Dwight Irvin, John H. L. Hansen
Assessing child communication engagement and statistical speech patterns for American English via speech recognition in naturalistic active learning spaces.

TASLP2022 Vinay Kothapally, John H. L. Hansen
SkipConvGAN: Monaural Speech Dereverberation Using Generative Adversarial Networks via Complex Time-Frequency Masking.

TASLP2022 Zhenyu Wang, John H. L. Hansen
Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition.

#26  | Yu Tsao 0001 | DBLP Google Scholar  
By venueInterspeech: 38ICASSP: 17TASLP: 9ICLR: 2NeurIPS: 1ICML: 1
By year2024: 62023: 92022: 202021: 92020: 92019: 122018: 3
ISCA sessionsspeech enhancement and intelligibility: 5speech enhancement: 5single-channel speech enhancement: 4speech, voice, and hearing disorders: 2dereverberation, noise reduction, and speaker extraction: 2voice conversion and adaptation: 2speech synthesis: 2neural techniques for voice conversion and waveform generation: 2speech coding and enhancement: 1speech recognition: 1speech production, perception and multimodality: 1the voicemos challenge: 1source separation: 1speech intelligibility prediction for hearing-impaired listeners: 1speech coding and privacy: 1noise reduction and intelligibility: 1intelligibility-enhancing speech modification: 1model training for asr: 1speech and audio classification: 1speech intelligibility and quality: 1audio events and acoustic scenes: 1voice conversion: 1
IEEE keywordsspeech enhancement: 13speech recognition: 6predictive models: 4measurement: 3unsupervised learning: 3pattern classification: 3generative adversarial networks: 2error analysis: 2self supervised learning: 2task analysis: 2ensemble learning: 2perturbation methods: 2speaker verification: 2adaptation models: 2spoken language understanding: 2robustness: 2convolutional neural nets: 2deep learning (artificial intelligence): 2signal denoising: 2generative model: 2deep neural network: 2audio signal processing: 2decoding: 2natural language processing: 2stargan: 1face masked speech enhancement: 1human in the loop: 1noise measurement: 1generators: 1noise: 1recording: 1sinkhorn attention: 1cross modality alignment: 1transformers: 1automatic speech recognition (asr): 1pretrained language model (plm): 1linguistics: 1linear programming: 1evaluation: 1audio visual learning: 1representation learning: 1benchmark testing: 1soft sensors: 1visualization: 1scalability: 1rendering (computer graphics): 1purification: 1adversarial sample detection: 1adversarial attack: 1user experience: 1multiprotocol label switching: 13quest: 1knowledge transfer: 1sdi: 1speech quality prediction: 1multitasking: 1speech intelligibility prediction: 1stoi: 1pesq: 1quality assessment: 1robust automatic speech recognition: 1hidden markov models: 1articulatory attribute: 1broad phonetic classes: 1phonetics: 1end to end: 1non intrusive speech assessment models: 1acoustic distortion: 1psychoacoustic models: 1multi objective learning: 1codes: 1computational modeling: 1spoken question answering: 1speech translation: 1speech coding: 1question answering (information retrieval): 1tokenization: 1mos: 1auditory system: 1perturbation: 1speech quality models: 1adversarial examples: 1data privacy: 1low quality data: 1data compression: 1audio visual systems: 1recurrent neural nets: 1asynchronous multimodal learning: 1audio visual: 1floating point arithmetic: 1deep neural network model compression: 1inference acceleration: 1adders: 1speech dereverberation: 1floating point integer arithmetic circuit: 1unsupervised speech enhancement: 1metricgan: 1supervised learning: 1reverberation: 1speech recovery: 1intermittent systems: 1internet of things: 1performance evaluation: 1data models: 1speech signal processing: 1energy harvesting: 1interpolation: 1generative adversarial network: 1unsupervised asr: 1training data: 1signal processing algorithms: 1diffusion probabilistic model: 1sensor fusion: 1non invasive: 1multimodal: 1medical signal processing: 1electromyography: 1biometrics (access control): 1security of data: 1partially fake audio detection: 1anti spoofing: 1audio deep synthesis detection challenge: 1speech synthesis: 1quantum computing: 1text analysis: 1quantum machine learning: 1text classification: 1temporal convolution: 1and heterogeneous computing: 1bayes methods: 1joint bayesian model: 1affine transforms: 1discriminative model: 1speaker recognition: 1statistical distributions: 1unsupervised domain adaptation: 1optimal transport: 1spoken language identification: 1maml: 1meta learning: 1source separation: 1speech separation: 1anil: 1support vector machines: 1phonotactic language recognition: 1subspace based learning: 1matrix decomposition: 1subspace based representation: 1gaussian processes: 1multichannel speech enhancement: 1distributed microphones: 1fully convolutional network (fcn): 1microphones: 1phase estimation: 1inner ear microphones: 1raw waveform mapping: 1generalizability: 1dynamically sized decision tree: 1decision trees: 1deep neural networks: 1regression analysis: 1deep denoising autoencoder: 1signal classification: 1automatic speech recognition: 1character error rate: 1mean square error methods: 1reinforcement learning: 1
Most publications (all venues) at2022: 462023: 422021: 382019: 362017: 31

Affiliations
Academia Sinica, Research Center for Information Technology Innovation, Taipei, Taiwan

Recent publications

TASLP2024 Syu-Siang Wang, Jia-Yang Chen, Bo-Ren Bai, Shih-Hau Fang, Yu Tsao 0001
Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics.

ICASSP2024 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai, 
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-Based ASR.

ICASSP2024 Yuan Tseng, Layne Berry, Yiting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Poyao Huang 0001, Chun-Mao Lai, Shang-Wen Li 0001, David Harwath, Yu Tsao 0001, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee, 
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.

ICASSP2024 Haibin Wu, Heng-Cheng Kuo, Yu Tsao 0001, Hung-Yi Lee, 
Scalable Ensemble-Based Detection Method Against Adversarial Attacks For Speaker Verification.

ICASSP2024 Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001
Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model.

ICLR2024 Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao 0001, Yu-Chiang Frank Wang, 
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech.

TASLP2023 Yen-Ju Lu, Chia-Yu Chang, Cheng Yu, Ching-Feng Liu, Jeih-weih Hung, Shinji Watanabe 0001, Yu Tsao 0001
Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information.

TASLP2023 Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen 0011, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001
Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features.

ICASSP2023 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao 0001
T5lephone: Bridging Speech and Text Self-Supervised Models for Spoken Language Understanding Via Phoneme Level T5.

ICASSP2023 Hsin-Yi Lin, Huan-Hsin Tseng, Yu Tsao 0001
On the Robustness of Non-Intrusive Speech Quality Model by Adversarial Examples.

Interspeech2023 Hsin-Hao Chen 0006, Yung-Lun Chien, Ming-Chi Yen, Shu-Wei Tsai, Tai-Shih Chi, Hsin-Min Wang, Yu Tsao 0001
Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features.

Interspeech2023 Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao 0001, Hsin-Min Wang, 
A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech.

Interspeech2023 Yung-Lun Chien, Hsin-Hao Chen 0006, Ming-Chi Yen, Shu-Wei Tsai, Hsin-Min Wang, Yu Tsao 0001, Tai-Shih Chi, 
Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion.

Interspeech2023 Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao 0001
Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition.

ICLR2023 Chi-Chang Lee, Yu Tsao 0001, Hsin-Min Wang, Chu-Song Chen, 
D4AM: A General Denoising Framework for Downstream Acoustic Models.

TASLP2022 Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao 0001
Improved Lite Audio-Visual Speech Enhancement.

TASLP2022 Yu-Chen Lin, Cheng Yu, Yi-Te Hsu, Szu-Wei Fu, Yu Tsao 0001, Tei-Wei Kuo, 
SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points.

ICASSP2022 Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao 0001
MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech.

ICASSP2022 Yu-Chen Lin, Tsun-An Hsieh, Kuo-Hsuan Hung, Cheng Yu, Harinath Garudadri, Yu Tsao 0001, Tei-Wei Kuo, 
Speech Recovery For Real-World Self-Powered Intermittent Devices.

ICASSP2022 Guan-Ting Lin, Chan-Jan Hsu, Da-Rong Liu, Hung-Yi Lee, Yu Tsao 0001
Analyzing The Robustness of Unsupervised Speech Recognition.

#27  | Jing Xiao 0006 | DBLP Google Scholar  
By venueInterspeech: 34ICASSP: 30ICML: 2TASLP: 1EMNLP-Findings: 1
By year2024: 72023: 152022: 152021: 182020: 122019: 1
ISCA sessionsspeech synthesis: 9topics in asr: 2speech, voice, and hearing disorders: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech activity detection and modeling: 1analysis of speech and audio signals: 1speech perception, production, and acquisition: 1speaker and language identification: 1question answering from speech: 1speech emotion recognition: 1source separation: 1novel models and training methods for asr: 1multi-, cross-lingual and other topics in asr: 1spoken language modeling and understanding: 1acoustic event detection and classification: 1non-autoregressive sequential modeling for speech processing: 1speech signal analysis and representation: 1graph and end-to-end learning for speaker recognition: 1embedding and network architecture for speaker recognition: 1acoustic event detection and acoustic scene classification: 1voice conversion and adaptation: 1spoken language understanding: 1dnn architectures for speaker recognition: 1speech and audio quality assessment: 1phonetic event detection and segmentation: 1
IEEE keywordsspeech synthesis: 14speech recognition: 8voice conversion: 6task analysis: 6text to speech: 5natural language processing: 5speaker recognition: 4computational modeling: 3contrastive learning: 3timbre: 3predictive models: 3end to end: 2emotion recognition: 2emotional speech synthesis: 2fuses: 2mutual information: 2adaptation models: 2linguistics: 2correlation: 2computer vision: 2multi modal: 2convolution: 2vector quantization: 2dynamic programming: 2zero shot: 2text analysis: 2transformer: 2couplings: 1differentiable aligner: 1vae: 1hierarchical vae: 1computer architecture: 1time invariant retrieval: 1data mining: 1self supervised learning: 1phonetics: 1noise reduction: 1speech emotion diarization: 1diffusion denoising probabilistic model: 1probabilistic logic: 1static var compensators: 1emotion decoupling: 1adaptive style fusion: 1adaptive systems: 1singing voice conversion: 1llm: 1model bias: 1text categorization: 1zero shot learning: 1bias leverage: 1robustness: 1few shot learning: 1knn methods: 1gold: 1automatic speech recognition: 1benchmark testing: 1monotonic alignment: 1asr: 1environmental sound classification: 1data free: 1audio classification: 1knowledge distillation: 1multiple signal classification: 1music genre classification: 1multi label: 1contrastive loss: 1symmetric cross modal attention: 1adversarial learning: 1speech representation disentanglement: 1linear programming: 1intonation intensity control: 1relative attribute: 1aligned cross entropy: 1entropy: 1non autoregressive asr: 1mask ctc: 1brain modeling: 1time frequency analysis: 1feature fusion: 1federated learning: 1graph convolution network: 1electroencephalogram: 1regression analysis: 1pattern classification: 1variance regularization: 1attribute inference: 1speaker age estimation: 1label distribution learning: 1any to any: 1object detection: 1self supervised: 1low resource: 1query processing: 1pattern clustering: 1interactive systems: 1visual dialog: 1patch embedding: 1question answering (information retrieval): 1incomplete utterance rewriting: 1self attention weight matrix: 1text edit: 1synthetic noise: 1adversarial perturbation: 1contextual information: 1grapheme to phoneme: 1multi speaker text to speech: 1conditional variational autoencoder: 1nat: 1end to end speech recognition: 1parallel processing: 1sampling methods: 1single step generation: 1ctc alignments: 1intent detection: 1continual learning: 1computational linguistics: 1slot filling: 1grammar: 1error analysis: 1pointer generator network: 1generators: 1parameter genera tor: 1semiotics: 1text normalization: 1unsupervised: 1data acquisition: 1information bottleneck: 1unsupervised learning: 1instance discriminator: 1recurrent neural nets: 1self attention: 1rnn transducer: 1feature maps: 1network pruning: 1matrix algebra: 1pqr: 1wireless channels: 1linear dependency analysis: 1waveform generators: 1vocoders: 1waveform generation: 1location variable convolution: 1vocoder: 1convolutional codes: 1strain: 1speaker clustering: 1aggregation hierarchy cluster: 1digital tv: 1analytical models: 1tied variational autoencoder: 1clustering methods: 1generative flow: 1non autoregressive: 1autoregressive processes: 1speech coding: 1prosody modelling: 1graph theory: 1graph neural network: 1baum welch algorithm: 1real time systems: 1signal processing algorithms: 1feed forward transformer: 1
Most publications (all venues) at2021: 952022: 762020: 652023: 572024: 38

Affiliations
PingAn Technology, Shenzhen, China
Epson Research and Development, San Jose, CA, USA (former)
Carnegie Mellon University, Robotics Institute, Pittsburgh, PA, USA (PhD 2005)

Recent publications

TASLP2024 Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006
EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion.

ICASSP2024 Yimin Deng, Huaizhen Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang, 
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval.

ICASSP2024 Haobin Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang, 
ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis.

ICASSP2024 Zeyu Yang, Minchuan Chen, Yanping Li, Wei Hu, Shaojun Wang, Jing Xiao 0006, Zijian Li, 
ESVC: Combining Adaptive Style Fusion and Multi-Level Feature Disentanglement for Expressive Singing Voice Conversion.

ICASSP2024 Yong Zhang, Hanzhang Li, Zhitao Li, Ning Cheng 0001, Ming Li, Jing Xiao 0006, Jianzong Wang, 
Leveraging Biases in Large Language Models: "bias-kNN" for Effective Few-Shot Learning.

ICASSP2024 Ziyang Zhuang, Kun Zou, Chenfeng Miao, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao 0006
Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction.

ICML2024 Chenfeng Miao, Qingying Zhu, Minchuan Chen, Wei Hu, Zijian Li, Shaojun Wang, Jing Xiao 0006
DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation.

ICASSP2023 Zuheng Kang, Yayun He, Jianzong Wang, Junqing Peng, Xiaoyang Qu, Jing Xiao 0006
Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification.

ICASSP2023 Ganghui Ru, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
Improving Music Genre Classification from multi-modal Properties of Music and Genre Correlations Perspective.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
Learning Speech Representations with Flexible Hidden Feature Dimensions.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization.

ICASSP2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis.

ICASSP2023 Xulong Zhang 0001, Haobin Tang, Jianzong Wang, Ning Cheng 0001, Jian Luo, Jing Xiao 0006
Dynamic Alignment Mask CTC: Improved Mask CTC With Aligned Cross Entropy.

ICASSP2023 Kexin Zhu, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
Improving EEG-based Emotion Recognition by Fusing Time-Frequency and Spatial Representations.

Interspeech2023 Minchuan Chen, Chenfeng Miao, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006
Exploring multi-task learning and data augmentation in dementia detection with self-supervised pretrained models.

Interspeech2023 Jiaxin Fan, Yong Zhang, Hanzhang Li, Jianzong Wang, Zhitao Li, Sheng Ouyang, Ning Cheng 0001, Jing Xiao 0006
Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism.

Interspeech2023 Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao 0006
SVVAD: Personal Voice Activity Detection for Speaker Verification.

Interspeech2023 Yifu Sun, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Kaiyu Hu, Jing Xiao 0006
Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning.

Interspeech2023 Fengyun Tan, Chaofeng Feng, Tao Wei, Shuai Gong, Jinqiang Leng, Wei Chu, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006
Improving End-to-End Modeling For Mandarin-English Code-Switching Using Lightweight Switch-Routing Mixture-of-Experts.

Interspeech2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis.

#28  | Shri Narayanan | DBLP Google Scholar  
By venueICASSP: 32Interspeech: 32TASLP: 1ACL: 1
By year2024: 52023: 112022: 42021: 72020: 182019: 112018: 10
ISCA sessionstrustworthy speech processing: 2speaker recognition and diarization: 2speech and language analytics for mental health: 2speaker state and trait: 2speech pathology, depression, and medical applications: 2phonetics, phonology, and prosody: 1speaker and language diarization: 1pathological speech analysis: 1keynote 1 isca medallist: 1connecting speech-science and speech-technology for children's speech: 1assessment of pathological speech and language: 1emotion and sentiment analysis: 1phonetics: 1speech enhancement, bandwidth extension and hearing aids: 1the interspeech 2020 far field speaker verification challenge: 1evaluation of speech technology systems and methods for resource construction and annotation: 1speech in health: 1speech signal characterization: 1the voices from a distance challenge: 1emotion and personality in conversation: 1the second dihard speech diarization challenge (dihard ii): 1topics in speech and audio signal processing: 1integrating speech science and technology for clinical applications: 1speaker diarization: 1emotion recognition and analysis: 1spoken corpora and annotation: 1novel approaches to enhancement: 1
IEEE keywordsspeech recognition: 14emotion recognition: 8speaker recognition: 8annotations: 5task analysis: 5speech: 4speaker diarization: 4computational modeling: 3pipelines: 3visualization: 3child speech: 3pattern clustering: 3benchmark testing: 2data models: 2natural languages: 2speaker classification: 2autism: 2speech emotion recognition: 2predictive models: 2music: 2signal processing algorithms: 2data privacy: 2adversarial training: 2speaker embeddings: 2x vector: 2clustergan: 2hospitals: 2signal classification: 2robustness: 2annotation fusion: 2behavioural sciences computing: 2convolutional neural nets: 2video signal processing: 2autism spectrum disorder: 2pattern classification: 2medical disorders: 2audio signal processing: 2trustworthiness: 1system performance: 1self supervision: 1speech enhancement: 1large language model: 1foundation model: 1video summarization: 1transformers: 1data compression: 1multimodal transformers: 1image representation: 1cross modal retrieval: 1music information retrieval: 1contrastive learning: 1tagging: 1self supervised learning: 1multimodal learning: 1semantics: 1buildings: 1motion segmentation: 1lips: 1audiovisual: 1voice activity detection: 1reproducibility of results: 1emotion evaluation: 1iemocap: 1motion capture: 1protocols: 1reproducibility: 1multimodal interaction modeling: 1tv: 1face recognition: 1multimedia: 1crops: 1context understanding: 1multimodal vision language pretrained models: 1costs: 1multilingual emotion recognition: 1emotion clusters: 1zero shot: 1audio visual dataset: 1taxonomy: 1event detection: 1audio event detection: 1audio recognition: 1medical services: 1movies: 1multiple signal classification: 1software: 1transformer: 1vocoders: 1text to speech synthesis: 1music audio synthesis: 1analytical models: 1neural vocoder: 1tacotron: 1catalysts: 1federated learning: 1audio benchmarks: 1machine learning: 1statistical privacy: 1speech emotion: 1noise enjection: 1fairness: 1nme sc: 1generative adversarial networks: 1clustering algorithms: 1gallium nitride: 1prototypes: 1mcgan: 1circadian rhythms: 1diseases: 1medical signal processing: 1recurrent neural nets: 1personnel: 1ubiquitous computing: 1health care: 1statistics: 1hybrid adversarial training: 1multi task objective: 1perturbation methods: 1adversarial attack: 1feature scattering: 1multi scale: 1score fusion: 1uniform segmentation: 1child forensic interview: 1law administration: 1deception detection: 1behavioral signal processing: 1triplet embedding: 1trapezoidal signal regression: 1signal warping: 1cost accounting: 1sequential analysis: 1behavior: 1suicidal risk: 1asr: 1couples conversations: 1psychology: 1military computing: 1prosody: 1sensor fusion: 1support vector machines: 1machine learning.: 1wearable: 1time series: 1wearable computers: 1routine analysis: 1data clustering: 1music emotion recognition: 1triplet embeddings: 1inter rater agreement: 1music perception: 1segmentation: 1cnn: 1biomedical mri: 1convlstm: 1medical image processing: 1rtmri: 1extraterrestrial measurements: 1supervised learning: 1prototypical networks: 1patient diagnosis: 1gradient reversal: 1medical diagnostic computing: 1paediatrics: 1natural language processing: 1domain adversarial learning: 1affective computing: 1affective representation: 1speaker invariant: 1entropy: 1signal representation: 1deep latent space clustering: 1medical computing: 1adversarial invariance: 1robust speaker recognition: 1spectrogram: 1document handling: 1multitask learning: 1situation awareness: 1text classification: 1optimisation: 1emergency management: 1clustering: 1data mining: 1mouse ultrasonic vocalizations: 1biocommunications: 1filtering theory: 1subspace similarity: 1sparse subspace clustering: 1speaker role recognition: 1lattice rescoring: 1language model: 1convo lutional neural networks: 1speech activity detection: 1movie audio: 1entertainment: 1wearable sensing: 1foreground detection: 1detectors: 1speaking patterns: 1audio: 1speech activity detector: 1employment: 1multitaper: 1bioelectric potentials: 1eeg: 1brain computer interfaces: 1electroencephalography: 1delta: 1syllable: 1
Most publications (all venues) at2013: 632011: 532016: 522019: 502008: 50


Recent publications

ICASSP2024 Tiantian Feng, Rajat Hebbar, Shrikanth Narayanan
TRUST-SER: On The Trustworthiness Of Fine-Tuning Pre-Trained Speech Embeddings For Speech Emotion Recognition.

ICASSP2024 Tiantian Feng, Shrikanth Narayanan
Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting.

ICASSP2024 Yoonsoo Nam, Adam Lehavi, Daniel Yang, Digbalay Bose, Swabha Swayamdipta, Shrikanth Narayanan
Does Video Summarization Require Videos? Quantifying the Effectiveness of Language in Video Summarization.

ICASSP2024 Shanti Stewart, Kleanthis Avramidis, Tiantian Feng, Shrikanth Narayanan
Emotion-Aligned Contrastive Learning Between Images and Music.

ICASSP2024 Anfeng Xu, Kevin Huang, Tiantian Feng, Helen Tager-Flusberg, Shrikanth Narayanan
Audio-Visual Child-Adult Speaker Classification in Dyadic Interactions.

ICASSP2023 Nikolaos Antoniou, Athanasios Katsamanis, Theodoros Giannakopoulos, Shrikanth Narayanan
Designing and Evaluating Speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP.

ICASSP2023 Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Shrikanth Narayanan
Contextually-Rich Human Affect Perception Using Multimodal Scene Information.

ICASSP2023 Georgios Chochlakis, Gireesh Mahajan, Sabyasachee Baruah, Keith Burghardt, Kristina Lerman, Shrikanth Narayanan
Using Emotion Embeddings to Transfer Knowledge between Emotions, Languages, and Annotation Formats.

ICASSP2023 Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan
A Dataset for Audio-Visual Sound Event Detection in Movies.

ICASSP2023 Xuan Shi, Erica Cooper, Xin Wang 0037, Junichi Yamagishi, Shrikanth Narayanan
Can Knowledge of End-to-End Text-to-Speech Models Improve Neural Midi-to-Audio Synthesis Systems?

ICASSP2023 Tuo Zhang, Tiantian Feng, Samiul Alam, Sunwoo Lee, Mi Zhang 0002, Shrikanth S. Narayanan, Salman Avestimehr, 
FedAudio: A Federated Learning Benchmark for Audio Tasks.

Interspeech2023 Reed Blaylock, Shrikanth Narayanan
Beatboxing Kick Drum Kinematics.

Interspeech2023 Rimita Lahiri, Tiantian Feng, Rajat Hebbar, Catherine Lord, So Hyun Kim, Shrikanth Narayanan
Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism.

Interspeech2023 Thomas Melistas, Lefteris Kapelonis, Nikolaos Antoniou, Petros Mitseas, Dimitris Sgouropoulos, Theodoros Giannakopoulos, Athanasios Katsamanis, Shrikanth Narayanan
Cross-Lingual Features for Alzheimer's Dementia Detection from Speech.

Interspeech2023 Shrikanth Narayanan
Bridging Speech Science and Technology - Now and Into the Future.

Interspeech2023 Anfeng Xu, Rajat Hebbar, Rimita Lahiri, Tiantian Feng, Lindsay Butler, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan
Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings.

ICASSP2022 Tiantian Feng, Hanieh Hashemi, Murali Annavaram, Shrikanth S. Narayanan
Enhancing Privacy Through Domain Adaptive Noise Injection For Speech Emotion Recognition.

Interspeech2022 Tiantian Feng, Shrikanth Narayanan
Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling.

Interspeech2022 Tiantian Feng, Raghuveer Peri, Shrikanth Narayanan
User-Level Differential Privacy against Attribute Inference Attack of Speech Emotion Recognition on Federated Learning.

Interspeech2022 Nikolaos Flemotomos, Shrikanth Narayanan
Multimodal Clustering with Role Induced Constraints for Speaker Diarization.

#29  | Dan Su 0002 | DBLP Google Scholar  
By venueInterspeech: 30ICASSP: 29ICML: 1ACL: 1AAAI: 1ICLR: 1IJCAI: 1TASLP: 1
By year2024: 32023: 42022: 152021: 112020: 162019: 102018: 6
ISCA sessionsvoice conversion and adaptation: 4speech synthesis: 4speech recognition: 2deep learning for source separation and pitch tracking: 2speech coding and enhancement: 1speaker embedding and diarization: 1tools, corpora and resources: 1topics in asr: 1source separation, dereverberation and echo cancellation: 1novel neural network architectures for asr: 1multi-channel speech enhancement: 1speaker recognition: 1asr neural network architectures and training: 1new trends in self-supervised speech processing: 1speech synthesis paradigms and methods: 1multimodal speech processing: 1speech enhancement: 1asr neural network architectures: 1speaker verification using neural network methods: 1sequence models for asr: 1expressive speech synthesis: 1topics in speech recognition: 1
IEEE keywordsspeech recognition: 11speaker recognition: 8speech synthesis: 6speaker verification: 4multi channel: 4speech separation: 4natural language processing: 4recurrent neural nets: 3overlapped speech: 3speech enhancement: 3data augmentation: 3microphone arrays: 2voice activity detection: 2speaker diarization: 2multi look: 2transfer learning: 2domain adaptation: 2maximum mean discrepancy: 2code switching: 2speaker embedding: 2attention based model: 2automatic speech recognition: 2end to end speech recognition: 2expressive tts: 1transformers: 1bigvgan: 1durian e: 1adaptation models: 1linguistics: 1style adaptive instance normalization: 1signal generators: 1adaptive systems: 1vits: 1speaking style: 1text analysis: 1conversational text to speech synthesis: 1graph neural network: 1low quality data: 1neural speech synthesis: 1style transfer: 1joint training: 1dual path: 1acoustic model: 1echo suppression: 1streaming: 1dynamic weight attention: 1training data: 1error analysis: 1three dimensional displays: 1noisy label: 1convolution: 1attention module: 1multi speaker: 1knowledge transfer: 1video to speech synthesis: 1vector quantization: 1measurement: 1voice conversion: 1knowledge engineering: 1lips: 1predictive coding: 1vocabulary: 1expert systems: 1router architecture: 1mixture of experts: 1global information: 1accent embedding: 1domain embedding: 1feature fusion: 1data handling: 1m2met: 1direction of arrival estimation: 1direction of arrival: 1neural architecture search: 1transferable architecture: 1neural net architecture: 1multi granularity: 1single channel: 1self attentive network: 1source separation: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1ctc: 1non autoregressive: 1decoding: 1transformer: 1autoregressive processes: 1speaker verification (sv): 1phonetic pos teriorgrams: 1speech intelligibility: 1speech coding: 1end to end: 1multi channel speech separation: 1inter channel convolution differences: 1reverberation: 1spatial filters: 1filtering theory: 1spatial features: 1parallel optimization: 1random sampling.: 1model partition: 1graphics processing units: 1lstm language model: 1bmuf: 1joint learning: 1noise measurement: 1speaker aware: 1target speech enhancement: 1time domain analysis: 1gain: 1teacher student: 1semi supervised learning: 1accent conversion: 1accented speech recognition: 1self attention: 1persistent memory: 1dfsmn: 1permutation invariant training: 1encoding: 1model integration: 1multi band: 1nist: 1artificial intelligence: 1mel frequency cepstral coefficient: 1loss function: 1boundary: 1top k loss: 1task analysis: 1language model: 1discriminative feature learning: 1sequence discriminative training: 1acoustic variability: 1hidden markov models: 1asr: 1variational inference: 1convolutional neural nets: 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1teacher student training: 1knowledge distillation: 1multi domain: 1all rounder: 1feedforward neural nets: 1
Most publications (all venues) at2021: 242022: 222020: 192019: 192018: 8

Affiliations
Tencent AI Lab, Shenzhen, China
URLs

Recent publications

ICASSP2024 Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su 0002
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis.

ICML2024 Manjie Xu, Chenxing Li, Duzhen Zhang, Dan Su 0002, Wei Liang, Dong Yu 0001, 
Prompt-guided Precise Audio Editing with Diffusion Models.

ACL2024 Yongxin Zhu 0003, Dan Su 0002, Liqiang He, Linli Xu, Dong Yu 0001, 
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer.

Interspeech2023 Wei Xiao, Wenzhe Liu, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su 0002, Shidong Shang, Dong Yu 0001, 
Multi-mode Neural Speech Coding Based on Deep Generative Networks.

Interspeech2023 Yuping Yuan, Zhao You, Shulin Feng, Dan Su 0002, Yanchun Liang 0001, Xiaohu Shi, Dong Yu 0001, 
Compressed MoE ASR Model Based on Knowledge Distillation and Quantization.

Interspeech2023 Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu 0001, Zhao You, Dan Su 0002, Dong Yu 0001, Helen Meng, 
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation.

AAAI2023 Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie 0001, Dan Su 0002
UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis.

ICASSP2022 Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu 0001, Helen Meng, Chao Weng, Dan Su 0002
Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling.

ICASSP2022 Songxiang Liu, Shan Yang, Dan Su 0002, Dong Yu 0001, 
Referee: Towards Reference-Free Cross-Speaker Style Transfer with Low-Quality Data for Expressive Speech Synthesis.

ICASSP2022 Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su 0002, Dong Yu 0001, 
DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition.

ICASSP2022 Xiaoyi Qin, Na Li 0012, Chao Weng, Dan Su 0002, Ming Li 0026, 
Simple Attention Module Based Speaker Verification with Iterative Noisy Label Detection.

ICASSP2022 Disong Wang, Shan Yang, Dan Su 0002, Xunying Liu, Dong Yu 0001, Helen Meng, 
VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion.

ICASSP2022 Zhao You, Shulin Feng, Dan Su 0002, Dong Yu 0001, 
Speechmoe2: Mixture-of-Experts Model with Improved Routing.

ICASSP2022 Naijun Zheng, Na Li 0012, Xixin Wu, Lingwei Meng, Jiawen Kang 0002, Haibin Wu, Chao Weng, Dan Su 0002, Helen Meng, 
The CUHK-Tencent Speaker Diarization System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.

ICASSP2022 Naijun Zheng, Na Li 0012, Jianwei Yu, Chao Weng, Dan Su 0002, Xunying Liu, Helen Meng, 
Multi-Channel Speaker Diarization Using Spatial Features for Meetings.

Interspeech2022 Yi Lei, Shan Yang, Jian Cong, Lei Xie 0001, Dan Su 0002
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion.

Interspeech2022 Xiaoyi Qin, Na Li 0012, Chao Weng, Dan Su 0002, Ming Li 0026, 
Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings.

Interspeech2022 Liumeng Xue, Shan Yang, Na Hu, Dan Su 0002, Lei Xie 0001, 
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers.

Interspeech2022 Yixuan Zhou 0002, Changhe Song, Jingbei Li, Zhiyong Wu 0001, Yanyao Bian, Dan Su 0002, Helen Meng, 
Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis.

Interspeech2022 Yixuan Zhou 0002, Changhe Song, Xiang Li 0105, Luwen Zhang, Zhiyong Wu 0001, Yanyao Bian, Dan Su 0002, Helen Meng, 
Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis.

#30  | Mark Hasegawa-Johnson | DBLP Google Scholar  
By venueInterspeech: 27ICASSP: 18TASLP: 6ICML: 5ACL: 2ACL-Findings: 2SpeechComm: 2NAACL: 1
By year2024: 42023: 92022: 102021: 142020: 122019: 82018: 6
ISCA sessionsvoice conversion: 1paralinguistics: 1speech synthesis and voice conversion: 1analysis of speech and audio signals: 1spoken language modeling and understanding: 1atypical speech detection: 1cross/multi-lingual asr: 1speech synthesis: 1topics in asr: 1the first dicova challenge: 1noise reduction and intelligibility: 1applications of asr: 1cross/multi-lingual and code-switched speech recognition: 1spoken language understanding: 1speech translation and multilingual/multimodal learning: 1phonetic event detection and segmentation: 1diarization: 1speech and voice disorders: 1model adaptation for asr: 1speech in the brain: 1spoken term detection: 1extracting information from audio: 1adjusting to speaker, accent, and domain: 1topics in speech recognition: 1multimodal systems: 1deep neural networks: 1speaker state and trait: 1
IEEE keywordsspeech recognition: 11natural language processing: 6automatic speech recognition: 5speech synthesis: 5data models: 3ctc: 3voice conversion: 3speaker recognition: 3decoding: 3end to end: 2visualization: 2unsupervised learning: 2signal classification: 2speaker change detection: 2speaker adaptation: 2language acquisition: 2multimodal learning: 2testing: 2task analysis: 2connectionist temporal classification: 1entropy maximization: 1adaptation models: 1minimization: 1symbols: 1predictive models: 1dynamic scheduling: 1transducers: 1training data: 1error analysis: 1grapheme to phoneme transducer: 1g2p: 1buildings: 1self supervised speech processing: 1acoustic unit discovery: 1benchmark testing: 1unsupervised phoneme segmentation: 1codes: 1text to speech (tts): 1model compression: 1data constrained multi speaker: 1multi lingual tts: 1audio visual attention: 1fuses: 1target speaker extraction: 1streaming media: 1under resourced asr: 1computer based training: 1autosegmental phonology: 1tones: 1cross lingual adap tation: 1ipa: 1speech disentanglement: 1pneumodynamics: 1time frequency analysis: 1covid 19: 1diseases: 1patient diagnosis: 1medical signal processing: 1medical signal detection: 1audio signal processing: 1telemedicine: 1dicova ii: 1signal detection: 1affine transforms: 1speaker segmentation: 1fairness in machine learning: 1counterfactual fairness: 1probability: 1cross modal captioning: 1image annotation: 1multimodal modelling: 1image to speech generation: 1multilingual: 1zero shot learning: 1phonotactics: 1medical disorders: 1dysarthric speech: 1data augmentation: 1speech intelligibility: 1asr: 1sequence to sequence: 1image to speech: 1image captioning: 1encoder decoder: 1spoken term discovery: 1low resource speech technology: 1transfer learning: 1multiple instance learning: 1child speech: 1convolution: 1paediatrics: 1behavioural sciences computing: 1voice activity detection: 1language development: 1speaker diarization: 1recurrent neural nets: 1speech codecs: 1source separation: 1signal reconstruction: 1text analysis: 1human computer interaction: 1image retrieval: 1image representation: 1language translation: 1machine translation: 1unsupervised word discovery: 1wavenet vocoder: 1f0 conversion: 1autoencoder: 1encoding: 1dialog act recognition: 1text recognition: 1multiview training: 1non parallel data: 1spoken language understanding: 1smoothing methods: 1acoustic landmarks: 1acoustic modeling: 1emotion recognition: 1vocal expression: 1latent semantic analysis: 1laughter: 1dimensional analysis: 1perception: 1loss measurement: 1microsoft windows: 1siamese networks: 1sequence embedding: 1
Most publications (all venues) at2022: 222017: 202016: 192021: 182020: 18


Recent publications

ICASSP2024 SooHwan Eom, Eunseop Yoon, Hee Suk Yoon, Chanwoo Kim 0001, Mark Hasegawa-Johnson, Chang D. Yoo, 
AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition.

ICASSP2024 Heting Gao, Mark Hasegawa-Johnson, Chang D. Yoo, 
G2PU: Grapheme-To-Phoneme Transducer with Speech Units.

ICASSP2024 Liming Wang, Mark Hasegawa-Johnson, Chang D. Yoo, 
Unsupervised Speech Recognition with N-skipgram and Positional Unigram Matching.

ICML2024 Heting Gao, Kaizhi Qian, Junrui Ni, Chuang Gan, Mark A. Hasegawa-Johnson, Shiyu Chang, Yang Zhang 0001, 
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data.

ICASSP2023 Abhayjeet Singh, Amala Nagireddi, Deekshitha G, Jesuraja Bandekar, Roopa R., Sandhya Badiger, Sathvik Udupa, Prasanta Kumar Ghosh, Hema A. Murthy, Heiga Zen, Pranaw Kumar, Kamal Kant, Amol Bole, Bira Chandra Singh, Keiichi Tokuda, Mark Hasegawa-Johnson, Philipp Olbrich, 
Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech.

ICASSP2023 Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Dual-Path Cross-Modal Attention for Better Audio-Visual Speech Extraction.

Interspeech2023 Wonjune Kang, Mark Hasegawa-Johnson, Deb Roy, 
End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions.

Interspeech2023 Jialu Li 0002, Mark Hasegawa-Johnson, Nancy L. McElwain, 
Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio.

Interspeech2023 Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John B. Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim 0001, Chang D. Yoo, 
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction.

Interspeech2023 Wanyue Zhai, Mark Hasegawa-Johnson
Wav2ToBI: a new approach to automatic ToBI transcription.

ACL2023 Liming Wang, Mark Hasegawa-Johnson, Chang Dong Yoo, 
A Theory of Unsupervised Speech Recognition.

ACL-Findings2023 Liming Wang, Junrui Ni, Heting Gao, Jialu Li 0002, Kai Chieh Chang, Xulin Fan, Junkai Wu, Mark Hasegawa-Johnson, Chang Dong Yoo, 
Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition.

ACL-Findings2023 Eunseop Yoon, Hee Suk Yoon, John B. Harvill, Mark Hasegawa-Johnson, Chang Dong Yoo, 
INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition.

SpeechComm2022 Heting Gao, Xiaoxuan Wang, Sunghun Kang, Rusty Mina, Dias Issa, John B. Harvill, Leda Sari, Mark Hasegawa-Johnson, Chang D. Yoo, 
Seamless equal accuracy ratio for inclusive CTC speech recognition.

TASLP2022 Jialu Li 0002, Mark Hasegawa-Johnson
Autosegmental Neural Nets 2.0: An Extensive Study of Training Synchronous and Asynchronous Phones and Tones for Under-Resourced Tonal Languages.

ICASSP2022 Chak Ho Chan, Kaizhi Qian, Yang Zhang 0001, Mark Hasegawa-Johnson
SpeechSplit2.0: Unsupervised Speech Disentanglement for Voice Conversion without Tuning Autoencoder Bottlenecks.

ICASSP2022 John B. Harvill, Yash R. Wani, Moitreya Chatterjee, Mustafa Alam, David G. Beiser, David Chestek, Mark Hasegawa-Johnson, Narendra Ahuja, 
Detection of Covid-19 from Joint Time and Frequency Analysis of Speech, Breathing and Cough Audio.

Interspeech2022 Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang 0001, Shiyu Chang, Mark Hasegawa-Johnson
WavPrompt: Towards Few-Shot Spoken Language Understanding with Frozen Language Models.

Interspeech2022 John B. Harvill, Mark Hasegawa-Johnson, Chang D. Yoo, 
Frame-Level Stutter Detection.

Interspeech2022 Mahir Morshed, Mark Hasegawa-Johnson
Cross-lingual articulatory feature information transfer for speech recognition using recurrent progressive neural networks.

#31  | James R. Glass | DBLP Google Scholar  
By venueInterspeech: 33ICASSP: 18NeurIPS: 3TASLP: 3ICLR: 2ACL: 2NAACL: 1AAAI: 1
By year2024: 22023: 62022: 72021: 122020: 122019: 182018: 6
ISCA sessionsnew trends in self-supervised speech processing: 2analysis of speech and audio signals: 1invariant and robust pre-trained acoustic models: 1cross-lingual and multilingual asr: 1speech synthesis: 1acoustic event detection and acoustic scene classification: 1assessment of pathological speech and language: 1non-autoregressive sequential modeling for speech processing: 1spoken dialogue systems: 1tools, corpora and resources: 1multimodal systems: 1low-resource speech recognition: 1language recognition: 1speech signal representation: 1spoken dialogue system: 1speech translation and multilingual/multimodal learning: 1speaker recognition challenges and applications: 1zero-resource asr: 1end-to-end speech recognition: 1speech signal characterization: 1speech and audio classification: 1speech and audio source separation and scene analysis: 1speech recognition and beyond: 1dialogue speech understanding: 1applications of language technologies: 1speaker recognition and diarization: 1speaker recognition: 1sequence models for asr: 1integrating speech science and technology for clinical applications: 1deep neural networks: 1robust speech recognition: 1neural network training strategies for asr: 1
IEEE keywordsspeech recognition: 12natural language processing: 7self supervised learning: 3speech synthesis: 3speaker recognition: 3computational modeling: 2entropy: 2cross modal: 2natural language interfaces: 2information retrieval: 2audio signal processing: 2image representation: 2speech representation learning: 2interactive systems: 2unsupervised learning: 2text analysis: 2self attention: 2language identification: 2dialect identification: 2convolutional neural nets: 2unsupervised speech processing: 2measurement: 1robustness: 1representation analysis: 1information theory: 1self supervised speech representation learning: 1retrieval: 1multilingual: 1knowledge distillation: 1analytical models: 1cross lingual: 1predictive models: 1transformer: 1pronunciation assessment: 1transformers: 1support vector machines: 1corpus: 1audio classification: 1speech: 1signal classification: 1pattern classification: 1condition monitoring: 1vocal sounds: 1real time systems: 1primary progressive aphasia: 1clinical trials: 1cognitive impairment: 1minimally invasive surgery: 1repetition assessment: 1time measurement: 1cross lingual transfer learning: 1adaptation: 1self training: 1asr: 1efficiency: 1pruning: 1text to speech: 1speech intelligibility: 1vocoder: 1vocoders: 1transfer learning: 1ensemble: 1noisy label: 1audio tagging: 1signal sampling: 1imbalanced learning: 1audio event classification: 1unsupervised pre training: 1recurrent neural nets: 1comparative analysis: 1spoken language understanding: 1semi supervised learning: 1maximum likelihood estimation: 1wordpiece: 1subword: 1end to end: 1and cross lingual retrieval: 1semantic embedding space: 1vision and spoken language: 1social networking (online): 1large scale: 1arabic dialect: 1dataset: 1query processing: 1query languages: 1crowdsourcing: 1semantic embedding: 1dialogue system: 1convolutional neural network: 1reinforcement learning: 1pattern clustering: 1brain: 1dnns: 1bottleneck feature: 1cepstral analysis: 1time contrastive learning: 1speaker verification: 1image segmentation: 1gmm ubm: 1gaussian processes: 1language translation: 1speech2vec: 1bilingual lexicon induction: 1speech to text translation: 1multimodal speech processing: 1vision and language: 1data augmentation: 1variational autoencoder: 1adversarial training: 1text to speech synthesis: 1disentangled representation learning: 1variational inference: 1factorial deep markov model: 1markov processes: 1task analysis: 1fusion: 1recognition: 1attention: 1person verification: 1multi modal: 1missing data: 1face recognition: 1image fusion: 1
Most publications (all venues) at2019: 472018: 342021: 232020: 232022: 20


Recent publications

ICASSP2024 Alexander H. Liu, Sung-Lin Yeh, James R. Glass
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective.

NAACL2024 Heng-Jui Chang, James R. Glass
R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces.

ICASSP2023 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas 0001, Rogério Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James R. Glass
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval.

Interspeech2023 Yuan Gong 0001, Sameer Khurana, Leonid Karlinsky, James R. Glass
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers.

Interspeech2023 Heng-Jui Chang, Alexander H. Liu, James R. Glass
Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering.

Interspeech2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas 0001, Rogério Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James R. Glass
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.

NeurIPS2023 Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning.

ICLR2023 Yuan Gong 0001, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass
Contrastive Audio-Visual Masked Autoencoder.

ICASSP2022 Yuan Gong 0001, Ziyi Chen, Iek-Heng Chu, Peng Chang 0002, James R. Glass
Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment.

ICASSP2022 Yuan Gong 0001, Jin Yu, James R. Glass
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition.

ICASSP2022 R'mani Haulcy, Katerina Placek, Brian Tracey, Adam P. Vogel, James R. Glass
Repetition Assessment for Speech and Language Disorders: A Study of the Logopenic Variant of Primary Progressive Aphasia.

ICASSP2022 Sameer Khurana, Antoine Laurent, James R. Glass
Magic Dust for Cross-Lingual Adaptation of Monolingual Wav2vec-2.0.

ICASSP2022 Cheng-I Jeff Lai, Erica Cooper, Yang Zhang 0001, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David D. Cox, James R. Glass
On the Interplay between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis.

Interspeech2022 Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James R. Glass
Simple and Effective Unsupervised Speech Synthesis.

AAAI2022 Yuan Gong 0001, Cheng-I Lai, Yu-An Chung, James R. Glass
SSAST: Self-Supervised Audio Spectrogram Transformer.

TASLP2021 Yuan Gong 0001, Yu-An Chung, James R. Glass
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation.

ICASSP2021 Yu-An Chung, Yonatan Belinkov, James R. Glass
Similarity Analysis of Self-Supervised Speech Representations.

ICASSP2021 Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee, Shang-Wen Li 0001, James R. Glass
Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining.

Interspeech2021 Yuan Gong 0001, Yu-An Chung, James R. Glass
AST: Audio Spectrogram Transformer.

Interspeech2021 R'mani Haulcy, James R. Glass
CLAC: A Speech Corpus of Healthy English Speakers.

#32  | Yu Zhang 0033 | DBLP Google Scholar  
By venueICASSP: 33Interspeech: 26ICLR: 2ICML: 1NeurIPS: 1
By year2024: 12023: 132022: 132021: 152020: 122019: 82018: 1
ISCA sessionsspeech synthesis: 7speech recognition: 2spoken language processing: 2search/decoding techniques and confidence measures for asr: 2asr neural network architectures and training: 2training strategies for asr: 2resource-constrained asr: 1novel models and training methods for asr: 1adaptation, transfer learning, and distillation for asr: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1self-supervision and semi-supervision for neural asr training: 1non-autoregressive sequential modeling for speech processing: 1multi- and cross-lingual asr, other topics in asr: 1asr neural network architectures: 1asr neural network training: 1
IEEE keywordsspeech recognition: 21speech synthesis: 8natural language processing: 7task analysis: 5data models: 5adaptation models: 5speech coding: 5text to speech: 4rnn t: 3text analysis: 3automatic speech recognition: 3end to end: 3conformer: 3recurrent neural nets: 3data augmentation: 3video on demand: 2vocabulary: 2decoding: 2computational modeling: 2error analysis: 2semi supervised learning: 2transducers: 2text injection: 2visualization: 2self supervised learning: 2multilingual: 2confidence scores: 2probability: 2end to end asr: 2speaker recognition: 2transformer: 2supervised learning: 2end to end speech recognition: 2fine grained vae: 2tacotron 2: 2hardware: 1large language model: 1distance measurement: 1multilingual speech recognition: 1computer architecture: 1noisy student training: 1machine learning: 1production: 1rnn transducer: 1knowledge distillation: 1domain adaptation: 1foundation models: 1frequency modulation: 1soft sensors: 1internal lm: 1text recognition: 1multilingual text to speech synthesis: 1semisupervised learning: 1massive multilingual pretraining: 1speech–text semi supervised joint learning: 1acoustic modeling: 1recurrent neural networks: 1lattices: 1loss measurement: 1speech text representation learning: 1simultaneous localization and mapping: 1inspection: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1convolution: 1kernel: 1encoding: 1multilingual asr: 1transfer learning: 1joint training: 1contrastive learning: 1indexes: 1linear programming: 1consistency regularization: 1self supervised: 1degradation: 1massive: 1lifelong learning: 1out of domain: 1feature selection: 1estimation theory: 1two pass asr: 1rnnt: 1long form asr: 1pattern classification: 1emotion recognition: 1paralinguistics: 1representation learning: 1speech: 1streaming asr: 1model distillation: 1non streaming asr: 1vae: 1iterative methods: 1computational complexity: 1neural tts: 1self attention: 1non autoregressive: 1autoregressive processes: 1cascaded encoders: 1latency: 1hidden markov models: 1mean square error methods: 1calibration: 1confidence: 1voice activity detection: 1attention based end to end models: 1echo state network: 1long form: 1echo: 1reproducibility of results: 1open source: 1open source software: 1sentiment analysis: 1speech sentiment analysis: 1end to end asr model: 1analytical models: 1asr pretraining: 1multi domain training: 1optimisation: 1standards: 1vector quantization: 1measurement: 1regression analysis: 1hierarchical: 1pre training: 1data efficiency: 1tacotron: 1expert systems: 1unpaired data: 1cycle consistency: 1variational autoencoder: 1adversarial training: 1text to speech synthesis: 1end to end speech synthesis: 1
Most publications (all venues) at2023: 232022: 222021: 212020: 182019: 13

Affiliations
Google
Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA (PhD 2017)
URLs

Recent publications

ICASSP2024 W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang 0033, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath, 
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study.

ICASSP2023 Ke Hu, Tara N. Sainath, Bo Li 0028, Nan Du 0002, Yanping Huang, Andrew M. Dai, Yu Zhang 0033, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman, 
Massively Multilingual Shallow Fusion with Large Language Models.

ICASSP2023 Dongseong Hwang, Khe Chai Sim, Yu Zhang 0033, Trevor Strohman, 
Comparison of Soft and Hard Target RNN-T Distillation for Large-Scale ASR.

ICASSP2023 Bo Li 0028, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang 0033, Wei Han 0002, Trevor Strohman, Françoise Beaufays, 
Efficient Domain Adaptation for Speech Foundation Models.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran, 
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

ICASSP2023 Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang 0033, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran, 
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech.

ICASSP2023 Yongqiang Wang, Zhehuai Chen, Chengjian Zheng, Yu Zhang 0033, Wei Han 0002, Parisa Haghani, 
Accelerating RNN-T Training and Inference Using CTC Guidance.

ICASSP2023 Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang 0033
Understanding Shared Speech-Text Representations.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman, 
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.

Interspeech2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath, 
How to Estimate Model Transferability of Pre-Trained Speech Models?

Interspeech2023 Ke Hu, Bo Li 0028, Tara N. Sainath, Yu Zhang 0033, Françoise Beaufays, 
Mixture-of-Expert Conformer for Streaming Multilingual ASR.

Interspeech2023 Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang 0033, Wei Han 0002, Ankur Bapna, 
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus.

ICML2023 Yong Cheng, Yu Zhang 0033, Melvin Johnson, Wolfgang Macherey, Ankur Bapna, 
Mu2SLAM: Multitask, Multilingual Speech and Language Models.

ICASSP2022 Junwen Bai, Bo Li 0028, Yu Zhang 0033, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath, 
Joint Unsupervised and Supervised Training for Multilingual ASR.

ICASSP2022 Zhehuai Chen, Yu Zhang 0033, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Gary Wang, 
Tts4pretrain 2.0: Advancing the use of Text and Speech in ASR Pretraining with Consistency and Contrastive Losses.

ICASSP2022 Bo Li 0028, Ruoming Pang, Yu Zhang 0033, Tara N. Sainath, Trevor Strohman, Parisa Haghani, Yun Zhu, Brian Farris, Neeraj Gaur, Manasa Prasad, 
Massively Multilingual ASR: A Lifelong Learning Solution.

ICASSP2022 Qiujia Li, Yu Zhang 0033, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland, 
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033
Improving The Latency And Quality Of Cascaded Encoders.

ICASSP2022 Joel Shor, Aren Jansen, Wei Han 0002, Daniel S. Park, Yu Zhang 0033
Universal Paralinguistic Speech Representations Using self-Supervised Conformers.

#33  | Zhen-Hua Ling | DBLP Google Scholar  
By venueICASSP: 22Interspeech: 21TASLP: 15ACL-Findings: 1EMNLP-Findings: 1AAAI: 1EMNLP: 1
By year2024: 82023: 112022: 92021: 132020: 112019: 82018: 2
ISCA sessionsspeech synthesis: 9speech coding and enhancement: 2voice conversion and adaptation: 2speech perception, production, and acquisition: 1asr model training and strategies: 1acoustic model adaptation for asr: 1speaker recognition: 1corpus annotation and evaluation: 1singing and multimodal synthesis: 1voice conversion and speech synthesis: 1speech synthesis paradigms and methods: 1
IEEE keywordsspeech synthesis: 17speech recognition: 12predictive models: 6voice conversion: 6vocoders: 6natural language processing: 6speech enhancement: 5neural network: 5sequence to sequence: 5task analysis: 4text to speech: 3fourier transforms: 3statistical parametric speech synthesis: 3speaker recognition: 3transformer: 3anti wrapping loss: 2parallel estimation architecture: 2prediction algorithms: 2speech phase prediction: 2estimation: 2data models: 2wav2vec 2.0: 2tongue: 2ultrasonic imaging: 2lips: 2semantics: 2costs: 2signal processing algorithms: 2phase spectrum: 2amplitude spectrum: 2spectrogram: 2neural vocoder: 2speech: 2voice activity detection: 2naturalness: 2audio signal processing: 2diseases: 2deep learning (artificial intelligence): 2text analysis: 2variational autoencoder: 2interactive systems: 2knowledge representation: 2response selection: 2decoding: 2tacotron: 2unit selection: 2speech coding: 2vocoder: 2gaussian processes: 2speech generation: 1low latency communication: 1low latency: 1self supervised prosody learning: 1transformers: 1lpc residual: 1audio visual speech enhancement: 1ultrasound tongue image: 1knowledge distillation: 1videos: 1memory network: 1glass box: 1personalized speech generation: 1voice privacy: 1rendering (computer graphics): 1adversary attack: 1perturbation methods: 1privacy: 1data privacy: 1information filtering: 1voice anonymization: 1pseudo speaker distribution: 1speaker uncertainty: 1pseudo speaker vector: 1uncertainty: 1oral communication: 1conversational speech: 1turn taking events: 1coherence: 1temporal connection: 1eeg channel selection: 1gumbel softmax function: 1electroencephalography: 1multi talker conditions: 1neuro steered speech enhancement: 1gain measurement: 1phase estimation: 1convolution: 1phonetic representations: 1multilingual: 1phonetics: 1dictionaries: 1iterative algorithms: 1delays: 1phase wrapping: 1acoustic features: 1low pass filters: 1alzheimer’s dementia detection: 1wav2vec2.0: 1lipreading: 1representation learning: 1self supervised learning: 1visualization: 1audio visual speech recognition: 1silent speech interface: 1iterative methods: 1pseudo target: 1domain adversarial training: 1articulatory to acoustic conversion: 1signal denoising: 1denoising: 1transient response: 1dereverberation: 1reverberation: 1recognition synthesis: 1any to one: 1cyclic training: 1pre trained grapheme model: 1self supervised training: 1supervised learning: 1grapheme to phoneme conversion: 1mutual information: 1content: 1multiple references: 1style: 1sensor fusion: 1bimodal fusion: 1gaze tracking: 1bottleneck feature: 1eye tracking: 1neurophysiology: 1dementia detection: 1image representation: 1wavelet transforms: 1prosody modeling: 1fastspeech: 1discourse level modeling: 1convolutional neural nets: 1anti spoofing: 1adversarial example generation: 1dialogue system technology challenge: 1dialogue disentanglement: 1object oriented methods: 1multiple participants: 1deep contextualized utterance representations: 1dialogue success: 1structured query language: 1bit error rate: 1cross domain: 1text to sql: 1lightweight multi head attention: 1databases: 1encoder decoder: 1multi turn: 1history: 1recurrent neural nets: 1bert: 1hidden markov models: 1autoregressive processes: 1pitch prediction: 1pitch control: 1speech codecs: 1support vector machines: 1alzheimer’s disease: 1speech analysis: 1bottleneck features: 1data augmentation: 1error analysis: 1robustness: 1phoneme level autoregression: 1information retrieval: 1pattern matching: 1interactive matching network: 1dialogue: 1software agents: 1utterance to utterance: 1disentangle: 1sequence to sequence (seq2seq): 1linguistics: 1signal representation: 1adversarial training: 1mathematical model: 1natural languages: 1ordinary differential equations: 1ffjord: 1ode: 1generative models: 1deep neural network: 1hidden markov model: 1attention: 1mel spectrogram: 1neural waveform generator: 1wavernn: 1multiple target learning: 1inverse transforms: 1waveform generators: 1signal reconstruction: 1dnn: 1spectral enhancement: 1quantisation (signal): 1cats: 1codecs: 1cross channel: 1channel adversarial training: 1computer architecture: 1computational linguistics: 1text supervision: 1unsupervised learning: 1style transfer: 1
Most publications (all venues) at2020: 342023: 282021: 272018: 242019: 23


Recent publications

TASLP2024 Yang Ai, Zhen-Hua Ling
Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks.

TASLP2024 Zhaoci Liu, Liping Chen, Ya-Jun Hu, Zhen-Hua Ling, Jia Pan, 
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS.

TASLP2024 Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement.

ICASSP2024 Shihao Chen, Liping Chen, Jie Zhang 0042, Kong-Aik Lee, Zhenhua Ling, Lirong Dai 0001, 
Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation.

ICASSP2024 Liping Chen, Kong Aik Lee, Wu Guo, Zhen-Hua Ling
Modeling Pseudo-Speaker Uncertainty in Voice Anonymization.

ICASSP2024 Kangdi Mei, Zhaoci Liu, Hui-Peng Du, Hengyu Li, Yang Ai, Liping Chen, Zhenhua Ling
Considering Temporal Connection between Turns for Conversational Speech Synthesis.

ICASSP2024 Qing-Tian Xu, Jie Zhang 0042, Zhen-Hua Ling
An End-to-End EEG Channel Selection Method with Residual Gumbel Softmax for Brain-Assisted Speech Enhancement.

ACL-Findings2024 Qian Wang, Jia-Chen Gu, Zhen-Hua Ling
X-ACE: Explainable and Multi-factor Audio Captioning Evaluation.

TASLP2023 Yang Ai, Zhen-Hua Ling
APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra.

TASLP2023 Chang Liu, Zhen-Hua Ling, Ling-Hui Chen, 
Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations.

ICASSP2023 Yang Ai, Zhen-Hua Ling
Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses.

ICASSP2023 Kangdi Mei, Xinyun Ding, Yinlong Liu, Zhiqiang Guo, Feiyang Xu, Xin Li 0064, Tuya Naren, Jiahong Yuan, Zhenhua Ling
The Ustc System for Adress-m Challenge.

ICASSP2023 Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation.

ICASSP2023 Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training.

Interspeech2023 Jie Zhang 0042, Qing-Tian Xu, Qiu-Shi Zhu, Zhen-Hua Ling
BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions.

Interspeech2023 Zhaoci Liu, Zhen-Hua Ling, Ya-Jun Hu, Jia Pan, Jin-Wei Wang, Yun-Di Wu, 
Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations.

Interspeech2023 Ye-Xin Lu, Yang Ai, Zhen-Hua Ling
MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra.

Interspeech2023 Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation.

EMNLP-Findings2023 Yue Chen, Tianwei He, Hongbin Zhou, Jia-Chen Gu, Heng Lu 0002, Zhen-Hua Ling
Symbolization, Prompt, and Classification: A Framework for Implicit Speaker Identification in Novels.

TASLP2022 Yang Ai, Zhen-Hua Ling, Wei-Lu Wu, Ang Li, 
Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis.

#34  | Li-Rong Dai 0001 | DBLP Google Scholar  
By venueInterspeech: 31ICASSP: 20TASLP: 8AAAI: 2EMNLP: 1
By year2024: 62023: 102022: 112021: 92020: 102019: 112018: 5
ISCA sessionsanalysis of speech and audio signals: 2speech activity detection and modeling: 2novel models and training methods for asr: 2speaker and language recognition: 2speech synthesis: 2speaker recognition: 2speech recognition: 1multimodal speech emotion recognition and paralinguistics: 1low-resource asr development: 1multimodal systems: 1language and accent recognition: 1acoustic event detection and acoustic scene classification: 1openasr20 and low resource asr development: 1learning techniques for speaker recognition: 1voice conversion and adaptation: 1asr neural network architectures and training: 1acoustic event detection: 1corpus annotation and evaluation: 1speaker recognition and diarization: 1singing and multimodal synthesis: 1speaker verification using neural network methods: 1representation learning for emotion: 1voice conversion and speech synthesis: 1novel neural network architectures for acoustic modelling: 1speech synthesis paradigms and methods: 1
IEEE keywordsspeech recognition: 12speaker recognition: 6representation learning: 4data models: 4voice conversion: 4self supervised pre training: 4speaker verification: 4speech enhancement: 3noise measurement: 3noise robustness: 3supervised learning: 3sequence to sequence: 3speech synthesis: 3convolutional neural nets: 3audio signal processing: 3attention: 3task analysis: 2text to speech: 2computational modeling: 2knowledge distillation: 2robustness: 2adaptation models: 2end to end: 2time domain analysis: 2wav2vec2.0: 2deep learning (artificial intelligence): 2tacotron: 2unit selection: 2autoregressive processes: 2signal representation: 2gaussian processes: 2speech separation: 2sound event detection: 2audio tagging: 2recurrent neural nets: 2speech text joint pre training: 1transformers: 1speech translation: 1discrete tokenization: 1unified modeling language: 1glass box: 1personalized speech generation: 1voice privacy: 1rendering (computer graphics): 1adversary attack: 1perturbation methods: 1privacy: 1couplings: 1harmonic analysis: 1variational autoencoder: 1adversarial learning: 1neural source filter model: 1synthesizers: 1singing voice synthesis: 1vits: 1spatiotemporal phenomena: 1microphone arrays: 1direction of arrival estimation: 1hearing aids: 1target speaker extraction: 1doa mis match: 1spatiotemporal features: 1microphone array: 1uncertainty: 1pre training: 1cross lingual: 1cross modal: 1decoding: 1speech to text translation: 1time frequency analysis: 1wiener filter: 1gevd: 1wiener filters: 1speech distortion: 1mean square error: 1signal to noise ratio: 1correlation: 1low rank approximation: 1degradation: 1stargan: 1domain adaptation: 1performance evaluation: 1data augmentation: 1recording: 1audio visual speech enhancement: 1multi branch: 1upper bound: 1fourier transforms: 1multi modality: 1lightweight model: 1multi scale: 1visualization: 1streaming media: 1automatic speech recognition: 1contrastive learning: 1analytical models: 1knowledge based systems: 1self supervised learning: 1anomalous sound detection: 1covid 19: 1binary classification: 1cepstral analysis: 1medical signal processing: 1audio recording: 1respiratory diagnosis: 1supervised pre training: 1label smoothing: 1unsupervised domain adaptation: 1speech emotion recognition: 1emotion recognition: 1signal reconstruction: 1style transformation: 1convolutional neural network: 1disentanglement: 1speech representation: 1sequence alignment: 1probability: 1multi granularity: 1post inference: 1inference mechanisms: 1end to end asr: 1encoder decoder: 1hidden markov models: 1dense residual networks: 1model ensemble: 1embedding learning: 1disentangle: 1sequence to sequence (seq2seq): 1voice activity detection: 1linguistics: 1adversarial training: 1ctc: 1matrix algebra: 1scaling: 1model adaptation: 1dilated convolution: 1baum welch statistics: 1attention mechanism: 1source separation: 1time domain: 1sparse encoder: 1speaker identification: 1target tracking: 1semi supervised learning: 1weakly labeled: 1deep neural network: 1hidden markov model: 1computational auditory scene analysis: 1label permutation problem: 1vocoders: 1mel spectrogram: 1weakly labelled data: 1signal classification: 1text analysis: 1computational linguistics: 1neural network: 1text supervision: 1natural language processing: 1
Most publications (all venues) at2016: 342014: 332015: 262023: 252018: 25

Affiliations
University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China
URLs

Recent publications

TASLP2024 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu 0012, Shuo Ren, Shujie Liu 0001, Zhuoyuan Yao, Xun Gong 0005, Li-Rong Dai 0001, Jinyu Li 0001, Furu Wei, 
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data.

ICASSP2024 Shihao Chen, Liping Chen, Jie Zhang 0042, Kong-Aik Lee, Zhenhua Ling, Lirong Dai 0001
Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation.

ICASSP2024 Jianwei Cui, Yu Gu, Chao Weng, Jie Zhang 0042, Liping Chen, Lirong Dai 0001
Sifisinger: A High-Fidelity End-to-End Singing Voice Synthesizer Based on Source-Filter Model.

ICASSP2024 Yichi Wang, Jie Zhang 0042, Shihao Chen, Weitai Zhang, Zhongyi Ye, Xinyuan Zhou, Lirong Dai 0001
A Study of Multichannel Spatiotemporal Features and Knowledge Distillation on Robust Target Speaker Extraction.

ICASSP2024 Weitai Zhang, Hanyi Zhang, Chenxuan Liu, Zhongyi Ye, Xinyuan Zhou, Chao Lin, Lirong Dai 0001
Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

AAAI2024 Qiushi Zhu, Jie Zhang 0042, Yu Gu, Yuchen Hu, Lirong Dai 0001
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation.

TASLP2023 Jie Zhang 0042, Rui Tao, Jun Du, Li-Rong Dai 0001
SDW-SWF: Speech Distortion Weighted Single-Channel Wiener Filter for Noise Reduction.

TASLP2023 Qiu-Shi Zhu, Jie Zhang 0042, Ziqiang Zhang, Li-Rong Dai 0001
A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition.

ICASSP2023 Hang-Rui Hu, Yan Song 0001, Jian-Tao Zhang, Li-Rong Dai 0001, Ian McLoughlin 0001, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, 
Stargan-vc Based Cross-Domain Data Augmentation for Speaker Verification.

ICASSP2023 Haitao Xu, Liangfa Wei, Jie Zhang 0042, Jianming Yang, Yannan Wang, Tian Gao, Xin Fang, Li-Rong Dai 0001
A Multi-Scale Feature Aggregation Based Lightweight Network for Audio-Visual Speech Enhancement.

ICASSP2023 Qiu-Shi Zhu, Long Zhou, Jie Zhang 0042, Shujie Liu 0001, Yu-Chen Hu, Li-Rong Dai 0001
Robust Data2VEC: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning.

Interspeech2023 Kang Li, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Jin Li, Li-Rong Dai 0001
Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection.

Interspeech2023 Mohan Shi, Zhihao Du, Qian Chen 0003, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang 0042, Li-Rong Dai 0001
CASA-ASR: Context-Aware Speaker-Attributed ASR.

Interspeech2023 Mohan Shi, Yuchun Shu, Lingyun Zuo, Qian Chen 0003, Shiliang Zhang, Jie Zhang 0042, Li-Rong Dai 0001
Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction.

Interspeech2023 Jingyuan Wang, Jie Zhang 0042, Li-Rong Dai 0001
Real-Time Causal Spectro-Temporal Voice Activity Detection Based on Convolutional Encoding and Residual Decoding.

Interspeech2023 Xiao-Min Zeng, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Li-Rong Dai 0001
Robust Prototype Learning for Anomalous Sound Detection.

ICASSP2022 Han Chen, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Self-Supervised Representation Learning for Unsupervised Anomalous Sound Detection Under Domain Shift.

ICASSP2022 Xing-Yu Chen, Qiu-Shi Zhu, Jie Zhang 0042, Li-Rong Dai 0001
Supervised and Self-Supervised Pretraining Based Covid-19 Detection Using Acoustic Breathing/Cough/Speech Signals.

ICASSP2022 Hang-Rui Hu, Yan Song 0001, Ying Liu, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Domain Robust Deep Embedding Learning for Speaker Recognition.

ICASSP2022 Yuxuan Xi, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Frontend Attributes Disentanglement for Speech Emotion Recognition.

#35  | Sriram Ganapathy | DBLP Google Scholar  
By venueInterspeech: 35ICASSP: 18TASLP: 5SpeechComm: 2EMNLP: 1
By year2024: 52023: 52022: 112021: 112020: 122019: 112018: 6
ISCA sessionsspeaker diarization: 3the first dicova challenge: 2feature extraction and distant asr: 2the second dihard speech diarization challenge (dihard ii): 2speaker verification: 2speaker and language diarization: 1neural processing of speech and language: 1speaker and language identification: 1voice conversion and adaptation: 1low-resource asr development: 1show and tell: 1speech and language in health: 1robust asr, and far-field/multi-talker asr: 1atypical speech detection: 1non-intrusive objective speech quality assessment (nisqa) challenge for online conferencing applications: 1survey talk: 1conferencingspeech 2021 challenge: 1spoken language understanding: 1language learning: 1speaker recognition: 1speech and voice disorders: 1feature extraction for asr: 1speaker recognition and anti-spoofing: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speaker and language recognition: 1neural network training strategies for asr: 1second language acquisition and code-switching: 1perspective talk: 1distant asr: 1
IEEE keywordsspeech recognition: 9speaker recognition: 6task analysis: 5speaker diarization: 4natural language processing: 4self supervised learning: 3deep learning (artificial intelligence): 3signal representation: 3semantics: 2joint modeling: 2dereverberation: 2end to end asr: 2reverberation: 2predictive models: 2audio recording: 2representation learning: 2text analysis: 2pattern clustering: 2iterative methods: 2supervised learning: 2autoregressive processes: 2recurrent neural nets: 2filtering theory: 2natural languages: 2time frequency analysis: 2modulation filtering: 2talker change detection: 2hidden unit clustering: 1context modeling: 1zerospeech. low resource asr: 1transformers: 1hidden markov models: 1contrastive loss: 1mirrors: 1frequency domain analysis: 1convolution: 1frequency domain auto regressive modeling: 1analytical models: 1video on demand: 1multimodal modeling: 1low resource languages: 1web sites: 1metadata: 1language identification: 1disentangled representation learning: 1speech emotion modeling: 1training data: 1data models: 1decoding: 1style transfer: 1supervised hierarchical clustering: 1clustering algorithms: 1benchmark testing: 1signal processing algorithms: 1graph neural networks: 1multi modal emotion recognition: 1transformer networks: 1emotion recognition: 1learnable front end: 1self attention models: 1contrastive predictive coding: 1prediction theory: 1deep clustering: 1zerospeech challenge: 1end to end automatic speech recognition: 1convolutional neural nets: 1frequency domain linear prediction (fdlp): 1machine learning: 1pattern classification: 1covid 19: 1healthcare: 1audio signal processing: 1respiratory diagnosis: 1signal classification: 1graph structural clustering: 1path integral clustering: 1optimisation: 12 stage relevance weighting: 1feature selection: 1raw speech waveform: 1feedback of acoustic embeddings: 1speech representation learning: 1transfer learning: 1buildings: 1vocoders: 1end to end modeling: 1synthesizers: 1lyrics transcription: 1voice to singing style transfer: 1forensics: 1harmonic analysis: 1nisp dataset: 1frequency measurement: 1physical parameters: 1linguistics: 1speaker profiling: 1voice forensics: 1canonical correlation analysis (cca): 1multi way cca: 1medical signal processing: 1electroencephalography: 1neurophysiology: 1audio eeg analysis: 1deep cca: 1automatic speech recognition: 1acoustic phonetics: 1relevance modeling: 1raw waveform processing: 1kernel: 1deep representation learning: 1modulation: 1urban sound classification: 1i vectors: 1long short term memory (lstm) networks: 1sequence modeling: 1attention networks: 1language recognition: 1human versus machine: 1language familiarity: 1response time: 1hearing: 1speech intelligibility: 1benchmarking speaker diarization: 1parameter estimation: 1singing voice separation: 1speech separation: 1robust speech recognition: 1unsupervised filter learning: 1convolutional variational autoencoder: 1skip connections: 1support vector machines: 1regression analysis: 1deep neural network: 1automatic joint height and age estimation: 1support vector regression: 1short duration: 1gaussian processes: 1hierarchical gru: 1end to end language identification: 1attention: 1dimensionality reduction: 1probability: 1plda scoring: 1gaussian distribution: 1gaussian back end: 1speaker verification.: 1linear discriminant analysis: 1x vectors: 1reaction time: 1oral communication: 1digital audio broadcasting: 1speech analysis: 1random forest regression: 1
Most publications (all venues) at2020: 232022: 162021: 162019: 162023: 13

Affiliations
URLs

Recent publications

SpeechComm2024 Shikha Baghel, Shreyas Ramoji, Somil Jain, Pratik Roy Chowdhuri, Prachi Singh, Deepu Vijayasenan, Sriram Ganapathy
Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments.

TASLP2024 Varun Krishna, Tarun Sai, Sriram Ganapathy
Representation Learning With Hidden Unit Clustering for Low Resource Speech Applications.

TASLP2024 Anurenjan Purushothaman, Debottam Dutta, Rohit Kumar, Sriram Ganapathy
Speech Dereverberation With Frequency Domain Autoregressive Modeling.

ICASSP2024 Shikhar Bharadwaj, Min Ma, Shikhar Vashishth, Ankur Bapna, Sriram Ganapathy, Vera Axelrod, Siddharth Dalmia, Wei Han, Yu Zhang, Daan van Esch, Sandy Ritchie, Partha Talukdar, Jason Riesa, 
Multimodal Modeling for Spoken Language Identification.

ICASSP2024 Soumya Dutta, Sriram Ganapathy
Zero Shot Audio To Audio Emotion Transfer With Speaker Disentanglement.

ICASSP2023 Prachi Singh, Amrit Kaul, Sriram Ganapathy
Supervised Hierarchical Clustering Using Graph Neural Networks for Speaker Diarization.

Interspeech2023 Shikha Baghel, Shreyas Ramoji, Sidharth, Ranjana H, Prachi Singh, Somil Jain, Pratik Roy Chowdhuri, Kaustubh Kulkarni, Swapnil Padhi, Deepu Vijayasenan, Sriram Ganapathy
The DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments.

Interspeech2023 Akshara Soman, Vidhi Sinha, Sriram Ganapathy
Enhancing the EEG Speech Match Mismatch Tasks With Word Boundaries.

Interspeech2023 Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han 0002, Vera Axelrod, Partha Talukdar, 
Label Aware Speech Representation Learning For Language Identification.

EMNLP2023 Darshan Prabhu, Preethi Jyothi, Sriram Ganapathy, Vinit Unni, 
Accented Speech Recognition With Accent-specific Codebooks.

ICASSP2022 Soumya Dutta, Sriram Ganapathy
Multimodal Transformer with Learnable Frontend and Self Attention for Emotion Recognition.

ICASSP2022 Varun Krishna, Sriram Ganapathy
Self Supervised Representation Learning with Deep Clustering for Acoustic Unit Discovery from Raw Speech.

ICASSP2022 Rohit Kumar, Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy
End-To-End Speech Recognition with Joint Dereverberation of Sub-Band Autoregressive Envelopes.

ICASSP2022 Neeraj Kumar Sharma 0001, Srikanth Raj Chetupalli, Debarpan Bhattacharya, Debottam Dutta, Pravin Mote, Sriram Ganapathy
The Second Dicova Challenge: Dataset and Performance Analysis for Diagnosis of Covid-19 Using Acoustics.

Interspeech2022 Shrutina Agarwal, Naoya Takahashi, Sriram Ganapathy
Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer.

Interspeech2022 Tarun Sai Bandarupalli, Shakti Rath, Nirmesh Shah, Naoyuki Onoe, Sriram Ganapathy
Semi-supervised Acoustic and Language Modeling for Hindi ASR.

Interspeech2022 Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma 0001, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K. K, Sadhana Gonuguntla, Murali Alagesan, 
Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms.

Interspeech2022 Debarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma 0001, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K. K, Sadhana Gonuguntla, Murali Alagesan, 
Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals.

Interspeech2022 Srikanth Raj Chetupalli, Sriram Ganapathy
Speaker conditioned acoustic modeling for multi-speaker conversational ASR.

Interspeech2022 Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, Amir Hossein Poorjam, Deepak Mittal, Maneesh Singh 0001, 
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection.

#36  | Najim Dehak | DBLP Google Scholar  
By venueInterspeech: 44ICASSP: 14TASLP: 2
By year2024: 12023: 52022: 62021: 132020: 122019: 132018: 10
ISCA sessionstrustworthy speech processing: 3robust speaker recognition: 2speaker recognition and diarization: 2non-autoregressive sequential modeling for speech processing: 2the attacker’s perpective on automatic speaker verification: 2speaker verification: 2speaker state and trait: 2language identification and diarization: 1speech recognition: 1pathological speech analysis: 1speaker recognition: 1speech, voice, and hearing disorders: 1self supervision and anti-spoofing: 1voice activity detection and keyword spotting: 1the adresso challenge: 1embedding and network architecture for speaker recognition: 1voice anti-spoofing and countermeasure: 1lm adaptation, lexical units and punctuation: 1the zero resource speech challenge 2020: 1speaker embedding: 1alzheimer’s dementia recognition through spontaneous speech: 1phonetic event detection and segmentation: 1spoken term detection: 1speaker recognition and anti-spoofing: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speech and voice disorders: 1representation learning of emotion and paralinguistics: 1the voices from a distance challenge: 1speaker recognition evaluation: 1nn architectures for asr: 1language identification: 1representation learning for emotion: 1deep neural networks: 1the first dihard speech diarization challenge: 1extracting information from audio: 1topics in speech recognition: 1
IEEE keywordsspeaker recognition: 11speech recognition: 5emotion recognition: 3transfer learning: 3speaker verification: 3decoding: 2automatic speech recognition: 2supervised learning: 2signal denoising: 2perceptual loss: 2natural language processing: 2speech enhancement: 2feature enhancement: 2microphones: 2end to end: 1transformers: 1non autoregressive model: 1iterative refinement: 1attractor mechanism: 1estimation: 1self attention: 1clustering: 1recording: 1speaker diarization: 1object detection: 1attention: 1connectionist temporal classification: 1regularization: 1understanding: 1text to speech: 1unsupervised learning: 1speech synthesis: 1multilingual: 1zero shot learning: 1phonotactics: 1self supervised features: 1pre trained networks: 1multi task learning: 1deep learning (artificial intelligence): 1speech denoising: 1audio signal processing: 1signal classification: 1data augmentation: 1copypaste: 1x vector: 1language acquisition: 1spoken term discovery: 1low resource speech technology: 1multimodal learning: 1visualization: 1automatic speaker recognition: 1surgery: 1probability: 1tonsillectomy: 1septoplasty: 1sinus surgery: 1deep feature loss: 1i vectors: 1medical disorders: 1diseases: 1patient diagnosis: 1parkinson’s disease: 1medical signal processing: 1neurophysiology: 1speech: 1x vectors: 1channel bank filters: 1far field adaptation: 1dereverberation: 1data handling: 1cyclegan: 1linear discriminant analysis: 1pre trained: 1x vector: 1cold fusion: 1automatic speech recognition (asr): 1language model: 1shallow fusion: 1storage management: 1deep fusion: 1sequence to sequence: 1asvspoof: 1replay attacks: 1automatic speaker verification: 1security of data: 1spoofing attack: 1anti spoofing: 1filtering theory: 1telephone sets: 1bandwidth: 1deep residual cnn: 1blstm: 1spectrogram: 1bandwidth extension: 1generative adversarial neural networks (gans): 1unsupervised domain adaptation: 1cycle gans: 1
Most publications (all venues) at2018: 262021: 242019: 202020: 192022: 16


Recent publications

TASLP2024 Magdalena Rybicka, Jesús Villalba 0001, Thomas Thebaud, Najim Dehak, Konrad Kowalczyk, 
End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors.

Interspeech2023 Jesús Villalba 0001, Jonas Borgstrom, Maliha Jahan, Saurabh Kataria, Leibny Paola García, Pedro A. Torres-Carrasquillo, Najim Dehak
Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22.

Interspeech2023 Saurabhchand Bhati, Jesús Villalba 0001, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak
Segmental SpeechCLIP: Utilizing Pretrained Image-text Models for Audio-Visual Learning.

Interspeech2023 Anna Favaro, Tianyu Cao 0003, Thomas Thebaud, Jesús Villalba 0001, Ankur A. Butala, Najim Dehak, Laureano Moro-Velázquez, 
Do Phonatory Features Display Robustness to Characterize Parkinsonian Speech Across Corpora?

Interspeech2023 Saurabh Kataria, Jesús Villalba 0001, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak
Self-FiLM: Conditioning GANs with self-supervised representations for bandwidth extension based speaker recognition.

Interspeech2023 Helin Wang, Thomas Thebaud, Jesús Villalba 0001, Myra Sydnor, Becky Lammers, Najim Dehak, Laureano Moro-Velázquez, 
DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model.

Interspeech2022 Jaejin Cho, Raghavendra Pappagari, Piotr Zelasko, Laureano Moro-Velázquez, Jesús Villalba 0001, Najim Dehak
Non-contrastive self-supervised learning of utterance-level speech representations.

Interspeech2022 Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesús Villalba 0001, Sanjeev Khudanpur, Najim Dehak
Defense against Adversarial Attacks on Hybrid Speech Recognition System using Adversarial Fine-tuning with Denoiser.

Interspeech2022 Sonal Joshi, Saurabh Kataria, Jesús Villalba 0001, Najim Dehak
AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification.

Interspeech2022 Saurabh Kataria, Jesús Villalba 0001, Laureano Moro-Velázquez, Najim Dehak
Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification.

Interspeech2022 Magdalena Rybicka, Jesús Villalba 0001, Najim Dehak, Konrad Kowalczyk, 
End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors.

Interspeech2022 Yiwen Shao, Jesús Villalba 0001, Sonal Joshi, Saurabh Kataria, Sanjeev Khudanpur, Najim Dehak
Chunking Defense for Adversarial Attacks on ASR.

ICASSP2021 Nanxin Chen, Piotr Zelasko, Jesús Villalba 0001, Najim Dehak
Focus on the Present: A Regularization Method for the ASR Source-Target Attention Layer.

ICASSP2021 Jaejin Cho, Piotr Zelasko, Jesús Villalba 0001, Najim Dehak
Improving Reconstruction Loss Based Speaker Embedding in Unsupervised and Semi-Supervised Scenarios.

ICASSP2021 Siyuan Feng 0001, Piotr Zelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
How Phonotactics Affect Multilingual and Zero-Shot ASR Performance.

ICASSP2021 Saurabh Kataria, Jesús Villalba 0001, Najim Dehak
Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models.

ICASSP2021 Raghavendra Pappagari, Jesús Villalba 0001, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak
CopyPaste: An Augmentation Method for Speech Emotion Recognition.

ICASSP2021 Liming Wang, Xinsheng Wang, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval.

Interspeech2021 Saurabhchand Bhati, Jesús Villalba 0001, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation.

Interspeech2021 Nanxin Chen, Piotr Zelasko, Laureano Moro-Velázquez, Jesús Villalba 0001, Najim Dehak
Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition.

#37  | Yuexian Zou | DBLP Google Scholar  
By venueInterspeech: 25ICASSP: 19ACL-Findings: 4TASLP: 3AAAI: 3ACL: 1NAACL-Findings: 1IJCAI: 1EMNLP-Findings: 1
By year2024: 62023: 152022: 112021: 152020: 72019: 22018: 2
ISCA sessionsacoustic event detection and classification: 3spoken dialogue systems: 3spoken dialog systems and conversational analysis: 2source separation: 2acoustic event detection: 2end-to-end spoken dialog systems: 1neural-based speech and acoustic analysis: 1analysis of speech and audio signals: 1speech synthesis: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1multi-, cross-lingual and other topics in asr: 1spoken term detection & voice search: 1acoustic event detection and acoustic scene classification: 1speech signal analysis and representation: 1speaker embedding: 1speech enhancement: 1the interspeech 2018 computational paralinguistics challenge (compare): 1source separation and spatial analysis: 1
IEEE keywordstask analysis: 6natural language processing: 5benchmark testing: 3multitask learning: 3spoken language understanding: 3audio signal processing: 3decoding: 2end to end: 2robustness: 2syntactics: 2filling: 2slot filling: 2multiple instance learning: 2supervised learning: 2speaker verification: 2question answering: 2text analysis: 2audio tagging: 2image segmentation: 2filtering theory: 2electronic mail: 1chatbots: 1chatgpt: 1engines: 1noise measurement: 1multimodal learning: 1pipelines: 1audio language dataset: 1transducers: 1bayes methods: 1discriminative training: 1automatic speech recognition: 1mutual information: 1maximum mutual information: 1minimum bayesian risk: 1sequential training: 1autoregressive model: 1diffusion model: 1measurement: 1vocoders: 1text to sound generation: 1transforms: 1spectrogram: 1vocoder: 1data models: 1machine translation: 1speech translation: 1data augmentation: 1mix at three levels: 1training data: 1cross modal matching: 1video music retrieval: 1label noise: 1self training: 1data mining: 1natural languages: 1retrieval based dialogue: 1syntactic dependency parsing: 1multi intent classification: 1correlation: 1semantics: 1stacking: 1bilin ear attention: 1features interaction: 1query processing: 1multiple intent detection: 1self distillation: 1internet: 1representation learning: 1keyword spotting: 1orthogonality regularization: 1sound event detection: 1signal detection: 1deep learning (artificial intelligence): 1mutual learning: 1few shot learning: 1transductive inference: 1machine reading comprehension: 1multi granularity representation: 1interactive systems: 1iteratively co interactive network: 1aspect based sentiment information: 1bert: 1weak labels: 1two stream framework: 1class wise attentional clips: 1image sequences: 1image recognition: 1action recognition: 1motion representation: 1video understanding: 1temporal modeling: 1object recognition: 1image representation: 1image motion analysis: 1video signal processing: 1knowledge discovery: 1spoken question answering: 1knowledge distillation: 1manuals: 1self supervised learning: 1unsupervised learning: 1speaker recognition: 1contrastive learning: 1curve text: 1text detection: 1two stage segmentation: 1scene text detection: 1self attention: 1speech recognition: 1speech enhancement: 1multi channel speech separation: 1inter channel convolution differences: 1reverberation: 1spatial filters: 1spatial features: 1weakly labelled data: 1object detection: 1convolutional neural nets: 1spatial attention: 1channel wise attention: 1structural correlations between images textures: 1generative adversarial networks: 1image texture: 1realistic images: 1semantic information preserved loss: 1s2pt: 1image resolution: 1correlated residual blocks: 1pattern clustering: 1object tracking: 1smc phd filter: 1probability: 1audio visual tracking: 1particle filtering (numerical methods): 1audio visual systems: 1particle flow: 1monte carlo methods: 1
Most publications (all venues) at2023: 422021: 372022: 352019: 302024: 28

Affiliations
URLs

Recent publications

TASLP2024 Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang 0001, 
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

AAAI2024 Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Xianwei Zhuang, Yuexian Zou
Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport.

AAAI2024 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren 0006, Yuexian Zou, Zhou Zhao, Shinji Watanabe 0001, 
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

ACL2024 Xianwei Zhuang, Xuxin Cheng, Liming Liang, Yuxin Xie, Zhichang Wang, Zhiqi Huang, Yuexian Zou
PCAD: Towards ASR-Robust Spoken Language Understanding via Prototype Calibration and Asymmetric Decoupling.

ACL-Findings2024 Xuxin Cheng, Zhihong Zhu, Bang Yang, Xianwei Zhuang, Hongxiang Li, Yuexian Zou
Cyclical Contrastive Learning Based on Geodesic for Zero-shot Cross-lingual Spoken Language Understanding.

ACL-Findings2024 Xuxin Cheng, Zhihong Zhu, Xianwei Zhuang, Zhanpeng Chen, Zhiqi Huang, Yuexian Zou
MoE-SLU: Towards ASR-Robust Spoken Language Understanding via Mixture-of-Experts.

TASLP2023 Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu 0001, 
Integrating Lattice-Free MMI Into End-to-End Speech Recognition.

TASLP2023 Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu 0001, 
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation.

ICASSP2023 Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou
M3ST: Mix at Three Levels for Speech Translation.

ICASSP2023 Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Yuexian Zou
SSVMR: Saliency-Based Self-Training for Video-Music Retrieval.

ICASSP2023 Tengtao Song, Nuo Chen 0001, Ji Jiang, Zhihong Zhu, Yuexian Zou
Improving Retrieval-Based Dialogue System Via Syntax-Informed Attention.

ICASSP2023 Zhihong Zhu, Weiyuan Xu, Xuxin Cheng, Tengtao Song, Yuexian Zou
A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding.

Interspeech2023 Xuxin Cheng, Ziyu Yao 0001, Zhihong Zhu, Yaowei Li, Hongxiang Li, Yuexian Zou
C²A-SLU: Cross and Contrastive Attention for Improving ASR Robustness in Spoken Language Understanding.

Interspeech2023 Xuxin Cheng, Wanshi Xu, Ziyu Yao 0001, Zhihong Zhu, Yaowei Li, Hongxiang Li, Yuexian Zou
FC-MTLF: A Fine- and Coarse-grained Multi-Task Learning Framework for Cross-Lingual Spoken Language Understanding.

Interspeech2023 Xuxin Cheng, Zhihong Zhu, Ziyu Yao 0001, Hongxiang Li, Yaowei Li, Yuexian Zou
GhostT5: Generate More Features with Cheap Operations to Improve Textless Spoken Question Answering.

Interspeech2023 Yifei Xin, Dongchao Yang, Yuexian Zou
Background-aware Modeling for Weakly Supervised Sound Event Detection.

Interspeech2023 Yifei Xin, Yuexian Zou
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions.

Interspeech2023 Dongchao Yang, Songxiang Liu, Helin Wang, Jianwei Yu, Chao Weng, Yuexian Zou
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS.

Interspeech2023 Zhihong Zhu, Xuxin Cheng, Dongsheng Chen, Zhiqi Huang, Hongxiang Li, Yuexian Zou
Mix before Align: Towards Zero-shot Cross-lingual Sentiment Analysis via Soft-Mix and Multi-View Learning.

ACL-Findings2023 Xuxin Cheng, Bowen Cao, Qichen Ye, Zhihong Zhu, Hongxiang Li, Yuexian Zou
ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding.

#38  | Tan Lee 0001 | DBLP Google Scholar  
By venueInterspeech: 36ICASSP: 15TASLP: 5
By year2024: 32023: 112022: 112021: 52020: 102019: 122018: 4
ISCA sessionsspeech synthesis: 7speech emotion recognition: 4embedding and network architecture for speaker recognition: 2multimodal speech emotion recognition and paralinguistics: 2speaker and language identification: 1speech recognition: 1connecting speech-science and speech-technology for children's speech: 1phonetics: 1speaker and language recognition: 1atypical speech detection: 1assessment of pathological speech and language: 1voice anti-spoofing and countermeasure: 1dnn architectures for speaker recognition: 1language learning: 1speech, language, and multimodal resources: 1zero-resource asr: 1the zero resource speech challenge 2019: 1network architectures for emotion and paralinguistics recognition: 1speech and language analytics for medical applications: 1speech and voice disorders: 1model adaptation for asr: 1adjusting to speaker, accent, and domain: 1zero-resource speech recognition: 1deception, personality, and culture attribute: 1speech pathology, depression, and medical applications: 1
IEEE keywordsspeech recognition: 8speaker recognition: 4adaptation models: 3speaker verification: 3hidden markov models: 3databases: 3signal classification: 3gaussian processes: 3automatic speech recognition: 2matrix decomposition: 2noise measurement: 2timbre: 2labeling: 2speech synthesis: 2emotion recognition: 2bayes methods: 2bayesian learning: 2lhuc: 2convolutional neural nets: 2speaker adaptation: 2unsupervised learning: 2natural language processing: 2black box models: 1domain adaptation: 1computational modeling: 1memory management: 1closed box: 1estimation: 1signal processing algorithms: 1reprogramming: 1costs: 1parameter efficient fine tuning: 1child speech: 1sparse matrices: 1whisper: 1lora: 1tongue: 1learning from noisy labels: 1articulation disorder: 1neck: 1personalized speech synthesis: 1speech: 1emotion transfer: 1emotion intensity: 1cross speaker: 1zero shot: 1prosody: 1channel frequency attention: 1resnet: 1text independent speaker verification: 1convolution: 1convolutional neural networks: 1phone embed ding: 1production: 1linguistic acoustic similarity: 1goodness of pronunciation: 1transformers: 1acoustic measurements: 1pronunciation scoring: 1fluency scoring: 1indexes: 1self supervised learning: 1non native speech: 1predictive models: 1training data: 1plda: 1system performance: 1correlation: 1regularization: 1probabilistic logic: 1soft labeling: 1learning with noisy labels: 1segment based speech emotion recognition: 1iterative self learning: 1task analysis: 1text analysis: 1text to speech: 1pre training: 1data reduction: 1adaptation: 1tdnn: 1switchboard: 1biometrics (access control): 1multivariate empirical mode decomposition: 1biometrics: 1medical signal processing: 1electroencephalography: 1resting state eeg: 1hilbert transforms: 1neurophysiology: 1connectivity: 1unsupervised deep factorization: 1unsupervised subword modeling: 1mixture factorized auto encoder: 1computational linguistics: 1signal reconstruction: 1speech coding: 1signal representation: 1acoustic scene classification: 1time frequency analysis: 1median filtering: 1wavelet transforms: 1feature decomposition: 1audio signal processing: 1sound duration: 1convolutional neural network: 1median filters: 1multi task learning: 1robust features: 1zero resource: 1voice assessment: 1posterior features: 1probability: 1continuous speech: 1acoustic features: 1dnn based asr system: 1speech emotion recognition: 1hybrid dnn hmm: 1subspace based gmm: 1pattern clustering: 1unsupervised adaptation: 1adversarial learning: 1language recognition: 1domain mismatch: 1aphasia: 1cnn: 1speech assessment: 1phone posteriorgrams: 1asr: 1maximum likelihood estimation: 1
Most publications (all venues) at2022: 192020: 182021: 162010: 152019: 14


Recent publications

ICASSP2024 Jingyu Li, Tan Lee 0001
Efficient Black-Box Speaker Verification Model Adaptation With Reprogramming And Backend Learning.

ICASSP2024 Wei Liu, Ying Qin, Zhiyuan Peng, Tan Lee 0001
Sparsely Shared Lora on Whisper for Child Speech Recognition.

ICASSP2024 Yusheng Tian, Jingyu Li, Tan Lee 0001
Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss.

TASLP2023 Guangyan Zhang, Ying Qin, Wenjie Zhang, Jialun Wu, Mei Li, Yutao Gai, Feijun Jiang, Tan Lee 0001
iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre.

ICASSP2023 Jingyu Li, Yusheng Tian, Tan Lee 0001
Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification.

ICASSP2023 Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li 0119, Zejun Ma, Tan Lee 0001
Leveraging Phone-Level Linguistic-Acoustic Similarity For Utterance-Level Pronunciation Scoring.

ICASSP2023 Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li 0119, Zejun Ma, Tan Lee 0001
An ASR-Free Fluency Scoring Approach with Self-Supervised Learning.

ICASSP2023 Zhiyuan Peng, Mingjie Shao, Xuanji He, Xu Li, Tan Lee 0001, Ke Ding, Guanglu Wan, 
Covariance Regularization for Probabilistic Linear Discriminant Analysis.

Interspeech2023 Jingyu Li, Wei Liu, Zhaoyang Zhang 0001, Jiong Wang, Tan Lee 0001
Model Compression for DNN-based Speaker Verification Using Weight Quantization.

Interspeech2023 Wei Liu, Zhiyuan Peng, Tan Lee 0001
CoMFLP: Correlation Measure Based Fast Search on ASR Layer Pruning.

Interspeech2023 Si Ioi Ng, Cymie Wing-Yee Ng, Tan Lee 0001
A Study on Using Duration and Formant Features in Automatic Detection of Speech Sound Disorder in Children.

Interspeech2023 Dehua Tao, Tan Lee 0001, Harold Chui, Sarah Luk, 
A Study on Prosodic Entrainment in Relation to Therapist Empathy in Counseling Conversation.

Interspeech2023 Yusheng Tian, Guangyan Zhang, Tan Lee 0001
Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models.

Interspeech2023 Yujia Xiao, Shaofei Zhang, Xi Wang 0016, Xu Tan 0003, Lei He 0005, Sheng Zhao, Frank K. Soong, Tan Lee 0001
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

TASLP2022 Shuiyang Mao, P. C. Ching, Tan Lee 0001
Enhancing Segment-Based Speech Emotion Recognition by Iterative Self-Learning.

ICASSP2022 Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan 0003, Sheng Zhao, Tan Lee 0001
A Study on the Efficacy of Model Pre-Training In Developing Neural Text-to-Speech System.

Interspeech2022 Jonathan Him Nok Lee, Dehua Tao, Harold Chui, Tan Lee 0001, Sarah Luk, Nicolette Wing Tung Lee, Koonkan Fung, 
Durational Patterning at Discourse Boundaries in Relation to Therapist Empathy in Psychotherapy.

Interspeech2022 Jingyu Li, Wei Liu, Tan Lee 0001
EDITnet: A Lightweight Network for Unsupervised Domain Adaptation in Speaker Verification.

Interspeech2022 Si Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee 0001
Automatic Detection of Speech Sound Disorder in Child Speech Using Posterior-based Speaker Representations.

Interspeech2022 Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee 0001, Guanglu Wan, 
Unifying Cosine and PLDA Back-ends for Speaker Verification.

#39  | Lukás Burget | DBLP Google Scholar  
By venueInterspeech: 31ICASSP: 21TASLP: 4
By year2024: 42023: 82022: 102021: 92020: 72019: 132018: 5
ISCA sessionsspeaker recognition and diarization: 3speaker recognition: 3embedding and network architecture for speaker recognition: 2large-scale evaluation of short-duration speaker verification: 2the voices from a distance challenge: 2multi-talker methods in speech processing: 1language identification and diarization: 1source separation: 1speaker and language identification: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1speaker embedding and diarization: 1search/decoding algorithms for asr: 1robust speaker recognition: 1language modeling and text-based innovations for asr: 1linguistic components in end-to-end asr: 1graph and end-to-end learning for speaker recognition: 1sequence-to-sequence speech recognition: 1zero-resource asr: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1language modeling: 1the first dihard speech diarization challenge: 1topics in speech recognition: 1low resource speech recognition challenge for indian languages: 1speaker verification: 1
IEEE keywordsspeaker diarization: 8speaker recognition: 8bayes methods: 7speech recognition: 7variational bayes: 5hidden markov models: 5data models: 4speaker verification: 4natural language processing: 4pattern clustering: 4transformers: 3discriminative training: 3unsupervised learning: 3hmm: 3oral communication: 2end to end neural diarization: 2decoding: 2telephone sets: 2analytical models: 2clustering: 2training data: 2adaptation models: 2acoustic unit discovery: 2speaker embedding: 2dihard: 2optimisation: 2x vector: 2biological system modeling: 1attractor: 1diaper: 1perceiver: 1long short term memory: 1automatic speech recognition: 1switches: 1computational modeling: 1confidences measures: 1end to end systems: 1system fusion: 1limiting: 1lattices: 1estimation: 1tokenization: 1error correction: 1conversational telephone speech: 1source coding: 1calibration: 1telephony: 1standards: 1vbx: 1tuning: 1self supervised features: 1smoothing methods: 1hubert: 1emotion recognition: 1iemocap: 1wavlm: 1wav2vec 2.0: 1correlation: 1codes: 1simulated conversations: 1voice activity detection: 1transfer learning: 1adapter: 1fine tuning: 1pre trained model: 1nist: 1psda: 1von mises fisher: 1probabilistic logic: 1non parametric bayesian models: 1blind source separation: 1unsupervised target speech extraction: 1frequency domain analysis: 1cross domain: 1dpccn: 1mixture remix: 1speech separation: 1time domain analysis: 1beamforming: 1speech enhancement: 1multi channel: 1multisv: 1dataset: 1array signal processing: 1julia language: 1lattice free mmi: 1algebra: 1computer languages: 1end to end asr: 1signal processing algorithms: 1forward backward: 1linear programming: 1self supervision: 1speech synthesis: 1sequence to sequence: 1cycle consistency: 1voxsrc challenge: 1voxconverse: 1auxiliary loss: 1joint training: 1language translation: 1how2 dataset: 1coupled de coding: 1spoken language translation: 1asr objective: 1end to end differentiable pipeline: 1hierarchical subspace model: 1bayesian methods: 1text analysis: 1pattern classification: 1embeddings: 1gaussian distribution: 1topic identification: 1on the fly data augmentation: 1specaugment: 1convolutional neural nets: 1probability: 1linear discriminant analysis: 1chime: 1inference mechanisms: 1attention models: 1recurrent neural nets: 1softmax margin: 1beam search training: 1sequence learning: 1i vectors: 1i vector extractor: 1entropy: 1domain adaptation: 1neural net architecture: 1topology: 1deep neural network: 1tensorflow: 1kaldi: 1network topology: 1task analysis: 1
Most publications (all venues) at2019: 212022: 182018: 162011: 162016: 15


Recent publications

TASLP2024 Federico Landini, Mireia Díez, Themos Stafylakis, Lukás Burget
DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors.

ICASSP2024 Karel Benes, Martin Kocour, Lukás Burget
Hystoc: Obtaining Word Confidences for Fusion of End-To-End ASR Systems.

ICASSP2024 Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Díez, Lukás Burget, Yuhang Cao, Heng Lu, Jan Cernocký, 
Diacorrect: Error Correction Back-End for Speaker Diarization.

ICASSP2024 Dominik Klement, Mireia Díez, Federico Landini, Lukás Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara, 
Discriminative Training of VBx Diarization.

ICASSP2023 Sofoklis Kakouros, Themos Stafylakis, Ladislav Mosner, Lukás Burget
Speech-Based Emotion Recognition with Self-Supervised Models Using Attentive Channel-Wise Correlations and Label Smoothing.

ICASSP2023 Federico Landini, Mireia Díez, Alicia Lozano-Diez, Lukás Burget
Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization.

ICASSP2023 Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldrich Plchot, Ladislav Mosner, Lukás Burget, Jan Cernocký, 
Parameter-Efficient Transfer Learning of Pre-Trained Transformer Models for Speaker Verification Using Adapters.

ICASSP2023 Anna Silnova, Niko Brümmer, Albert Swart, Lukás Burget
Toroidal Probabilistic Spherical Discriminant Analysis.

Interspeech2023 Marc Delcroix, Naohiro Tawara, Mireia Díez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukás Burget, Shoko Araki, 
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization.

Interspeech2023 Pavel Matejka, Anna Silnova, Josef Slavícek, Ladislav Mosner, Oldrich Plchot, Michal Klco, Junyi Peng, Themos Stafylakis, Lukás Burget
Description and Analysis of ABC Submission to NIST LRE 2022.

Interspeech2023 Ladislav Mosner, Oldrich Plchot, Junyi Peng, Lukás Burget, Jan Cernocký, 
Multi-Channel Speech Separation with Cross-Attention and Beamforming.

Interspeech2023 Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukás Burget, Jan Cernocký, 
Improving Speaker Verification with Self-Pretrained Transformer Models.

TASLP2022 Lucas Ondel, Bolaji Yusuf, Lukás Burget, Murat Saraçlar, 
Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery.

ICASSP2022 Jiangyu Han, Yanhua Long, Lukás Burget, Jan Cernocký, 
DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction.

ICASSP2022 Ladislav Mosner, Oldrich Plchot, Lukás Burget, Jan Honza Cernocký, 
Multisv: Dataset for Far-Field Multi-Channel Speaker Verification.

ICASSP2022 Lucas Ondel, Léa-Marie Lam-Yee-Mui, Martin Kocour, Caio Filippo Corro, Lukás Burget
GPU-Accelerated Forward-Backward Algorithm with Application to Lattice-Free MMI.

Interspeech2022 Murali Karthick Baskar, Tim Herzig, Diana Nguyen, Mireia Díez, Tim Polzehl, Lukás Burget, Jan Cernocký, 
Speaker adaptation for Wav2vec2 based dysarthric ASR.

Interspeech2022 Niko Brummer, Albert Swart, Ladislav Mosner, Anna Silnova, Oldrich Plchot, Themos Stafylakis, Lukás Burget
Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings.

Interspeech2022 Martin Kocour, Katerina Zmolíková, Lucas Ondel, Jan Svec, Marc Delcroix, Tsubasa Ochiai, Lukás Burget, Jan Cernocký, 
Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model.

Interspeech2022 Federico Landini, Alicia Lozano-Diez, Mireia Díez, Lukás Burget
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization.

#40  | Xixin Wu | DBLP Google Scholar  
By venueInterspeech: 25ICASSP: 24TASLP: 6ICML: 1
By year2024: 62023: 102022: 102021: 72020: 82019: 112018: 4
ISCA sessionsspeech synthesis: 5speech and language in health: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1multi-talker methods in speech processing: 1spoofing-aware automatic speaker verification (sasv): 1voice anti-spoofing and countermeasure: 1non-autoregressive sequential modeling for speech processing: 1speech recognition of atypical speech: 1automatic speech recognition for non-native children’s speech: 1speaker recognition: 1spoken language evaluatiosn: 1learning techniques for speaker recognition: 1asr neural network architectures: 1neural techniques for voice conversion and waveform generation: 1speech and audio classification: 1lexicon and language model for speech recognition: 1second language acquisition and code-switching: 1voice conversion: 1expressive speech synthesis: 1application of asr in medical practice: 1
IEEE keywordsspeech synthesis: 10speech recognition: 9speaker recognition: 7emotion recognition: 6speech coding: 5speech emotion recognition: 5dysarthric speech reconstruction: 4voice conversion: 4recurrent neural nets: 4expressive speech synthesis: 3data models: 3adaptation models: 3decoding: 3linguistics: 3predictive models: 3semantics: 3task analysis: 2vq vae: 2cloning: 2text to speech: 2vae: 2representation learning: 2vocoders: 2computational modeling: 2self supervised learning: 2knowledge distillation: 2deep learning (artificial intelligence): 2optimisation: 2adversarial attack: 2natural language processing: 2speech intelligibility: 2code switching: 2gaussian processes: 2entropy: 2convolutional neural nets: 2multi modal: 1av hubert: 1transforms: 1audio visual: 1visualization: 1training data: 1speech enhancement: 1data mining: 1pre training: 1self supervised style enhancing: 1spectrogram: 1speaker adaptation: 1language model: 1zero shot: 1timbre: 1multi scale acoustic prompts: 1speech disentanglement: 1voice cloning: 1speech normalization: 1speech units: 1perturbation methods: 1pipelines: 1speech representation learning: 1neural tts: 1vector quantization: 1multi stage multi codebook (msmc): 1speech representation: 1context modeling: 1style modeling: 1hidden markov models: 1hierarchical: 1bit error rate: 1multi scale: 1automatic speech recognition: 1neural machine translation: 1transformer: 1transformers: 1hierarchical attention mechanism: 1machine translation: 1alzheimer’s disease: 1sociology: 1syntactics: 1task oriented: 1transfer learning: 1pretrained embeddings: 1multimodality: 1affective computing: 1multi label: 1bidirectional control: 1multi task learning: 1emotional expression: 1multi culture: 1vocal bursts: 1data analysis: 1degradation: 1fast: 1complexity theory: 1lightweight: 1error analysis: 1domain adaptation: 1end to end speech recognition: 1multi talker speech recognition: 1particle separators: 1speech separation: 1speaker change detection: 1audio signal processing: 1multitask learning: 1unsupervised learning: 1unsupervised speech decomposition: 1adversarial speaker adaptation: 1speaker identity: 1neural architecture search: 1uniform sampling: 1path dropout: 1design methodology: 1benchmark testing: 1robustness: 1feature fusion: 1multi channel: 1voice activity detection: 1data handling: 1overlapped speech: 1m2met: 1speaker diarization: 1any to many: 1sequence to sequence modeling: 1signal reconstruction: 1signal sampling: 1signal representation: 1location relative attention: 1capsule: 1exemplary emotion descriptor: 1residual error: 1capsule network: 1spatial information: 1sequential: 1recurrent: 1phonetic pos teriorgrams: 1x vector: 1gmm i vector: 1speaker verification: 1accent conversion: 1accented speech recognition: 1cross modal: 1seq2seq: 1multilingual speech synthesis: 1end to end: 1foreign accent: 1spectral analysis: 1center loss: 1human computer interaction: 1discriminative features: 1bayes methods: 1gaussian process neural network: 1activation function selection: 1inference mechanisms: 1bayesian neural network: 1variational inference: 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1utterance level features: 1spatial relationship information: 1recurrent connection: 1capsule networks: 1natural gradient: 1rnnlms: 1gradient methods: 1
Most publications (all venues) at2024: 272023: 192022: 172019: 132021: 10

Affiliations
URLs

Recent publications

ICASSP2024 Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction.

ICASSP2024 Xueyuan Chen, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Zhiyong Wu 0001, Xixin Wu, Helen Meng, 
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

ICASSP2024 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Dan Luo, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han 0001, Helen Meng, 
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.

ICASSP2024 Hui Lu, Xixin Wu, Haohan Guo, Songxiang Liu, Zhiyong Wu 0001, Helen Meng, 
Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations.

ICASSP2024 Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng, 
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng, 
UniAudio: Towards Universal Audio Generation with Large Language Models.

TASLP2023 Haohan Guo, Fenglong Xie, Xixin Wu, Frank K. Soong, Helen Meng, 
MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTS.

TASLP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Helen Meng, 
MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.

TASLP2023 Xixin Wu, Hui Lu, Kun Li 0003, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms.

ICASSP2023 Jinchao Li, Kaitao Song, Junan Li, Bo Zheng, Dongsheng Li 0002, Xixin Wu, Xunying Liu, Helen Meng, 
Leveraging Pretrained Representations With Task-Related Keywords for Alzheimer's Disease Detection.

ICASSP2023 Jinchao Li, Xixin Wu, Kaitao Song, Dongsheng Li 0002, Xunying Liu, Helen Meng, 
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition.

ICASSP2023 Yuhao Liu, Cheng Gong, Longbiao Wang, Xixin Wu, Qiuyu Liu, Jianwu Dang 0001, 
VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

ICASSP2023 Lingwei Meng, Jiawen Kang 0002, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen Meng, 
A Sidecar Separator Can Convert A Single-Talker Speech Recognition System to A Multi-Talker One.

Interspeech2023 Yunxiang Li, Pengfei Liu 0003, Xixin Wu, Helen Meng, 
PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts.

Interspeech2023 Lingwei Meng, Jiawen Kang 0002, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng, 
Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator.

Interspeech2023 Helen Meng, Brian Mak, Man-Wai Mak, Helene H. Fung, Xianmin Gong, Timothy C. Y. Kwok, Xunying Liu, Vincent C. T. Mok, Patrick C. M. Wong, Jean Woo, Xixin Wu, Ka Ho Wong, Sean Shensheng Xu, Naijun Zheng, Ranzo Huang, Jiawen Kang 0002, Xiaoquan Ke, Junan Li, Jinchao Li, Yi Wang, 
Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders.

ICASSP2022 Hang Su, Danyang Zhao, Long Dang, Minglei Li 0001, Xixin Wu, Xunying Liu, Helen Meng, 
A Multitask Learning Framework for Speaker Change Detection with Content Information from Unsupervised Speech Decomposition.

ICASSP2022 Disong Wang, Songxiang Liu, Xixin Wu, Hui Lu, Lifa Sun, Xunying Liu, Helen Meng, 
Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation.

ICASSP2022 Xixin Wu, Shoukang Hu, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Neural Architecture Search for Speech Emotion Recognition.

ICASSP2022 Haibin Wu, Bo Zheng, Xu Li 0015, Xixin Wu, Hung-Yi Lee, Helen Meng, 
Characterizing the Adversarial Vulnerability of Speech self-Supervised Learning.

#41  | Zhou Zhao | DBLP Google Scholar  
By venueACL: 13ACL-Findings: 9AAAI: 7ICASSP: 5NeurIPS: 5ICML: 4ICLR: 4IJCAI: 4Interspeech: 2NAACL: 1EMNLP: 1KDD: 1
By year2024: 152023: 192022: 112021: 52020: 32019: 3
ISCA sessionsspeech synthesis: 1speech coding and privacy: 1
IEEE keywordstext to speech: 3task analysis: 2title generation: 2summarization: 2action item detection: 2keyphrase extraction: 2grasping: 2topic segmentation: 2speech enhancement: 2style control: 1codecs: 1speech coding: 1programming: 1dataset: 1data handling: 1performance gain: 1data mining: 1long form spoken language processing: 1benchmark testing: 1annotations: 1manuals: 1recording: 1prosody modeling: 1pre training: 1focusing: 1predictive models: 1shape: 1signal denoising: 1generative adversarial network: 1singing voice synthesis: 1noisy audio: 1speech synthesis: 1denoise: 1
Most publications (all venues) at2023: 722024: 602022: 472020: 382021: 37

Affiliations
URLs

Recent publications

ICASSP2024 Shengpeng Ji, Jialong Zuo, Minghui Fang 0002, Ziyue Jiang 0004, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models.

ICML2024 Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang 0001, Xize Cheng, Ziyue Jiang 0001, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao 0007, Zhou Zhao
InstructSpeech: Following Speech Editing Instructions via Large Language Models.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng, 
UniAudio: Towards Universal Audio Generation with Large Language Models.

ICLR2024 Ziyue Jiang 0001, Jinglin Liu, Yi Ren 0006, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang 0020, Pengfei Wei 0001, Chunfeng Wang, Xiang Yin 0006, Zejun Ma, Zhou Zhao
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis.

AAAI2024 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren 0006, Yuexian Zou, Zhou Zhao, Shinji Watanabe 0001, 
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

AAAI2024 Yu Zhang 0126, Rongjie Huang, Ruiqi Li, Jinzheng He, Yan Xia 0006, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis.

ACL2024 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang 0001, Ziyue Jiang 0001, Xuankai Chang, Jiatong Shi, Chao Weng, Zhou Zhao, Dong Yu 0001, 
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

ACL2024 Shengpeng Ji, Ziyue Jiang 0001, Hanting Wang, Jialong Zuo, Zhou Zhao
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech.

ACL2024 Songju Lei, Xize Cheng, Mengjiao Lyu, Jianqiao Hu, Jintao Tan, Runlin Liu, Lingyu Xiong, Tao Jin 0004, Xiandong Li, Zhou Zhao
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation.

ACL2024 Ruiqi Li, Yu Zhang 0126, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao
Robust Singing Voice Transcription Serves Synthesis.

ACL2024 Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang 0001, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou, 
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension.

NAACL2024 Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt.

ACL-Findings2024 Xize Cheng, Rongjie Huang, Linjun Li, Zehan Wang 0001, Tao Jin 0004, Aoxiong Yin, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation.

ACL-Findings2024 Ruiqi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion.

ACL-Findings2024 Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing.

ICASSP2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen 0003, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren 0006, Zhou Zhao
Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG).

ICASSP2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen 0003, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren 0006, Zhou Zhao
MUG: A General Meeting Understanding and Generation Benchmark.

ICML2023 Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren 0006, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin 0006, Zhou Zhao
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models.

NeurIPS2023 Haoyi Duan, Yan Xia 0006, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks.

ICLR2023 Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren 0006, Lichao Zhang, Jinzheng He, Zhou Zhao
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation.

#42  | Jianwei Yu | DBLP Google Scholar  
By venueInterspeech: 24ICASSP: 22TASLP: 8AAAI: 1
By year2024: 42023: 122022: 82021: 132020: 72019: 92018: 2
ISCA sessionsspeech coding and enhancement: 3speech recognition of atypical speech: 3speech and language in health: 2speech synthesis: 2topics in asr: 2speech recognition: 1spoken dialogue systems and multimodality: 1multi-, cross-lingual and other topics in asr: 1acoustic event detection and classification: 1source separation, dereverberation and echo cancellation: 1multimodal speech processing: 1speech and speaker recognition: 1asr neural network architectures: 1medical applications and visual asr: 1lexicon and language model for speech recognition: 1novel neural network architectures for acoustic modelling: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 16speaker recognition: 8bayes methods: 6speech enhancement: 5speech separation: 5natural language processing: 5optimisation: 4quantisation (signal): 4recurrent neural nets: 4task analysis: 3deep learning (artificial intelligence): 3multi channel: 3overlapped speech: 3transformer: 3language models: 3gaussian processes: 3noise reduction: 2computational modeling: 2computational efficiency: 2training data: 2pipelines: 2band split rnn: 2estimation: 2spectrogram: 2automatic speech recognition: 2decoding: 2end to end: 2measurement: 2recurrent neural networks: 2audio visual: 2audio visual systems: 2speech intelligibility: 2multi look: 2variational inference: 2inference mechanisms: 2bayesian learning: 2gradient methods: 2admm: 2quantization: 2speaker verification: 2speech synthesis: 2multi path transformer: 1speech denoising: 1transformers: 1complexity scaling: 1neural network: 1computer architecture: 1system performance: 1in the wild: 1filtering algorithms: 1dino: 1self supervised learning: 1filtering: 1speeech data preprocessing: 1speaker clustering: 1data preprocessing: 1annotations: 1background noise: 1music separation: 1multiple signal classification: 1instruments: 1data models: 1transducers: 1discriminative training: 1mutual information: 1maximum mutual information: 1minimum bayesian risk: 1sequential training: 1autoregressive model: 1diffusion model: 1vocoders: 1text to sound generation: 1transforms: 1vocoder: 1headphones: 1personalized speech enhancement: 1artificial intelligence: 1dns challenge 2023: 1bandwidth: 1benchmark testing: 1universal sample rate: 1dynamic complexity: 1search problems: 1uncertainty handling: 1minimisation: 1neural architecture search: 1neural net architecture: 1time delay neural network: 1dereverberation and recognition: 1reverberation: 1mean square error methods: 1neural network quantization: 1source separation: 1mixed precision: 1direction of arrival estimation: 1direction of arrival: 1speaker diarization: 1delays: 1domain adaptation: 1generalisation (artificial intelligence): 1lf mmi: 1gaussian process: 1handicapped aids: 1speaker adaptation: 1data augmentation: 1multimodal speech recognition: 1disordered speech recognition: 1lstm rnn: 1low bit quantization: 1image recognition: 1microphone arrays: 1visual occlusion: 1overlapped speech recognition: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1adress: 1cognition: 1patient diagnosis: 1alzheimer's disease detection: 1signal classification: 1diseases: 1features: 1geriatrics: 1linguistics: 1medical diagnostic computing: 1asr: 1switches: 1adaptation models: 1model uncertainty: 1neural language models: 1uncertainty: 1neurocognitive disorder detection: 1elderly speech: 1dementia: 1adversarial attack: 1x vector: 1gmm i vector: 1dysarthric speech reconstruction: 1cross modal: 1knowledge distillation: 1seq2seq: 1voice conversion: 1data compression: 1alternating direction methods of multipliers: 1multi modal: 1audio visual speech recognition: 1multilingual speech synthesis: 1speech coding: 1code switching: 1foreign accent: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1neural network language models: 1lstm: 1parameter estimation: 1utterance level features: 1spatial relationship information: 1speech emotion recognition: 1convolutional neural nets: 1emotion recognition: 1recurrent connection: 1capsule networks: 1entropy: 1natural gradient: 1rnnlms: 1
Most publications (all venues) at2023: 182021: 182022: 172024: 122019: 10

Affiliations
URLs

Recent publications

ICASSP2024 Hangting Chen, Jianwei Yu, Chao Weng, 
Complexity Scaling for Speech Denoising.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001, 
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

ICASSP2024 Jianwei Yu, Hangting Chen, Yanyao Bian, Xiang Li, Yi Luo, Jinchuan Tian, Mengyang Liu, Jiayi Jiang, Shuai Wang, 
AutoPrep: An Automatic Preprocessing Framework for In-The-Wild Speech Data.

AAAI2024 Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu 0001, Shi-Xiong Zhang, Guangzhi Li, Yi Luo 0004, Rongzhi Gu, 
SECap: Speech Emotion Captioning with Large Language Model.

TASLP2023 Yi Luo 0004, Jianwei Yu
Music Source Separation With Band-Split RNN.

TASLP2023 Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu 0001, 
Integrating Lattice-Free MMI Into End-to-End Speech Recognition.

TASLP2023 Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu 0001, 
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation.

ICASSP2023 Jianwei Yu, Hangting Chen, Yi Luo 0004, Rongzhi Gu, Weihua Li, Chao Weng, 
TSpeech-AI System Description to the 5th Deep Noise Suppression (DNS) Challenge.

ICASSP2023 Jianwei Yu, Yi Luo 0004, 
Efficient Monaural Speech Enhancement with Universal Sample Rate Band-Split RNN.

Interspeech2023 Yi Luo 0004, Jianwei Yu
FRA-RIR: Fast Random Approximation of the Image-source Method.

Interspeech2023 Hangting Chen, Jianwei Yu, Yi Luo 0004, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng, 
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression.

Interspeech2023 Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu, 
Use of Speech Impairment Severity for Dysarthric Speech Recognition.

Interspeech2023 Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye 0001, Helen Meng, Xunying Liu, 
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition.

Interspeech2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu 0001, Shinji Watanabe 0001, 
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.

Interspeech2023 Dongchao Yang, Songxiang Liu, Helin Wang, Jianwei Yu, Chao Weng, Yuexian Zou, 
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS.

Interspeech2023 Jianwei Yu, Hangting Chen, Yi Luo 0004, Rongzhi Gu, Chao Weng, 
High Fidelity Speech Enhancement with Band-split RNN.

TASLP2022 Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

ICASSP2022 Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng, 
Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition.

ICASSP2022 Junhao Xu, Jianwei Yu, Xunying Liu, Helen Meng, 
Mixed Precision DNN Quantization for Overlapped Speech Separation and Recognition.

ICASSP2022 Naijun Zheng, Na Li 0012, Jianwei Yu, Chao Weng, Dan Su 0002, Xunying Liu, Helen Meng, 
Multi-Channel Speaker Diarization Using Spatial Features for Meetings.

#43  | Sanjeev Khudanpur | DBLP Google Scholar  
By venueInterspeech: 35ICASSP: 18TASLP: 1
By year2024: 42023: 102022: 52021: 92020: 72019: 102018: 9
ISCA sessionsspeaker recognition evaluation: 3merlion ccs challenge: 2trustworthy speech processing: 2the voices from a distance challenge: 2speech recognition: 1multi-talker methods in speech processing: 1resources for spoken language processing: 1speaker and language recognition: 1tools, corpora and resources: 1language and accent recognition: 1source separation: 1graph and end-to-end learning for speaker recognition: 1linguistic components in end-to-end asr: 1feature extraction and distant asr: 1lm adaptation, lexical units and punctuation: 1neural networks for language modeling: 1asr neural network architectures and training: 1summarization, semantic analysis and classification: 1nn architectures for asr: 1spoken language processing for children’s speech: 1speaker recognition and diarization: 1recurrent neural models for asr: 1novel neural network architectures for acoustic modelling: 1robust speech recognition: 1speaker state and trait: 1end-to-end speech recognition: 1language modeling: 1acoustic modelling: 1the first dihard speech diarization challenge: 1extracting information from audio: 1
IEEE keywordsspeech recognition: 8decoding: 7automatic speech recognition: 5speech coding: 4end to end: 3code switching: 3switches: 3self supervised learning: 3pipelines: 3transformer: 3standards: 2forced alignment: 2predictive models: 2speech enhancement: 2benchmark testing: 2multi talker asr: 2task analysis: 2speech separation: 2natural language processing: 2speaker diarization: 2label priors: 1ctc: 1runtime: 1behavioral sciences: 1conversational speech: 1context modeling: 1contextual information: 1end to end models: 1memory management: 1machine translation: 1speech translation: 1robustness: 1zero shot learning: 1data models: 1data augmentation: 1asr: 1splicing: 1interaction: 1degradation: 1visualization: 1language bias: 1transducers: 1hidden markov models: 1surt: 1analytical models: 1reproducibility of results: 1espnet: 1s3prl: 1unsupervised asr: 1learning systems: 1aggregates: 1adaptation models: 1target speaker asr: 1codes: 1error correction: 1information retrieval: 1measurement: 1keyword search: 1confidence: 1timing: 1estimation: 1language diarization: 1multitasking: 1token: 1language posterior: 1channel bank filters: 1fourier transforms: 1cross lingual asr: 1lattice free mmi: 1mutual information: 1self supervised: 1few shot learning: 1lattice pruning: 1decoder: 1lattice generation: 1parallel processing: 1lattice rescoring: 1parallel computation: 1neural language models: 1noisy speech: 1deep learning (artificial intelligence): 1signal denoising: 1source separation: 1lf mmi: 1convolutional neural nets: 1computational complexity: 1gradient methods: 1wake word detection: 1streaming: 1voice activity detection: 1proposals: 1neural network: 1region proposal network: 1faster r cnn: 1language model adaptation: 1neural language model: 1interpolation: 1merging: 1linear interpolation: 1robust speech recognition: 1microphone arrays: 1acoustic modeling: 1chime 5 challenge: 1array signal processing: 1kaldi: 1speaker recognition: 1deep neural networks: 1x vectors: 1
Most publications (all venues) at2018: 272021: 192019: 192023: 182017: 15

Affiliations
URLs

Recent publications

ICASSP2024 Ruizhe Huang, Xiaohui Zhang 0007, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe 0001, Daniel Povey, Sanjeev Khudanpur
Less Peaky and More Accurate CTC Forced Alignment by Label Priors.

ICASSP2024 Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe 0001, Sanjeev Khudanpur
Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization.

ICASSP2024 Amir Hussein, Dorsa Zeinali, Ondrej Klejch, Matthew Wiesner, Brian Yan, Shammur Absar Chowdhury, Ahmed Ali 0002, Shinji Watanabe 0001, Sanjeev Khudanpur
Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora.

ICASSP2024 Hexin Liu, Leibny Paola Garcia, Xiangyu Zhang, Andy W. H. Khong, Sanjeev Khudanpur
Enhancing Code-Switching Speech Recognition With Interactive Language Biases.

TASLP2023 Desh Raj, Daniel Povey, Sanjeev Khudanpur
SURT 2.0: Advances in Transducer-Based Multi-Talker Speech Recognition.

ICASSP2023 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola García, Hung-Yi Lee, Shinji Watanabe 0001, Sanjeev Khudanpur
Euro: Espnet Unsupervised ASR Open-Source Toolkit.

ICASSP2023 Zili Huang, Desh Raj, Paola García 0001, Sanjeev Khudanpur
Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings.

ICASSP2023 Ruizhe Huang, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, Jan Trmal, Sanjeev Khudanpur
Building Keyword Search System from End-To-End Asr Systems.

ICASSP2023 Hexin Liu, Haihua Xu, Leibny Paola García, Andy W. H. Khong, Yi He, Sanjeev Khudanpur
Reducing Language Confusion for Code-Switching Speech Recognition with Token-Level Language Diarization.

Interspeech2023 Yi Han Victoria Chua, Hexin Liu, Leibny Paola García, Fei Ting Woon, Jinyi Wong, Xiangyu Zhang, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels, Suzy J. Styles, 
MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization.

Interspeech2023 Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola García, Daniel Povey, Sanjeev Khudanpur
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts.

Interspeech2023 Desh Raj, Daniel Povey, Sanjeev Khudanpur
GPU-accelerated Guided Source Separation for Meeting Transcription.

Interspeech2023 Suzy J. Styles, Yi Han Victoria Chua, Fei Ting Woon, Hexin Liu, Leibny Paola García, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels, 
Investigating model performance in language identification: beyond simple error statistics.

Interspeech2023 Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur
HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation.

ICASSP2022 Zili Huang, Shinji Watanabe 0001, Shu-Wen Yang, Paola García 0001, Sanjeev Khudanpur
Investigating Self-Supervised Learning for Speech Enhancement and Separation.

ICASSP2022 Matthew Wiesner, Desh Raj, Sanjeev Khudanpur
Injecting Text and Cross-Lingual Supervision in Few-Shot Learning from Self-Supervised Models.

Interspeech2022 Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesús Villalba 0001, Sanjeev Khudanpur, Najim Dehak, 
Defense against Adversarial Attacks on Hybrid Speech Recognition System using Adversarial Fine-tuning with Denoiser.

Interspeech2022 Hexin Liu, Leibny Paola García-Perera, Andy W. H. Khong, Suzy J. Styles, Sanjeev Khudanpur
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification.

Interspeech2022 Yiwen Shao, Jesús Villalba 0001, Sonal Joshi, Saurabh Kataria, Sanjeev Khudanpur, Najim Dehak, 
Chunking Defense for Adversarial Attacks on ASR.

ICASSP2021 Hang Lv 0001, Zhehuai Chen, Hainan Xu, Daniel Povey, Lei Xie 0001, Sanjeev Khudanpur
An Asynchronous WFST-Based Decoder for Automatic Speech Recognition.

#44  | Ming Li 0026 | DBLP Google Scholar  
By venueInterspeech: 26ICASSP: 20TASLP: 7
By year2024: 42023: 112022: 112021: 72020: 82019: 112018: 1
ISCA sessionsspeech synthesis: 3speaker embedding and diarization: 2speaker recognition: 2speaker and language recognition: 2anti-spoofing for speaker verification: 1multi-talker methods in speech processing: 1analysis of speech and audio signals: 1spoofing-aware automatic speaker verification (sasv): 1spoken term detection & voice search: 1sdsv challenge 2021: 1robust speaker recognition: 1feature, embedding and neural architecture for speaker recognition: 1targeted source separation: 1speaker diarization: 1the fearless steps challenge phase-02: 1the interspeech 2020 far field speaker verification challenge: 1the voices from a distance challenge: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker recognition and diarization: 1the interspeech 2019 computational paralinguistics challenge (compare): 1speaker verification using neural network methods: 1
IEEE keywordsspeaker recognition: 10speaker verification: 7task analysis: 6voice activity detection: 6databases: 5error analysis: 5target speaker voice activity detection: 5adaptation models: 4training data: 4voice conversion: 4speech recognition: 3anti spoofing: 3speaker diarization: 3three dimensional displays: 3data models: 3convolution: 3self supervised learning: 3network architecture: 2automatic speech recognition: 2conformer: 2recording: 2security: 2microphone arrays: 2far field: 2target tracking: 2misp challenge: 2measurement: 2visualization: 2clustering: 2deep neural network: 2robustness: 2deep learning (artificial intelligence): 2attention: 2speaker embedding: 2natural language processing: 2transfer learning: 1knowledge distillation: 1phonetics: 1time varying: 1face recognition: 1template updating: 1cross age: 1time varying systems: 1videos: 1reinforcement learning: 1aging: 1couplings: 1invertible neural networks: 1audio security: 1information sharing: 1multi channel: 1source speaker identification: 1data mining: 1spoofing attack: 1closed box: 1protocols: 1signal processing algorithms: 1add challenge: 1deepfakes: 1partially spoofed audio detection: 1wav2vec: 1pretraining: 1transformer: 1transformers: 1performance gain: 1audio visual speaker diarization: 1sequence to sequence transformers: 1graphics processing units: 1predictive models: 1complex acoustic scenarios: 1multimodal system: 1degradation: 1simple attention module: 1semantics: 1solid modeling: 1audio visual wake up word spotting: 1singing language identification (slid): 1universal singing speech language identification (ulid): 1music database: 1clustering algorithms: 1self labeling: 1representation learning: 1encoding: 1audio visual data: 1pattern clustering: 1fuses: 1audio visual wake word spotting: 1multimodal fusion: 1ecapa tdnnlite: 1asymmetric enroll verify structure: 1lightweight speaker verification: 1noisy label: 1attention module: 1end to end speaker diarization: 1multichannel speaker diarization: 1cloning: 1intermediate representation: 1zero shot: 1disentanglement: 1computer assisted piano learning: 1performance evaluation: 1hidden markov models: 1recurrent neural networks: 1dynamic time warping: 1convolutional neural network: 1music: 1piano performance evaluation: 1iterative methods: 1contrastive learning: 1online data augmentation: 1utterance level aggregation: 1load modeling: 1speaker and language recognition: 1variable length training: 1neural network: 1noisy conditions: 1text dependent: 1multichannel: 1open source database: 1convolutional neural nets: 1cnn blstm: 1utterance level: 1language identification: 1end to end: 1phonetic feature: 1electrolaryngeal speech: 1speech enhancement: 1probability: 1fundamental frequency: 1speech intelligibility: 1
Most publications (all venues) at2023: 252022: 242021: 222020: 202019: 18

Affiliations
Duke Kunshan University, Data Science Research Center, China
Sun Yat-Sen University Carnegie Mellon University Joint Institute of Engineering, China (former)
University of Southern California, Los Angeles, CA, USA (former)
Chinese Academy of Sciences, Institute of Acoustics, China (former)

Recent publications

TASLP2024 Danwei Cai, Ming Li 0026
Leveraging ASR Pretrained Conformers for Speaker Verification Through Transfer Learning and Knowledge Distillation.

TASLP2024 Xiaoyi Qin, Na Li 0012, Shufei Duan, Ming Li 0026
Investigating Long-Term and Short-Term Time-Varying Speaker Verification.

ICASSP2024 Zexin Cai, Ming Li 0026
Invertible Voice Conversion with Parallel Data.

ICASSP2024 Weiqing Wang, Danwei Cai, Ming Cheng, Ming Li 0026
Joint Inference of Speaker Diarization and ASR with Multi-Stage Information Sharing.

TASLP2023 Xiaoyi Qin, Danwei Cai, Ming Li 0026
Robust Multi-Channel Far-Field Speaker Verification Under Different In-Domain Data Availability Scenarios.

ICASSP2023 Danwei Cai, Zexin Cai, Ming Li 0026
Identifying Source Speakers for Voice Conversion Based Spoofing Attacks on Speaker Verification Systems.

ICASSP2023 Zexin Cai, Weiqing Wang, Ming Li 0026
Waveform Boundary Detection for Partially Spoofed Audio.

ICASSP2023 Danwei Cai, Weiqing Wang, Ming Li 0026, Rui Xia, Chuanzeng Huang, 
Pretraining Conformer with ASR for Speaker Verification.

ICASSP2023 Ming Cheng, Haoxu Wang, Ziteng Wang, Qiang Fu 0001, Ming Li 0026
The WHU-Alibaba Audio-Visual Speaker Diarization System for the MISP 2022 Challenge.

ICASSP2023 Ming Cheng, Weiqing Wang, Yucong Zhang, Xiaoyi Qin, Ming Li 0026
Target-Speaker Voice Activity Detection Via Sequence-to-Sequence Prediction.

ICASSP2023 Haoxu Wang, Ming Cheng, Qiang Fu 0001, Ming Li 0026
The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis.

ICASSP2023 Xingming Wang, Hao Wu, Chen Ding, Chuanzeng Huang, Ming Li 0026
Exploring Universal Singing Speech Language Identification Using Self-Supervised Learning Based Front-End Features.

Interspeech2023 Xingming Wang, Bang Zeng, Hongbin Suo, Yulong Wan, Ming Li 0026
Robust Audio Anti-spoofing Countermeasure with Joint Training of Front-end and Back-end Models.

Interspeech2023 Bang Zeng, Hongbin Suo, Yulong Wan, Ming Li 0026
SEF-Net: Speaker Embedding Free Target Speaker Extraction Network.

Interspeech2023 Yucong Zhang, Hongbin Suo, Yulong Wan, Ming Li 0026
Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning.

TASLP2022 Danwei Cai, Weiqing Wang, Ming Li 0026
Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition.

TASLP2022 Weiqing Wang, Qingjian Lin, Danwei Cai, Ming Li 0026
Similarity Measurement of Segment-Level Speaker Embeddings in Speaker Diarization.

ICASSP2022 Ming Cheng, Haoxu Wang, Yechen Wang, Ming Li 0026
The DKU Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge.

ICASSP2022 Qingjian Li, Lin Yang, Xuyang Wang, Xiaoyi Qin, Junjie Wang, Ming Li 0026
Towards Lightweight Applications: Asymmetric Enroll-Verify Structure for Speaker Verification.

ICASSP2022 Xiaoyi Qin, Na Li 0012, Chao Weng, Dan Su 0002, Ming Li 0026
Simple Attention Module Based Speaker Verification with Iterative Noisy Label Detection.

#45  | Ralf Schlüter | DBLP Google Scholar  
By venueInterspeech: 32ICASSP: 18TASLP: 1SpeechComm: 1
By year2024: 32023: 52022: 82021: 72020: 132019: 112018: 5
ISCA sessionsnovel models and training methods for asr: 2linguistic components in end-to-end asr: 2search for speech recognition: 2neural networks for language modeling: 2asr neural network training: 2end-to-end speech recognition: 2new computational strategies for asr training and inference: 1multi-talker methods in speech processing: 1speech recognition: 1asr: 1neural transducers, streaming asr and novel asr models: 1language modeling and text-based innovations for asr: 1applications in transcription, education and learning: 1neural network training methods and architectures for asr: 1novel neural network architectures for asr: 1asr neural network architectures and training: 1general topics in speech recognition: 1training strategies for asr: 1model adaptation for asr: 1asr neural network architectures: 1model training for asr: 1corpus annotation and evaluation: 1sequence models for asr: 1asr systems and technologies: 1acoustic model adaptation: 1language modeling: 1
IEEE keywordsspeech recognition: 15hidden markov models: 8decoding: 7natural language processing: 5recurrent neural nets: 5transducers: 4end to end: 3neural transducer: 3sequence discriminative training: 3automatic speech recognition: 2data models: 2task analysis: 2language model: 2estimation: 2transducer: 2switches: 2switchboard: 2bayes methods: 2vocabulary: 2transformer: 2lstm: 2acoustic modeling: 2correlation: 1standards: 1symbols: 1chunked attention models: 1streamable: 1degradation: 1error analysis: 1artificial neural networks: 1multitasking: 1adversarial: 1speaker: 1asr: 1multi task: 1librispeech: 1regularization: 1blstm acoustic model: 1cart free hybrid hmm: 1trees (mathematics): 1beam search: 1lattice: 1sequence training: 1global normalization: 1language model integration: 1hybrid conformer hmm: 1autoregressive processes: 1phoneme: 1phonetics: 1acoustic beams: 1attention: 1end to end speech recognition: 1direct hmm: 1stochastic processes: 1latent models: 1lace: 1resnet: 1dense prediction: 1cnn: 1language modeling: 1feedforward neural nets: 1self attention: 1entropy: 1maximum mutual information: 1optimisation: 1text analysis: 1data augmentation: 1audio signal processing: 1speech synthesis: 1speaker recognition: 1layer normalized lstm: 1layer normalization: 1specaugment: 1hybrid blstm hmm: 1ted lium release 2: 1multi dimensional lstm: 12d sequence to sequence model: 1joint training: 1speech enhancement: 1single channel asr: 1optimization: 1noise measurement: 1chime 4: 1robust asr: 1
Most publications (all venues) at2013: 212011: 212012: 202019: 172021: 15


Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001, 
End-to-End Speech Recognition: A Survey.

ICASSP2024 Zijian Yang, Wei Zhou 0043, Ralf Schlüter, Hermann Ney, 
On the Relation Between Internal Language Model and Sequence Discriminative Training for Neural Transducers.

ICASSP2024 Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney, 
Chunked Attention-Based Encoder-Decoder Model for Streaming Speech Recognition.

ICASSP2023 Zijian Yang, Wei Zhou 0043, Ralf Schlüter, Hermann Ney, 
Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers.

ICASSP2023 Wei Zhou 0043, Haotian Wu, Jingjing Xu, Mohammad Zeineldeen, Christoph Lüscher, Ralf Schlüter, Hermann Ney, 
Enhancing and Adversarial: Improve ASR with Speaker Labels.

Interspeech2023 Wei Zhou 0043, Eugen Beck, Simon Berger, Ralf Schlüter, Hermann Ney, 
RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition.

Interspeech2023 Simon Berger, Peter Vieting, Christoph Böddeker, Ralf Schlüter, Reinhold Haeb-Umbach, 
Mixture Encoder for Joint Speech Separation and Recognition.

Interspeech2023 Tina Raissi, Christoph Lüscher, Moritz Gunz, Ralf Schlüter, Hermann Ney, 
Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think.

ICASSP2022 Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney, 
Improving Factored Hybrid HMM Acoustic Modeling without State Tying.

ICASSP2022 Nils-Philipp Wynands, Wilfried Michel, Jan Rosendahl, Ralf Schlüter, Hermann Ney, 
Efficient Sequence Training of Attention Models Using Approximative Recombination.

ICASSP2022 Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Wilfried Michel, Alexander Gerstenberger, Ralf Schlüter, Hermann Ney, 
Conformer-Based Hybrid ASR System For Switchboard Dataset.

ICASSP2022 Wei Zhou 0043, Zuoyun Zheng, Ralf Schlüter, Hermann Ney, 
On Language Model Integration for RNN Transducer Based Speech Recognition.

Interspeech2022 Felix Meyer, Wilfried Michel, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney, 
Automatic Learning of Subword Dependent Model Scales.

Interspeech2022 Zijian Yang, Yingbo Gao, Alexander Gerstenberger, Jintao Jiang, Ralf Schlüter, Hermann Ney, 
Self-Normalized Importance Sampling for Neural Language Modeling.

Interspeech2022 Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney, 
Improving the Training Recipe for a Robust Conformer-based Hybrid Model.

Interspeech2022 Wei Zhou 0043, Wilfried Michel, Ralf Schlüter, Hermann Ney, 
Efficient Training of Neural Transducer for Speech Recognition.

ICASSP2021 Wei Zhou 0043, Simon Berger, Ralf Schlüter, Hermann Ney, 
Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition.

Interspeech2021 Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney, 
On Sampling-Based Training Criteria for Neural Language Modeling.

Interspeech2021 Yu Qiao 0005, Wei Zhou 0043, Elma Kerz, Ralf Schlüter
The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech.

Interspeech2021 Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney, 
Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models.

#46  | Hermann Ney | DBLP Google Scholar  
By venueInterspeech: 30ICASSP: 19TASLP: 1EMNLP: 1SpeechComm: 1
By year2024: 32023: 42022: 92021: 72020: 142019: 102018: 5
ISCA sessionsnovel models and training methods for asr: 2linguistic components in end-to-end asr: 2search for speech recognition: 2neural networks for language modeling: 2asr neural network training: 2new computational strategies for asr training and inference: 1speech recognition: 1asr: 1neural transducers, streaming asr and novel asr models: 1language modeling and text-based innovations for asr: 1keynote: 1neural network training methods and architectures for asr: 1novel neural network architectures for asr: 1asr neural network architectures and training: 1general topics in speech recognition: 1training strategies for asr: 1model adaptation for asr: 1asr neural network architectures: 1model training for asr: 1corpus annotation and evaluation: 1sequence models for asr: 1asr systems and technologies: 1acoustic model adaptation: 1language modeling: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 16hidden markov models: 7decoding: 6natural language processing: 6recurrent neural nets: 5transducers: 4neural transducer: 3sequence discriminative training: 3noise measurement: 2task analysis: 2language model: 2estimation: 2transducer: 2switches: 2asr: 2switchboard: 2bayes methods: 2vocabulary: 2transformer: 2lstm: 2acoustic modeling: 2language modeling: 2end to end: 2knowledge based systems: 1oral communication: 1document grounded dialog: 1factuality: 1dstc: 1response generation: 1retrieval augmentation: 1dialog system: 1databases: 1correlation: 1standards: 1symbols: 1chunked attention models: 1streamable: 1degradation: 1error analysis: 1artificial neural networks: 1multitasking: 1adversarial: 1speaker: 1multi task: 1librispeech: 1regularization: 1blstm acoustic model: 1cart free hybrid hmm: 1trees (mathematics): 1beam search: 1lattice: 1sequence training: 1global normalization: 1language model integration: 1hybrid conformer hmm: 1autoregressive processes: 1phoneme: 1phonetics: 1acoustic beams: 1automatic speech recognition: 1attention: 1end to end speech recognition: 1direct hmm: 1stochastic processes: 1latent models: 1lace: 1resnet: 1dense prediction: 1cnn: 1teacher student learning: 1domain robustness: 1feedforward neural nets: 1self attention: 1entropy: 1maximum mutual information: 1optimisation: 1text analysis: 1data augmentation: 1audio signal processing: 1speech synthesis: 1speaker recognition: 1layer normalized lstm: 1layer normalization: 1specaugment: 1hybrid blstm hmm: 1ted lium release 2: 1multi dimensional lstm: 12d sequence to sequence model: 1joint training: 1speech enhancement: 1single channel asr: 1optimization: 1data models: 1chime 4: 1robust asr: 1
Most publications (all venues) at2012: 522013: 482011: 442005: 412014: 38


Recent publications

TASLP2024 David Thulke, Nico Daheim, Christian Dugast, Hermann Ney
Task-Oriented Document-Grounded Dialog Systems by HLTPR@RWTH for DSTC9 and DSTC10.

ICASSP2024 Zijian Yang, Wei Zhou 0043, Ralf Schlüter, Hermann Ney
On the Relation Between Internal Language Model and Sequence Discriminative Training for Neural Transducers.

ICASSP2024 Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney
Chunked Attention-Based Encoder-Decoder Model for Streaming Speech Recognition.

ICASSP2023 Zijian Yang, Wei Zhou 0043, Ralf Schlüter, Hermann Ney
Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers.

ICASSP2023 Wei Zhou 0043, Haotian Wu, Jingjing Xu, Mohammad Zeineldeen, Christoph Lüscher, Ralf Schlüter, Hermann Ney
Enhancing and Adversarial: Improve ASR with Speaker Labels.

Interspeech2023 Wei Zhou 0043, Eugen Beck, Simon Berger, Ralf Schlüter, Hermann Ney
RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition.

Interspeech2023 Tina Raissi, Christoph Lüscher, Moritz Gunz, Ralf Schlüter, Hermann Ney
Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think.

ICASSP2022 Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney
Improving Factored Hybrid HMM Acoustic Modeling without State Tying.

ICASSP2022 Nils-Philipp Wynands, Wilfried Michel, Jan Rosendahl, Ralf Schlüter, Hermann Ney
Efficient Sequence Training of Attention Models Using Approximative Recombination.

ICASSP2022 Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Wilfried Michel, Alexander Gerstenberger, Ralf Schlüter, Hermann Ney
Conformer-Based Hybrid ASR System For Switchboard Dataset.

ICASSP2022 Wei Zhou 0043, Zuoyun Zheng, Ralf Schlüter, Hermann Ney
On Language Model Integration for RNN Transducer Based Speech Recognition.

Interspeech2022 Felix Meyer, Wilfried Michel, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney
Automatic Learning of Subword Dependent Model Scales.

Interspeech2022 Zijian Yang, Yingbo Gao, Alexander Gerstenberger, Jintao Jiang, Ralf Schlüter, Hermann Ney
Self-Normalized Importance Sampling for Neural Language Modeling.

Interspeech2022 Mohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney
Improving the Training Recipe for a Robust Conformer-based Hybrid Model.

Interspeech2022 Wei Zhou 0043, Wilfried Michel, Ralf Schlüter, Hermann Ney
Efficient Training of Neural Transducer for Speech Recognition.

EMNLP2022 Viet Anh Khoa Tran, David Thulke, Yingbo Gao, Christian Herold, Hermann Ney
Does Joint Training Really Help Cascaded Speech Translation?

ICASSP2021 Wei Zhou 0043, Simon Berger, Ralf Schlüter, Hermann Ney
Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition.

Interspeech2021 Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney
On Sampling-Based Training Criteria for Neural Language Modeling.

Interspeech2021 Hermann Ney
Forty Years of Speech and Language Processing: From Bayes Decision Rule to Deep Learning.

Interspeech2021 Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney
Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models.

#47  | Hiroshi Saruwatari | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 19TASLP: 8SpeechComm: 3IJCAI: 1
By year2024: 52023: 122022: 82021: 112020: 142019: 2
ISCA sessionsspeech synthesis: 12speech synthesis paradigms and methods: 2speech perception, production, and acquisition: 1speech coding and restoration: 1the voicemos challenge: 1spoken language processing: 1voice conversion and adaptation: 1speech annotation and speech assessment: 1speech signal representation: 1
IEEE keywordsspeech synthesis: 7interpolation: 4text to speech synthesis: 3multizone sound field control: 3loudspeakers: 3personal audio: 3regression analysis: 3filtering theory: 3blind source separation: 3optimisation: 3covariance matrices: 3speech recognition: 3training data: 2adaptation models: 2symbols: 2task analysis: 2video on demand: 2cost function: 2pressure matching: 2amplitude matching: 2source separation: 2predictive models: 2kernel ridge regression: 2sound field interpolation: 2active noise control: 2speaker embedding: 2speaker recognition: 2spatial covariance matrix: 2diffuse noise: 2blind speech extraction: 2acoustic field: 2numerical analysis: 2sound reproduction: 2speech perception: 2generative adversarial networks: 2generative adversarial network: 2black box optimization: 2human computation: 2gaussian distribution: 2audio signal processing: 2gaussian processes: 2music: 2transfer learning: 1multilingual text to speech: 1low resource adaptation: 1graphone: 1data models: 1adaptation of masked language model: 1acoustic measurements: 1corpus construction: 1linguistics: 1core set selection: 1diversification: 1data selection: 1human machine systems: 1oral communication: 1generative spoken language model: 1speech analysis: 1zipf’s law: 1annotations: 1speech representation: 1vocabulary: 1sound field synthesis: 1optimization: 1exterior radiation: 1directional weighting: 1frequency synthesizers: 1potential energy: 1indexes: 1time domain analysis: 1time frequency analysis: 1multichannel music source separation: 1independent deeply learned matrix analysis: 1product of experts: 1independent low rank matrix analysis: 1spectrogram: 1social networking (online): 1vocal ensemble: 1audio source separation: 1corpus: 1lead: 1audio recording: 1singing voice synthesis: 1signal processing algorithms: 1singing voice: 1degradation: 1controllability: 1cross lingual speech synthesis: 1multi speaker speech synthesis: 1speaker generation: 1context modeling: 1audiobook: 1aggregates: 1signal resolution: 1speech prosody: 1tts: 1feeds: 1multi speaker tts: 1rhythm: 1categorized pause insertion: 1phrasing: 1phrase break prediction: 1transformers: 1bit error rate: 1bert: 1pause insertion: 1principal component analysis: 1acoustic transfer function: 1hilbert spaces: 1transfer functions: 1reproducing kernel hilbert space: 1helmholtz equation: 1microphone arrays: 1sound field control: 1spatial active noise control: 1kernel interpolation: 1adaptive filters: 1adaptive filter: 1active learning: 1deep speaker representation learning: 1multi speaker generative modeling: 1perceptual speaker similarity: 1estimation theory: 1em algorithm: 1physics computing: 1acoustic variables control: 1scalability: 1auxiliary classifier: 1backpropagation algorithms: 1conditional generator: 1domain adaptation: 1text analysis: 1mutual information: 1cross lingual: 1multivariate complex generalized gaussian distribution: 1crowdsourcing: 1computational modeling: 1human voice: 1gallium nitride: 1generators: 1spatial noise: 1convolution: 1matrix decomposition: 1joint diagonalization: 1spatial covariance model: 1student’s t distribution: 1frequency domain analysis: 1independent positive semidefinite tensor analysis: 1tensors: 1deep gaussian process: 1bayesian deep model: 1recurrent neural nets: 1simple recurrent unit: 1sequential modeling: 1wave u net: 1discrete wavelet transform: 1deep neural networks: 1discrete wavelet transforms: 1time domain audio source separation: 1spectral differentials: 1deep neural network: 1minimum phase filter: 1sub band processing: 1hilbert transforms: 1voice conversion: 1sound field reproduction: 1mode matching method: 1spherical wavefunction expansion: 1artificial double tracking: 1modulation spectrum: 1moment matching network: 1inter utterance pitch variation: 1dnn based singing voice synthesis: 1
Most publications (all venues) at2021: 332020: 322023: 252022: 242012: 23

Affiliations
URLs

Recent publications

SpeechComm2024 Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari
JNV corpus: A corpus of Japanese nonverbal vocalizations with diverse phrases and emotions.

TASLP2024 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe 0001, Shinnosuke Takamichi, Hiroshi Saruwatari
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis.

ICASSP2024 Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
Diversity-Based Core-Set Selection for Text-to-Speech with Linguistic and Acoustic Features.

ICASSP2024 Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari
Do Learned Speech Symbols Follow Zipf's Law?

ICASSP2024 Yoshihide Tomita, Shoichi Koyama, Hiroshi Saruwatari
Localizing Acoustic Energy in Sound Field Synthesis by Directionally Weighted Exterior Radiation Suppression.

TASLP2023 Takumi Abe, Shoichi Koyama, Natsuki Ueno, Hiroshi Saruwatari
Amplitude Matching for Multizone Sound Field Control.

TASLP2023 Takuya Hasumi, Tomohiko Nakamura, Norihiro Takamune, Hiroshi Saruwatari, Daichi Kitamura, Yu Takahashi, Kazunobu Kondo, 
PoP-IDLMA: Product-of-Prior Independent Deeply Learned Matrix Analysis for Multichannel Music Source Separation.

ICASSP2023 Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, Hiroshi Saruwatari
jaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus.

ICASSP2023 Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari
MID-Attribute Speaker Generation Using Optimal-Transport-Based Interpolation of Gaussian Mixture Models.

ICASSP2023 Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari
Improving Speech Prosody of Audiobook Text-To-Speech Synthesis with Acoustic and Textual Contexts.

ICASSP2023 Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari
Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-To-Speech.

Interspeech2023 Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari
How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics.

Interspeech2023 Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center.

Interspeech2023 Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari
ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings.

Interspeech2023 Yota Ueda, Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Hiroshi Saruwatari
HumanDiffusion: diffusion model using perceptual gradients.

Interspeech2023 Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi Saruwatari
Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus.

IJCAI2023 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe 0001, Shinnosuke Takamichi, Hiroshi Saruwatari
Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining.

TASLP2022 Juliano G. C. Ribeiro, Natsuki Ueno, Shoichi Koyama, Hiroshi Saruwatari
Region-to-Region Kernel Interpolation of Acoustic Transfer Functions Constrained by Physical Properties.

Interspeech2022 Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis.

Interspeech2022 Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History.

#48  | Pengyuan Zhang | DBLP Google Scholar  
By venueInterspeech: 27ICASSP: 13TASLP: 8SpeechComm: 2
By year2024: 42023: 42022: 142021: 122020: 62019: 72018: 3
ISCA sessionsnovel models and training methods for asr: 2speech synthesis: 2speaker embedding and diarization: 1spatial audio: 1speech emotion recognition: 1spoken language processing: 1multi-, cross-lingual and other topics in asr: 1spoofing-aware automatic speaker verification (sasv): 1atypical speech analysis and detection: 1low-resource asr development: 1voice conversion and adaptation: 1single-channel speech enhancement: 1speaker recognition: 1voice anti-spoofing and countermeasure: 1noise robust and distant speech recognition: 1the fearless steps challenge phase-02: 1general topics in speech recognition: 1speech recognition and beyond: 1lexicon and language model for speech recognition: 1asr neural network training: 1asr for noisy and far-field speech: 1model adaptation for asr: 1acoustic scenes and rare events: 1neural network training strategies for asr: 1language modeling: 1
IEEE keywordsspeech recognition: 12automatic speech recognition: 4natural language processing: 4decoding: 4end to end speech recognition: 4task analysis: 3data models: 3voice activity detection: 3computational modeling: 3text analysis: 3hidden markov models: 3pre training: 2filtering: 2estimation: 2pseudo labeling: 2speech synthesis: 2end to end: 2signal classification: 2ctc/attention speech recognition: 2error analysis: 2transformer: 2online speech recognition: 2computer architecture: 2oral communication: 1clustering algorithms: 1metric embedding learning: 1online clustering: 1speaker diarization: 1machine learning algorithms: 1domain adaptation: 1adaptation models: 1self supervised learning: 1generalization ability: 1spoofing detection: 1one class classification: 1classification algorithms: 1knowledge distillation: 1res2net: 1large margin fine tuning: 1speech anti spoofing: 1duration mismatch: 1visual explanations: 1recurrent neural networks: 1partitioning algorithms: 1content of silence: 1convolutional neural networks: 1robustness: 1proportion fo silence duration: 1anti spoofing: 1semi supervised learning: 1noise measurement: 1tuning: 1time frequency analysis: 1tdnn: 1optimization: 1convolution: 1progressive channel fusion: 1spectrogram: 1speaker verification: 1random processes: 1supervised learning: 1self supervised pre training: 1sensor fusion: 1frequency domain: 1dual path transformer: 1speech enhancement: 1full band and sub band fusion: 1least squares approximations: 1image fusion: 1knowledge transfer: 1connectionist temporal classification: 1pre trained language model: 1non autoregressive: 1autoregressive processes: 1keyword confidence scoring: 1keyword search: 1transformers: 1phoneme alignment: 1history: 1language model: 1graphics processing units: 1history utterance: 1performance gain: 1grammars: 1speech coding: 1unpaired data: 1rnn t: 1probability: 1multi level detection: 1constrained attention: 1home automation: 1keyword spotting: 1vocabulary: 1cloning: 1text to speech: 1vocoders: 1m2voc challenge: 1finetune: 1manuals: 1voice cloning: 1predictive models: 1heuristic algorithms: 1hybrid ctc/attention speech recognition: 1dataset: 1speaker recognition: 1chinese: 1computational efficiency: 1autoregressive moving average: 1convolutional neural nets: 1interpretability: 1recurrent neural nets: 1autoregressive moving average processes: 1neural language models: 1multitask learning: 1self attention: 1prosodic boundary prediction: 1
Most publications (all venues) at2022: 352021: 242023: 222019: 192024: 15

Affiliations
URLs

Recent publications

TASLP2024 Yifan Chen, Gaofeng Cheng, Runyan Yang, Pengyuan Zhang, Yonghong Yan 0002, 
Interrelate Training and Clustering for Online Speaker Diarization.

TASLP2024 Han Zhu 0004, Gaofeng Cheng, Jindong Wang 0001, Wenxin Hou, Pengyuan Zhang, Yonghong Yan 0002, 
Boosting Cross-Domain Speech Recognition With Self-Supervision.

ICASSP2024 Jingze Lu, Yuxiang Zhang, Wenchao Wang, Zengqiang Shang, Pengyuan Zhang
One-Class Knowledge Distillation for Spoofing Speech Detection.

ICASSP2024 Yuxiang Zhang, Jingze Lu, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang
Improving Short Utterance Anti-Spoofing with Aasist2.

SpeechComm2023 Feng Dang, Hangting Chen, Qi Hu, Pengyuan Zhang, Yonghong Yan 0002, 
First coarse, fine afterward: A lightweight two-stage complex approach for monaural speech enhancement.

TASLP2023 Yuxiang Zhang, Zhuo Li, Jingze Lu, Hua Hua, Wenchao Wang, Pengyuan Zhang
The Impact of Silence on Speech Anti-Spoofing.

TASLP2023 Han Zhu 0004, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan 0002, 
Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition.

ICASSP2023 Zhenduo Zhao, Zhuo Li, Wenchao Wang, Pengyuan Zhang
PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification.

TASLP2022 Changfeng Gao, Gaofeng Cheng, Ta Li, Pengyuan Zhang, Yonghong Yan 0002, 
Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model.

ICASSP2022 Feng Dang, Hangting Chen, Pengyuan Zhang
DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement.

ICASSP2022 Keqi Deng, Songjun Cao, Yike Zhang, Long Ma, Gaofeng Cheng, Ji Xu, Pengyuan Zhang
Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models.

ICASSP2022 Keqi Deng, Zehui Yang, Shinji Watanabe 0001, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang
Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models.

Interspeech2022 Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002, 
Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization.

Interspeech2022 Hangting Chen, Yi Yang 0057, Feng Dang, Pengyuan Zhang
Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output.

Interspeech2022 Chengxin Chen, Pengyuan Zhang
CTA-RNN: Channel and Temporal-wise Attention RNN leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition.

Interspeech2022 Yukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan 0002, 
NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech Recognition.

Interspeech2022 Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie 0001, Yonghong Yan 0002, 
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset.

Interspeech2022 Lingxuan Ye, Gaofeng Cheng, Runyan Yang, Zehui Yang, Sanli Tian, Pengyuan Zhang, Yonghong Yan 0002, 
Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods.

Interspeech2022 Yuxiang Zhang, Zhuo Li, Wenchao Wang, Pengyuan Zhang
SASV Based on Pre-trained ASV System and Integrated Scoring Module.

Interspeech2022 Xueshuai Zhang, Jiakun Shen, Jun Zhou 0024, Pengyuan Zhang, Yonghong Yan 0002, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shaoxing Zhang, Aijun Sun, 
Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics.

#49  | Kong-Aik Lee | DBLP Google Scholar  
By venueICASSP: 20Interspeech: 20TASLP: 7NeurIPS: 1SpeechComm: 1
By year2024: 72023: 132022: 72021: 52020: 92019: 62018: 2
ISCA sessionsrobust speaker recognition: 3speaker recognition and diarization: 2speaker verification: 2speech coding and enhancement: 1anti-spoofing for speaker verification: 1voice anti-spoofing and countermeasure: 1feature, embedding and neural architecture for speaker recognition: 1anti-spoofing and liveness detection: 1speaker recognition challenges and applications: 1the attacker’s perpective on automatic speaker verification: 1large-scale evaluation of short-duration speaker verification: 1dnn architectures for speaker recognition: 1learning techniques for speaker recognition: 1speaker recognition evaluation: 1speaker recognition: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1
IEEE keywordsspeaker recognition: 13speaker verification: 10task analysis: 5uncertainty: 4meta learning: 4transformers: 3optimization: 3representation learning: 3asvspoof: 2anti spoofing: 2presentation attack detection: 2self supervised learning: 2visualization: 2probabilistic logic: 2speech recognition: 2pattern classification: 2unsupervised learning: 2natural language processing: 2domain adaptation: 2resnet: 1time frequency analysis: 1stride configuration: 1temporal resolution: 12d cnn: 1convolutional neural networks: 1image resolution: 1computer architecture: 1electronic mail: 1authentication: 1multitasking: 1spoof aware speaker verification (sasv): 1glass box: 1personalized speech generation: 1text to speech: 1voice privacy: 1rendering (computer graphics): 1adversary attack: 1perturbation methods: 1voice conversion: 1privacy: 1data privacy: 1information filtering: 1voice anonymization: 1pseudo speaker distribution: 1speaker uncertainty: 1pseudo speaker vector: 1low snr: 1testing: 1artificial noise: 1signal to noise ratio: 1background noise: 1gradient: 1noise robust: 1automatic speaker verification: 1label level knowledge distillation: 1knowledge engineering: 1knowledge distillation: 1substitution: 1training data: 1speech anti spoofing: 1concatenation: 1blending strategies: 1data augmentation: 1refining: 1codecs: 1deepfakes: 1spoofing: 1distributed databases: 1protocols: 1countermeasures: 1communication networks: 1progressive clustering: 1multi modal: 1diverse positive pairs: 1supervised learning: 1face recognition: 1meta generalized speaker verification: 1performance evaluation: 1domain mismatch: 1recording: 1upper bound: 1audio visual data: 1co teaching+: 1local global: 1positional encoding: 1natural languages: 1transformer: 1encoding: 1convolution: 1focusing: 1learning systems: 1biometrics (access control): 1production: 1lip biometrics: 1visual speech: 1cross modal: 1correlation: 1lips: 1co learning: 1online speaker clustering: 1clustering algorithms: 1calibration: 1signal processing algorithms: 1noise robustness: 1disentangled representation learning: 1degradation: 1metric learning: 1extraterrestrial measurements: 1noise measurement: 1noisy label: 1deep cleansing: 1audiovisual: 1codes: 1speaker embeddings: 1plda: 1measurement uncertainty: 1xi vector: 1estimation: 1latent variables: 1uncertainty estimation: 1artificial intelligence: 1measurement: 1coherence: 1dialogue generation: 1multi scale frequency channel attention: 1short utterance: 1convolutional neural nets: 1text independent speaker verification: 1pseudo label selection: 1self supervised speaker recognition: 1loss gated learning: 1microphone arrays: 1multi speaker asr: 1meeting transcription: 1alimeeting: 1m2met: 1speaker diarization: 1multilayer perceptrons: 1domain invariant: 1meta generalized transformation: 1optimisation: 1meta speaker embedding network: 1cross channel: 1automatic speaker verification (asv): 1security of data: 1detect ion cost function: 1spoofing counter measures: 1speech articulatory attributes: 1backpropagation: 1maximal figure of merit: 1deep bottleneck features: 1convolutional recurrent neural network: 1spoken language recognition: 1speak verification: 1generalized framework: 1probability: 1correlation alignment: 1interpolation: 1regularization: 1correlation methods: 1unsupervised: 1discriminant analysis: 1linear discriminant analysis: 1
Most publications (all venues) at2024: 202021: 192020: 192023: 182022: 16

Affiliations
Institute for Infocomm Research, Singapore
URLs

Recent publications

TASLP2024 Tianchi Liu 0004, Kong Aik Lee, Qiongqiong Wang, Haizhou Li 0001, 
Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification.

TASLP2024 Xuechen Liu, Md. Sahidullah, Kong Aik Lee, Tomi Kinnunen, 
Generalizing Speaker Verification for Spoof Awareness in the Embedding Space.

ICASSP2024 Shihao Chen, Liping Chen, Jie Zhang 0042, Kong-Aik Lee, Zhenhua Ling, Lirong Dai 0001, 
Adversarial Speech for Voice Privacy Protection from Personalized Speech Generation.

ICASSP2024 Liping Chen, Kong Aik Lee, Wu Guo, Zhen-Hua Ling, 
Modeling Pseudo-Speaker Uncertainty in Voice Anonymization.

ICASSP2024 Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li 0001, 
Gradient Weighting for Speaker Verification in Extremely Low Signal-to-Noise Ratio.

ICASSP2024 Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng, 
Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification.

ICASSP2024 Linjuan Zhang, Kong Aik Lee, Lin Zhang, Longbiao Wang, Baoning Niu, 
CPAUG: Refining Copy-Paste Augmentation for Speech Anti-Spoofing.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

TASLP2023 Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li 0001, 
Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs.

TASLP2023 Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, Helen Meng, 
Meta-Generalization for Domain-Invariant Speaker Verification.

ICASSP2023 Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, 
Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning.

ICASSP2023 Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Jianwu Dang 0001, 
Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection.

ICASSP2023 Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang 0001, 
Cross-Modal Audio-Visual Co-Learning for Text-Independent Speaker Verification.

ICASSP2023 Alexey Sholokhov, Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng, 
Probabilistic Back-ends for Online Speaker Recognition and Clustering.

ICASSP2023 Yao Sun, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang 0001, 
Noise-Disentanglement Metric Learning for Robust Speaker Verification.

ICASSP2023 Ruijie Tao, Kong Aik Lee, Zhan Shi, Haizhou Li 0001, 
Speaker Recognition with Two-Step Multi-Modal Deep Cleansing.

ICASSP2023 Qiongqiong Wang, Kong Aik Lee, Tianchi Liu 0004, 
Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification.

Interspeech2023 Xuechen Liu, Md. Sahidullah, Kong Aik Lee, Tomi Kinnunen, 
Speaker-Aware Anti-spoofing.

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

NeurIPS2023 Tianchi Liu 0004, Kong Aik Lee, Qiongqiong Wang, Haizhou Li 0001, 
Disentangling Voice and Content with Self-Supervision for Speaker Recognition.

#50  | Yifan Gong 0001 | DBLP Google Scholar  
By venueICASSP: 26Interspeech: 21TASLP: 2
By year2024: 12022: 32021: 132020: 132019: 162018: 3
ISCA sessionsasr neural network architectures: 3streaming for asr/rnn transducers: 2multi- and cross-lingual asr, other topics in asr: 2novel models and training methods for asr: 1topics in asr: 1self-supervision and semi-supervision for neural asr training: 1neural network training methods for asr: 1acoustic model adaptation for asr: 1streaming asr: 1feature extraction and distant asr: 1asr neural network architectures and training: 1search for speech recognition: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1asr neural network training: 1novel neural network architectures for acoustic modelling: 1novel approaches to enhancement: 1deep enhancement: 1
IEEE keywordsspeech recognition: 22recurrent neural nets: 5end to end: 4speaker recognition: 4attention: 4data models: 3adaptation models: 3decoding: 3natural language processing: 3deep neural network: 3probability: 3meeting transcription: 3task analysis: 3speaker adaptation: 3teacher student learning: 3automatic speech recognition: 3lstm: 3adversarial learning: 3error analysis: 2combination: 2speech separation: 2acoustic model adaptation: 2speech synthesis: 2domain adaptation: 2adaptation: 2ctc: 2speech coding: 2vocabulary: 2signal classification: 2neural network: 2llm: 1pretrained lm: 1fully formatted e2e asr transcription: 1multi talker asr: 1streaming: 1end to end end point detection: 1training data: 1real time systems: 1hybrid: 1cascaded: 1two pass: 1speaker inventory: 1mathematical model: 1estimated speech: 1particle separators: 1computer science: 1correlation: 1speaker separation: 1language model: 1attention based encoder decoder: 1recurrent neural network transducer: 1sequence training: 1self teaching: 1regularization: 1segmentation: 1microphone arrays: 1sound source localisation: 1hidden markov models: 1speaker location: 1hidden markov model: 1diarisation: 1filtering theory: 1audio signal processing: 1source separation: 1system fusion: 1speaker diarization: 1unsupervised adaptation: 1speaker adaption: 1robustness: 1production systems: 1neural language generation: 1unsupervised learning: 1transducers: 1production: 1tensors: 1rnn transducer: 1virtual assistants: 1alignments: 1pre training.: 1pattern classification: 1streaming attention based sequence to sequence asr: 1encoding: 1latency reduction: 1monotonic chunkwise attention: 1entropy: 1computer aided instruction: 1latency: 1label embedding: 1knowledge representation: 1backpropagation: 1text analysis: 1text to speech: 1rnn t: 1keyword spotting: 1end to end system: 1oov: 1acoustic to word: 1universal acoustic model: 1mixture of experts: 1mixture models: 1interpolation: 1word embedding: 1confidence classifier: 1digital assistant: 1acoustic state prediction: 1model adaptation: 1model combination: 1layer trajectory: 1future context frames: 1temporal modeling: 1senone classification: 1code switching: 1language identification: 1asr: 1domain invariant training: 1computational modeling: 1predictive models: 1speaker verification: 1context modeling: 1data mining: 1speaker extraction: 1speaker profile: 1periodic structures: 1cloud computing: 1quantization: 1polynomials: 1privacy preserving: 1dnn: 1application program interfaces: 1cryptography: 1encryption: 1
Most publications (all venues) at2019: 212021: 162020: 152018: 132015: 11

Affiliations
Microsoft Corporation, Redmond, WA, USA
Texas Instruments Inc., Dallas, TX, USA
INRIA-Lorraine, Nancy, France
Henri Poincaré University, Department of Mathematics and Computer Science, Nancy, France (PhD)

Recent publications

ICASSP2024 Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong 0001, Ed Lin, Michael Zeng 0001, 
Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition.

ICASSP2022 Liang Lu 0001, Jinyu Li 0001, Yifan Gong 0001
Endpoint Detection for Streaming End-to-End Multi-Talker ASR.

ICASSP2022 Guoli Ye, Vadim Mazalov, Jinyu Li 0001, Yifan Gong 0001
Have Best of Both Worlds: Two-Pass Hybrid and E2E Cascading Framework for Speech Recognition.

Interspeech2022 Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li 0001, Xie Chen 0001, Yu Wu 0012, Yifan Gong 0001
Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition.

TASLP2021 Peidong Wang, Zhuo Chen 0006, DeLiang Wang, Jinyu Li 0001, Yifan Gong 0001
Speaker Separation Using Speaker Inventories and Estimated Speech.

ICASSP2021 Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu 0001, Xie Chen 0001, Jinyu Li 0001, Yifan Gong 0001
Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition.

ICASSP2021 Eric Sun, Liang Lu 0001, Zhong Meng, Yifan Gong 0001
Sequence-Level Self-Teaching Regularization.

ICASSP2021 Jeremy Heng Meng Wong, Dimitrios Dimitriadis, Ken'ichi Kumatani, Yashesh Gaur, George Polovets, Partha Parthasarathy, Eric Sun, Jinyu Li 0001, Yifan Gong 0001
Ensemble Combination between Different Time Segmentations.

ICASSP2021 Jeremy Heng Meng Wong, Xiong Xiao, Yifan Gong 0001
Hidden Markov Model Diarisation with Speaker Location Information.

ICASSP2021 Xiong Xiao, Naoyuki Kanda, Zhuo Chen 0006, Tianyan Zhou, Takuya Yoshioka, Sanyuan Chen, Yong Zhao 0008, Gang Liu 0001, Yu Wu 0012, Jian Wu 0027, Shujie Liu 0001, Jinyu Li 0001, Yifan Gong 0001
Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020.

Interspeech2021 Liang Lu 0001, Naoyuki Kanda, Jinyu Li 0001, Yifan Gong 0001
Streaming Multi-Talker Speech Recognition with Joint Speaker Identification.

Interspeech2021 Liang Lu 0001, Zhong Meng, Naoyuki Kanda, Jinyu Li 0001, Yifan Gong 0001
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer.

Interspeech2021 Yan Huang 0028, Guoli Ye, Jinyu Li 0001, Yifan Gong 0001
Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need.

Interspeech2021 Yan Deng, Rui Zhao 0017, Zhong Meng, Xie Chen 0001, Bing Liu, Jinyu Li 0001, Yifan Gong 0001, Lei He 0005, 
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS.

Interspeech2021 Vikas Joshi, Amit Das 0007, Eric Sun, Rupesh R. Mehta, Jinyu Li 0001, Yifan Gong 0001
Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems.

Interspeech2021 Zhong Meng, Yu Wu 0012, Naoyuki Kanda, Liang Lu 0001, Xie Chen 0001, Guoli Ye, Eric Sun, Jinyu Li 0001, Yifan Gong 0001
Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition.

Interspeech2021 Eric Sun, Jinyu Li 0001, Zhong Meng, Yu Wu 0012, Jian Xue, Shujie Liu 0001, Yifan Gong 0001
Improving Multilingual Transformer Transducer Models by Reducing Language Confusions.

ICASSP2020 Yan Huang 0028, Yifan Gong 0001
Acoustic Model Adaptation for Presentation Transcription and Intelligent Meeting Assistant Systems.

ICASSP2020 Yan Huang 0028, Lei He 0005, Wenning Wei, William Gale, Jinyu Li 0001, Yifan Gong 0001
Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation.

ICASSP2020 Hu Hu, Rui Zhao 0017, Jinyu Li 0001, Liang Lu 0001, Yifan Gong 0001
Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition.

#51  | Daniel Povey | DBLP Google Scholar  
By venueInterspeech: 31ICASSP: 15TASLP: 2ICLR: 1
By year2024: 52023: 92022: 12021: 62020: 92019: 92018: 10
ISCA sessionsspeech recognition: 3speaker recognition evaluation: 3tools, corpora and resources: 2the voices from a distance challenge: 2multi-talker methods in speech processing: 1neural transducers, streaming asr and novel asr models: 1feature extraction and distant asr: 1lm adaptation, lexical units and punctuation: 1neural networks for language modeling: 1multilingual and code-switched asr: 1asr neural network architectures and training: 1summarization, semantic analysis and classification: 1representation learning of emotion and paralinguistics: 1spoken language processing for children’s speech: 1speaker recognition and diarization: 1recurrent neural models for asr: 1novel neural network architectures for acoustic modelling: 1robust speech recognition: 1speaker state and trait: 1end-to-end speech recognition: 1language modeling: 1acoustic modelling: 1representation learning for emotion: 1the first dihard speech diarization challenge: 1speaker verification using neural network methods: 1
IEEE keywordsspeech recognition: 10automatic speech recognition: 5transducers: 4decoding: 4predictive models: 3pipelines: 3natural language processing: 3transformer: 3standards: 2forced alignment: 2error analysis: 2switches: 2computational modeling: 2transducer: 2degradation: 2measurement: 2estimation: 2end to end: 2streaming: 2decoder: 2convolutional neural nets: 2speaker diarization: 2label priors: 1ctc: 1runtime: 1behavioral sciences: 1buildings: 1corpus: 1librivox: 1audio alignment: 1solids: 1computational efficiency: 1contextualized asr: 1encoding: 1prompts: 1codes: 1text to speech: 1multitasking: 1self supervised learning: 1discrete tokens: 1hidden markov models: 1benchmark testing: 1multi talker asr: 1surt: 1analytical models: 1task analysis: 1filtering: 1semi supervised learning: 1data models: 1noise measurement: 1pseudo labeling: 1tuning: 1neural transducer: 1vector quantization: 1prediction algorithms: 1knowledge distillation: 1signal processing algorithms: 1asr: 1error correction: 1information retrieval: 1keyword search: 1confidence: 1timing: 1delays: 1low latency communication: 1low latency: 1symbols: 1delay penalized: 1lattice pruning: 1speech coding: 1lattice generation: 1multistream cnn: 1robust acoustic modeling: 1neural net architecture: 1parallel processing: 1lattice rescoring: 1parallel computation: 1neural language models: 1lf mmi: 1computational complexity: 1gradient methods: 1wake word detection: 1voice activity detection: 1multiprocessing systems: 1parallel computing: 1edge: 1graphics processing units: 1optimisation: 1wfst: 1proposals: 1neural network: 1region proposal network: 1faster r cnn: 1language model adaptation: 1neural language model: 1interpolation: 1merging: 1linear interpolation: 1speaker recognition: 1deep neural networks: 1x vectors: 1
Most publications (all venues) at2018: 212019: 142015: 132023: 112020: 11


Recent publications

ICASSP2024 Ruizhe Huang, Xiaohui Zhang 0007, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe 0001, Daniel Povey, Sanjeev Khudanpur, 
Less Peaky and More Accurate CTC Forced Alignment by Label Priors.

ICASSP2024 Wei Kang 0006, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey
Libriheavy: A 50, 000 Hours ASR Corpus with Punctuation Casing and Context.

ICASSP2024 Xiaoyu Yang, Wei Kang 0006, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey
PromptASR for Contextualized ASR with Controllable Style.

ICASSP2024 Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu 0004, Daniel Povey, Xie Chen 0001, 
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.

ICLR2024 Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang 0006, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey
Zipformer: A faster and better encoder for automatic speech recognition.

TASLP2023 Desh Raj, Daniel Povey, Sanjeev Khudanpur, 
SURT 2.0: Advances in Transducer-Based Multi-Talker Speech Recognition.

TASLP2023 Han Zhu 0004, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan 0002, 
Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition.

ICASSP2023 Liyong Guo, Xiaoyu Yang, Quandong Wang, Yuxiang Kong, Zengwei Yao, Fan Cui, Fangjun Kuang, Wei Kang 0006, Long Lin, Mingshuang Luo, Piotr Zelasko, Daniel Povey
Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation.

ICASSP2023 Ruizhe Huang, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, Jan Trmal, Sanjeev Khudanpur, 
Building Keyword Search System from End-To-End Asr Systems.

ICASSP2023 Wei Kang 0006, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long Lin, Piotr Zelasko, Daniel Povey
Delay-Penalized Transducer for Low-Latency Streaming ASR.

Interspeech2023 Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola García, Daniel Povey, Sanjeev Khudanpur, 
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts.

Interspeech2023 Desh Raj, Daniel Povey, Sanjeev Khudanpur, 
GPU-accelerated Guided Source Separation for Meeting Transcription.

Interspeech2023 Yifan Yang, Xiaoyu Yang, Liyong Guo, Zengwei Yao, Wei Kang 0006, Fangjun Kuang, Long Lin, Xie Chen 0001, Daniel Povey
Blank-regularized CTC for Frame Skipping in Neural Transducer.

Interspeech2023 Zengwei Yao, Wei Kang 0006, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Yifan Yang, Long Lin, Daniel Povey
Delay-penalized CTC Implemented Based on Finite State Transducer.

Interspeech2022 Fangjun Kuang, Liyong Guo, Wei Kang 0006, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey
Pruned RNN-T for fast, memory-efficient ASR training.

ICASSP2021 Hang Lv 0001, Zhehuai Chen, Hainan Xu, Daniel Povey, Lei Xie 0001, Sanjeev Khudanpur, 
An Asynchronous WFST-Based Decoder for Automatic Speech Recognition.

ICASSP2021 Kyu Jeong Han, Jing Pan, Venkata Krishna Naveen Tadala, Tao Ma, Dan Povey
Multistream CNN for Robust Acoustic Modeling.

ICASSP2021 Ke Li 0018, Daniel Povey, Sanjeev Khudanpur, 
A Parallelizable Lattice Rescoring Strategy with Neural Language Models.

ICASSP2021 Yiming Wang 0006, Hang Lv 0001, Daniel Povey, Lei Xie 0001, Sanjeev Khudanpur, 
Wake Word Detection with Streaming Transformers.

Interspeech2021 Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su 0002, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe 0001, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan, 
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio.

#52  | Zhuo Chen 0006 | DBLP Google Scholar  
By venueICASSP: 25Interspeech: 19TASLP: 4ICML: 1
By year2024: 32023: 112022: 102021: 152020: 62019: 32018: 1
ISCA sessionssource separation: 4speech recognition: 1multi-talker methods in speech processing: 1speaker and language recognition: 1other topics in speech recognition: 1robust asr, and far-field/multi-talker asr: 1single-channel speech enhancement: 1tools, corpora and resources: 1speaker diarization: 1applications in transcription, education and learning: 1multi- and cross-lingual asr, other topics in asr: 1asr neural network architectures: 1noise robust and distant speech recognition: 1multi-channel speech enhancement: 1rich transcription and asr systems: 1distant asr: 1
IEEE keywordsspeech recognition: 14speech separation: 8speaker recognition: 7continuous speech separation: 6error analysis: 5recurrent neural nets: 5speech enhancement: 4self supervised learning: 4source separation: 4task analysis: 3oral communication: 3computational modeling: 3transformers: 3multi talker automatic speech recognition: 3automatic speech recognition: 3audio signal processing: 3codecs: 2speech coding: 2semantics: 2transducers: 2conversation transcription: 2analytical models: 2representation learning: 2microphone arrays: 2benchmark testing: 2dual path modeling: 2speaker counting: 2speaker diarization: 2meeting transcription: 2transformer: 2convolutional neural nets: 2speech removal: 1codes: 1speech generation: 1noise reduction: 1audio text input: 1multi task learning: 1noise suppression: 1target speaker extraction: 1zero shot text to speech: 1speech editing: 1machine translation: 1speech translation: 1language model: 1speech synthesis: 1token level serialized output training: 1multi talker speech recognition: 1factorized neural transducer: 1text only adaptation: 1adaptation models: 1symbols: 1vocabulary: 1data models: 1wavlm: 1multi speaker: 1model size reduction: 1speech interruption detection: 1performance evaluation: 1semi supervised learning: 1pandemics: 1quantization (signal): 1focusing: 1training data: 1streaming inference: 1geometry: 1microphone array: 1machine learning: 1multi clue processing: 1robustness: 1cross modality attention: 1target sound extraction: 1packet loss concealment: 1packet loss: 1speaker change detection: 1degradation: 1e2e asr: 1transformer transducer: 1f1 score: 1limiting: 1data simulation: 1conversation analysis: 1signal processing algorithms: 1overlap ratio predictor: 1memory pool: 1multitasking: 1pre training: 1speaker: 1linear programming: 1personalized speech enhancement: 1speaker embedding: 1speech intelligibility: 1teleconferencing: 1perceptual speech quality: 1rich transcription: 1voice activity detection: 1multi talker asr: 1transducer: 1long form meeting transcription: 1dual path rnn: 1recurrent selective attention network: 1speaker inventory: 1mathematical model: 1estimated speech: 1particle separators: 1computer science: 1correlation: 1speaker separation: 1multi channel microphone: 1deep learning (artificial intelligence): 1signal representation: 1multi speaker asr: 1conformer: 1bayes methods: 1probability: 1speaker identification: 1natural language processing: 1minimum bayes risk training: 1long recording speech separation: 1online processing: 1transforms: 1neural network: 1wireless channels: 1blind source separation: 1separation: 1mimo communication: 1filtering theory: 1system fusion: 1permutation invariant training: 1libricss: 1microphones: 1overlapped speech: 1time domain: 1recurrent neural networks: 1frequency domain analysis: 1cnn: 1speaker verification: 1attentive pooling: 1lstm: 1attention: 1context modeling: 1data mining: 1speaker extraction: 1speaker profile: 1periodic structures: 1speaker independent speech separation: 1array signal processing: 1
Most publications (all venues) at2021: 232023: 192022: 192017: 112020: 10

Affiliations
Microsoft, Redmond, WA, USA
Columbia University, New York, NY, USA (PhD 2017)

Recent publications

TASLP2024 Xiaofei Wang 0007, Manthan Thakker, Zhuo Chen 0006, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu 0001, Jinyu Li 0001, Takuya Yoshioka, 
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer.

TASLP2024 Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu 0012, Shujie Liu 0001, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Furu Wei, 
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation.

ICASSP2024 Jian Wu 0027, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao 0017, Zhuo Chen 0006, Jinyu Li 0001, 
T-SOT FNT: Streaming Multi-Talker ASR with Text-Only Domain Adaptation Capability.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Quchen Fu, Szu-Wei Fu, Yaran Fan, Yu Wu 0012, Zhuo Chen 0006, Jayant Gupchup, Ross Cutler, 
Real-Time Speech Interruption Analysis: from Cloud to Client Deployment.

ICASSP2023 Zili Huang, Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yiming Wang, Jinyu Li 0001, Takuya Yoshioka, Xiaofei Wang 0009, Peidong Wang, 
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition.

ICASSP2023 Naoyuki Kanda, Jian Wu 0027, Xiaofei Wang 0009, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Vararray Meets T-Sot: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition.

ICASSP2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Dongmei Wang, Takuya Yoshioka, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Target Sound Extraction with Variable Cross-Modality Clues.

ICASSP2023 Heming Wang, Yao Qian, Hemin Yang, Nauyuki Kanda, Peidong Wang, Takuya Yoshioka, Xiaofei Wang 0009, Yiming Wang, Shujie Liu 0001, Zhuo Chen 0006, DeLiang Wang, Michael Zeng 0001, 
DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks.

ICASSP2023 Jian Wu 0027, Zhuo Chen 0006, Min Hu, Xiong Xiao, Jinyu Li 0001, 
Speaker Change Detection For Transformer Transducer ASR.

ICASSP2023 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.

Interspeech2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng 0001, 
Adapting Multi-Lingual ASR Models for Handling Multiple Talkers.

Interspeech2023 Midia Yousefi, Naoyuki Kanda, Dongmei Wang, Zhuo Chen 0006, Xiaofei Wang 0009, Takuya Yoshioka, 
Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach.

ICML2023 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Shujie Liu 0001, Daniel Tompkins, Zhuo Chen 0006, Wanxiang Che, Xiangzhan Yu, Furu Wei, 
BEATs: Audio Pre-Training with Acoustic Tokenizers.

TASLP2022 Chenda Li, Zhuo Chen 0006, Yanmin Qian, 
Dual-Path Modeling With Memory Embedding Model for Continuous Speech Separation.

ICASSP2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Zhengyang Chen, Zhuo Chen 0006, Shujie Liu 0001, Jian Wu 0027, Yao Qian, Furu Wei, Jinyu Li 0001, Xiangzhan Yu, 
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training.

ICASSP2022 Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang 0009, Zhuo Chen 0006, Xuedong Huang 0001, 
Personalized speech enhancement: new models and Comprehensive evaluation.

ICASSP2022 Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang 0009, Zhong Meng, Zhuo Chen 0006, Takuya Yoshioka, 
Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR.

ICASSP2022 Desh Raj, Liang Lu 0001, Zhuo Chen 0006, Yashesh Gaur, Jinyu Li 0001, 
Continuous Streaming Multi-Talker ASR with Dual-Path Transducers.

ICASSP2022 Yixuan Zhang 0005, Zhuo Chen 0006, Jian Wu 0027, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li 0001, 
Continuous Speech Separation with Recurrent Selective Attention Network.

#53  | Ryo Masumura | DBLP Google Scholar  
By venueInterspeech: 33ICASSP: 14TASLP: 1
By year2023: 112022: 82021: 112020: 92019: 62018: 3
ISCA sessionsspeech recognition: 2end-to-end asr: 2spoken dialog systems and conversational analysis: 1paralinguistics: 1speech coding and enhancement: 1other topics in speech recognition: 1multi-, cross-lingual and other topics in asr: 1speech synthesis: 1spoken dialogue systems and multimodality: 1single-channel speech enhancement: 1speech emotion recognition: 1novel models and training methods for asr: 1spoken language processing: 1voice activity detection and keyword spotting: 1neural network training methods for asr: 1streaming for asr/rnn transducers: 1search/decoding techniques and confidence measures for asr: 1applications in transcription, education and learning: 1speech classification: 1training strategies for asr: 1asr neural network architectures and training: 1spoken language understanding: 1speech synthesis paradigms and methods: 1training strategy for speech emotion recognition: 1model training for asr: 1dialogue speech understanding: 1nn architectures for asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speaker characterization and analysis: 1selected topics in neural speech processing: 1asr systems and technologies: 1
IEEE keywordsspeech recognition: 10neural network: 4recurrent neural nets: 4automatic speech recognition: 3task analysis: 3end to end: 3natural language processing: 3training data: 2recurrent neural network transducer: 2probability: 2knowledge distillation: 2transformer: 2end to end automatic speech recognition: 2end to end speech summarization: 1measurement: 1synthetic data augmentation: 1how2 dataset: 1multi modal data augmentation: 1oral communication: 1multi party conversation: 1video conversation: 1next speaker prediction: 1non verbal information: 1behavioral sciences: 1neural transducer: 1recurrent neural networks: 1robustness: 1linguistics: 1decoding: 1scheduled sampling: 1buildings: 1multilingual: 1representation learning: 1transformers: 1cross lingual: 1self supervised speech representation learning: 1attention based decoder: 1listener adaptation: 1perceived emotion: 1speech emotion recognition: 1emotion recognition: 1sequence to sequence pre training: 1text analysis: 1language translation: 1pointer generator networks: 1self supervised learning: 1spoken text normalization: 1blind source separation: 1audio signal processing: 1speech separation: 1audio visual: 1and cross modal: 1large context endo to end automatic speech recognition: 1hierarchical encoder decoder: 1entropy: 1whole network pre training: 1synchronisation: 1autoregressive processes: 1call centres: 1hierarchical multi task model: 1contact center call: 1customer satisfaction (cs): 1long short term memory recurrent neural networks: 1customer satisfaction: 1customer services: 1large context pointer generator networks: 1spoken to written style conversion: 1sequence level consistency training: 1specaugment: 1semi supervised learning: 1connectionist temporal classification: 1attention weight: 1speech codecs: 1speech coding: 1attention based encoder decoder: 1hierarchical recurrent encoder decoder: 1
Most publications (all venues) at2023: 192021: 182019: 182020: 162022: 15

Affiliations
URLs

Recent publications

ICASSP2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura
Leveraging Large Text Corpora For End-To-End Speech Summarization.

ICASSP2023 Saki Mizuno, Nobukatsu Hojo, Satoshi Kobashikawa, Ryo Masumura
Next-Speaker Prediction Based on Non-Verbal Information in Multi-Party Video Conversation.

ICASSP2023 Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura
Improving Scheduled Sampling for Neural Transducer-Based ASR.

ICASSP2023 Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi Moriya, 
Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning.

Interspeech2023 Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, 
Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer.

Interspeech2023 Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nobukatsu Hojo, 
Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model.

Interspeech2023 Yuki Kitagishi, Naohiro Tawara, Atsunori Ogawa, Ryo Masumura, Taichi Asami, 
What are differences? Comparing DNN and Human by Their Performance and Characteristics in Speaker Age Estimation.

Interspeech2023 Naoki Makishima, Keita Suzuki, Satoshi Suzuki, Atsushi Ando, Ryo Masumura
Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction.

Interspeech2023 Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Target and Non-Target Speakers ASR.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Takafumi Moriya, Takanori Ashihara, Atsushi Ando, Hiroshi Sato, Tomohiro Tanaka, Kohei Matsuura, Ryo Masumura, Marc Delcroix, Takahiro Shinozaki, 
Hybrid RNN-T/Attention-Based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration.

Interspeech2022 Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura
Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data.

Interspeech2022 Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training.

Interspeech2022 Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari, 
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis.

Interspeech2022 Fumio Nihei, Ryo Ishii, Yukiko I. Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura, 
Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information.

Interspeech2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations.

Interspeech2022 Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi, 
Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition.

Interspeech2022 Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, 
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks.

ICASSP2021 Atsushi Ando, Ryo Masumura, Hiroshi Sato, Takafumi Moriya, Takanori Ashihara, Yusuke Ijima, Tomoki Toda, 
Speech Emotion Recognition Based on Listener Adaptive Models.

#54  | Takuya Yoshioka | DBLP Google Scholar  
By venueICASSP: 25Interspeech: 21TASLP: 1NAACL-Findings: 1
By year2024: 52023: 122022: 102021: 102020: 62019: 32018: 2
ISCA sessionssource separation: 3speech recognition: 2robust asr, and far-field/multi-talker asr: 2single-channel speech enhancement: 2speech enhancement and denoising: 1multi-talker methods in speech processing: 1other topics in speech recognition: 1applications in transcription, education and learning: 1multi- and cross-lingual asr, other topics in asr: 1asr neural network architectures: 1training strategies for asr: 1noise robust and distant speech recognition: 1multi-channel speech enhancement: 1rich transcription and asr systems: 1source separation from monaural input: 1distant asr: 1
IEEE keywordsspeech recognition: 12speech enhancement: 7error analysis: 7speech separation: 7speaker diarization: 5speaker recognition: 5oral communication: 4self supervised learning: 4continuous speech separation: 4task analysis: 3transformers: 3voice activity detection: 3multi talker automatic speech recognition: 3automatic speech recognition: 3audio signal processing: 3recurrent neural nets: 3source separation: 3speaker profile: 2signal processing algorithms: 2ts vad: 2streaming inference: 2computational modeling: 2conversation transcription: 2training data: 2microphone arrays: 2personalized speech enhancement: 2perceptual speech quality: 2rich transcription: 2speaker counting: 2natural language processing: 2meeting transcription: 2speaker identification: 2transformer: 2speech removal: 1codes: 1speech generation: 1codecs: 1noise reduction: 1audio text input: 1multi task learning: 1noise suppression: 1target speaker extraction: 1zero shot text to speech: 1speech coding: 1speech editing: 1clustering algorithms: 1speaker counting error: 1pet tsvad: 1token level serialized output training: 1transducers: 1multi talker speech recognition: 1factorized neural transducer: 1text only adaptation: 1adaptation models: 1symbols: 1vocabulary: 1measurement: 1speech translation: 1overlapping speech: 1recording: 1data models: 1wavlm: 1multi speaker: 1representation learning: 1focusing: 1geometry: 1microphone array: 1machine learning: 1multi clue processing: 1benchmark testing: 1robustness: 1cross modality attention: 1target sound extraction: 1target speech extraction: 1detectors: 1knowledge distillation: 1background noise: 1interference: 1packet loss concealment: 1packet loss: 1semantics: 1tensors: 1eend eda: 1correlation: 1data simulation: 1conversation analysis: 1analytical models: 1iterative methods: 1p.835: 1deep noise suppression: 1signal denoising: 1personalized noise suppression: 1speaker embedding: 1speech intelligibility: 1teleconferencing: 1robust automatic speech recognition: 1recurrent selective attention network: 1hypothesis stitcher: 1decoding: 1computer architecture: 1multi channel microphone: 1deep learning (artificial intelligence): 1signal representation: 1multi speaker asr: 1conformer: 1bayes methods: 1probability: 1minimum bayes risk training: 1filtering theory: 1system fusion: 1permutation invariant training: 1libricss: 1microphones: 1overlapped speech: 1time domain: 1recurrent neural networks: 1frequency domain analysis: 1attention: 1context modeling: 1data mining: 1speaker extraction: 1periodic structures: 1speaker independent speech separation: 1array signal processing: 1
Most publications (all venues) at2022: 192023: 172021: 152013: 102024: 9

Affiliations
URLs

Recent publications

TASLP2024 Xiaofei Wang 0007, Manthan Thakker, Zhuo Chen 0006, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu 0001, Jinyu Li 0001, Takuya Yoshioka
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer.

ICASSP2024 Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu, 
Profile-Error-Tolerant Target-Speaker Voice Activity Detection.

ICASSP2024 Jian Wu 0027, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao 0017, Zhuo Chen 0006, Jinyu Li 0001, 
T-SOT FNT: Streaming Multi-Talker ASR with Text-Only Domain Adaptation Capability.

ICASSP2024 Mu Yang, Naoyuki Kanda, Xiaofei Wang 0009, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li 0001, Takuya Yoshioka
Diarist: Streaming Speech Translation with Speaker Diarization.

NAACL-Findings2024 Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu 0001, Dongdong Chen 0001, Yao Qian, Xuemei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao 0004, Yu Shi 0001, Lu Yuan, Takuya Yoshioka, Michael Zeng 0001, Xuedong Huang 0001, 
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Zili Huang, Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yiming Wang, Jinyu Li 0001, Takuya Yoshioka, Xiaofei Wang 0009, Peidong Wang, 
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition.

ICASSP2023 Naoyuki Kanda, Jian Wu 0027, Xiaofei Wang 0009, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka
Vararray Meets T-Sot: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition.

ICASSP2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Dongmei Wang, Takuya Yoshioka, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Target Sound Extraction with Variable Cross-Modality Clues.

ICASSP2023 Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka
Breaking the Trade-Off in Personalized Speech Enhancement With Cross-Task Knowledge Distillation.

ICASSP2023 Heming Wang, Yao Qian, Hemin Yang, Nauyuki Kanda, Peidong Wang, Takuya Yoshioka, Xiaofei Wang 0009, Yiming Wang, Shujie Liu 0001, Zhuo Chen 0006, DeLiang Wang, Michael Zeng 0001, 
DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks.

ICASSP2023 Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu 0027, 
Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization.

ICASSP2023 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka
Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.

Interspeech2023 Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Pärnamaa, Huaming Wang, 
Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation.

Interspeech2023 Naoyuki Kanda, Takuya Yoshioka, Yang Liu, 
Factual Consistency Oriented Speech Recognition.

Interspeech2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng 0001, 
Adapting Multi-Lingual ASR Models for Handling Multiple Talkers.

Interspeech2023 Midia Yousefi, Naoyuki Kanda, Dongmei Wang, Zhuo Chen 0006, Xiaofei Wang 0009, Takuya Yoshioka
Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach.

ICASSP2022 Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner, 
Icassp 2022 Deep Noise Suppression Challenge.

ICASSP2022 Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang 0009, Zhuo Chen 0006, Xuedong Huang 0001, 
Personalized speech enhancement: new models and Comprehensive evaluation.

ICASSP2022 Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang 0009, Zhong Meng, Zhuo Chen 0006, Takuya Yoshioka
Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR.

#55  | Keisuke Kinoshita | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 20TASLP: 5SpeechComm: 1
By year2024: 12023: 42022: 82021: 122020: 122019: 82018: 2
ISCA sessionssource separation: 3speech enhancement: 2dereverberation, noise reduction, and speaker extraction: 1speaker embedding and diarization: 1single-channel speech enhancement: 1speaker diarization: 1source separation, dereverberation and echo cancellation: 1speech localization, enhancement, and quality assessment: 1speech enhancement and intelligibility: 1noise reduction and intelligibility: 1monaural source separation: 1multi-channel speech enhancement: 1diarization: 1targeted source separation: 1asr for noisy and far-field speech: 1speech and audio source separation and scene analysis: 1distant asr: 1speech intelligibility and quality: 1
IEEE keywordsspeech recognition: 14source separation: 10speech enhancement: 10reverberation: 7speaker recognition: 7blind source separation: 5dereverberation: 4neural network: 4online processing: 3recording: 3array signal processing: 3backpropagation: 3transfer functions: 2microphone array: 2artificial neural networks: 2continuous speech separation: 2permutation invariant training: 2particle separators: 2convolution: 2dynamic programming: 2blind source separation (bss): 2weighted prediction error (wpe): 2expectation maximization (em) algorithm: 2gaussian distribution: 2multivariate complex gaussian distribution: 2full rank spatial covariance analysis (fca): 2meeting recognition: 2switches: 2maximum likelihood estimation: 2diarization: 2signal to distortion ratio: 2speech separation: 2speech extraction: 2convolutional neural nets: 2dynamic stream weights: 2audio signal processing: 2optimisation: 2target speech extraction: 2time domain network: 2source counting: 2robust asr: 2time domain analysis: 2delays: 1noise reduction: 1spatial regularization: 1optimization: 1real time systems: 1telephone sets: 1data mining: 1few shot adaptation: 1sound event: 1soundbeam: 1target sound extraction: 1oral communication: 1graph pit: 1analytical models: 1sensors: 1blind dereverberation (bd): 1covariance matrices: 1probabilistic logic: 1error analysis: 1computational efficiency: 1software: 1tensors: 1word error rate: 1levenshtein distance: 1time frequency analysis: 1minimization: 1linear prediction (lp): 1pattern clustering: 1bayes methods: 1infinite gmm: 1mixture models: 1gaussian processes: 1training data: 1computational modeling: 1loss function: 1input switching: 1deep learning (artificial intelligence): 1noise robust speech recognition: 1speakerbeam: 1expectation maximisation algorithm: 1microphones: 1covariance analysis: 1acoustic beamforming: 1complex backpropagation: 1multi channel source separation: 1speaker activity: 1clustering algorithms: 1memory management: 1databases: 1signal processing algorithms: 1speaker diarization: 1long recording speech separation: 1transforms: 1dual path modeling: 1sensor fusion: 1audiovisual speaker localization: 1audio visual systems: 1image fusion: 1data fusion: 1video signal processing: 1beamforming: 1automatic speech recognition: 1filtering theory: 1microphone arrays: 1multi task loss: 1spatial features: 1separation: 1smart devices: 1robustness: 1task analysis: 1single channel speech enhancement: 1signal denoising: 1joint training: 1computational complexity: 1end to end speech recognition: 1hidden markov models: 1multi speaker speech recognition: 1time domain: 1frequency domain analysis: 1audiovisual speaker tracking: 1kalman filters: 1tracking: 1recurrent neural nets: 1backprop kalman filter: 1adaptation: 1auxiliary feature: 1iterative methods: 1joint optimization: 1least squares approximations: 1meeting diarization: 1speaker attention: 1speech separation/extraction: 1
Most publications (all venues) at2021: 242020: 172017: 162019: 152018: 12

Affiliations
URLs

Recent publications

TASLP2024 Tetsuya Ueda, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki, Shoji Makino, 
Blind and Spatially-Regularized Online Joint Optimization of Source Separation, Dereverberation, and Noise Reduction.

TASLP2023 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki, 
SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning.

TASLP2023 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach, 
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria.

TASLP2023 Hiroshi Sawada, Rintaro Ikeshita, Keisuke Kinoshita, Tomohiro Nakatani, 
Multi-Frame Full-Rank Spatial Covariance Analysis for Underdetermined Blind Source Separation and Dereverberation.

ICASSP2023 Thilo von Neumann, Christoph Böddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach, 
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems.

ICASSP2022 Naoyuki Kamo, Rintaro Ikeshita, Keisuke Kinoshita, Tomohiro Nakatani, 
Importance of Switch Optimization Criterion in Switching WPE Dereverberation.

ICASSP2022 Keisuke Kinoshita, Marc Delcroix, Tomoharu Iwata, 
Tight Integration Of Neural- And Clustering-Based Diarization Through Deep Unfolding Of Infinite Gaussian Mixture Model.

ICASSP2022 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach, 
SA-SDR: A Novel Loss Function for Separation of Meeting Style Data.

ICASSP2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya, 
Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition.

ICASSP2022 Hiroshi Sawada, Rintaro Ikeshita, Keisuke Kinoshita, Tomohiro Nakatani, 
Multi-Frame Full-Rank Spatial Covariance Analysis for Underdetermined BSS in Reverberant Environments.

Interspeech2022 Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani, 
Listen only to me! How well can target speech extraction handle false alarms?

Interspeech2022 Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Böddeker, Reinhold Haeb-Umbach, 
Utterance-by-utterance overlap-aware neural diarization with Graph-PIT.

Interspeech2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, 
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations.

ICASSP2021 Christoph Böddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach, 
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation.

ICASSP2021 Marc Delcroix, Katerina Zmolíková, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani, 
Speaker Activity Driven Neural Speech Extraction.

ICASSP2021 Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara, 
Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds.

ICASSP2021 Chenda Li, Zhuo Chen 0006, Yi Luo 0004, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe 0001, Yanmin Qian, 
Dual-Path Modeling for Long Recording Speech Separation in Meetings.

ICASSP2021 Julio Wissing, Benedikt T. Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura, 
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain.

Interspeech2021 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki, 
Few-Shot Learning of New Sound Classes for Target Sound Extraction.

Interspeech2021 Cong Han, Yi Luo 0004, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe 0001, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen 0006, 
Continuous Speech Separation Using Speaker Inventory for Long Recording.

#56  | Jiangyan Yi | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 13TASLP: 9SpeechComm: 2AAAI: 1ICML: 1
By year2024: 52023: 72022: 42021: 112020: 132019: 7
ISCA sessionsspeech synthesis: 3topics in asr: 2voice conversion and adaptation: 2speech coding and enhancement: 1speaker and language identification: 1asr: 1privacy-preserving machine learning for audio & speech processing: 1search/decoding techniques and confidence measures for asr: 1speech coding and privacy: 1computational resource constrained speech recognition: 1multi-channel audio and emotion recognition: 1speech enhancement: 1asr neural network architectures: 1sequence-to-sequence speech recognition: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speech and audio source separation and scene analysis: 1nn architectures for asr: 1
IEEE keywordsspeech recognition: 11speech synthesis: 9natural language processing: 6end to end: 5speech enhancement: 4predictive models: 4transfer learning: 4error analysis: 3speech coding: 3text analysis: 3decoding: 3speaker recognition: 3noise robustness: 2text to speech: 2asvspoof: 2adversarial training: 2signal processing algorithms: 2spectrogram: 2text based speech editing: 2text editing: 2end to end model: 2attention: 2low resource: 2synthetic speech detection: 1interactive fusion: 1noise measurement: 1data models: 1knowledge distillation: 1noise: 1noise robust: 1fewer tokens: 1language model: 1speech codecs: 1speech codec: 1time invariant: 1codes: 1multiscale permutation entropy: 1nonlinear dynamics: 1deepfakes: 1power spectral entropy: 1entropy: 1audio deepfake detection: 1splicing: 1costs: 1prosodic boundaries: 1computational modeling: 1multi task learning: 1tagging: 1multi modal embeddings: 1bit error rate: 1linguistics: 1speaker dependent weighting: 1direction of arrival estimation: 1target speaker localization: 1generalized cross correlation: 1transforms: 1location awareness: 1buildings: 1automatic speaker verification: 1complexity theory: 1architecture: 1fake speech detection: 1voice activity detection: 1self distillation: 1task analysis: 1waveform generators: 1vocoders: 1deterministic plus stochastic: 1multiband excitation: 1noise control: 1vocoder: 1filtering theory: 1stochastic processes: 1one shot learning: 1coarse to fine decoding: 1mask prediction: 1mask and prediction: 1fast: 1bert: 1non autoregressive: 1cross modal: 1autoregressive processes: 1teacher student learning: 1language modeling: 1gated recurrent fusion: 1robust end to end speech recognition: 1speech transformer: 1speech distortion: 1decoupled transformer: 1automatic speech recognition: 1code switching: 1bi level decoupling: 1prosody modeling: 1speaking style modeling: 1personalized speech synthesis: 1few shot speaker adaptation: 1the m2voc challenge: 1prosody and voice factorization: 1sequence to sequence: 1transformer: 1robustness: 1phoneme level autoregression: 1clustering algorithms: 1end to end post filter: 1deep clustering: 1permutation invariant training: 1deep attention fusion features: 1speech separation: 1interference: 1prosody transfer: 1speaker adaptation: 1audio signal processing: 1optimisation: 1optimization strategy: 1forward backward algorithm: 1synchronous transformer: 1online speech recognition: 1encoding: 1asynchronous problem: 1chunk by chunk: 1cross lingual: 1word embedding: 1punctuation prediction: 1speech embedding: 1self attention: 1adversarial: 1language invariant: 1
Most publications (all venues) at2024: 212023: 202020: 182022: 172021: 17

Affiliations
URLs

Recent publications

SpeechComm2024 Cunhang Fan, Jun Xue, Shunbo Dong, Mingming Ding, Jiangyan Yi, Jinpeng Li, Zhao Lv, 
Subband fusion of complex spectrogram for fake speech detection.

TASLP2024 Cunhang Fan, Mingming Ding, Jianhua Tao 0001, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv, 
Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection.

ICASSP2024 Yong Ren, Tao Wang 0074, Jiangyan Yi, Le Xu, Jianhua Tao 0001, Chu Yuan Zhang, Junzuo Zhou, 
Fewer-Token Neural Speech Codec with Time-Invariant Codes.

ICASSP2024 Chenglong Wang, Jiayi He, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Xiaohui Zhang 0006, 
Multi-Scale Permutation Entropy for Audio Deepfake Detection.

AAAI2024 Xiaohui Zhang 0006, Jiangyan Yi, Chenglong Wang, Chu Yuan Zhang, Siding Zeng, Jianhua Tao 0001, 
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection.

SpeechComm2023 Jiangyan Yi, Jianhua Tao 0001, Ye Bai, Zhengkun Tian, Cunhang Fan, 
Transfer knowledge for punctuation prediction via adversarial training.

TASLP2023 Jiangyan Yi, Jianhua Tao 0001, Ruibo Fu, Tao Wang 0074, Chu Yuan Zhang, Chenglong Wang, 
Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction With Multi-Modal Embeddings.

ICASSP2023 Guanjun Li, Wei Xue, Wenju Liu, Jiangyan Yi, Jianhua Tao 0001, 
GCC-Speaker: Target Speaker Localization with Optimal Speaker-Dependent Weighting in Multi-Speaker Scenarios.

ICASSP2023 Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv, 
Learning From Yourself: A Self-Distillation Method For Fake Speech Detection.

Interspeech2023 Chenglong Wang, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Shuai Zhang 0014, Xun Chen, 
Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features.

Interspeech2023 Chenglong Wang, Jiangyan Yi, Jianhua Tao 0001, Chu Yuan Zhang, Shuai Zhang 0014, Ruibo Fu, Xun Chen, 
TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection.

ICML2023 Xiaohui Zhang 0006, Jiangyan Yi, Jianhua Tao 0001, Chenglong Wang, Chu Yuan Zhang, 
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection.

TASLP2022 Tao Wang 0074, Ruibo Fu, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen, 
NeuralDPS: Neural Deterministic Plus Stochastic Model With Multiband Excitation for Noise-Controllable Waveform Generation.

TASLP2022 Tao Wang 0074, Jiangyan Yi, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, 
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing.

ICASSP2022 Tao Wang 0074, Jiangyan Yi, Liqun Deng, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, 
Context-Aware Mask Prediction Network for End-to-End Text-Based Speech Editing.

Interspeech2022 Shuai Zhang 0014, Jiangyan Yi, Zhengkun Tian, Jianhua Tao 0001, Yu Ting Yeung, Liqun Deng, 
reducing multilingual context confusion for end-to-end code-switching automatic speech recognition.

TASLP2021 Ye Bai, Jiangyan Yi, Jianhua Tao 0001, Zhengkun Tian, Zhengqi Wen, Shuai Zhang 0014, 
Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT.

TASLP2021 Ye Bai, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen, Zhengkun Tian, Shuai Zhang 0014, 
Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data.

TASLP2021 Cunhang Fan, Jiangyan Yi, Jianhua Tao 0001, Zhengkun Tian, Bin Liu 0041, Zhengqi Wen, 
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition.

ICASSP2021 Shuai Zhang 0014, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao 0001, Zhengqi Wen, 
Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition.

#57  | Chao Weng | DBLP Google Scholar  
By venueICASSP: 25Interspeech: 18TASLP: 3ACL: 1
By year2024: 62023: 82022: 82021: 102020: 82019: 52018: 2
ISCA sessionsspeech synthesis: 2speech coding and enhancement: 2singing voice computing and processing in music: 2sequence models for asr: 2speech recognition: 1speaker embedding and diarization: 1acoustic event detection and classification: 1tools, corpora and resources: 1topics in asr: 1source separation, dereverberation and echo cancellation: 1asr model training and strategies: 1multi-channel speech enhancement: 1speech synthesis paradigms and methods: 1asr neural network training: 1
IEEE keywordsspeech recognition: 11speaker recognition: 9speech enhancement: 5decoding: 4overlapped speech: 4end to end speech recognition: 4automatic speech recognition: 3speech synthesis: 3natural language processing: 3speaker embedding: 3voice activity detection: 3speaker diarization: 3multi channel: 3diffusion model: 2noise reduction: 2computational modeling: 2transformers: 2vits: 2training data: 2bayes methods: 2task analysis: 2measurement: 2text analysis: 2recurrent neural nets: 2error analysis: 2speaker verification: 2pattern clustering: 2direction of arrival estimation: 2multi look: 2source separation: 2unsupervised learning: 2attention based model: 2prompt based learning: 1text to speech: 1metric learning: 1natural languages: 1representation learning: 1semantics: 1speech: 1multi path transformer: 1computational efficiency: 1speech denoising: 1complexity scaling: 1neural network: 1computer architecture: 1couplings: 1harmonic analysis: 1variational autoencoder: 1adversarial learning: 1neural source filter model: 1synthesizers: 1singing voice synthesis: 1expressive tts: 1bigvgan: 1durian e: 1adaptation models: 1linguistics: 1style adaptive instance normalization: 1signal generators: 1adaptive systems: 1multi channel speech enhancement: 1iterative methods: 1data models: 1optimization inspired: 1proximal gradient decent: 1pipelines: 1array signal processing: 1transducers: 1discriminative training: 1mutual information: 1maximum mutual information: 1minimum bayesian risk: 1sequential training: 1end to end: 1autoregressive model: 1vocoders: 1text to sound generation: 1transforms: 1spectrogram: 1vocoder: 1headphones: 1personalized speech enhancement: 1artificial intelligence: 1recurrent neural networks: 1band split rnn: 1dns challenge 2023: 1speaking style: 1conversational text to speech synthesis: 1graph neural network: 1three dimensional displays: 1noisy label: 1convolution: 1attention module: 1rnn t: 1code switched asr: 1bilingual asr: 1computational linguistics: 1speaker clustering: 1inference mechanisms: 1overlap speech detection: 1feature fusion: 1data handling: 1m2met: 1direction of arrival: 1synthetic speech detection: 1res2net: 1replay detection: 1multi scale feature: 1asv anti spoofing: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1ctc: 1non autoregressive: 1transformer: 1autoregressive processes: 1microphone arrays: 1source localization: 1contrastive learning: 1semi supervised learning: 1data augmentation: 1self supervised learning: 1target speaker enhancement: 1robust speaker verification: 1interference suppression: 1speech separation: 1regression analysis: 1singing synthesis: 1audio signal processing: 1voice conversion: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1self attention: 1persistent memory: 1dfsmn: 1robust automatic speech recognition: 1cepstral analysis: 1channel normalization: 1cepstral mean normalization: 1language model: 1code switching: 1mathematical model: 1switches: 1attention based end to end speech recognition: 1early update: 1hidden markov models: 1optimization: 1token wise training: 1
Most publications (all venues) at2023: 162022: 152021: 142019: 112024: 9

Affiliations
URLs

Recent publications

TASLP2024 Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng, 
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.

ICASSP2024 Hangting Chen, Jianwei Yu, Chao Weng
Complexity Scaling for Speech Denoising.

ICASSP2024 Jianwei Cui, Yu Gu, Chao Weng, Jie Zhang 0042, Liping Chen, Lirong Dai 0001, 
Sifisinger: A High-Fidelity End-to-End Singing Voice Synthesizer Based on Source-Filter Model.

ICASSP2024 Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su 0002, 
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis.

ICASSP2024 Andong Li, Rilin Chen, Yu Gu, Chao Weng, Dan Su, 
Opine: Leveraging a Optimization-Inspired Deep Unfolding Method for Multi-Channel Speech Enhancement.

ACL2024 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang 0001, Ziyue Jiang 0001, Xuankai Chang, Jiatong Shi, Chao Weng, Zhou Zhao, Dong Yu 0001, 
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

TASLP2023 Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu 0001, 
Integrating Lattice-Free MMI Into End-to-End Speech Recognition.

TASLP2023 Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu 0001, 
Diffsound: Discrete Diffusion Model for Text-to-Sound Generation.

ICASSP2023 Jianwei Yu, Hangting Chen, Yi Luo 0004, Rongzhi Gu, Weihua Li, Chao Weng
TSpeech-AI System Description to the 5th Deep Noise Suppression (DNS) Challenge.

Interspeech2023 Xiang Li 0105, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu 0001, Chao Weng, Helen Meng, 
Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model.

Interspeech2023 Hangting Chen, Jianwei Yu, Yi Luo 0004, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression.

Interspeech2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu 0001, Shinji Watanabe 0001, 
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.

Interspeech2023 Dongchao Yang, Songxiang Liu, Helin Wang, Jianwei Yu, Chao Weng, Yuexian Zou, 
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS.

Interspeech2023 Jianwei Yu, Hangting Chen, Yi Luo 0004, Rongzhi Gu, Chao Weng
High Fidelity Speech Enhancement with Band-split RNN.

ICASSP2022 Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu 0001, Helen Meng, Chao Weng, Dan Su 0002, 
Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling.

ICASSP2022 Xiaoyi Qin, Na Li 0012, Chao Weng, Dan Su 0002, Ming Li 0026, 
Simple Attention Module Based Speaker Verification with Iterative Noisy Label Detection.

ICASSP2022 Brian Yan, Chunlei Zhang, Meng Yu 0003, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe 0001, Dong Yu 0001, 
Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization.

ICASSP2022 Chunlei Zhang, Jiatong Shi, Chao Weng, Meng Yu 0003, Dong Yu 0001, 
Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering.

ICASSP2022 Naijun Zheng, Na Li 0012, Xixin Wu, Lingwei Meng, Jiawen Kang 0002, Haibin Wu, Chao Weng, Dan Su 0002, Helen Meng, 
The CUHK-Tencent Speaker Diarization System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.

ICASSP2022 Naijun Zheng, Na Li 0012, Jianwei Yu, Chao Weng, Dan Su 0002, Xunying Liu, Helen Meng, 
Multi-Channel Speaker Diarization Using Spatial Features for Meetings.

#58  | Zejun Ma | DBLP Google Scholar  
By venueICASSP: 20Interspeech: 20ICLR: 2ICML: 1IJCAI: 1ACL-Findings: 1AAAI: 1KDD: 1
By year2024: 72023: 162022: 132021: 92020: 2
ISCA sessionsspeech recognition: 6speech synthesis: 3end-to-end asr: 1statistical machine translation: 1voice conversion: 1speech segmentation: 1applications in transcription, education and learning: 1neural transducers, streaming asr and novel asr models: 1asr: 1speech enhancement and intelligibility: 1neural network training methods for asr: 1voice activity detection and keyword spotting: 1streaming for asr/rnn transducers: 1
IEEE keywordsspeech recognition: 8natural language processing: 6automatic speech recognition: 4audio signal processing: 4representation learning: 3music: 3speech synthesis: 3text analysis: 3decoding: 2task analysis: 2training data: 2error analysis: 2degradation: 2multilingual: 2asr: 2q former: 2switches: 2transformers: 2databases: 2ctc: 2speech coding: 2oral communication: 1serialized output training: 1speaker aware: 1semantics: 1multi talker: 1language extension: 1dual encoders: 1multimodal large language model: 1visual perception: 1audio captioning: 1large language model: 1connectors: 1long form speech: 1phone embed ding: 1production: 1linguistic acoustic similarity: 1goodness of pronunciation: 1acoustic measurements: 1pronunciation scoring: 1fluency scoring: 1indexes: 1self supervised learning: 1non native speech: 1predictive models: 1fusion: 1artificial neural networks: 1domain adaptation: 1adaptation models: 1language model: 1estimation: 1internal language model: 1computational modeling: 1parallel: 1performance evaluation: 1parallel processing: 1liteg2p: 1g2p: 1tts: 1expert knowledge: 1sound event detection: 1semantic networks: 1convolutional neural nets: 1audio classification: 1transformer: 1graphics processing units: 1token semantic module: 1signal classification: 1collaborative decoding: 1contextual biasing: 1knowledge selection: 1contextual speech recognition: 1semi supervised learning (artificial intelligence): 1semi supervised learning: 1unsupervised learning: 1end to end model: 1signal representation: 1pseudo labeling: 1language adaptation: 1direction of arrival estimation: 1data augmentation: 1multi channel multi speaker speech recognition: 1reverberation: 1alimeeting: 1speaker recognition: 1m2met: 1speaker diarization: 1approximation theory: 1bayes methods: 1pattern classification: 1melody midi: 1pose estimation: 1audio recording: 1pitch refinement: 1high resolution network (hrnet): 1melody extraction: 1cross modal learning: 1multi modal fusion: 1audio visual systems: 1video streaming: 1audio visual voice detection: 1voice activity detection: 1video signal processing: 1rule embedding: 1confusion module: 1hidden markov models: 1phonetic posteriorgrams: 1singing voice conversion: 1text to speech: 1emotion recognition: 1speaker determination: 1emotion classifi cation: 1audiobook: 1transfer learning: 1entropy: 1recurrent neural nets: 1rnn t: 1feedforward neural nets: 1keyword spotting: 1multi task: 1joint modeling: 1sequence to sequence: 1linguistics: 1text to speech front end: 1semi auto regressive: 1pipelines: 1knowledge based systems: 1multi head self attention: 1imbalanced dataset: 1mandarin: 1text normalization: 1
Most publications (all venues) at2023: 332022: 292024: 132021: 132020: 6

Affiliations
URLs

Recent publications

ICASSP2024 Zhiyun Fan, Linhao Dong, Jun Zhang 0066, Lu Lu 0015, Zejun Ma
SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR.

ICASSP2024 Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Lu Lu, Zejun Ma
Extending Multilingual ASR to New Languages Using Supplementary Encoder and Decoder Components.

ICASSP2024 Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan 0019, Wei Li 0119, Lu Lu 0015, Zejun Ma, Chao Zhang 0031, 
Extending Large Language Models for Speech and Audio Captioning.

ICASSP2024 Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan 0019, Wei Li 0119, Lu Lu 0015, Zejun Ma, Chao Zhang 0031, 
Connecting Speech Encoder and Large Language Model for ASR.

ICML2024 Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan 0019, Wei Li 0119, Lu Lu 0015, Zejun Ma, Yuxuan Wang 0002, Chao Zhang 0031, 
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.

ICLR2024 Ziyue Jiang 0001, Jinglin Liu, Yi Ren 0006, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang 0020, Pengfei Wei 0001, Chunfeng Wang, Xiang Yin 0006, Zejun Ma, Zhou Zhao, 
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis.

ICLR2024 Qianqian Dong, Zhiying Huang, Qi Tian 0001, Chen Xu 0008, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li 0001, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu 0015, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang 0002, 
PolyVoice: Language Models for Speech to Speech Translation.

ICASSP2023 Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li 0119, Zejun Ma, Tan Lee 0001, 
Leveraging Phone-Level Linguistic-Acoustic Similarity For Utterance-Level Pronunciation Scoring.

ICASSP2023 Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li 0119, Zejun Ma, Tan Lee 0001, 
An ASR-Free Fluency Scoring Approach with Self-Supervised Learning.

ICASSP2023 Rao Ma, Xiaobo Wu, Jin Qiu, Yanan Qin, Haihua Xu, Peihao Wu, Zejun Ma
Internal Language Model Estimation Based Adaptive Language Model Fusion for Domain Adaptation.

ICASSP2023 Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu 0003, Xiang Yin 0006, Zejun Ma
LiteG2P: A Fast, Light and High Accuracy Model for Grapheme-to-Phoneme Conversion.

Interspeech2023 Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, Zejun Ma
Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition.

Interspeech2023 Zhipeng Chen, Haihua Xu, Yerbolat Khassanov, Yi He, Lu Lu, Zejun Ma, Ji Wu 0002, 
Knowledge Distillation Approach for Efficient Internal Language Model Estimation.

Interspeech2023 Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu 0003, Chunfeng Wang, Yi Ren 0006, Xiang Yin 0006, Zejun Ma
GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech.

Interspeech2023 Zhiyun Fan, Linhao Dong, Chen Shen 0011, Zhenlin Liang, Jun Zhang 0066, Lu Lu 0015, Zejun Ma
Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition.

Interspeech2023 Kaiqi Fu, Shaojun Gao, Shuju Shi, Xiaohai Tian, Wei Li 0119, Zejun Ma
Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring.

Interspeech2023 Lu Huang, Boyu Li, Jun Zhang 0066, Lu Lu 0015, Zejun Ma
Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer.

Interspeech2023 Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu 0015, Zejun Ma
Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition.

Interspeech2023 Shuju Shi, Kaiqi Fu, Yiwei Gu, Xiaohai Tian, Shaojun Gao, Wei Li 0119, Zejun Ma
Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment.

Interspeech2023 Kun Song, Yi Ren 0006, Yi Lei, Chunfeng Wang, Kun Wei, Lei Xie 0001, Xiang Yin 0006, Zejun Ma
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation.

#59  | Wenwu Wang 0001 | DBLP Google Scholar  
By venueICASSP: 18TASLP: 13Interspeech: 13AAAI: 1ICML: 1
By year2024: 82023: 112022: 72021: 72020: 72019: 6
ISCA sessionsanalysis of speech and audio signals: 4source separation: 2acoustic event detection: 2automatic audio classification and audio captioning: 1acoustic scene analysis: 1acoustic event detection and classification: 1speaker and language recognition: 1acoustic event detection and acoustic scene classification: 1
IEEE keywordstask analysis: 9audio signal processing: 9transformers: 4convolutional neural nets: 4source separation: 4audio tagging: 4self supervised learning: 3semantics: 3visualization: 3time frequency analysis: 3computational modeling: 2audio classification: 2spectrogram: 2scene classification: 2predictive models: 2diffusion model: 2audio generation: 2music: 2noise measurement: 2representation learning: 2decoding: 2singing voice separation: 2cross modal task: 2audio captioning: 2logic gates: 2deep learning (artificial intelligence): 2particle filtering (numerical methods): 2direction of arrival estimation: 2object detection: 2object tracking: 2audio visual systems: 2filtering theory: 2reverberation: 2pattern clustering: 2weakly labelled data: 2multiple instance learning: 2signal classification: 2monaural source separation: 2speech intelligibility: 2audio spectrogram: 1context modeling: 1vision transformers: 1group masked model learning: 1similarity learning: 1image reconstruction: 1acoustic scene classification: 1scene event relation: 1couplings: 1cooperative modelling: 1adaptation models: 1audio event classification: 1aigc: 1speech synthesis: 1electronic mail: 1chatbots: 1chatgpt: 1engines: 1multimodal learning: 1pipelines: 1audio language dataset: 1weakly supervised learning: 1cross modal aggregation: 1video sequences: 1segment based attention: 1audio visual video parsing: 1anomalous sound detection: 1domain shift: 1metadata: 1tail: 1long tail problem: 1measurement: 1data models: 1retrieval information: 1time frequency attention: 1frequency band aware: 1u shaped: 1speech enhancement: 1transformer: 1computer architecture: 1optimization: 1particle separators: 1euclidean distance: 1deep unfolding: 1analysis prior: 1contrastive learning: 1caption consistency regularization: 1sequential audio tagging: 1connectionist temporal classification: 1tagging: 1gated contextual transformer: 1training data: 1data mining: 1optimal transport: 1speaker identification: 1gating mechanism: 1optimal transport kernel embedding: 1maximum likelihood estimation: 1backpropagation: 1natural language processing: 1reinforcement learning: 1gans: 1sound event detection: 1signal detection: 1mutual learning: 1few shot learning: 1transductive inference: 1microphone arrays: 1tracking: 1multiple speaker tracking: 1target tracking: 1audio visual fusion: 1speaker recognition: 1pmbm filter: 1estimation: 1harmonic analysis: 1multi speaker localization: 1sensor arrays: 1circular harmonics: 1bayesian nonparametrics (bnp): 1location awareness: 1microphone array signal processing: 1array signal processing: 1direction of arrival (doa) estimation: 1evolving multi resolution pooling cnn: 1neural architecture search: 1pareto optimisation: 1neural net architecture: 1genetic algorithm: 1signal resolution: 1voice activity detection: 1genetic algorithms: 1monaural singing voice separation: 1weak labels: 1two stream framework: 1class wise attentional clips: 1loss measurement: 1mean square error methods: 1phase measurement: 1deep neural network: 1loss function: 1speech dereverberation: 1complex ideal ratio mask: 1weight measurement: 1transfer learning: 1speech recognition: 1emotion recognition: 1computational complexity: 1pretrained audio neural networks: 1metric learning: 1meta learning: 1class imbalance: 1image classification: 1remote sensing: 1spatial attention: 1channel wise attention: 1supervised learning: 1out of distribution: 1convolutional neural network: 1pseudo labelling: 1discrete fourier transforms: 1interpolated dft: 1analytical solution: 1frequency estimation: 1jacobsen estimator: 1interpolation: 1window function: 1audioset: 1attention neural network: 1comb filters: 1ipd: 1binaural audio: 1multipath propagation: 1acoustic wave propagation: 1ild: 1comb filter effect: 1signal representation: 1interaural coherence: 1rirs: 1dereverberation mask: 1deep neural networks: 1transient response: 1highly reverberant room environments: 1smc phd filter: 1probability: 1audio visual tracking: 1particle flow: 1monte carlo methods: 1audio quality: 1neural network: 1intelligibility: 1background adaptation: 1listening experience: 1recurrent neural nets: 1proximal algorithm: 1recurrent neural network: 1
Most publications (all venues) at2023: 432022: 432024: 372018: 372021: 29

Affiliations
University of Surrey, Guildford, UK

Recent publications

TASLP2024 Sara Atito Ali Ahmed 0001, Muhammad Awais 0001, Wenwu Wang 0001, Mark D. Plumbley, Josef Kittler, 
ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification.

TASLP2024 Yuanbo Hou, Bo Kang, Andrew Mitchell, Wenwu Wang 0001, Jian Kang 0002, Dick Botteldooren, 
Cooperative Scene-Event Modelling for Acoustic Scene Classification.

TASLP2024 Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang 0001, Yuxuan Wang 0002, Mark D. Plumbley, 
AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining.

TASLP2024 Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang 0001
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

ICASSP2024 Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang 0001
CM-PIE: Cross-Modal Perception for Interactive-Enhanced Audio-Visual Video Parsing.

ICASSP2024 Haiyan Lan, Qiaoxi Zhu, Jian Guan 0001, Yuming Wei, Wenwu Wang 0001
Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection under Domain Shift.

ICASSP2024 Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang 0001
Retrieval-Augmented Text-to-Audio Generation.

AAAI2024 Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang 0001, Mark D. Plumbley, 
Learning Temporal Resolution in Spectrogram for Audio Classification.

TASLP2023 Yi Li, Yang Sun 0003, Wenwu Wang 0001, Syed Mohsen Naqvi, 
U-Shaped Transformer With Frequency-Band Aware Attention for Speech Enhancement.

TASLP2023 Weitao Yuan, Shengbei Wang, Jianming Wang, Masashi Unoki, Wenwu Wang 0001
Unsupervised Deep Unfolded Representation Learning for Singing Voice Separation.

TASLP2023 Yiming Zhang, Hong Yu 0006, Ruoyi Du, Zheng-Hua Tan, Wenwu Wang 0001, Zhanyu Ma, Yuan Dong, 
ACTUAL: Audio Captioning With Caption Feature Space Regularization.

ICASSP2023 Yuanbo Hou, Yun Wang, Wenwu Wang 0001, Dick Botteldooren, 
Gct: Gated Contextual Transformer for Sequential Audio Tagging.

ICASSP2023 Weitao Yuan, Yuren Bian, Shengbei Wang, Masashi Unoki, Wenwu Wang 0001
An Improved Optimal Transport Kernel Embedding Method with Gating Mechanism for Singing Voice Separation and Speaker Identification.

Interspeech2023 Yuanbo Hou, Siyang Song, Cheng Luo, Andrew Mitchell, Qiaoqiao Ren, Weicheng Xie 0001, Jian Kang 0002, Wenwu Wang 0001, Dick Botteldooren, 
Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning.

Interspeech2023 Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang 0001
Adapting Language-Audio Models as Few-Shot Audio Learners.

Interspeech2023 Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang 0006, H. Lilian Tang, Mark D. Plumbley, Volkan Kiliç, Wenwu Wang 0001
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention.

Interspeech2023 Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang 0001, Mark D. Plumbley, 
Ontology-aware Learning and Evaluation for Audio Tagging.

Interspeech2023 Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kiliç, Mark D. Plumbley, Wenwu Wang 0001
Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning.

ICML2023 Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang 0001, Mark D. Plumbley, 
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.

ICASSP2022 Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang 0001
Diverse Audio Captioning Via Adversarial Training.

#60  | Shujie Liu 0001 | DBLP Google Scholar  
By venueICASSP: 18Interspeech: 15TASLP: 4AAAI: 3ICML: 2ACL: 2NeurIPS: 1EMNLP: 1
By year2024: 42023: 122022: 142021: 82020: 72019: 1
ISCA sessionsnovel models and training methods for asr: 3source separation: 3analysis of speech and audio signals: 1speech recognition: 1statistical machine translation: 1speaker and language recognition: 1multi- and cross-lingual asr, other topics in asr: 1asr model training and strategies: 1streaming asr: 1asr neural network architectures: 1speech synthesis: 1
IEEE keywordsspeech recognition: 11speech enhancement: 6speaker recognition: 5transformers: 4task analysis: 4representation learning: 4self supervised learning: 4transformer: 4natural language processing: 4speech coding: 3semantics: 3error analysis: 3automatic speech recognition: 3speech separation: 3transducers: 2factorized neural transducer: 2predictive models: 2vocabulary: 2codecs: 2speech translation: 2data models: 2multitasking: 2pre training: 2decoding: 2benchmark testing: 2robustness: 2analytical models: 2transformer transducer: 2speaker verification: 2source separation: 2recurrent neural nets: 2long content speech recognition: 1streaming and non streaming: 1context modeling: 1rnn t: 1computer architecture: 1speech removal: 1codes: 1speech generation: 1noise reduction: 1audio text input: 1multi task learning: 1noise suppression: 1target speaker extraction: 1zero shot text to speech: 1speech editing: 1machine translation: 1language model: 1speech synthesis: 1speech text joint pre training: 1discrete tokenization: 1unified modeling language: 1variance adaptor: 1neural tts: 1adaptation models: 1semiconductor device modeling: 1speecht5: 1prosody: 1fuses: 1long form speech recognition: 1context and speech encoder: 1machine learning: 1multi clue processing: 1cross modality attention: 1target sound extraction: 1packet loss concealment: 1packet loss: 1speech to speech translation: 1joint pre training: 1data mining: 1cross lingual modeling: 1tts conversion: 1code switching asr: 1cross modality learning: 1industries: 1learning systems: 1noise robustness: 1contrastive learning: 1noise measurement: 1self supervised pre training: 1self supervised pretrain: 1unsupervised learning: 1image representation: 1speaker: 1linear programming: 1text analysis: 1speaker identification: 1robust automatic speech recognition: 1switches: 1multi modality: 1end to end: 1supervised learning: 1configurable multilingual model: 1multilingual speech recognition: 1multi channel microphone: 1deep learning (artificial intelligence): 1signal representation: 1real time decoding: 1transducer: 1encoding: 1continuous speech separation: 1multi speaker asr: 1conformer: 1filtering theory: 1audio signal processing: 1system fusion: 1speaker diarization: 1
Most publications (all venues) at2023: 212022: 202024: 142021: 142020: 14

Affiliations
Microsoft Research Asia, Beijing, China
URLs

Recent publications

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

TASLP2024 Xiaofei Wang 0007, Manthan Thakker, Zhuo Chen 0006, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu 0001, Jinyu Li 0001, Takuya Yoshioka, 
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer.

TASLP2024 Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu 0012, Shujie Liu 0001, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Furu Wei, 
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation.

TASLP2024 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu 0012, Shuo Ren, Shujie Liu 0001, Zhuoyuan Yao, Xun Gong 0005, Li-Rong Dai 0001, Jinyu Li 0001, Furu Wei, 
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data.

ICASSP2023 Yan Deng, Long Zhou, Yuanhao Yi, Shujie Liu 0001, Lei He 0005, 
Prosody-Aware Speecht5 for Expressive Neural TTS.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

ICASSP2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Dongmei Wang, Takuya Yoshioka, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Target Sound Extraction with Variable Cross-Modality Clues.

ICASSP2023 Heming Wang, Yao Qian, Hemin Yang, Nauyuki Kanda, Peidong Wang, Takuya Yoshioka, Xiaofei Wang 0009, Yiming Wang, Shujie Liu 0001, Zhuo Chen 0006, DeLiang Wang, Michael Zeng 0001, 
DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks.

ICASSP2023 Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu 0001, Lei He 0005, Jinyu Li 0001, Furu Wei, 
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation.

ICASSP2023 Haibin Yu, Yuxuan Hu, Yao Qian, Ma Jin, Linquan Liu, Shujie Liu 0001, Yu Shi 0001, Yanmin Qian, Edward Lin, Michael Zeng 0001, 
Code-Switching Text Generation and Injection in Mandarin-English ASR.

ICASSP2023 Qiu-Shi Zhu, Long Zhou, Jie Zhang 0042, Shujie Liu 0001, Yu-Chen Hu, Li-Rong Dai 0001, 
Robust Data2VEC: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning.

Interspeech2023 Youngdo Ahn, Chengyi Wang 0002, Yu Wu 0012, Jong Won Shin, Shujie Liu 0001
GRAVO: Learning to Generate Relevant Audio from Visual Features with Noisy Online Videos.

Interspeech2023 Yuang Li, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001
Accelerating Transducers through Adjacent Token Merging.

Interspeech2023 Peidong Wang, Eric Sun, Jian Xue, Yu Wu 0012, Long Zhou, Yashesh Gaur, Shujie Liu 0001, Jinyu Li 0001, 
LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers.

ICML2023 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Shujie Liu 0001, Daniel Tompkins, Zhuo Chen 0006, Wanxiang Che, Xiangzhan Yu, Furu Wei, 
BEATs: Audio Pre-Training with Acoustic Tokenizers.

NeurIPS2023 Chenyang Le, Yao Qian, Long Zhou, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, Xuedong Huang 0001, 
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation.

ICASSP2022 Zhengyang Chen, Sanyuan Chen, Yu Wu 0012, Yao Qian, Chengyi Wang 0002, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification.

ICASSP2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Zhengyang Chen, Zhuo Chen 0006, Shujie Liu 0001, Jian Wu 0027, Yao Qian, Furu Wei, Jinyu Li 0001, Xiangzhan Yu, 
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training.

ICASSP2022 Rui Wang 0073, Junyi Ao, Long Zhou, Shujie Liu 0001, Zhihua Wei 0001, Tom Ko, Qing Li 0001, Yu Zhang 0006, 
Multi-View Self-Attention Based Transformer for Speaker Recognition.

ICASSP2022 Heming Wang, Yao Qian, Xiaofei Wang 0009, Yiming Wang, Chengyi Wang 0002, Shujie Liu 0001, Takuya Yoshioka, Jinyu Li 0001, DeLiang Wang, 
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction.

#61  | Yonghong Yan 0002 | DBLP Google Scholar  
By venueInterspeech: 24TASLP: 11ICASSP: 7SpeechComm: 3
By year2024: 22023: 22022: 122021: 82020: 42019: 112018: 6
ISCA sessionsnovel models and training methods for asr: 3speech synthesis: 2speaker embedding and diarization: 1asr: 1spoken language processing: 1multi-, cross-lingual and other topics in asr: 1atypical speech analysis and detection: 1low-resource asr development: 1source separation, dereverberation and echo cancellation: 1speech recognition and beyond: 1lexicon and language model for speech recognition: 1asr neural network training: 1speaker and language recognition: 1asr for noisy and far-field speech: 1model adaptation for asr: 1acoustic scenes and rare events: 1novel neural network architectures for acoustic modelling: 1neural network training strategies for asr: 1spoken dialogue systems and conversational analysis: 1source separation and spatial analysis: 1language modeling: 1
IEEE keywordsspeech recognition: 10automatic speech recognition: 5end to end speech recognition: 5hidden markov models: 4data models: 3computational modeling: 3end to end: 3task analysis: 2pre training: 2filtering: 2estimation: 2pseudo labeling: 2text analysis: 2natural language processing: 2error analysis: 2transformer: 2online speech recognition: 2decoding: 2computer architecture: 2speech enhancement: 2recurrent neural nets: 2oral communication: 1clustering algorithms: 1metric embedding learning: 1online clustering: 1speaker diarization: 1machine learning algorithms: 1domain adaptation: 1adaptation models: 1self supervised learning: 1semi supervised learning: 1noise measurement: 1tuning: 1mixture models: 1hybrid dnn hmm speech recognition: 1gaussian processes: 1entropy: 1pattern classification: 1long tailed problem: 1probability: 1random processes: 1supervised learning: 1self supervised pre training: 1signal classification: 1weighted histogram analysis: 1direction of arrival estimation: 1mean square error methods: 1deep neural network: 1source separation: 1sound source localization: 1time frequency mask: 1steering vector phase difference: 1keyword confidence scoring: 1keyword search: 1transformers: 1phoneme alignment: 1history: 1language model: 1graphics processing units: 1history utterance: 1performance gain: 1grammars: 1speech coding: 1unpaired data: 1heuristic algorithms: 1hybrid ctc/attention speech recognition: 1multilayer perceptrons: 1pruning: 1matrix algebra: 1model compression: 1matrix product operators: 1data compression: 1computational efficiency: 1ctc/attention speech recognition: 1autoregressive moving average: 1convolutional neural nets: 1interpretability: 1autoregressive moving average processes: 1neural language models: 1multitask learning: 1self attention: 1prosodic boundary prediction: 1speech synthesis: 1binaural cue preservation: 1complex deep neural network: 1hearing: 1interference suppression: 1complex ideal ratio mask: 1binaural speech enhancement: 1different timescales: 1biological research: 1speaker embedding: 1speaker recognition: 1t vector: 1binaural synthesis: 1elevation perception: 1elevation control: 1spectral cues: 1head related transfer function: 1
Most publications (all venues) at2008: 292012: 252009: 242021: 232013: 21

Affiliations
Chinese Academy of Sciences, Institute of Acoustics / Xinjiang Technical Institute of Physics and Chemistry, China

Recent publications

TASLP2024 Yifan Chen, Gaofeng Cheng, Runyan Yang, Pengyuan Zhang, Yonghong Yan 0002
Interrelate Training and Clustering for Online Speaker Diarization.

TASLP2024 Han Zhu 0004, Gaofeng Cheng, Jindong Wang 0001, Wenxin Hou, Pengyuan Zhang, Yonghong Yan 0002
Boosting Cross-Domain Speech Recognition With Self-Supervision.

SpeechComm2023 Feng Dang, Hangting Chen, Qi Hu, Pengyuan Zhang, Yonghong Yan 0002
First coarse, fine afterward: A lightweight two-stage complex approach for monaural speech enhancement.

TASLP2023 Han Zhu 0004, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan 0002
Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition.

TASLP2022 Gaofeng Cheng, Haoran Miao, Runyan Yang, Keqi Deng, Yonghong Yan 0002
ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture.

TASLP2022 Keqi Deng, Gaofeng Cheng, Runyan Yang, Yonghong Yan 0002
Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification.

TASLP2022 Changfeng Gao, Gaofeng Cheng, Ta Li, Pengyuan Zhang, Yonghong Yan 0002
Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model.

Interspeech2022 Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002
Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization.

Interspeech2022 Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan 0002
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies.

Interspeech2022 Yukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan 0002
NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech Recognition.

Interspeech2022 Sanli Tian, Keqi Deng, Zehan Li, Lingxuan Ye, Gaofeng Cheng, Ta Li, Yonghong Yan 0002
Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning.

Interspeech2022 Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie 0001, Yonghong Yan 0002
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset.

Interspeech2022 Lingxuan Ye, Gaofeng Cheng, Runyan Yang, Zehui Yang, Sanli Tian, Pengyuan Zhang, Yonghong Yan 0002
Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods.

Interspeech2022 Xueshuai Zhang, Jiakun Shen, Jun Zhou 0024, Pengyuan Zhang, Yonghong Yan 0002, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shaoxing Zhang, Aijun Sun, 
Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics.

Interspeech2022 Han Zhu 0004, Li Wang, Gaofeng Cheng, Jindong Wang 0001, Pengyuan Zhang, Yonghong Yan 0002
Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR.

Interspeech2022 Han Zhu 0004, Jindong Wang 0001, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002
Decoupled Federated Learning for ASR with Non-IID Data.

SpeechComm2021 Danyang Liu, Ji Xu, Pengyuan Zhang, Yonghong Yan 0002
A unified system for multilingual speech recognition and language identification.

TASLP2021 Longbiao Cheng, Xingwei Sun, Dingding Yao, Junfeng Li, Yonghong Yan 0002
Estimation Reliability Function Assisted Sound Source Localization With Enhanced Steering Vector Phase Difference.

TASLP2021 Runyan Yang, Gaofeng Cheng, Haoran Miao, Ta Li, Pengyuan Zhang, Yonghong Yan 0002
Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments.

ICASSP2021 Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan 0002
History Utterance Embedding Transformer LM for Speech Recognition.

#62  | Tatsuya Kawahara | DBLP Google Scholar  
By venueInterspeech: 27ICASSP: 10TASLP: 6NAACL: 1
By year2024: 42023: 52022: 82021: 42020: 92019: 92018: 5
ISCA sessionsspeech emotion recognition: 2turn management in dialogue: 2speech recognition: 1asr: 1multi-, cross-lingual and other topics in asr: 1speaking styles and interaction styles: 1asr technologies and systems: 1dereverberation, noise reduction, and speaker extraction: 1streaming for asr/rnn transducers: 1search/decoding techniques and confidence measures for asr: 1spoken dialogue system: 1neural networks for language modeling: 1asr neural network architectures and training: 1streaming asr: 1topics in asr: 1conversational systems: 1cross-lingual and multilingual asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1nn architectures for asr: 1training strategy for speech emotion recognition: 1multimodal dialogue systems: 1spoken dialogue systems and conversational analysis: 1acoustic modelling: 1recurrent neural models for asr: 1adjusting to speaker, accent, and domain: 1
IEEE keywordsspeech recognition: 7predictive models: 4speech synthesis: 4speech enhancement: 4decoding: 4domain adaptation: 3matrix decomposition: 3blind source separation: 3covariance matrices: 3emotion recognition: 2adaptation models: 2linguistics: 2fuses: 2task analysis: 2low resource language: 2time frequency analysis: 2maximum likelihood estimation: 2joint diagonalization: 2natural language processing: 2speaker recognition: 2speech coding: 2unsupervised learning: 2computational modeling: 1vocoders: 1data models: 1data augmentation: 1pretrained model: 1multi task learning (mtl): 1transformers: 1speech emotion recognition (ser): 1target recognition: 1adapters: 1predictive system: 1diffusion model: 1storms: 1generative model: 1diffusion processes: 1pipelines: 1training data: 1fake audio detection (fad): 1self supervised learned (ssl) model: 1model fusion: 1mos prediction: 1logic gates: 1transducers: 1context modeling: 1attention based encoder decoder: 1connectionist temporal classification: 1complexity theory: 1streaming automatic speech recognition: 1knowledge distillation: 1monotonic chunkwise attention: 1language adaptation: 1automatic speech recognition: 1multitasking: 1khmer language: 1self supervised pretraining: 1time domain: 1multiresolution spectrograms: 1spectrogram: 1time domain analysis: 1source separation: 1dereverberation: 1reverberation: 1covariance analysis: 1iterative methods: 1audio signal processing: 1autoregressive moving average processes: 1optimisation: 1multichannel audio signal processing: 1fastspeech 2: 1transformer: 1multiple corpora: 1speech emotion recognition: 1multi task learning: 1self attention mechanism: 1non autoregressive decoding: 1multiprocessing systems: 1language translation: 1encoding: 1conditional masked language model: 1end to end speech translation: 1autoregressive processes: 1non native acoustic modeling: 1cross lingual transfer: 1capt: 1pronunciation error detection and diagnosis: 1call: 1blind source separation (bss): 1full rank spatial covariance matrix: 1gaussian distribution: 1multichannel nonnegative matrix factorization: 1image representation: 1multichannel speech enhancement: 1matrix algebra: 1signal denoising: 1variational autoencoder: 1supervised learning: 1nonnegative matrix factorization: 1transfer learning: 1text analysis: 1end to end asr: 1vocabulary: 1multilingual speech recognition: 1multi speaker speech synthesis: 1training data augmentation: 1acoustic to word model: 1sequence to sequence speech synthesis: 1sequence to sequence speech recognition: 1
Most publications (all venues) at2018: 262020: 232019: 222017: 202004: 20


Recent publications

TASLP2024 Sei Ueno, Akinobu Lee, Tatsuya Kawahara
Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition.

ICASSP2024 Yuan Gao, Hao Shi, Chenhui Chu, Tatsuya Kawahara
Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters.

ICASSP2024 Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya 0001, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji, 
Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders.

ICASSP2024 Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li 0010, Raj Dabre, Yi Zhao, Tatsuya Kawahara
MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction.

TASLP2023 Hirofumi Inaguma, Tatsuya Kawahara
Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition.

ICASSP2023 Soky Kak, Sheng Li 0010, Chenhui Chu, Tatsuya Kawahara
Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language.

ICASSP2023 Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang 0001, Tatsuya Kawahara
Time-Domain Speech Enhancement Assisted by Multi-Resolution Frequency Encoder and Decoder.

Interspeech2023 Yuan Gao, Chenhui Chu, Tatsuya Kawahara
Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining.

Interspeech2023 Jaeyoung Lee, Masato Mimura, Tatsuya Kawahara
Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model.

TASLP2022 Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Mathieu Fontaine 0002, Kazuyoshi Yoshii, Tatsuya Kawahara
Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation.

ICASSP2022 Sei Ueno, Tatsuya Kawahara
Phone-Informed Refinement of Synthesized Mel Spectrogram for Data Augmentation in Speech Recognition.

ICASSP2022 Heran Zhang, Masato Mimura, Tatsuya Kawahara, Kenkichi Ishizuka, 
Selective Multi-Task Learning For Speech Emotion Recognition Using Corpora Of Different Styles.

Interspeech2022 Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM.

Interspeech2022 Soky Kak, Sheng Li 0010, Masato Mimura, Chenhui Chu, Tatsuya Kawahara
Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism.

Interspeech2022 Seiya Kawano, Muteki Arioka, Akishige Yuguchi, Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara, Satoshi Nakamura 0001, Koichiro Yoshino, 
Multimodal Persuasive Dialogue Corpus using Teleoperated Android.

Interspeech2022 Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto, 
End-to-end Speech-to-Punctuated-Text Recognition.

Interspeech2022 Hao Shi, Longbiao Wang, Sheng Li 0010, Jianwu Dang 0001, Tatsuya Kawahara
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction.

ICASSP2021 Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe 0001, 
ORTHROS: non-autoregressive end-to-end speech translation With dual-decoder.

Interspeech2021 Hirofumi Inaguma, Tatsuya Kawahara
StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR.

Interspeech2021 Hirofumi Inaguma, Tatsuya Kawahara
VAD-Free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording.

#63  | Xin Wang 0037 | DBLP Google Scholar  
By venueICASSP: 18Interspeech: 16TASLP: 9SpeechComm: 1
By year2024: 62023: 102022: 42021: 52020: 112019: 72018: 1
ISCA sessionsspeech synthesis: 4voice anti-spoofing and countermeasure: 3speaker and language identification: 2voice privacy challenge: 2anti-spoofing for speaker verification: 1speech coding and restoration: 1speech synthesis paradigms and methods: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1prosody modeling and generation: 1
IEEE keywordsspeech synthesis: 12vocoders: 7speaker recognition: 7speech recognition: 5privacy: 5countermeasure: 5presentation attack detection: 5anti spoofing: 5training data: 4text to speech: 4data privacy: 4voice conversion: 3task analysis: 3logical access: 3neural vocoder: 3speaker anonymization: 3neural network: 3data models: 2protocols: 2self supervised learning: 2asvspoof: 2information filtering: 2pipelines: 2music: 2tacotron: 2natural language processing: 2speaker verification: 2variational auto encoder: 2speech coding: 2fourier transforms: 2autoregressive processes: 2multilingual: 1self supervised representations: 1decoding: 1zero shot: 1spectrogram: 1low resource: 1pseudonymisation: 1voice privacy: 1anonymisation: 1attack model: 1recording: 1degradation: 1deepfake detection: 1signal processing algorithms: 1cloning: 1generated speech detection: 1hifi gan: 1source coding: 1watermarking: 1voice cloning: 1voice activity detection: 1collaboration: 1privacy friendly data: 1language robust orthogonal householder neural network: 1codecs: 1deepfakes: 1spoofing: 1distributed databases: 1countermeasures: 1communication networks: 1selection based anonymizer: 1measurement: 1information integrity: 1synthetic aperture sonar: 1orthogonal householder neural network anonymizer: 1weighted additive angular softmax: 1internet: 1deepfake: 1databases: 1spoof localization: 1partialspoof: 1splicing: 1forgery: 1transforms: 1privacy preservation: 1sex neutral voice: 1attribute privacy: 1multiple signal classification: 1computational modeling: 1software: 1transformer: 1text to speech synthesis: 1music audio synthesis: 1analytical models: 1buildings: 1linkability: 1computer crime: 1estimation theory: 1resnet: 1attention: 1tdnn: 1feedforward neural nets: 1deep learning (artificial intelligence): 1entertainment: 1listening test: 1rakugo: 1duration modeling: 1vector quantization: 1hidden markov models: 1automatic speaker verification (asv): 1security of data: 1detect ion cost function: 1spoofing counter measures: 1short time fourier transform: 1convolution: 1waveform model: 1filtering theory: 1recurrent neural nets: 1fundamental frequency: 1speaker embeddings: 1transfer learning: 1speaker adaptation: 1speech enhancement: 1performance evaluation: 1source separation: 1child speech extraction: 1speech separation: 1realistic conditions: 1measures: 1reverberation: 1signal classification: 1search problems: 1probability: 1sequences: 1sampling methods: 1sequence to sequence model: 1stochastic processes: 1neural waveform synthesizer: 1musical instruments: 1fine tuning: 1audio signal processing: 1zero shot adaptation: 1musical instrument sounds synthesis: 1spectral analysis: 1wavenet: 1neural net architecture: 1neural waveform modeling: 1maximum likelihood estimation: 1waveform analysis: 1gaussian distribution: 1waveform generators: 1waveform modeling: 1gradient methods: 1text analysis: 1
Most publications (all venues) at2024: 162021: 162020: 162019: 152023: 14

Affiliations
Graduate University for Advanced Studies (SOKENDAI), National Institute of Informatics, Department of Informatics, Tokyo, Japan
URLs

Recent publications

TASLP2024 Cheng Gong, Xin Wang 0037, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang 0001, Korin Richmond, Junichi Yamagishi, 
ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations.

TASLP2024 Michele Panariello, Natalia A. Tomashenko, Xin Wang 0037, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas W. D. Evans, Emmanuel Vincent 0001, Junichi Yamagishi, 
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation.

ICASSP2024 Xin Wang 0037, Junichi Yamagishi, 
Can Large-Scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-Supervised Front End?

ICASSP2024 Wanying Ge, Xin Wang 0037, Junichi Yamagishi, Massimiliano Todisco, Nicholas W. D. Evans, 
Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?

ICASSP2024 Lauri Juvela, Xin Wang 0037
Collaborative Watermarking for Adversarial Speech Synthesis.

ICASSP2024 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Nicholas W. D. Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier, 
Synvox2: Towards A Privacy-Friendly Voxceleb2 Dataset.

SpeechComm2023 Shi Cheng, Jun Du, Shutong Niu, Alejandrina Cristià, Xin Wang 0037, Qing Wang 0008, Chin-Hui Lee 0001, 
Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee, 
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

TASLP2023 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Natalia A. Tomashenko, 
Speaker Anonymization Using Orthogonal Householder Neural Network.

TASLP2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi, 
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance.

ICASSP2023 Paul-Gauthier Noé, Xiaoxiao Miao, Xin Wang 0037, Junichi Yamagishi, Jean-François Bonastre, Driss Matrouf, 
Hiding Speaker's Sex in Speech Using Zero-Evidence Speaker Representation in an Analysis/Synthesis Pipeline.

ICASSP2023 Xuan Shi, Erica Cooper, Xin Wang 0037, Junichi Yamagishi, Shrikanth Narayanan, 
Can Knowledge of End-to-End Text-to-Speech Models Improve Neural Midi-to-Audio Synthesis Systems?

ICASSP2023 Xin Wang 0037, Junichi Yamagishi, 
Spoofed Training Data for Speech Spoofing Countermeasure Can Be Efficiently Created Using Neural Vocoders.

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Chang Zeng, Xin Wang 0037, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi, 
Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms.

Interspeech2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi, 
Range-Based Equal Error Rate for Spoof Localization.

TASLP2022 Brij Mohan Lal Srivastava, Mohamed Maouche, Md. Sahidullah, Emmanuel Vincent 0001, Aurélien Bellet, Marc Tommasi, Natalia A. Tomashenko, Xin Wang 0037, Junichi Yamagishi, 
Privacy and Utility of X-Vector Based Speaker Anonymization.

ICASSP2022 Xin Wang 0037, Junichi Yamagishi, 
Estimating the Confidence of Speech Spoofing Countermeasure.

ICASSP2022 Chang Zeng, Xin Wang 0037, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi, 
Attention Back-End for Automatic Speaker Verification with Multiple Enrollment Utterances.

Interspeech2022 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Natalia A. Tomashenko, 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions.

#64  | Wei-Ning Hsu | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 7ACL: 6ICLR: 4NeurIPS: 4ICML: 2ACL-Findings: 1EMNLP-Findings: 1EMNLP: 1NAACL: 1TASLP: 1
By year2024: 32023: 112022: 132021: 62020: 32019: 52018: 3
ISCA sessionsspeech synthesis: 3resources for spoken language processing: 1spoken language processing: 1zero, low-resource and multi-modal speech recognition: 1speaker recognition and anti-spoofing: 1resource-constrained asr: 1self-supervision and semi-supervision for neural asr training: 1speech signal representation: 1new trends in self-supervised speech processing: 1speech signal characterization: 1speech recognition and beyond: 1deep neural networks: 1robust speech recognition: 1neural network training strategies for asr: 1
IEEE keywordspre training: 3task analysis: 3speech recognition: 3self supervised learning: 3representation learning: 2unsupervised learning: 2speech synthesis: 2multimodal: 1roads: 1data models: 1machine translation: 1speech translation: 1domain adaptation: 1continual learning: 1on device: 1computational modeling: 1adaptation models: 1benchmark testing: 1signal processing algorithms: 1asr: 1smoothing methods: 1measurement: 1self supervision: 1unit discovery: 1diarization: 1mixture speech: 1cocktail party: 1source separation: 1multispeaker asr: 1self supervised pre training: 1object recognition: 1recording: 1bert: 1pattern clustering: 1natural language processing: 1supervised learning: 1text analysis: 1text to speech: 1semi supervised learning: 1data efficiency: 1tacotron: 1data augmentation: 1variational autoencoder: 1speaker recognition: 1adversarial training: 1text to speech synthesis: 1
Most publications (all venues) at2023: 192022: 192021: 122024: 102018: 7

Affiliations
URLs

Recent publications

ICASSP2024 Peng-Jen Chen, Bowen Shi, Kelvin Niu, Ann Lee 0001, Wei-Ning Hsu
M2BART: Multilingual and Multimodal Encoder-Decoder Pre-Training for Any-to-Any Machine Translation.

ICLR2024 Alexander H. Liu, Matthew Le 0001, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
Generative Pre-training for Speech with Flow Matching.

ACL2024 HyoJung Han, Mohamed Anwar, Juan Pino 0001, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang, 
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception.

ICASSP2023 Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed, 
Continual Learning for On-Device Speech Recognition Using Disentangled Conformers.

ICASSP2023 Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux, Abdelrahman Mohamed, 
Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

ICASSP2023 Maryam Fazel-Zarandi, Wei-Ning Hsu
Cocktail Hubert: Generalized Self-Supervised Pre-Training for Mixture and Single-Source Speech.

Interspeech2023 Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino 0001, Changhan Wang, 
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.

Interspeech2023 Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarandi, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux, 
Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis.

ICML2023 Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli, 
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language.

NeurIPS2023 Matthew Le 0001, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale.

NeurIPS2023 Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass, 
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning.

ACL2023 Changhan Wang, Hirofumi Inaguma, Peng-Jen Chen, Ilia Kulikov, Yun Tang 0002, Wei-Ning Hsu, Michael Auli, Juan Pino 0001, 
Simple and Effective Unsupervised Speech Translation.

ACL-Findings2023 Peng-Jen Chen, Kevin Tran, Yilin Yang, Jingfei Du, Justine Kao, Yu-An Chung, Paden Tomasello, Paul-Ambroise Duquenne, Holger Schwenk, Hongyu Gong, Hirofumi Inaguma, Sravya Popuri, Changhan Wang, Juan Pino 0001, Wei-Ning Hsu, Ann Lee 0001, 
Speech-to-Speech Translation for a Real-world Unwritten Language.

EMNLP-Findings2023 Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli, 
Toward Joint Language Modeling for Speech Units and Text.

Interspeech2022 Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James R. Glass, 
Simple and Effective Unsupervised Speech Synthesis.

Interspeech2022 Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino 0001, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee 0001, 
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation.

Interspeech2022 Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed, 
Robust Self-Supervised Audio-Visual Speech Recognition.

Interspeech2022 Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT.

Interspeech2022 Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski, 
On-demand compute reduction with stochastic wav2vec 2.0.

ICML2022 Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli, 
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language.

#65  | Joon Son Chung | DBLP Google Scholar  
By venueICASSP: 24Interspeech: 17TASLP: 1ICML: 1AAAI: 1
By year2024: 112023: 82022: 42021: 62020: 82019: 42018: 3
ISCA sessionsspeaker diarization: 3multimodal speech processing: 2analysis of speech and audio signals: 1speaker recognition: 1speaker and language identification: 1speaker and language recognition: 1tools, corpora and resources: 1bi- and multilinguality: 1learning techniques for speaker recognition: 1speech enhancement: 1speaker recognition and diarization: 1deep enhancement: 1multimodal systems: 1speaker verification: 1
IEEE keywordsspeaker recognition: 9speaker verification: 7speaker diarisation: 5speech recognition: 5visualization: 5task analysis: 4self supervised learning: 4annotations: 3training data: 2data models: 2data mining: 2oral communication: 2audio visual: 2diffusion: 2computational modeling: 2benchmark testing: 2codes: 2visually grounded speech: 2speech enhancement: 2protocols: 2graph attention networks: 2graph theory: 2spatial resolution: 1continuous sign language recognition: 1bidirectional control: 1bi directional fusion: 1sign language: 1temporal modelling: 1semantics: 1slowfast network: 1degradation: 1speaker embedding: 1calibration: 1robustness: 1session information: 1network architecture: 1multi modal speech processing: 1infonce loss: 1active speaker detection: 1linear programming: 1costs: 1dataset: 1pipelines: 1stochastic differential equation: 1phonetics: 1speech separation: 1style control: 1controllability: 1text to speech: 1manuals: 1linguistics: 1text to audio: 1latent diffusion model: 1production: 1vocoders: 1convolution: 1transforms: 1lightweight model: 1fast diffusion: 1speech synthesis: 1vocoder: 1buildings: 1contrastive learning: 1masked autoencoder: 1evaluation protocol: 1data augmentation: 1correlation: 1metric learning: 1filtering: 1measurement: 1user defined keyword spotting: 1noise robustness: 1speaker embeddings: 1noise robust: 1speech coding: 1background noise: 1dimensionality reduction: 1biometrics (access control): 1biological system modeling: 1diffusion model: 1multi speaker text to speech (tts): 1audiovisual biometrics: 1image segmentation: 1audio visual correspondence: 1anti spoofing: 1audio spoofing detection: 1end to end: 1heterogeneous: 1beam search: 1keyword score: 1contextual biasing: 1keyword boosting: 1pattern clustering: 1multi scale: 1gaussian processes: 1entertainment: 1domain adaptation: 1graph neural network: 1graph attention network: 1signal classification: 1optimisation: 1entropy: 1cross modal distillation: 1lip reading: 1filtering theory: 1speaker representation: 1source separation: 1triplet loss: 1cross modal learning: 1selfsupervised machine learning: 1audio visual systems: 1signal representation: 1audio visual synchronisation: 1cross modal supervision: 1synchronization: 1lips: 1cross modal embedding: 1streaming media: 1speech: 1cnns: 1
Most publications (all venues) at2024: 252020: 162023: 132021: 112022: 10

Affiliations
URLs

Recent publications

TASLP2024 Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown 0006, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman, 
The VoxCeleb Speaker Recognition Challenge: A Retrospective.

ICASSP2024 Junseok Ahn, Youngjoon Jang, Joon Son Chung
Slowfast Network for Continuous Sign Language Recognition.

ICASSP2024 Hee-Soo Heo, Kihyun Nam, Bong-Jin Lee, Youngki Kwon, Minjae Lee, You Jin Kim, Joon Son Chung
Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification.

ICASSP2024 Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung
TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning.

ICASSP2024 Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe 0001, Joon Son Chung
VoxMM: Rich Transcription of Conversations in the Wild.

ICASSP2024 Suyeon Lee, Chaeyoung Jung, Youngjoon Jang, Jaehun Kim, Joon Son Chung
Seeing Through The Conversation: Audio-Visual Speech Separation Based on Diffusion Model.

ICASSP2024 Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung
VoiceLDM: Text-to-Speech with Environmental Context.

ICASSP2024 Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung
Fregrad: Lightweight and Fast Frequency-Aware Diffusion Vocoder.

ICASSP2024 Jongbhin Woo, Hyeonggon Ryu, Arda Senocak, Joon Son Chung
Speech Guided Masked Image Modeling for Visually Grounded Speech.

ICML2024 Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning.

AAAI2024 Ji-Hoon Kim, Jaehun Kim, Joon Son Chung
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos.

ICASSP2023 Jee-Weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown 0006, Youngki Kwon, Shinji Watanabe 0001, Joon Son Chung
In Search of Strong Embedding Extractors for Speaker Diarisation.

ICASSP2023 Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung
Metric Learning for User-Defined Keyword Spotting.

ICASSP2023 You Jin Kim, Hee-Soo Heo, Jee-Weon Jung, Youngki Kwon, Bong-Jin Lee, Joon Son Chung
Advancing the Dimensionality Reduction of Speaker Embeddings for Speaker Diarisation: Disentangling Noise and Informing Speech Activity.

ICASSP2023 Jiyoung Lee, Joon Son Chung, Soo-Whan Chung, 
Imaginary Voice: Face-Styled Diffusion Model for Text-to-Speech.

ICASSP2023 Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung
Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples.

Interspeech2023 Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak, 
FlexiAST: Flexibility is What AST Needs.

Interspeech2023 Hee-Soo Heo, Jee-weon Jung, Jingu Kang, Youngki Kwon, Bong-Jin Lee, You Jin Kim, Joon Son Chung
Curriculum Learning for Self-supervised Speaker Verification.

Interspeech2023 Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung
Disentangled Representation Learning for Multilingual Speaker Recognition.

ICASSP2022 Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, Nicholas W. D. Evans, 
AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks.

#66  | Jesús Villalba 0001 | DBLP Google Scholar  
By venueInterspeech: 31ICASSP: 11TASLP: 1
By year2024: 12023: 52022: 62021: 102020: 82019: 72018: 6
ISCA sessionstrustworthy speech processing: 3robust speaker recognition: 2the attacker’s perpective on automatic speaker verification: 2speaker verification: 2speaker state and trait: 2language identification and diarization: 1speech recognition: 1pathological speech analysis: 1speaker recognition: 1speech, voice, and hearing disorders: 1self supervision and anti-spoofing: 1speaker recognition and diarization: 1voice activity detection and keyword spotting: 1non-autoregressive sequential modeling for speech processing: 1the adresso challenge: 1embedding and network architecture for speaker recognition: 1voice anti-spoofing and countermeasure: 1the zero resource speech challenge 2020: 1speaker embedding: 1speaker recognition and anti-spoofing: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1the voices from a distance challenge: 1speaker recognition evaluation: 1language identification: 1the first dihard speech diarization challenge: 1
IEEE keywordsspeaker recognition: 9speech recognition: 4emotion recognition: 3transfer learning: 3speaker verification: 3supervised learning: 2signal denoising: 2perceptual loss: 2natural language processing: 2speech enhancement: 2feature enhancement: 2microphones: 2end to end: 1transformers: 1non autoregressive model: 1iterative refinement: 1attractor mechanism: 1decoding: 1estimation: 1self attention: 1clustering: 1recording: 1speaker diarization: 1object detection: 1attention: 1connectionist temporal classification: 1regularization: 1understanding: 1text to speech: 1automatic speech recognition: 1unsupervised learning: 1speech synthesis: 1self supervised features: 1pre trained networks: 1multi task learning: 1deep learning (artificial intelligence): 1speech denoising: 1audio signal processing: 1signal classification: 1data augmentation: 1copypaste: 1x vector: 1deep feature loss: 1i vectors: 1medical disorders: 1diseases: 1patient diagnosis: 1parkinson’s disease: 1medical signal processing: 1neurophysiology: 1speech: 1x vectors: 1channel bank filters: 1far field adaptation: 1dereverberation: 1data handling: 1cyclegan: 1linear discriminant analysis: 1pre trained: 1x vector: 1cold fusion: 1automatic speech recognition (asr): 1language model: 1shallow fusion: 1storage management: 1deep fusion: 1sequence to sequence: 1telephone sets: 1bandwidth: 1deep residual cnn: 1blstm: 1spectrogram: 1bandwidth extension: 1generative adversarial neural networks (gans): 1unsupervised domain adaptation: 1cycle gans: 1
Most publications (all venues) at2021: 172020: 142022: 132019: 132018: 12

Affiliations
Johns Hopkins University, Center for Language and Speech Processing, Baltimore, MD, USA
University of Zaragoza, Spain

Recent publications

TASLP2024 Magdalena Rybicka, Jesús Villalba 0001, Thomas Thebaud, Najim Dehak, Konrad Kowalczyk, 
End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors.

Interspeech2023 Jesús Villalba 0001, Jonas Borgstrom, Maliha Jahan, Saurabh Kataria, Leibny Paola García, Pedro A. Torres-Carrasquillo, Najim Dehak, 
Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22.

Interspeech2023 Saurabhchand Bhati, Jesús Villalba 0001, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak, 
Segmental SpeechCLIP: Utilizing Pretrained Image-text Models for Audio-Visual Learning.

Interspeech2023 Anna Favaro, Tianyu Cao 0003, Thomas Thebaud, Jesús Villalba 0001, Ankur A. Butala, Najim Dehak, Laureano Moro-Velázquez, 
Do Phonatory Features Display Robustness to Characterize Parkinsonian Speech Across Corpora?

Interspeech2023 Saurabh Kataria, Jesús Villalba 0001, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak, 
Self-FiLM: Conditioning GANs with self-supervised representations for bandwidth extension based speaker recognition.

Interspeech2023 Helin Wang, Thomas Thebaud, Jesús Villalba 0001, Myra Sydnor, Becky Lammers, Najim Dehak, Laureano Moro-Velázquez, 
DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model.

Interspeech2022 Jaejin Cho, Raghavendra Pappagari, Piotr Zelasko, Laureano Moro-Velázquez, Jesús Villalba 0001, Najim Dehak, 
Non-contrastive self-supervised learning of utterance-level speech representations.

Interspeech2022 Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesús Villalba 0001, Sanjeev Khudanpur, Najim Dehak, 
Defense against Adversarial Attacks on Hybrid Speech Recognition System using Adversarial Fine-tuning with Denoiser.

Interspeech2022 Sonal Joshi, Saurabh Kataria, Jesús Villalba 0001, Najim Dehak, 
AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification.

Interspeech2022 Saurabh Kataria, Jesús Villalba 0001, Laureano Moro-Velázquez, Najim Dehak, 
Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification.

Interspeech2022 Magdalena Rybicka, Jesús Villalba 0001, Najim Dehak, Konrad Kowalczyk, 
End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors.

Interspeech2022 Yiwen Shao, Jesús Villalba 0001, Sonal Joshi, Saurabh Kataria, Sanjeev Khudanpur, Najim Dehak, 
Chunking Defense for Adversarial Attacks on ASR.

ICASSP2021 Nanxin Chen, Piotr Zelasko, Jesús Villalba 0001, Najim Dehak, 
Focus on the Present: A Regularization Method for the ASR Source-Target Attention Layer.

ICASSP2021 Jaejin Cho, Piotr Zelasko, Jesús Villalba 0001, Najim Dehak, 
Improving Reconstruction Loss Based Speaker Embedding in Unsupervised and Semi-Supervised Scenarios.

ICASSP2021 Saurabh Kataria, Jesús Villalba 0001, Najim Dehak, 
Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models.

ICASSP2021 Raghavendra Pappagari, Jesús Villalba 0001, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak, 
CopyPaste: An Augmentation Method for Speech Emotion Recognition.

Interspeech2021 Saurabhchand Bhati, Jesús Villalba 0001, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak, 
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation.

Interspeech2021 Nanxin Chen, Piotr Zelasko, Laureano Moro-Velázquez, Jesús Villalba 0001, Najim Dehak, 
Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition.

Interspeech2021 Saurabh Kataria, Jesús Villalba 0001, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak, 
Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification.

Interspeech2021 Raghavendra Pappagari, Jaejin Cho, Sonal Joshi, Laureano Moro-Velázquez, Piotr Zelasko, Jesús Villalba 0001, Najim Dehak, 
Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios.

#67  | Tomohiro Nakatani | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 19TASLP: 4SpeechComm: 1
By year2024: 12023: 42022: 32021: 62020: 132019: 132018: 3
ISCA sessionsspeech enhancement: 2speech coding and enhancement: 1multi-talker methods in speech processing: 1speech coding: 1dereverberation, noise reduction, and speaker extraction: 1speech localization, enhancement, and quality assessment: 1speech enhancement and intelligibility: 1noise reduction and intelligibility: 1monaural source separation: 1multi-channel speech enhancement: 1diarization: 1asr for noisy and far-field speech: 1asr neural network architectures: 1speech and audio source separation and scene analysis: 1neural networks for language modeling: 1adjusting to speaker, accent, and domain: 1distant asr: 1speech intelligibility and quality: 1
IEEE keywordsspeech recognition: 14speech enhancement: 9blind source separation: 8reverberation: 7source separation: 7speaker recognition: 6dereverberation: 4neural network: 4gaussian distribution: 3maximum likelihood estimation: 3audio signal processing: 3array signal processing: 3optimisation: 3backpropagation: 3natural language processing: 3online processing: 2transfer functions: 2microphone array: 2blind source separation (bss): 2weighted prediction error (wpe): 2expectation maximization (em) algorithm: 2multivariate complex gaussian distribution: 2full rank spatial covariance analysis (fca): 2covariance matrices: 2covariance analysis: 2microphone arrays: 2dynamic stream weights: 2target speech extraction: 2time domain network: 2source counting: 2robust asr: 2time domain analysis: 2frequency domain analysis: 2delays: 1noise reduction: 1spatial regularization: 1optimization: 1real time systems: 1analytical models: 1sensors: 1blind dereverberation (bd): 1probabilistic logic: 1time frequency analysis: 1switches: 1minimization: 1linear prediction (lp): 1expectation maximisation algorithm: 1microphones: 1independent component analysis: 1wiener filters: 1joint diagonalization: 1multichannel wiener filter: 1signal to distortion ratio: 1acoustic beamforming: 1complex backpropagation: 1convolution: 1multi channel source separation: 1speaker activity: 1meeting recognition: 1speech extraction: 1sensor fusion: 1audiovisual speaker localization: 1audio visual systems: 1image fusion: 1data fusion: 1video signal processing: 1beamforming: 1automatic speech recognition: 1filtering theory: 1multi task loss: 1spatial features: 1block coordinate descent method: 1independent vector analysis: 1generalized eigenvalue problem: 1gaussian noise: 1overdetermined: 1diarization: 1separation: 1smart devices: 1robustness: 1task analysis: 1single channel speech enhancement: 1signal denoising: 1student’s t distribution: 1independent positive semidefinite tensor analysis: 1tensors: 1joint training: 1convolutional neural nets: 1computational complexity: 1end to end speech recognition: 1hidden markov models: 1multi speaker speech recognition: 1time domain: 1speech separation: 1audiovisual speaker tracking: 1kalman filters: 1tracking: 1recurrent neural nets: 1backprop kalman filter: 1adaptation: 1auxiliary feature: 1domain adaptation: 1text analysis: 1topic model: 1recurrent neural network language model: 1sequence summary network: 1iterative methods: 1joint optimization: 1least squares approximations: 1semi supervised learning: 1encoding: 1decoding: 1encoder decoder: 1speech synthesis: 1autoencoder: 1meeting diarization: 1speaker attention: 1speech separation/extraction: 1integer programming: 1compressive speech summarization: 1maximum coverage of content words: 1oracle (upper bound) performance: 1integer linear programming (ilp): 1linear programming: 1
Most publications (all venues) at2019: 242017: 242021: 232018: 232013: 20

Affiliations
URLs

Recent publications

TASLP2024 Tetsuya Ueda, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki, Shoji Makino, 
Blind and Spatially-Regularized Online Joint Optimization of Source Separation, Dereverberation, and Noise Reduction.

TASLP2023 Hiroshi Sawada, Rintaro Ikeshita, Keisuke Kinoshita, Tomohiro Nakatani
Multi-Frame Full-Rank Spatial Covariance Analysis for Underdetermined Blind Source Separation and Dereverberation.

Interspeech2023 Shoko Araki, Ayako Yamamoto, Tsubasa Ochiai, Kenichi Arai, Atsunori Ogawa, Tomohiro Nakatani, Toshio Irino, 
Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine.

Interspeech2023 Marc Delcroix, Naohiro Tawara, Mireia Díez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukás Burget, Shoko Araki, 
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization.

Interspeech2023 Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani
Target Speech Extraction with Conditional Diffusion Model.

ICASSP2022 Naoyuki Kamo, Rintaro Ikeshita, Keisuke Kinoshita, Tomohiro Nakatani
Importance of Switch Optimization Criterion in Switching WPE Dereverberation.

ICASSP2022 Hiroshi Sawada, Rintaro Ikeshita, Keisuke Kinoshita, Tomohiro Nakatani
Multi-Frame Full-Rank Spatial Covariance Analysis for Underdetermined BSS in Reverberant Environments.

Interspeech2022 Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani
Listen only to me! How well can target speech extraction handle false alarms?

TASLP2021 Nobutaka Ito, Rintaro Ikeshita, Hiroshi Sawada, Tomohiro Nakatani
A Joint Diagonalization Based Efficient Approach to Underdetermined Blind Audio Source Separation Using the Multichannel Wiener Filter.

ICASSP2021 Christoph Böddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach, 
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation.

ICASSP2021 Marc Delcroix, Katerina Zmolíková, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani
Speaker Activity Driven Neural Speech Extraction.

ICASSP2021 Julio Wissing, Benedikt T. Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura, 
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain.

Interspeech2021 Christopher Schymura, Benedikt T. Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa, 
PILOT: Introducing Transformers for Probabilistic Sound Event Localization.

Interspeech2021 Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility.

SpeechComm2020 Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani
GEDI: Gammachirp envelope distortion index for predicting intelligibility of enhanced speech.

TASLP2020 Tomohiro Nakatani, Christoph Böddeker, Keisuke Kinoshita, Rintaro Ikeshita, Marc Delcroix, Reinhold Haeb-Umbach, 
Jointly Optimal Denoising, Dereverberation, and Source Separation.

ICASSP2020 Marc Delcroix, Tsubasa Ochiai, Katerina Zmolíková, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki, 
Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam.

ICASSP2020 Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki, 
Overdetermined Independent Vector Analysis.

ICASSP2020 Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani
Tackling Real Noisy Reverberant Meetings with All-Neural Source Separation, Counting, and Diarization System.

ICASSP2020 Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani
Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network.

#68  | Chao Zhang 0031 | DBLP Google Scholar  
By venueICASSP: 19Interspeech: 16TASLP: 4ICML: 1ICLR: 1ACL-Findings: 1SpeechComm: 1
By year2024: 92023: 122022: 52021: 72020: 42019: 42018: 2
ISCA sessionsspeech recognition: 3speech emotion recognition: 1asr technologies and systems: 1neural transducers, streaming asr and novel asr models: 1multi-, cross-lingual and other topics in asr: 1robust asr, and far-field/multi-talker asr: 1neural network training methods for asr: 1speech synthesis: 1the interspeech 2020 far field speaker verification challenge: 1spatial audio: 1asr neural network architectures: 1speech and audio source separation and scene analysis: 1acoustic model adaptation: 1novel neural network architectures for acoustic modelling: 1
IEEE keywordsspeech recognition: 10error analysis: 6speech synthesis: 4decoding: 4training data: 4pointer generator: 3generators: 3automatic speech recognition: 3task analysis: 3data models: 3cross utterance: 3speaker recognition: 3production: 2speech enhancement: 2tts: 2context modeling: 2contextual speech recognition: 2end to end: 2q former: 2adaptation models: 2switches: 2end to end asr: 2hidden markov models: 2bert: 2speaker diarisation: 2foundation model: 2emotion recognition: 2asr: 2natural language processing: 2d vector: 2text to speech: 1mirrors: 1variational autoencoder: 1natural languages: 1pre trained language model: 1speech editing: 1spectrogram: 1multimedia systems: 1vegetation: 1encoding: 1audio visual: 1graph neural networks: 1dual encoders: 1multimodal large language model: 1visual perception: 1audio captioning: 1whisper model: 1test time adaptation: 1in context learning: 1large pre trained models: 1large language model: 1connectors: 1long form speech: 1speaker adaptive training: 1performance evaluation: 1whisper: 1quantization (signal): 1lora: 1quantisation: 1language model discounting: 1minimum bayes' risk: 1transformers: 1fastspeech2: 1bit error rate: 1wav2vec: 1video on demand: 1transducers: 1longform asr: 1fuses: 1tensors: 1spectral clustering: 1speaker embedding: 1wav2vec 2.0: 1clustering methods: 1contextual biasing: 1zero shot learning: 1probability: 1spoken language understanding: 1filling: 1predictive models: 1slot filling: 1depression: 1speech based depression detection: 1self supervised learning: 1analytical models: 1buildings: 1multilingual: 1utf 8 byte: 1unified modeling language: 1word piece: 1fusion: 1gating: 1bilinear pooling: 1rnn t: 1recurrent neural nets: 1signal representation: 1transfer learning: 1regression analysis: 1m2voc: 1speech intelligibility: 1voice cloning: 1few shots learning: 1end to end speech synthesis: 1language models: 1transformer: 1lstm: 1diarisation: 1distributed representation: 1content aware speaker embedding: 1tacotron2: 1speech coding: 1prosody: 1kalman filtering: 1backpropagation: 1deep neural network: 1kalman filters: 1speaker embeddings: 1large margin softmax: 1overlapping speech: 1self attention: 1model combination: 1speaker diarization: 1python: 1
Most publications (all venues) at2024: 252023: 202021: 132022: 92015: 8

Affiliations
Tsinghua University, Department of Electronic Engineering, Beijing, China
University of Cambridge, Department of Engineering, UK (PhD 2017)
URLs

Recent publications

TASLP2024 Yang Li 0116, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian 0002, Ying Wen 0001, Wei Pan 0004, Chao Zhang 0031, Jun Wang 0012, Yang Yang 0001, Fanglei Sun, 
Cross-Utterance Conditioned VAE for Speech Generation.

TASLP2024 Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland, 
Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator.

ICASSP2024 Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan 0019, Wei Li 0119, Lu Lu 0015, Zejun Ma, Chao Zhang 0031
Extending Large Language Models for Speech and Audio Captioning.

ICASSP2024 Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang 0031
Can Whisper Perform Speech-Based In-Context Learning?

ICASSP2024 Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan 0019, Wei Li 0119, Lu Lu 0015, Zejun Ma, Chao Zhang 0031
Connecting Speech Encoder and Large Language Model for ASR.

ICASSP2024 Qiuming Zhao, Guangzhi Sun, Chao Zhang 0031, Mingxing Xu, Thomas Fang Zheng, 
Enhancing Quantised End-to-End ASR Models Via Personalisation.

ICML2024 Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan 0019, Wei Li 0119, Lu Lu 0015, Zejun Ma, Yuxuan Wang 0002, Chao Zhang 0031
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.

ICLR2024 Yuchen Hu, Chen Chen 0075, Chao-Han Huck Yang, Ruizhe Li 0001, Chao Zhang 0031, Pin-Yu Chen, Engsiong Chng, 
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition.

ACL-Findings2024 Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang 0031, Milica Gasic, Philip C. Woodland, 
Speech-based Slot Filling using Large Language Models.

SpeechComm2023 Qiujia Li, Chao Zhang 0031, Philip C. Woodland, 
Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring.

TASLP2023 Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland, 
Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator.

TASLP2023 Ya-Jie Zhang, Chao Zhang 0031, Wei Song, Zhengchen Zhang, Youzheng Wu, Xiaodong He 0001, 
Prosody Modelling With Pre-Trained Cross-Utterance Representations for Improved Speech Synthesis.

ICASSP2023 Shuo-Yiin Chang, Chao Zhang 0031, Tara N. Sainath, Bo Li 0028, Trevor Strohman, 
Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion.

ICASSP2023 Evonne P. C. Lee, Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland, 
Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation.

ICASSP2023 Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland, 
End-to-End Spoken Language Understanding with Tree-Constrained Pointer Generator.

ICASSP2023 Wen Wu, Chao Zhang 0031, Philip C. Woodland, 
Self-Supervised Representations in Speech-Based Depression Detection.

ICASSP2023 Chao Zhang 0031, Bo Li 0028, Tara N. Sainath, Trevor Strohman, Shuo-Yiin Chang, 
UML: A Universal Monolingual Output Layer For Multilingual Asr.

Interspeech2023 Dongcheng Jiang, Chao Zhang 0031, Philip C. Woodland, 
A Neural Time Alignment Module for End-to-End Automatic Speech Recognition.

Interspeech2023 Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang 0027, Chao Zhang 0031, Xie Chen 0001, 
Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.

Interspeech2023 Guangzhi Sun, Xianrui Zheng, Chao Zhang 0031, Philip C. Woodland, 
Can Contextual Biasing Remain Effective with Whisper and GPT-2?

#69  | Brian Kingsbury | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 20TASLP: 1
By year2024: 22023: 62022: 102021: 122020: 72019: 6
ISCA sessionsspoken language understanding: 2asr neural network training: 2speech recognition: 1cross-lingual and multilingual asr: 1end-to-end spoken dialog systems: 1novel models and training methods for asr: 1neural transducers, streaming asr and novel asr models: 1multi-, cross-lingual and other topics in asr: 1asr: 1spoken language modeling and understanding: 1streaming for asr/rnn transducers: 1neural network training methods for asr: 1language and lexical modeling for asr: 1multimodal systems: 1low-resource speech recognition: 1novel neural network architectures for asr: 1multilingual and code-switched asr: 1summarization, semantic analysis and classification: 1asr neural network architectures and training: 1resources – annotation – evaluation: 1
IEEE keywordsspeech recognition: 12automatic speech recognition: 9spoken language understanding: 7natural language processing: 6recurrent neural nets: 5switches: 4data models: 3text analysis: 3decoding: 2signal processing algorithms: 2deep neural networks: 2task analysis: 2transducers: 2data handling: 2end to end systems: 2rnn transducers: 2distributed training: 2end to end asr: 2parallel computing: 2ctc: 1encoding: 1streaming: 1semi autoregressive: 1asr: 1inference algorithms: 1feedback loop: 1supervised training: 1complexity theory: 1bilevel optimization: 1unsupervised training: 1retrieval: 1multilingual: 1entropy: 1cross modal: 1knowledge distillation: 1analytical models: 1cross lingual: 1predictive models: 1knowledge transfer: 1transformers: 1bit error rate: 1filling: 1telephone sets: 1training data: 1dialog history: 1transforms: 1robustness: 1multi speaker: 1end to end: 1recording: 1weakly supervised learning: 1nearest neighbors: 1text classification: 1voice conversations: 1nearest neighbour methods: 1intent classification: 1software agents: 1virtual reality: 1attention: 1atis: 1speech coding: 1encoder decoder: 1interactive systems: 1spoken dialog system: 1end to end models: 1natural languages: 1adaptation: 1end to end mod els: 1language model customization: 1decentralized training: 1convergence: 1asynchronous training: 1data analysis: 1lvcsr: 1data privacy: 1gdpr: 1federated learning: 1adaptation models: 1adaptive training: 1transformer networks: 1end to end systems: 1self supervised pre training: 1sensor fusion: 1recurrent neural network transducer: 1multiplicative integration: 1speaker recognition: 1synthetic speech augmentation: 1pre trained text embedding: 1speech to intent: 1decentralized sgd: 1image recognition: 1supercomputers: 1noise injection: 1broadcast news: 1deep neural networks.: 1switchboard.: 1parallel processing: 1graphics processing units: 1lstm: 1
Most publications (all venues) at2013: 162021: 152022: 132019: 102014: 9

Affiliations
URLs

Recent publications

ICASSP2024 Siddhant Arora, George Saon, Shinji Watanabe 0001, Brian Kingsbury
Semi-Autoregressive Streaming ASR with Label Context.

ICASSP2024 A F. M. Saif, Xiaodong Cui, Han Shen, Songtao Lu, Brian Kingsbury, Tianyi Chen, 
Joint Unsupervised and Supervised Training for Automatic Speech Recognition via Bilevel Optimization.

ICASSP2023 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas 0001, Rogério Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James R. Glass, 
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval.

ICASSP2023 Vishal Sunder, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury, Eric Fosler-Lussier, 
Fine-Grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding.

ICASSP2023 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, George Saon, Brian Kingsbury
Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition.

Interspeech2023 Xiaodong Cui, George Saon, Brian Kingsbury
Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition.

Interspeech2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas 0001, Rogério Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James R. Glass, 
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.

Interspeech2023 Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury
ConvKT: Conversation-Level Knowledge Transfer for Context Aware End-to-End Spoken Language Understanding.

ICASSP2022 Zvi Kons, Aharon Satt, Hong-Kwang Kuo, Samuel Thomas 0001, Boaz Carmeli, Ron Hoory, Brian Kingsbury
A New Data Augmentation Method for Intent Classification Enhancement and its Application on Spoken Conversation Datasets.

ICASSP2022 Hong-Kwang Jeff Kuo, Zoltán Tüske, Samuel Thomas 0001, Brian Kingsbury, George Saon, 
Improving End-to-end Models for Set Prediction in Spoken Language Understanding.

ICASSP2022 Vishal Sunder, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Jatin Ganhotra, Brian Kingsbury, Eric Fosler-Lussier, 
Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding.

ICASSP2022 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury, George Saon, 
Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems.

ICASSP2022 Samuel Thomas 0001, Brian Kingsbury, George Saon, Hong-Kwang Jeff Kuo, 
Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models.

Interspeech2022 Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata, 
Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing.

Interspeech2022 Andrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan, 
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization.

Interspeech2022 Takashi Fukuda, Samuel Thomas 0001, Masayuki Suzuki, Gakuto Kurata, George Saon, Brian Kingsbury
Global RNN Transducer Models For Multi-dialect Speech Recognition.

Interspeech2022 Jiatong Shi, George Saon, David Haws, Shinji Watanabe 0001, Brian Kingsbury
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States.

Interspeech2022 Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas 0001, Hong-Kwang Kuo, Brian Kingsbury
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems.

TASLP2021 Xiaodong Cui, Wei Zhang 0022, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David S. Kung 0001, 
Asynchronous Decentralized Distributed Training of Acoustic Models.

ICASSP2021 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, 
RNN Transducer Models for Spoken Language Understanding.

#70  | Jonathan Le Roux | DBLP Google Scholar  
By venueICASSP: 23Interspeech: 15TASLP: 5
By year2024: 42023: 62022: 62021: 82020: 92019: 92018: 1
ISCA sessionsdialog management: 1spoken dialogue systems: 1source separation: 1self-supervision and semi-supervision for neural asr training: 1acoustic event detection and acoustic scene classification: 1novel neural network architectures for asr: 1streaming for asr/rnn transducers: 1asr neural network architectures: 1privacy and security in speech communication: 1diarization: 1end-to-end speech recognition: 1search methods for speech recognition: 1speech technologies for code-switching in multilingual communities: 1speech and audio source separation and scene analysis: 1spatial and phase cues for source separation and speech recognition: 1
IEEE keywordsspeech recognition: 12source separation: 9speech enhancement: 6time frequency analysis: 5audio source separation: 4music: 4speech separation: 4reverberation: 4natural language processing: 4recurrent neural nets: 4task analysis: 3signal processing algorithms: 3cocktail party problem: 3speech: 3graph theory: 3self training: 3speaker recognition: 3transformer: 3end to end: 3adaptation models: 2diffusion processes: 2domain adaptation: 2sound event detection: 2sound effects: 2motion pictures: 2soundtrack: 2tagging: 2semi supervised learning: 2degradation: 2ctc: 2pattern classification: 2wfst: 2end to end speech recognition: 2pseudo labeling: 2automatic speech recognition: 2mask inference: 2audio signal processing: 2noise measurement: 2triggered attention: 2signal classification: 2speech coding: 2artificial neural networks: 1meeting separation: 1microphones: 1indexes: 1meeting recognition: 1speaker diarization: 1computer architecture: 1stability analysis: 1generative adversarial networks: 1vocoders: 1diffusion process: 1deep audio synthesis: 1spectral envelope: 1generative adversarial network (gan): 1spectrogram: 1diffusion models: 1speech generation: 1griffin lim algorithm: 1brain modeling: 1multi modal: 1speaker extraction: 1eeg: 1electroencephalography: 1auditory attention: 1event detection: 1remixing: 1measurement: 1audio tagging: 1separation processes: 1discrete fourier transforms: 1low latency communication: 1frame online speech enhancement: 1complex spectral mapping: 1microphone array processing: 1prediction algorithms: 1time domain analysis: 1array signal processing: 1wiener filtering: 1training data: 1filtering: 1wiener filters: 1room impulse response: 1taxonomy: 1sound hierarchy: 1computational modeling: 1hyperbolic space: 1manifolds: 1uncertainty: 1cold diffusion: 1mathematical models: 1robustness: 1unfolded training: 1diffusion probabilistic model: 1gtc: 1multi speaker overlapped speech: 1end to end asr: 1semi supervised learning (artificial intelligence): 1iterative methods: 1gtc t: 1rnn t: 1transducer: 1asr: 1instruments: 1benchmark testing: 1regression analysis: 1blind deconvolution: 1deep learning (artificial intelligence): 1supervised learning: 1speech dereverberation: 1rir estimation: 1filtering theory: 1dropout: 1iterative pseudo labeling: 1self supervised asr: 1dilated self attention: 1computational complexity: 1language translation: 1graph based temporal classification: 1semi supervised asr: 1blind source separation: 1probability: 1weak supervision: 1overlapped speech recognition: 1decoding: 1neural beamforming: 1standards: 1audio coding: 1streaming: 1semi supervised classification: 1weakly labeled data: 1neural turing machine: 1turing machines: 1unsupervised speaker adaptation: 1speaker memory: 1pattern clustering: 1chimera network: 1deep clustering: 1low latency: 1speaker independent speech separation: 1expert systems: 1unpaired data: 1cycle consistency: 1end to end automatic speech recognition: 1connectionist temporal classification: 1frame synchronous decoding: 1attention mechanism: 1signal to noise ratio: 1objective measure: 1signal denoising: 1discrete representation: 1phase estimation: 1estimation: 1interpolation: 1
Most publications (all venues) at2024: 182019: 172021: 162023: 132020: 13

Affiliations
URLs

Recent publications

TASLP2024 Christoph Böddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings.

ICASSP2024 Teysir Baoueb, Haocheng Liu, Mathieu Fontaine 0002, Jonathan Le Roux, Gaël Richard, 
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis.

ICASSP2024 Haocheng Liu, Teysir Baoueb, Mathieu Fontaine 0002, Jonathan Le Roux, Gaël Richard, 
GLA-GRAD: A Griffin-Lim Extended Waveform Generation Diffusion Model.

ICASSP2024 Zexu Pan, Gordon Wichern, François G. Germain, Sameer Khurana, Jonathan Le Roux
NeuroHeed+: Improving Neuro-Steered Speaker Extraction with Joint Auditory Attention Detection.

TASLP2023 Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Zhong-Qiu Wang, Jonathan Le Roux
Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks.

TASLP2023 Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe 0001, Jonathan Le Roux
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency.

ICASSP2023 Rohith Aralikatti, Christoph Böddeker, Gordon Wichern, Aswin Shanmugam Subramanian, Jonathan Le Roux
Reverberation as Supervision For Speech Separation.

ICASSP2023 Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Jonathan Le Roux
Hyperbolic Audio Source Separation.

ICASSP2023 Hao Yen, François G. Germain, Gordon Wichern, Jonathan Le Roux
Cold Diffusion for Speech Enhancement.

Interspeech2023 Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh K. Jha, Diego Romeres, Jonathan Le Roux
Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos.

ICASSP2022 Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe 0001, Jonathan Le Roux
Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR.

ICASSP2022 Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori, 
Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy.

ICASSP2022 Niko Moritz, Takaaki Hori, Shinji Watanabe 0001, Jonathan Le Roux
Sequence Transduction with Graph-Based Supervision.

ICASSP2022 Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks.

Interspeech2022 Chiori Hori, Takaaki Hori, Jonathan Le Roux
Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers.

Interspeech2022 Efthymios Tzinis, Gordon Wichern, Aswin Shanmugam Subramanian, Paris Smaragdis, Jonathan Le Roux
Heterogeneous Target Speech Separation.

TASLP2021 Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux
Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation.

ICASSP2021 Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training.

ICASSP2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux
Capturing Multi-Resolution Context by Dilated Self-Attention.

ICASSP2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux
Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification.

#71  | Jianzong Wang | DBLP Google Scholar  
By venueICASSP: 23Interspeech: 20
By year2024: 32023: 122022: 112021: 92020: 8
ISCA sessionsspeech synthesis: 4spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech activity detection and modeling: 1analysis of speech and audio signals: 1question answering from speech: 1speech emotion recognition: 1source separation: 1voice conversion and adaptation: 1acoustic event detection and classification: 1speech signal analysis and representation: 1graph and end-to-end learning for speaker recognition: 1embedding and network architecture for speaker recognition: 1acoustic event detection and acoustic scene classification: 1spoken language understanding: 1dnn architectures for speaker recognition: 1topics in asr: 1phonetic event detection and segmentation: 1
IEEE keywordsspeech synthesis: 11speech recognition: 6voice conversion: 5natural language processing: 5task analysis: 4speaker recognition: 4contrastive learning: 3timbre: 3predictive models: 3emotion recognition: 2emotional speech synthesis: 2computer vision: 2multi modal: 2convolution: 2vector quantization: 2dynamic programming: 2zero shot: 2text analysis: 2transformer: 2text to speech: 2time invariant retrieval: 1data mining: 1self supervised learning: 1phonetics: 1noise reduction: 1speech emotion diarization: 1diffusion denoising probabilistic model: 1probabilistic logic: 1llm: 1model bias: 1text categorization: 1zero shot learning: 1bias leverage: 1adaptation models: 1robustness: 1few shot learning: 1knn methods: 1gold: 1environmental sound classification: 1computational modeling: 1data free: 1audio classification: 1knowledge distillation: 1multiple signal classification: 1fuses: 1music genre classification: 1multi label: 1contrastive loss: 1correlation: 1symmetric cross modal attention: 1adversarial learning: 1speech representation disentanglement: 1linear programming: 1linguistics: 1intonation intensity control: 1relative attribute: 1aligned cross entropy: 1entropy: 1non autoregressive asr: 1mask ctc: 1brain modeling: 1time frequency analysis: 1feature fusion: 1federated learning: 1graph convolution network: 1electroencephalogram: 1regression analysis: 1pattern classification: 1variance regularization: 1attribute inference: 1speaker age estimation: 1label distribution learning: 1any to any: 1object detection: 1self supervised: 1low resource: 1query processing: 1pattern clustering: 1interactive systems: 1visual dialog: 1patch embedding: 1question answering (information retrieval): 1incomplete utterance rewriting: 1self attention weight matrix: 1text edit: 1synthetic noise: 1adversarial perturbation: 1contextual information: 1grapheme to phoneme: 1multi speaker text to speech: 1conditional variational autoencoder: 1intent detection: 1continual learning: 1computational linguistics: 1slot filling: 1recurrent neural nets: 1self attention: 1rnn transducer: 1waveform generators: 1vocoders: 1waveform generation: 1location variable convolution: 1vocoder: 1convolutional codes: 1mutual information: 1strain: 1speaker clustering: 1aggregation hierarchy cluster: 1digital tv: 1analytical models: 1tied variational autoencoder: 1clustering methods: 1speech coding: 1prosody modelling: 1graph theory: 1graph neural network: 1baum welch algorithm: 1real time systems: 1signal processing algorithms: 1feed forward transformer: 1
Most publications (all venues) at2022: 532021: 402023: 362024: 262020: 23


Recent publications

ICASSP2024 Yimin Deng, Huaizhen Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval.

ICASSP2024 Haobin Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang
ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis.

ICASSP2024 Yong Zhang, Hanzhang Li, Zhitao Li, Ning Cheng 0001, Ming Li, Jing Xiao 0006, Jianzong Wang
Leveraging Biases in Large Language Models: "bias-kNN" for Effective Few-Shot Learning.

ICASSP2023 Zuheng Kang, Yayun He, Jianzong Wang, Junqing Peng, Xiaoyang Qu, Jing Xiao 0006, 
Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification.

ICASSP2023 Ganghui Ru, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Improving Music Genre Classification from multi-modal Properties of Music and Genre Correlations Perspective.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Learning Speech Representations with Flexible Hidden Feature Dimensions.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization.

ICASSP2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis.

ICASSP2023 Xulong Zhang 0001, Haobin Tang, Jianzong Wang, Ning Cheng 0001, Jian Luo, Jing Xiao 0006, 
Dynamic Alignment Mask CTC: Improved Mask CTC With Aligned Cross Entropy.

ICASSP2023 Kexin Zhu, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Improving EEG-based Emotion Recognition by Fusing Time-Frequency and Spatial Representations.

Interspeech2023 Jiaxin Fan, Yong Zhang, Hanzhang Li, Jianzong Wang, Zhitao Li, Sheng Ouyang, Ning Cheng 0001, Jing Xiao 0006, 
Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism.

Interspeech2023 Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao 0006, 
SVVAD: Personal Voice Activity Detection for Speaker Verification.

Interspeech2023 Yifu Sun, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Kaiyu Hu, Jing Xiao 0006, 
Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning.

Interspeech2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis.

Interspeech2023 Yong Zhang, Zhitao Li, Jianzong Wang, Yiming Gao 0010, Ning Cheng 0001, Fengying Yu, Jing Xiao 0006, 
Prompt Guided Copy Mechanism for Conversational Question Answering.

ICASSP2022 Shijing Si, Jianzong Wang, Junqing Peng, Jing Xiao 0006, 
Towards Speaker Age Estimation With Label Distribution Learning.

ICASSP2022 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Avqvc: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning.

ICASSP2022 Qiqi Wang 0005, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning.

ICASSP2022 Tong Ye, Shijing Si, Jianzong Wang, Rui Wang, Ning Cheng 0001, Jing Xiao 0006, 
VU-BERT: A Unified Framework for Visual Dialog.

ICASSP2022 Yong Zhang, Zhitao Li, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Self-Attention for Incomplete Utterance Rewriting.

#72  | Chi-Chun Lee | DBLP Google Scholar  
By venueInterspeech: 27ICASSP: 14TASLP: 1
By year2024: 42023: 72022: 62021: 12020: 102019: 92018: 5
ISCA sessionsspeech emotion recognition: 8network architectures for emotion and paralinguistics recognition: 2speech enhancement and bandwidth expansion: 1paralinguistics: 1(multimodal) speech emotion recognition: 1trustworthy speech processing: 1speech signal analysis and representation: 1speech in multimodality: 1diarization: 1computational paralinguistics: 1speech and language analytics for medical applications: 1acoustic phonetics: 1attention mechanism for speaker state recognition: 1the interspeech 2019 computational paralinguistics challenge (compare): 1representation learning for emotion: 1speech pathology, depression, and medical applications: 1speech and language analytics for mental health: 1the interspeech 2018 computational paralinguistics challenge (compare): 1deception, personality, and culture attribute: 1
IEEE keywordsemotion recognition: 10speech recognition: 8speech emotion recognition: 6conversation: 3adaptation models: 2analytical models: 2data models: 2physiology: 2soft label learning: 2affective multimedia: 2convolutional neural nets: 2signal classification: 2interactive systems: 2particle measurements: 1atmospheric measurements: 1oral communication: 1vowel space characteristics: 1extraterrestrial measurements: 1diagnosis: 1acoustic measurements: 1severity assessment: 1autism: 1task analysis: 1pathology: 1gamma correction: 1respiratory sound classification: 1mix up: 1data augmentation: 1stethoscope: 1spectrogram: 1sensitivity: 1stability criteria: 1gender neutrality: 1annotations: 1focusing: 1speaker rater biases: 1fairness: 1human factors: 1wearable devices: 1mutual learning: 1federated learning: 1stress detection: 1heart rate variability: 1stress measurement: 1fair representation: 1measurement: 1perceptual fairness: 1rater bias: 1transfer learning: 1domain adaptation: 1cross lingual: 1phonetics: 1modulation: 1multi label learning: 1natural language processing: 1distribution label learning: 1cognition: 1transformer: 1auditory saliency: 1audio signal processing: 1multi site transfer: 1adversarial domain adaptation: 1diseases: 1adhd: 1biomedical mri: 1medical image processing: 1neurophysiology: 1image classification: 1unsupervised learning: 1fmri: 1regression analysis: 1graph convolutional network: 1mean square error methods: 1decision making: 1small group interaction: 1group performance: 1behavioural sciences computing: 1graph theory: 1graph convolution network: 1personality recognition: 1image representation: 1video signal processing: 1dialogical emotion decoder: 1decoding: 1behavioral signal processing (bsp): 1cross corpus learning: 1adversarial network: 1blstm: 1annotator modeling: 1interaction: 1human computer interaction: 1spoken dialogs: 1attention mechanism: 1
Most publications (all venues) at2019: 262020: 212023: 172018: 162022: 15

Affiliations
URLs

Recent publications

TASLP2024 Chin-Po Chen, Ho-Hsien Pan, Susan Shur-Fen Gau, Chi-Chun Lee
Using Measures of Vowel Space for Autistic Traits Characterization.

ICASSP2024 An-Yan Chang, Jing-Tong Tzeng, Huan-Yu Chen, Chih-Wei Sung, Chun-Hsiang Huang, Edward Pei-Chuan Huang, Chi-Chun Lee
GaP-Aug: Gamma Patch-Wise Correction Augmentation Method for Respiratory Sound Classification.

ICASSP2024 Woan-Shiuan Chien, Shreya G. Upadhyay, Chi-Chun Lee
Balancing Speaker-Rater Fairness for Gender-Neutral Speech Emotion Recognition.

ICASSP2024 Po-Chen Lin, Jeng-Lin Li, Woan-Shiuan Chien, Chi-Chun Lee
In-The-Wild Physiological-Based Stress Detection Using Federated Strategy.

ICASSP2023 Woan-Shiuan Chien, Chi-Chun Lee
Achieving Fair Speech Emotion Recognition via Perceptual Fairness.

ICASSP2023 Shreya G. Upadhyay, Luz Martinez-Lucas, Bo-Hao Su, Wei-Cheng Lin, Woan-Shiuan Chien, Ya-Tse Wu, William Katz, Carlos Busso, Chi-Chun Lee
Phonetic Anchor-Based Transfer Learning to Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition.

Interspeech2023 Huang-Cheng Chou, Lucas Goncalves, Seong-Gyun Leem, Chi-Chun Lee, Carlos Busso, 
The Importance of Calibration: Rethinking Confidence and Performance of Speech Multi-label Emotion Classifiers.

Interspeech2023 Yin-Tse Lin, Bo-Hao Su, Chi-Han Lin, Shih-Chan Kuo, Jyh-Shing Roger Jang, Chi-Chun Lee
Noise-Robust Bandwidth Expansion for 8K Speech Recordings.

Interspeech2023 Shao-Hao Lu, Yun-Shao Lin, Chi-Chun Lee
Speaking State Decoder with Transition Detection for Next Speaker Prediction.

Interspeech2023 Ya-Tse Wu, Yuan-Ting Chang, Shao-Hao Lu, Jing-Yi Chuang, Chi-Chun Lee
A Context-Constrained Sentence Modeling for Deception Detection in Real Interrogation.

Interspeech2023 Ya-Tse Wu, Chi-Chun Lee
MetricAug: A Distortion Metric-Lead Augmentation Strategy for Training Noise-Robust Speech Emotion Recognizer.

ICASSP2022 Huang-Cheng Chou, Wei-Cheng Lin, Chi-Chun Lee, Carlos Busso, 
Exploiting Annotators' Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition.

ICASSP2022 Ya-Tse Wu, Jeng-Lin Li, Chi-Chun Lee
An Audio-Saliency Masking Transformer for Audio Emotion Classification in Movies.

Interspeech2022 Chun-Yu Chen, Yun-Shao Lin, Chi-Chun Lee
Emotion-Shift Aware CRF for Decoding Emotion Sequence in Conversation.

Interspeech2022 Huang-Cheng Chou, Chi-Chun Lee, Carlos Busso, 
Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier.

Interspeech2022 Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
An Attention-Based Method for Guiding Attribute-Aligned Speech Representation Learning.

Interspeech2022 Bo-Hao Su, Chi-Chun Lee
Vaccinating SER to Neutralize Adversarial Attacks with Self-Supervised Augmentation Strategy.

Interspeech2021 Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
An Attribute-Aligned Strategy for Learning Speech Representation.

ICASSP2020 Ya-Lin Huang, Wan-Ting Hsieh, Hao-Chun Yang, Chi-Chun Lee
Conditional Domain Adversarial Transfer for Robust Cross-Site ADHD Classification Using Functional MRI.

ICASSP2020 Yun-Shao Lin, Chi-Chun Lee
Predicting Performance Outcome with a Conversational Graph Convolutional Network for Small Group Interactions.

#73  | Elmar Nöth | DBLP Google Scholar  
By venueInterspeech: 33ICASSP: 7SpeechComm: 2
By year2024: 22023: 122022: 72021: 72020: 42019: 82018: 2
ISCA sessionsspeech and language in health: 6connecting speech-science and speech-technology for children's speech: 3speech, voice, and hearing disorders: 2speech and language analytics for medical applications: 2atypical speech detection: 1acoustic signal representation and analysis: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1technology for disordered speech: 1show and tell: 1miscellaneous topics in speech, voice and hearing disorders: 1speech and audio analysis: 1voice quality characterization for clinical voice assessment: 1the interspeech 2021 computational paralinguistics challenge (compare): 1the adresso challenge: 1disordered speech: 1noise reduction and intelligibility: 1the interspeech 2020 computational paralinguistics challenge (compare): 1speech perception in adverse listening conditions: 1speech and audio classification: 1the interspeech 2019 computational paralinguistics challenge (compare): 1applications in language learning and healthcare: 1social signals detection and speaker traits analysis: 1speech and language analytics for mental health: 1automatic detection and recognition of voice and speech disorders: 1
IEEE keywordsdiseases: 4speech analysis: 3alzheimer’s disease: 3medical signal processing: 3depression: 2medical diagnostic computing: 2natural language processing: 2parkinson’s disease: 2gait analysis: 2neurophysiology: 2automatic assessment: 1children’s speech: 1pathologic speech: 1lips: 1cleft lip and palate: 1task analysis: 1sociology: 1mental health: 1adaptation models: 1longitudinal assessment: 1contrastive training: 1language analysis: 1transfer learning: 1artificial neural networks: 1computational modeling: 1emotion recognition: 1forestry: 1analytical models: 1linguistic analysis: 1medical disorders: 1psen1–e280a: 1acoustic analysis: 1smartphones: 1patient treatment: 1deep learning (artificial intelligence): 1smart phones: 1handwriting analysis: 1mixture models: 1gmm ubm: 1speaker recognition: 1ivectors: 1gaussian processes: 1speech recognition: 1automatic diagnosis: 1neural network language models: 1recurrent neural nets: 1long short term memory: 1language models: 1
Most publications (all venues) at2022: 202023: 192019: 192015: 192009: 18


Recent publications

ICASSP2024 Ilja Baumann, Dominik Wagner 0002, Maria Schuster, Elmar Nöth, Tobias Bocklet, 
Towards Interpretability of Automatic Phoneme Analysis in Cleft Lip and Palate Speech.

ICASSP2024 Paula Andrea Pérez-Toro, Judith Dineley, Agnieszka Kaczkowska, Pauline Conde, Yuezhou Zhang, Faith Matcham, Sara Siddi, Josep Maria Haro, Stuart Bruce, Til Wykes, Raquel Bailón, Srinivasan Vairavan, Richard J. B. Dobson, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave, Vaibhav A. Narayan, Nicholas Cummins, 
Longitudinal Modeling of Depression Shifts Using Speech and Language.

SpeechComm2023 Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Philipp Klumpp, Juan Camilo Vásquez-Correa, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, 
Depression assessment in people with Parkinson's disease: The combination of acoustic features and natural language processing.

ICASSP2023 Paula Andrea Pérez-Toro, Dalia Rodríguez-Salas, Tomás Arias-Vergara, Sebastian P. Bayerl, Philipp Klumpp, Korbinian Riedhammer, Maria Schuster, Elmar Nöth, Andreas K. Maier, Juan Rafael Orozco-Arroyave, 
Transferring Quantified Emotion Knowledge for the Detection of Depression in Alzheimer's Disease Using Forestnets.

Interspeech2023 Soroosh Tayebi Arasteh, Cristian David Ríos-Urrego, Elmar Nöth, Andreas Maier 0001, Seung Hee Yang, Jan Rusz, Juan Rafael Orozco-Arroyave, 
Federated Learning for Secure Development of AI Models for Parkinson's Disease Detection Using Speech from Different Languages.

Interspeech2023 Tomás Arias-Vergara, Elizabeth Londoño-Mora, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas Maier 0001, 
Measuring Phonological Precision in Children with Cleft Lip and Palate.

Interspeech2023 Ilja Baumann, Dominik Wagner 0002, Franziska Braun, Sebastian P. Bayerl, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet, 
Influence of Utterance and Speaker Characteristics on the Classification of Children with Cleft Lip and Palate.

Interspeech2023 Sebastian P. Bayerl, Dominik Wagner 0002, Ilja Baumann, Florian Hönig, Tobias Bocklet, Elmar Nöth, Korbinian Riedhammer, 
A Stutter Seldom Comes Alone - Cross-Corpus Stuttering Detection as a Multi-label Problem.

Interspeech2023 Franziska Braun, Sebastian P. Bayerl, Paula Andrea Pérez-Toro, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Elmar Nöth, Tobias Bocklet, Korbinian Riedhammer, 
Classifying Dementia in the Presence of Depression: A Cross-Corpus Study.

Interspeech2023 Daniel Escobar-Grisales, Tomás Arias-Vergara, Cristian David Ríos-Urrego, Elmar Nöth, Adolfo M. García, Juan Rafael Orozco-Arroyave, 
An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson's Disease Patients.

Interspeech2023 Hiuching Hung, Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Andreas Maier 0001, Elmar Nöth
Speaking Clearly, Understanding Better: Predicting the L2 Narrative Comprehension of Chinese Bilingual Kindergarten Children Based on Speech Intelligibility Using a Machine Learning Approach.

Interspeech2023 Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Franziska Braun, Florian Hönig, Carlos Andrés Tobón-Quintero, David Aguillón, Francisco Lopera, Liliana Hincapié-Henao, Maria Schuster, Korbinian Riedhammer, Andreas Maier 0001, Elmar Nöth, Juan Rafael Orozco-Arroyave, 
Automatic Assessment of Alzheimer's across Three Languages Using Speech and Language Features.

Interspeech2023 Cristian David Ríos-Urrego, Jan Rusz, Elmar Nöth, Juan Rafael Orozco-Arroyave, 
Automatic Classification of Hypokinetic and Hyperkinetic Dysarthria based on GMM-Supervectors.

Interspeech2023 Dominik Wagner 0002, Ilja Baumann, Franziska Braun, Sebastian P. Bayerl, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet, 
Multi-class Detection of Pathological Speech with Latent Features: How does it perform on unseen data?

Interspeech2022 Sebastian Peter Bayerl, Dominik Wagner 0002, Elmar Nöth, Korbinian Riedhammer, 
Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0.

Interspeech2022 Christian Bergler, Alexander Barnhill, Dominik Perrin, Manuel Schmitt, Andreas K. Maier, Elmar Nöth
ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning.

Interspeech2022 Teena tom Dieck, Paula Andrea Pérez-Toro, Tomas Arias, Elmar Nöth, Philipp Klumpp, 
Wav2vec behind the Scenes: How end2end Models learn Phonetics.

Interspeech2022 Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier, Seung Hee Yang, 
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition.

Interspeech2022 Paula Andrea Pérez-Toro, Philipp Klumpp, Abner Hernandez, Tomas Arias, Patricia Lillo, Andrea Slachevsky, Adolfo Martín García, Maria Schuster, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave, 
Alzheimer's Detection from English to Spanish Using Acoustic and Linguistic Embeddings.

Interspeech2022 P. Schäfer, Paula Andrea Pérez-Toro, Philipp Klumpp, Juan Rafael Orozco-Arroyave, Elmar Nöth, Andreas K. Maier, A. Abad, Maria Schuster, Tomás Arias-Vergara, 
CoachLea: an Android Application to Evaluate the Speech Production and Perception of Children with Hearing Loss.

#74  | Peter Bell 0001 | DBLP Google Scholar  
By venueInterspeech: 25ICASSP: 14TASLP: 2SpeechComm: 1
By year2024: 22023: 132022: 52021: 82020: 72019: 62018: 1
ISCA sessionsfeature extraction and distant asr: 3speech recognition: 2paralinguistics: 1perception of paralinguistics: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech synthesis and voice conversion: 1spoken dialog systems and conversational analysis: 1cross/multi-lingual asr: 1robust speaker recognition: 1spoken dialogue systems: 1linguistic components in end-to-end asr: 1topics in asr: 1embedding and network architecture for speaker recognition: 1spoken language processing: 1self-supervision and semi-supervision for neural asr training: 1neural network training methods for asr: 1asr model training and strategies: 1asr neural network training: 1model training for asr: 1feature extraction for asr: 1asr neural network architectures: 1acoustic model adaptation: 1
IEEE keywordsspeech recognition: 9automatic speech recognition: 4adaptation models: 3task analysis: 3natural language processing: 3explanation: 2error analysis: 2end to end: 2phonetics: 2oral communication: 2sensor fusion: 2asr: 2acoustic modelling: 2explainable ai: 1phoneme recognition: 1image segmentation: 1reliability: 1image classification: 1confusion matrix: 1phonetic error analysis: 1hybrid: 1broad phonetic classes: 1timit: 1phone recognition: 1nose: 1raw signal representation: 1multi stream acoustic modelling: 1fourier transforms: 1fourier transform: 1shape: 1streaming media: 1information filters: 1cross domain fusion: 1multitask learning: 1computational modeling: 1affective computing: 1impression recognition: 1listener adaptation: 1correlation coefficient: 1dyadic interaction: 1computer architecture: 1bias in speech recognition: 1conversational speech: 1english accents: 1data models: 1linguistics: 1recording: 1audio visual speech enhancement: 1speech enhancement: 1layout: 1dictionaries: 1intelligibility evaluation: 1keyword spotting: 1internet: 1machine learning: 1quality assessment: 1speech emotion recognition: 1multi task learning: 1emotion recognition: 1wav2vec 2.0: 1sequence training: 1topology: 1mathematical models: 1e2e asr: 1ctc: 1mmi: 1planning: 1phase based source filter separation: 1multi head cnns: 1raw phase spectrum: 1general classifier: 1recurrent neural nets: 1language model: 1top down training: 1layer wise training: 1domain adaptation: 1multilingual speech recognition: 1diarization: 1deep neural network: 1domain adversarial training: 1adversarial learning: 1speaker verification: 1speaker recognition: 1convolutional neural nets: 1signal resolution: 1low pass filters: 1signal representation: 1computer vision: 1bottleneck features: 1statistical normalisation: 1deep neural networks: 1probability density function: 1attention: 1decoding: 1
Most publications (all venues) at2023: 162024: 152020: 152021: 122019: 10

Affiliations
University of Edinburgh, Centre for Speech Technology Research, UK

Recent publications

SpeechComm2024 Georgios Karakasidis, Mikko Kurimo, Peter Bell 0001, Tamás Grósz, 
Comparison and analysis of new curriculum criteria for end-to-end ASR.

ICASSP2024 Xiaoliang Wu, Peter Bell 0001, Ajitha Rajan, 
Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition.

TASLP2023 Erfan Loweimi, Andrea Carmantini, Peter Bell 0001, Steve Renals, Zoran Cvetkovic, 
Phonetic Error Analysis Beyond Phone Error Rate.

TASLP2023 Erfan Loweimi, Zhengjun Yue, Peter Bell 0001, Steve Renals, Zoran Cvetkovic, 
Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform.

ICASSP2023 Yuanchao Li, Peter Bell 0001, Catherine Lai, 
Multimodal Dyadic Impression Recognition via Listener Adaptive Cross-Domain Fusion.

ICASSP2023 Ramon Sanabria, Nikolay Bogoychev, Nina Markl, Andrea Carmantini, Ondrej Klejch, Peter Bell 0001
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR.

ICASSP2023 Cassia Valentini-Botinhao, Andrea Lorena Aldana Blanco, Ondrej Klejch, Peter Bell 0001
Efficient Intelligibility Evaluation Using Keyword Spotting: A Study on Audio-Visual Speech Enhancement.

ICASSP2023 Xiaoliang Wu, Peter Bell 0001, Ajitha Rajan, 
Explanations for Automatic Speech Recognition.

Interspeech2023 Debasmita Bhattacharya, Jie Chi, Julia Hirschberg, Peter Bell 0001
Capturing Formality in Speech Across Domains and Languages.

Interspeech2023 Jie Chi, Brian Lu, Jason Eisner, Peter Bell 0001, Preethi Jyothi, Ahmed M. Ali 0002, 
Unsupervised Code-switched Text Generation from Parallel Text.

Interspeech2023 Yuanchao Li, Peter Bell 0001, Catherine Lai, 
Transfer Learning for Personality Perception via Speech Emotion Recognition.

Interspeech2023 Yuanchao Li, Zeyu Zhao 0004, Ondrej Klejch, Peter Bell 0001, Catherine Lai, 
ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition.

Interspeech2023 Christoph Minixhofer, Ondrej Klejch, Peter Bell 0001
Evaluating and reducing the distance between synthetic and real speech distributions.

Interspeech2023 Sarenne Wallbridge, Peter Bell 0001, Catherine Lai, 
Quantifying the perceptual value of lexical and non-lexical channels in speech.

Interspeech2023 Zeyu Zhao 0004, Peter Bell 0001
Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR.

ICASSP2022 Yuanchao Li, Peter Bell 0001, Catherine Lai, 
Fusing ASR Outputs in Joint Training for Speech Emotion Recognition.

ICASSP2022 Zeyu Zhao 0004, Peter Bell 0001
Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR.

Interspeech2022 Ondrej Klejch, Electra Wallington, Peter Bell 0001
Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR.

Interspeech2022 Chau Luu, Steve Renals, Peter Bell 0001
Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations.

Interspeech2022 Sarenne Carrol Wallbridge, Catherine Lai, Peter Bell 0001
Investigating perception of spoken dialogue acceptability through surprisal.

#75  | Bo Li 0028 | DBLP Google Scholar  
By venueICASSP: 26Interspeech: 13NAACL: 1ICLR: 1
By year2024: 42023: 112022: 82021: 72020: 52019: 52018: 1
ISCA sessionsspeech recognition: 2asr technologies and systems: 2multi-, cross-lingual and other topics in asr: 2acoustic model adaptation for asr: 1search/decoding techniques and confidence measures for asr: 1streaming for asr/rnn transducers: 1speech classification: 1training strategies for asr: 1asr neural network architectures: 1acoustic model adaptation: 1
IEEE keywordsspeech recognition: 19recurrent neural nets: 8adaptation models: 6natural language processing: 6computational modeling: 4speech coding: 4transducers: 3data models: 3task analysis: 3end to end asr: 3decoding: 3multilingual: 3rnn t: 3degradation: 2error analysis: 2costs: 2universal speech model: 2automatic speech recognition: 2foundation model: 2video on demand: 2transfer learning: 2asr: 2conformer: 2latency: 2probability: 2optimisation: 2end to end speech recognition: 2tail: 1adapter finetuning: 1streaming multilingual asr: 1sparsity: 1topology: 1computational efficiency: 1model pruning: 1model quantization: 1quantization (signal): 1parameter efficient adaptation: 1tuning: 1longform asr: 1fuses: 1tensors: 1computer architecture: 1memory management: 1analytical models: 1domain adaptation: 1foundation models: 1frequency modulation: 1soft sensors: 1internal lm: 1text recognition: 1text injection: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1convolution: 1kernel: 1encoding: 1buildings: 1production: 1switches: 1utf 8 byte: 1unified modeling language: 1word piece: 1multilingual asr: 1joint training: 1contrastive learning: 1indexes: 1self supervised learning: 1linear programming: 1massive: 1lifelong learning: 1two pass asr: 1rnnt: 1long form asr: 1speaker recognition: 1fusion: 1gating: 1bilinear pooling: 1signal representation: 1cascaded encoders: 1confidence scores: 1hidden markov models: 1mean square error methods: 1transformer: 1calibration: 1confidence: 1voice activity detection: 1attention based end to end models: 1regression analysis: 1endpointer: 1data augmentation: 1multi domain training: 1vocabulary: 1unsupervised learning: 1sequence to sequence: 1filtering theory: 1semi supervised training: 1mobile handsets: 1sequence classification: 1connectionist temporal classification: 1stimulated learning: 1end to end speech synthesis: 1speech synthesis: 1
Most publications (all venues) at2023: 152022: 142019: 92021: 82017: 8

Affiliations
Google Inc., USA
National University of Singapore, Singapore (former)

Recent publications

ICASSP2024 Junwen Bai, Bo Li 0028, Qiujia Li, Tara N. Sainath, Trevor Strohman, 
Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR.

ICASSP2024 Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li 0028, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal, 
USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models.

ICASSP2024 Khe Chai Sim, Zhouyuan Huo, Tsendsuren Munkhdalai, Nikhil Siddhartha, Adam Stooke, Zhong Meng, Bo Li 0028, Tara N. Sainath, 
A Comparison of Parameter-Efficient ASR Domain Adaptation Methods for Universal Speech and Language Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar, 
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Shuo-Yiin Chang, Chao Zhang 0031, Tara N. Sainath, Bo Li 0028, Trevor Strohman, 
Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion.

ICASSP2023 Ke Hu, Tara N. Sainath, Bo Li 0028, Nan Du 0002, Yanping Huang, Andrew M. Dai, Yu Zhang 0033, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman, 
Massively Multilingual Shallow Fusion with Large Language Models.

ICASSP2023 Zhouyuan Huo, Khe Chai Sim, Bo Li 0028, Dongseong Hwang, Tara N. Sainath, Trevor Strohman, 
Resource-Efficient Transfer Learning from Speech Foundation Model Using Hierarchical Feature Fusion.

ICASSP2023 Bo Li 0028, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang 0033, Wei Han 0002, Trevor Strohman, Françoise Beaufays, 
Efficient Domain Adaptation for Speech Foundation Models.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran, 
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman, 
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.

ICASSP2023 Chao Zhang 0031, Bo Li 0028, Tara N. Sainath, Trevor Strohman, Shuo-Yiin Chang, 
UML: A Universal Monolingual Output Layer For Multilingual Asr.

Interspeech2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath, 
How to Estimate Model Transferability of Pre-Trained Speech Models?

Interspeech2023 Ke Hu, Bo Li 0028, Tara N. Sainath, Yu Zhang 0033, Françoise Beaufays, 
Mixture-of-Expert Conformer for Streaming Multilingual ASR.

Interspeech2023 Qiujia Li, Bo Li 0028, Dongseong Hwang, Tara N. Sainath, Pedro Moreno Mengibar, 
Modular Domain Adaptation for Conformer-Based Streaming ASR.

ICASSP2022 Junwen Bai, Bo Li 0028, Yu Zhang 0033, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath, 
Joint Unsupervised and Supervised Training for Multilingual ASR.

ICASSP2022 Bo Li 0028, Ruoming Pang, Yu Zhang 0033, Tara N. Sainath, Trevor Strohman, Parisa Haghani, Yun Zhu, Brian Farris, Neeraj Gaur, Manasa Prasad, 
Massively Multilingual ASR: A Lifelong Learning Solution.

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033, 
Improving The Latency And Quality Of Cascaded Encoders.

ICASSP2022 Chao Zhang 0031, Bo Li 0028, Zhiyun Lu, Tara N. Sainath, Shuo-Yiin Chang, 
Improving the Fusion of Acoustic and Text Representations in RNN-T.

Interspeech2022 Shuo-Yiin Chang, Bo Li 0028, Tara N. Sainath, Chao Zhang 0031, Trevor Strohman, Qiao Liang 0001, Yanzhang He, 
Turn-Taking Prediction for Natural Conversational Speech.

#76  | Sheng Zhao | DBLP Google Scholar  
By venueInterspeech: 14ICASSP: 11ICLR: 4ICML: 3NeurIPS: 3AAAI: 3TASLP: 1KDD: 1IJCAI: 1
By year2024: 42023: 72022: 92021: 82020: 72019: 6
ISCA sessionsspeech synthesis: 8show and tell: 1voice conversion and adaptation: 1language and lexical modeling for asr: 1asr model training and strategies: 1asr neural network architectures and training: 1speech synthesis paradigms and methods: 1
IEEE keywordstext to speech: 7speech synthesis: 7speech recognition: 4contextual biasing: 2contextual spelling correction: 2vocoders: 2lightweight: 2medical image processing: 2speaker recognition: 2speech intelligibility: 2style control: 1prompt: 1benchmark testing: 1decoding: 1image synthesis: 1task analysis: 1context modeling: 1external attention: 1data models: 1semantics: 1multi lingual: 1convolution: 1waveglow: 1multi speaker: 1non autoregressive: 1iterative methods: 1fast sampling: 1probability: 1image denoising: 1denoising diffusion probabilistic models: 1optimisation: 1vocoder: 1speech to animation: 1mixture of experts: 1transformer: 1computer animation: 1phonetic posteriorgrams: 1text analysis: 1pre training: 1data reduction: 1speech quality assessment: 1correlation methods: 1mos prediction: 1mean bias network: 1sensitivity analysis: 1video signal processing: 1search problems: 1neural architecture search: 1fast: 1autoregressive processes: 1adaptation: 1untranscribed data: 1signal reconstruction: 1frame level condition: 1speech enhancement: 1noisy speech: 1signal denoising: 1denoise: 1multi head self attention: 1speech emotion recognition: 1emotion recognition: 1dilated residual network: 1
Most publications (all venues) at2023: 212024: 172022: 122021: 112019: 11

Affiliations
URLs

Recent publications

ICML2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan 0003, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu 0001, Tao Qin 0001, Xiangyang Li 0001, Wei Ye 0004, Shikun Zhang, Jiang Bian 0002, Lei He 0005, Jinyu Li 0001, Sheng Zhao
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng, 
UniAudio: Towards Universal Audio Generation with Large Language Models.

ICLR2024 Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He 0005, Xiangyang Li 0001, Sheng Zhao, Tao Qin 0001, Jiang Bian 0002, 
PromptTTS 2: Describing and Generating Voices with Text Prompt.

ICLR2024 Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yichong Leng, Lei He 0005, Tao Qin 0001, Sheng Zhao, Jiang Bian 0002, 
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

ICASSP2023 Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan 0003, 
Prompttts: Controllable Text-To-Speech With Text Descriptions.

ICASSP2023 Xiaoqiang Wang 0006, Yanqing Liu, Jinyu Li 0001, Sheng Zhao
Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation.

ICASSP2023 Chen Zhang 0020, Shubham Bansal, Aakash Lakhera, Jinzhu Li, Gang Wang 0001, Sandeepkumar Satpal, Sheng Zhao, Lei He 0005, 
LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.

Interspeech2023 Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang 0016, Serena Ruan, Sheng Zhao, Lei He 0005, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus Weimer, 
Large-Scale Automatic Audiobook Creation.

Interspeech2023 Yujia Xiao, Shaofei Zhang, Xi Wang 0016, Xu Tan 0003, Lei He 0005, Sheng Zhao, Frank K. Soong, Tan Lee 0001, 
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

NeurIPS2023 Yuancheng Wang, Zeqian Ju, Xu Tan 0003, Lei He 0005, Zhizheng Wu 0001, Jiang Bian 0002, Sheng Zhao
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.

AAAI2023 Yihan Wu, Junliang Guo, Xu Tan 0003, Chen Zhang 0020, Bohan Li 0003, Ruihua Song, Lei He 0005, Sheng Zhao, Arul Menezes, Jiang Bian 0002, 
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.

TASLP2022 Xiaoqiang Wang 0006, Yanqing Liu, Jinyu Li 0001, Veljko Miljanic, Sheng Zhao, Hosam Khalil, 
Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems.

ICASSP2022 Zehua Chen, Xu Tan 0003, Ke Wang, Shifeng Pan, Danilo P. Mandic, Lei He 0005, Sheng Zhao
Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.

ICASSP2022 Liyang Chen, Zhiyong Wu 0001, Jun Ling, Runnan Li, Xu Tan 0003, Sheng Zhao
Transformer-S2A: Robust and Efficient Speech-to-Animation.

ICASSP2022 Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan 0003, Sheng Zhao, Tan Lee 0001, 
A Study on the Efficacy of Model Pre-Training In Developing Neural Text-to-Speech System.

Interspeech2022 Yanqing Liu, Ruiqing Xue, Lei He 0005, Xu Tan 0003, Sheng Zhao
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.

Interspeech2022 Yihan Wu, Xu Tan 0003, Bohan Li 0003, Lei He 0005, Sheng Zhao, Ruihua Song, Tao Qin 0001, Tie-Yan Liu, 
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.

Interspeech2022 Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang 0006, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo, 
RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion.

Interspeech2022 Guangyan Zhang, Kaitao Song, Xu Tan 0003, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang 0001, Wei Zhou, Tao Qin 0001, Tan Lee 0001, Sheng Zhao
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.

NeurIPS2022 Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen 0008, Xu Tan 0003, Danilo P. Mandic, Lei He 0005, Xiangyang Li 0001, Tao Qin 0001, Sheng Zhao, Tie-Yan Liu, 
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.

#77  | Yi Ren 0006 | DBLP Google Scholar  
By venueICLR: 5AAAI: 5ICASSP: 5Interspeech: 5ACL: 5NeurIPS: 4IJCAI: 4ACL-Findings: 3ICML: 2KDD: 2TASLP: 1
By year2024: 42023: 132022: 112021: 72020: 42019: 2
ISCA sessionsspeech synthesis: 3statistical machine translation: 1speech coding and privacy: 1
IEEE keywordsspeech enhancement: 4text to speech: 3data mining: 2task analysis: 2title generation: 2summarization: 2action item detection: 2keyphrase extraction: 2grasping: 2topic segmentation: 2signal denoising: 2speech synthesis: 2denoise: 2speech recognition: 1speaker embedding: 1cross lingual voice conversion (xvc): 1multi reference: 1timbre: 1pitch normalization: 1data handling: 1performance gain: 1long form spoken language processing: 1benchmark testing: 1annotations: 1manuals: 1recording: 1prosody modeling: 1pre training: 1focusing: 1predictive models: 1shape: 1generative adversarial network: 1singing voice synthesis: 1noisy audio: 1frame level condition: 1noisy speech: 1speaker recognition: 1
Most publications (all venues) at2023: 232022: 212021: 112024: 92020: 8

Affiliations
Zhejiang University, China
URLs

Recent publications

TASLP2024 Mingyang Zhang 0003, Yi Zhou 0020, Yi Ren 0006, Chen Zhang 0020, Xiang Yin 0006, Haizhou Li 0001, 
RefXVC: Cross-Lingual Voice Conversion With Enhanced Reference Leveraging.

ICLR2024 Ziyue Jiang 0001, Jinglin Liu, Yi Ren 0006, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang 0020, Pengfei Wei 0001, Chunfeng Wang, Xiang Yin 0006, Zejun Ma, Zhou Zhao, 
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis.

AAAI2024 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren 0006, Yuexian Zou, Zhou Zhao, Shinji Watanabe 0001, 
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

AAAI2024 Rui Liu 0008, Yifan Hu, Yi Ren 0006, Xiang Yin 0006, Haizhou Li 0001, 
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling.

ICASSP2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen 0003, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren 0006, Zhou Zhao, 
Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG).

ICASSP2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen 0003, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren 0006, Zhou Zhao, 
MUG: A General Meeting Understanding and Generation Benchmark.

Interspeech2023 Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu 0003, Chunfeng Wang, Yi Ren 0006, Xiang Yin 0006, Zejun Ma, 
GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech.

Interspeech2023 Kun Song, Yi Ren 0006, Yi Lei, Chunfeng Wang, Kun Wei, Lei Xie 0001, Xiang Yin 0006, Zejun Ma, 
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation.

ICML2023 Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren 0006, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin 0006, Zhou Zhao, 
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models.

ICLR2023 Yi Ren 0006, Chen Zhang 0020, Shuicheng Yan, 
Bag of Tricks for Unsupervised Text-to-Speech.

ICLR2023 Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren 0006, Lichao Zhang, Jinzheng He, Zhou Zhao, 
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation.

ICLR2023 Zhenhui Ye, Ziyue Jiang 0001, Yi Ren 0006, Jinglin Liu, Jinzheng He, Zhou Zhao, 
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis.

ACL2023 Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren 0006, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin 0006, Zhou Zhao, 
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation.

ACL2023 Zhenhui Ye, Rongjie Huang, Yi Ren 0006, Ziyue Jiang 0001, Jinglin Liu, Jinzheng He, Xiang Yin 0006, Zhou Zhao, 
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training.

ACL-Findings2023 Rongjie Huang, Yi Ren 0006, Ziyue Jiang 0001, Chenye Cui, Jinglin Liu, Zhou Zhao, 
FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis.

ACL-Findings2023 Rongjie Huang, Chunlei Zhang, Yi Ren 0006, Zhou Zhao, Dong Yu 0001, 
Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech.

ACL-Findings2023 Ziyue Jiang 0001, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren 0006, Zhou Zhao, 
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models.

ICASSP2022 Yi Ren 0006, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen 0003, Zhijie Yan, Zhou Zhao, 
Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech.

ICASSP2022 Lichao Zhang, Yi Ren 0006, Liqun Deng, Zhou Zhao, 
HiFiDenoise: High-Fidelity Denoising Text to Speech with Adversarial Networks.

NeurIPS2022 Rongjie Huang, Yi Ren 0006, Jinglin Liu, Chenye Cui, Zhou Zhao, 
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech.

#78  | Ariya Rastrow | DBLP Google Scholar  
By venueInterspeech: 20ICASSP: 17ACL-Findings: 2ICML: 1KDD: 1
By year2024: 52023: 102022: 72021: 122020: 32019: 22018: 2
ISCA sessionsspoken language understanding: 3resource-constrained asr: 3new computational strategies for asr training and inference: 2lexical and language modeling for asr: 1speech recognition: 1spoken dialog systems and conversational analysis: 1applications in transcription, education and learning: 1neural network training methods for asr: 1multi- and cross-lingual asr, other topics in asr: 1self-supervision and semi-supervision for neural asr training: 1computational resource constrained speech recognition: 1neural networks for language modeling: 1speech synthesis: 1syllabification, rhythm, and voice activity detection: 1language modeling: 1
IEEE keywordsspeech recognition: 14automatic speech recognition: 6natural language processing: 5personalization: 5error analysis: 4transducers: 4adaptation models: 3end to end: 3attention: 3language modeling: 3recurrent neural nets: 3spoken language understanding: 2neural transducer: 2runtime: 2costs: 2computational modeling: 2contextual biasing: 2semantics: 2inference optimization: 2decoding: 2second pass rescoring: 2optimisation: 2oral communication: 1catalysts: 1task oriented dialogue: 1self supervised learning: 1task analysis: 1speech enhancement: 1zero shot learning: 1upper bound: 1robustness: 1lattices: 1question answering (information retrieval): 1in context learning: 1large language models. asr confusion networks: 1max margin: 1end to end speech recognition models: 1sequence discriminative criterion: 1minimum word error rate training: 1conformer: 1contact name recognition: 1logic gates: 1context: 1transfer learning: 1adaptation: 1production: 1data collection: 1data models: 1fine tuning: 1measurement uncertainty: 1fuses: 1pronunciation: 1rnn t: 1technological innovation: 1self learning: 1performance evaluation: 1recurrent neural networks: 1federated learning: 1weak supervision: 1privacy: 1switches: 1wake word spotting: 1neural biasing: 1computer architecture: 1cache storage: 1streaming: 1latency: 1bert: 1pretrained model: 1minimum wer training: 1masked language model: 1degradation: 1rescoring: 1signal processing algorithms: 1shallow fusion: 1inference algorithms: 1accent invariance: 1domain adversarial training: 1rnn transducer: 1end to end asr: 1multi accent asr: 1domain adaptation: 1audio signal processing: 1recurrent neural network transducer (rnn t): 1on device speech recognition: 1multilingual: 1joint modeling: 1recurrent neural network transducer: 1code switching: 1language identification: 1reinforce: 1multitask training: 1neural interfaces: 1entropy: 1minimum word error rate: 1hidden markov models: 1
Most publications (all venues) at2021: 172023: 162022: 82020: 82024: 7


Recent publications

ICASSP2024 David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister, 
Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition.

ICASSP2024 Kevin Everson, Yile Gu, Chao-Han Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke, 
Towards ASR Robust Spoken Language Understanding Through in-Context Learning with Word Confusion Networks.

ICASSP2024 Rupak Vignesh Swaminathan, Grant P. Strimel, Ariya Rastrow, Sri Harish Mallidi, Kai Zhen, Hieu Duy Nguyen, Nathan Susanj, Athanasios Mouchtaris, 
Max-Margin Transducer Loss: Improving Sequence-Discriminative Training Using a Large-Margin Learning Strategy.

ICML2024 Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister, 
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems.

ACL-Findings2024 Aditya Gourav, Jari Kolehmainen, Prashanth Gurunath Shivakumar, Yile Gu, Grant P. Strimel, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko, 
Multi-Modal Retrieval For Large Language Model Based Speech Recognition.

ICASSP2023 Anastasios Alexandridis, Kanthashree Mysore Sathyendra, Grant P. Strimel, Feng-Ju Chang, Ariya Rastrow, Nathan Susanj, Athanasios Mouchtaris, 
Gated Contextual Adapters For Selective Contextual Biasing In Neural Transducers.

ICASSP2023 David M. Chan, Shalini Ghosh, Ariya Rastrow, Björn Hoffmeister, 
Domain Adaptation with External Off-Policy Acoustic Catalogs for Scalable Contextual End-to-End Automated Speech Recognition.

ICASSP2023 Rahul Pandey, Roger Ren, Qi Luo, Jing Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant P. Strimel, Andreas Stolcke, Ivan Bulyko, 
Procter: Pronunciation-Aware Contextual Adapter For Personalized Speech Recognition In Neural Transducers.

ICASSP2023 Milind Rao, Gopinath Chennupati, Gautam Tiwari, Anit Kumar Sahu, Anirudh Raju, Ariya Rastrow, Jasha Droppo, 
Federated Self-Learning with Weak Supervision for Speech Recognition.

ICASSP2023 Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree Mysore Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann, 
Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition.

Interspeech2023 Denis Filimonov, Prabhat Pandey, Ariya Rastrow, Ankur Gandhe, Andreas Stolcke, 
Streaming Speech-to-Confusion Network Speech Recognition.

Interspeech2023 Yile Gu, Prashanth Gurunath Shivakumar, Jari Kolehmainen, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko, 
Scaling Laws for Discriminative Speech Recognition Rescoring Models.

Interspeech2023 Jari Kolehmainen, Yile Gu, Aditya Gourav, Prashanth Gurunath Shivakumar, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko, 
Personalization for BERT-based Discriminative Speech Recognition Rescoring.

Interspeech2023 Andreas Schwarz, Di He 0004, Maarten Van Segbroeck, Mohammed Hethnawi, Ariya Rastrow
Personalized Predictive ASR for Latency Reduction in Voice Assistants.

Interspeech2023 Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko, 
Distillation Strategies for Discriminative Speech Recognition Rescoring.

ICASSP2022 Anastasios Alexandridis, Grant P. Strimel, Ariya Rastrow, Pavel Kveton, Jon Webb, Maurizio Omologo, Siegfried Kunzmann, Athanasios Mouchtaris, 
Caching Networks: Capitalizing on Common Speech for ASR.

ICASSP2022 Liyan Xu, Yile Gu, Jari Kolehmainen, Haidar Khan, Ankur Gandhe, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko, 
RescoreBERT: Discriminative Speech Recognition Rescoring With Bert.

Interspeech2022 Phani Sankar Nidadavolu, Na Xu, Nick Jutila, Ravi Teja Gadde, Aswarth Abhilash Dara, Joseph Savold, Sapan Patel, Aaron Hoff, Veerdhawal Pande, Kevin Crews, Ankur Gandhe, Ariya Rastrow, Roland Maas, 
RefTextLAS: Reference Text Biased Listen, Attend, and Spell Model For Accurate Reading Evaluation.

Interspeech2022 Anirudh Raju, Milind Rao, Gautam Tiwari, Pranav Dheram, Bryan Anderson, Zhe Zhang, Chul Lee, Bach Bui, Ariya Rastrow
On joint training with interfaces for spoken language understanding.

Interspeech2022 Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian John King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel, 
Compute Cost Amortized Transformer for Streaming ASR.

#79  | Carlos Busso | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 14TASLP: 5SpeechComm: 4
By year2024: 52023: 112022: 52021: 42020: 52019: 52018: 5
ISCA sessionsspeech emotion recognition: 5paralinguistics: 2emotion modeling: 2(multimodal) speech emotion recognition: 1emotion and sentiment analysis: 1voice activity detection: 1computational paralinguistics: 1emotion modeling and analysis: 1speaker state and trait: 1emotion recognition and analysis: 1syllabification, rhythm, and voice activity detection: 1
IEEE keywordsspeech recognition: 17emotion recognition: 17speech emotion recognition: 13adaptation models: 4noisy speech: 3regression analysis: 3speech enhancement: 2feature selection: 2noise measurement: 2recording: 2task analysis: 2representation learning: 2self supervised learning: 2databases: 2audiovisual emotion recognition: 2ladder networks: 2transfer learning: 2domain adaptation: 2preference learning: 2time continuous emotional traces: 1dynamic speech emotion recognition: 1predictive models: 1speech: 1unsupervised domain adaptation: 1speaker embeddings: 1multitasking: 1multi task learning: 1contrastive learning: 1clustering: 1emotion rankers: 1sequence to sequence modeling: 1chunk level segmentation: 1hidden markov models: 1annotations: 1focusing: 1visualization: 1contrastive teacher student learning: 1chunk level modeling: 1lexical information: 1data segmentation approach: 1timing: 1performance gain: 1robustness: 1analytical models: 1cross lingual: 1phonetics: 1modulation: 1soft label learning: 1multi label learning: 1natural language processing: 1distribution label learning: 1auxiliary networks: 1neural net architecture: 1multimodal fusion: 1transformers: 1shared losses: 1audio visual systems: 1acoustic feature: 1acoustic noise: 1disentangled representation learning: 1audio generation: 1guided representation learning: 1audio signal processing: 1and generative adversarial neural network: 1signal representation: 1semi supervised learning (ssl): 1semisupervised learning: 1labeling: 1speech emotion recognition (ser): 1unsupervised clusters: 1signal denoising: 1semi supervised emotion recognition: 1reject option.: 1monte carlo dropout: 1activation functions: 1monte carlo methods: 1pattern classification: 1curriculum learning: 1inter evaluator agreement: 1information retrieval: 1emotion retrieval: 1ranking: 1perception: 1speech synthesis: 1triplet loss: 1
Most publications (all venues) at2023: 242017: 182016: 172024: 162022: 15


Recent publications

SpeechComm2024 Wei-Cheng Lin, Carlos Busso
Deep temporal clustering features for speech emotion recognition.

TASLP2024 Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso
Selective Acoustic Feature Enhancement for Speech Emotion Recognition With Noisy Speech.

ICASSP2024 Luz Martinez-Lucas, Carlos Busso
Dynamic Speech Emotion Recognition Using A Conditional Neural Process.

ICASSP2024 Abinay Reddy Naini, Mary A. Kohler, Elizabeth Richerson, Donita Robinson, Carlos Busso
Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition.

ICASSP2024 Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman, 
Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition.

SpeechComm2023 Andrea Vidal, Carlos Busso
Multimodal attention for lip synthesis using conditional generative adversarial networks.

TASLP2023 Wei-Cheng Lin, Carlos Busso
Sequential Modeling by Leveraging Non-Uniform Distribution of Speech Emotion.

ICASSP2023 Lucas Goncalves, Carlos Busso
Learning Cross-Modal Audiovisual Representations with Ladder Networks for Emotion Recognition.

ICASSP2023 Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso
Adapting a Self-Supervised Speech Representation for Noisy Speech Emotion Recognition by Using Contrastive Teacher-Student Learning.

ICASSP2023 Wei-Cheng Lin, Carlos Busso
Role of Lexical Boundary Information in Chunk-Level Segmentation for Speech Emotion Recognition.

ICASSP2023 Abinay Reddy Naini, Mary A. Kohler, Carlos Busso
Unsupervised Domain Adaptation for Preference Learning Based Speech Emotion Recognition.

ICASSP2023 Shreya G. Upadhyay, Luz Martinez-Lucas, Bo-Hao Su, Wei-Cheng Lin, Woan-Shiuan Chien, Ya-Tse Wu, William Katz, Carlos Busso, Chi-Chun Lee, 
Phonetic Anchor-Based Transfer Learning to Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition.

Interspeech2023 Huang-Cheng Chou, Lucas Goncalves, Seong-Gyun Leem, Chi-Chun Lee, Carlos Busso
The Importance of Calibration: Rethinking Confidence and Performance of Speech Multi-label Emotion Classifiers.

Interspeech2023 Nicolás Grágeda, Eduardo Alvarado, Rodrigo Mahú, Carlos Busso, Néstor Becerra Yoma, 
Distant Speech Emotion Recognition in an Indoor Human-robot Interaction Scenario.

Interspeech2023 Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso
Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters.

Interspeech2023 Abinay Reddy Naini, Ali N. Salman, Carlos Busso
Preference Learning Labels by Anchoring on Consecutive Annotations.

ICASSP2022 Huang-Cheng Chou, Wei-Cheng Lin, Chi-Chun Lee, Carlos Busso
Exploiting Annotators' Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition.

ICASSP2022 Lucas Goncalves, Carlos Busso
AuxFormer: Robust Approach to Audiovisual Emotion Recognition.

ICASSP2022 Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso
Not All Features are Equal: Selection of Robust Features for Speech Emotion Recognition in Noisy Environments.

Interspeech2022 Huang-Cheng Chou, Chi-Chun Lee, Carlos Busso
Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier.

#80  | Tomohiro Tanaka | DBLP Google Scholar  
By venueInterspeech: 27ICASSP: 12TASLP: 1
By year2023: 112022: 52021: 102020: 62019: 62018: 2
ISCA sessionsspeech recognition: 2analysis of neural speech representations: 1spoken dialog systems and conversational analysis: 1end-to-end asr: 1spoken language understanding, summarization, and information retrieval: 1speech coding and enhancement: 1speech representation: 1multi-, cross-lingual and other topics in asr: 1single-channel speech enhancement: 1novel models and training methods for asr: 1spoken language processing: 1voice activity detection and keyword spotting: 1neural network training methods for asr: 1streaming for asr/rnn transducers: 1search/decoding techniques and confidence measures for asr: 1applications in transcription, education and learning: 1training strategies for asr: 1asr neural network architectures and training: 1spoken language understanding: 1conversational systems: 1model training for asr: 1dialogue speech understanding: 1nn architectures for asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1selected topics in neural speech processing: 1asr systems and technologies: 1
IEEE keywordsspeech recognition: 9neural network: 4natural language processing: 4automatic speech recognition: 3end to end: 3recurrent neural nets: 3self supervised learning: 2task analysis: 2recurrent neural network transducer: 2knowledge distillation: 2data models: 1speech representation: 1focusing: 1analytical models: 1language dependency: 1low resource: 1training data: 1end to end speech summarization: 1measurement: 1synthetic data augmentation: 1how2 dataset: 1multi modal data augmentation: 1neural transducer: 1recurrent neural networks: 1robustness: 1linguistics: 1decoding: 1scheduled sampling: 1buildings: 1multilingual: 1representation learning: 1transformers: 1cross lingual: 1self supervised speech representation learning: 1attention based decoder: 1sequence to sequence pre training: 1text analysis: 1language translation: 1pointer generator networks: 1spoken text normalization: 1blind source separation: 1audio signal processing: 1speech separation: 1audio visual: 1and cross modal: 1large context endo to end automatic speech recognition: 1transformer: 1hierarchical encoder decoder: 1entropy: 1whole network pre training: 1synchronisation: 1autoregressive processes: 1intelligent robots: 1zero resource word segmentation: 1data acquisition: 1spoken language acquisition: 1word processing: 1unsupervised learning: 1reinforcement learning: 1connectionist temporal classification: 1probability: 1attention weight: 1speech codecs: 1cloud computing: 1covariance matrix adaptation evolution strategy (cma es): 1multi objective optimization: 1pareto optimisation: 1genetic algorithm: 1hidden markov models: 1parallel processing: 1deep neural network (dnn): 1evolutionary computation: 1end to end automatic speech recognition: 1speech coding: 1attention based encoder decoder: 1hierarchical recurrent encoder decoder: 1
Most publications (all venues) at2021: 162023: 152022: 112019: 112020: 9

Affiliations
URLs

Recent publications

ICASSP2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka
Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models.

ICASSP2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura, 
Leveraging Large Text Corpora For End-To-End Speech Summarization.

ICASSP2023 Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, 
Improving Scheduled Sampling for Neural Transducer-Based ASR.

ICASSP2023 Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi Moriya, 
Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning.

Interspeech2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma, 
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Interspeech2023 Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka
Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer.

Interspeech2023 Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nobukatsu Hojo, 
Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model.

Interspeech2023 Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Target and Non-Target Speakers ASR.

Interspeech2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, 
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Takafumi Moriya, Takanori Ashihara, Atsushi Ando, Hiroshi Sato, Tomohiro Tanaka, Kohei Matsuura, Ryo Masumura, Marc Delcroix, Takahiro Shinozaki, 
Hybrid RNN-T/Attention-Based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration.

Interspeech2022 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models.

Interspeech2022 Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training.

Interspeech2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, 
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations.

Interspeech2022 Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, 
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks.

ICASSP2021 Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura, 
MAPGN: Masked Pointer-Generator Network for Sequence-to-Sequence Pre-Training.

ICASSP2021 Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura, 
Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss.

ICASSP2021 Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, 
Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation.

ICASSP2021 Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, Yusuke Shinohara, 
Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition.

#81  | Xu Tan 0003 | DBLP Google Scholar  
By venueICASSP: 10Interspeech: 9NeurIPS: 5ICLR: 4ICML: 3AAAI: 3ACL: 2KDD: 2EMNLP-Findings: 1TASLP: 1
By year2024: 42023: 52022: 82021: 132020: 62019: 4
ISCA sessionsspeech synthesis: 6voice conversion and adaptation: 1multi- and cross-lingual asr, other topics in asr: 1singing and multimodal synthesis: 1
IEEE keywordstext to speech: 7speech synthesis: 6medical image processing: 2text analysis: 2speaker recognition: 2speech intelligibility: 2speech recognition: 2style control: 1prompt: 1benchmark testing: 1decoding: 1image synthesis: 1task analysis: 1iterative methods: 1fast sampling: 1probability: 1image denoising: 1vocoders: 1denoising diffusion probabilistic models: 1optimisation: 1vocoder: 1speech to animation: 1mixture of experts: 1transformer: 1computer animation: 1phonetic posteriorgrams: 1pre training: 1data reduction: 1speech quality assessment: 1correlation methods: 1mos prediction: 1mean bias network: 1sensitivity analysis: 1video signal processing: 1search problems: 1neural architecture search: 1fast: 1lightweight: 1autoregressive processes: 1data augmentation: 1low resource: 1mixup: 1adaptation: 1untranscribed data: 1signal reconstruction: 1frame level condition: 1speech enhancement: 1noisy speech: 1signal denoising: 1denoise: 1automatic speech recognition: 1reproducibility of results: 1open source: 1supervised learning: 1open source software: 1end to end: 1error propagation: 1accuracy drop: 1language characteristic: 1natural language processing: 1sequence generation: 1
Most publications (all venues) at2023: 362024: 272022: 242021: 242019: 19

Affiliations
Microsoft Research Asia, Beijing, China
URLs

Recent publications

ICML2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan 0003, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu 0001, Tao Qin 0001, Xiangyang Li 0001, Wei Ye 0004, Shikun Zhang, Jiang Bian 0002, Lei He 0005, Jinyu Li 0001, Sheng Zhao, 
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng, 
UniAudio: Towards Universal Audio Generation with Large Language Models.

ICLR2024 Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He 0005, Xiangyang Li 0001, Sheng Zhao, Tao Qin 0001, Jiang Bian 0002, 
PromptTTS 2: Describing and Generating Voices with Text Prompt.

ICLR2024 Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yichong Leng, Lei He 0005, Tao Qin 0001, Sheng Zhao, Jiang Bian 0002, 
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

ICASSP2023 Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan 0003
Prompttts: Controllable Text-To-Speech With Text Descriptions.

Interspeech2023 Yujia Xiao, Shaofei Zhang, Xi Wang 0016, Xu Tan 0003, Lei He 0005, Sheng Zhao, Frank K. Soong, Tan Lee 0001, 
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

NeurIPS2023 Yuancheng Wang, Zeqian Ju, Xu Tan 0003, Lei He 0005, Zhizheng Wu 0001, Jiang Bian 0002, Sheng Zhao, 
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.

AAAI2023 Yichong Leng, Xu Tan 0003, Wenjie Liu, Kaitao Song, Rui Wang 0028, Xiang-Yang Li 0001, Tao Qin 0001, Edward Lin, Tie-Yan Liu, 
SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition.

AAAI2023 Yihan Wu, Junliang Guo, Xu Tan 0003, Chen Zhang 0020, Bohan Li 0003, Ruihua Song, Lei He 0005, Sheng Zhao, Arul Menezes, Jiang Bian 0002, 
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.

ICASSP2022 Zehua Chen, Xu Tan 0003, Ke Wang, Shifeng Pan, Danilo P. Mandic, Lei He 0005, Sheng Zhao, 
Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.

ICASSP2022 Liyang Chen, Zhiyong Wu 0001, Jun Ling, Runnan Li, Xu Tan 0003, Sheng Zhao, 
Transformer-S2A: Robust and Efficient Speech-to-Animation.

ICASSP2022 Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan 0003, Sheng Zhao, Tan Lee 0001, 
A Study on the Efficacy of Model Pre-Training In Developing Neural Text-to-Speech System.

Interspeech2022 Yanqing Liu, Ruiqing Xue, Lei He 0005, Xu Tan 0003, Sheng Zhao, 
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.

Interspeech2022 Yihan Wu, Xu Tan 0003, Bohan Li 0003, Lei He 0005, Sheng Zhao, Ruihua Song, Tao Qin 0001, Tie-Yan Liu, 
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.

Interspeech2022 Guangyan Zhang, Kaitao Song, Xu Tan 0003, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang 0001, Wei Zhou, Tao Qin 0001, Tan Lee 0001, Sheng Zhao, 
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.

NeurIPS2022 Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen 0008, Xu Tan 0003, Danilo P. Mandic, Lei He 0005, Xiangyang Li 0001, Tao Qin 0001, Sheng Zhao, Tie-Yan Liu, 
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.

ACL2022 Yi Ren 0006, Xu Tan 0003, Tao Qin 0001, Zhou Zhao, Tie-Yan Liu, 
Revisiting Over-Smoothness in Text to Speech.

ICASSP2021 Yichong Leng, Xu Tan 0003, Sheng Zhao, Frank K. Soong, Xiang-Yang Li 0001, Tao Qin 0001, 
MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network.

ICASSP2021 Renqian Luo, Xu Tan 0003, Rui Wang 0028, Tao Qin 0001, Jinzhu Li, Sheng Zhao, Enhong Chen, Tie-Yan Liu, 
Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

ICASSP2021 Linghui Meng 0001, Jin Xu 0010, Xu Tan 0003, Jindong Wang 0001, Tao Qin 0001, Bo Xu 0002, 
MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition.

#82  | Hsin-Min Wang | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 9TASLP: 7ICLR: 1
By year2024: 12023: 72022: 92021: 62020: 82019: 62018: 2
ISCA sessionsspeech, voice, and hearing disorders: 2single-channel speech enhancement: 2source separation: 2speech synthesis: 2speech enhancement: 2speech coding and enhancement: 1the voicemos challenge: 1neural transducers, streaming asr and novel asr models: 1speech intelligibility prediction for hearing-impaired listeners: 1speech enhancement and intelligibility: 1spoken machine translation: 1voice conversion and adaptation: 1noise reduction and intelligibility: 1model training for asr: 1neural techniques for voice conversion and waveform generation: 1speech intelligibility and quality: 1voice conversion: 1
IEEE keywordsspeech enhancement: 7speech recognition: 5predictive models: 3decoding: 3speaker verification: 3convolutional neural nets: 3natural language processing: 3task analysis: 2measurement: 2deep neural network: 2audio signal processing: 2gaussian processes: 2speaker recognition: 2multiprotocol label switching: 13quest: 1knowledge transfer: 1sdi: 1speech quality prediction: 1multitasking: 1speech intelligibility prediction: 1stoi: 1self supervised learning: 1pesq: 1quality assessment: 1parent embedding learning: 1standards: 1interference: 1partial adaptive score normalization: 1data mining: 1self constraint learning: 1phonetic information: 1reconstruction learning: 1phonetics: 1non intrusive speech assessment models: 1acoustic distortion: 1adaptation models: 1psychoacoustic models: 1multi objective learning: 1data privacy: 1low quality data: 1data compression: 1audio visual systems: 1recurrent neural nets: 1asynchronous multimodal learning: 1audio visual: 1sensor fusion: 1non invasive: 1deep learning (artificial intelligence): 1multimodal: 1medical signal processing: 1electromyography: 1biometrics (access control): 1security of data: 1partially fake audio detection: 1anti spoofing: 1audio deep synthesis detection challenge: 1speech synthesis: 1bert: 1language model: 1text analysis: 1orderless nade: 1sample weighting: 1melody harmonization: 1coherence: 1blocked gibbs sampling: 1support vector machines: 1phonotactic language recognition: 1subspace based learning: 1matrix decomposition: 1subspace based representation: 1multichannel speech enhancement: 1distributed microphones: 1fully convolutional network (fcn): 1microphones: 1phase estimation: 1inner ear microphones: 1raw waveform mapping: 1generalizability: 1dynamically sized decision tree: 1decision trees: 1signal denoising: 1deep neural networks: 1ensemble learning: 1time delay neural network: 1statistics pooling: 1statistics: 1multilayer perceptrons: 1articulatory feature: 1speaker identification: 1convolutional neural network: 1regression analysis: 1unsupervised learning: 1deep denoising autoencoder: 1signal classification: 1automatic speech recognition: 1character error rate: 1mean square error methods: 1reinforcement learning: 1
Most publications (all venues) at2017: 242019: 222022: 212021: 202015: 20

Affiliations
Academia Sinica, Taipei, Taiwan
National Taiwan University, Taipei, Taiwan (PhD 1995)

Recent publications

ICASSP2024 Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001, 
Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model.

TASLP2023 Qian-Bei Hong, Chung-Hsien Wu 0001, Hsin-Min Wang
Generalization Ability Improvement of Speaker Representation and Anti-Interference for Speaker Verification.

TASLP2023 Qian-Bei Hong, Chung-Hsien Wu 0001, Hsin-Min Wang
Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning.

TASLP2023 Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen 0011, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001, 
Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features.

Interspeech2023 Hsin-Hao Chen 0006, Yung-Lun Chien, Ming-Chi Yen, Shu-Wei Tsai, Tai-Shih Chi, Hsin-Min Wang, Yu Tsao 0001, 
Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features.

Interspeech2023 Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao 0001, Hsin-Min Wang
A Training and Inference Strategy Using Noisy and Enhanced Speech as Target for Speech Enhancement without Clean Speech.

Interspeech2023 Yung-Lun Chien, Hsin-Hao Chen 0006, Ming-Chi Yen, Shu-Wei Tsai, Hsin-Min Wang, Yu Tsao 0001, Tai-Shih Chi, 
Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion.

ICLR2023 Chi-Chang Lee, Yu Tsao 0001, Hsin-Min Wang, Chu-Song Chen, 
D4AM: A General Denoising Framework for Downstream Acoustic Models.

TASLP2022 Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao 0001, 
Improved Lite Audio-Visual Speech Enhancement.

ICASSP2022 Kuan-Chen Wang, Kai-Chun Liu, Hsin-Min Wang, Yu Tsao 0001, 
EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement.

ICASSP2022 Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao 0001, Hsin-Min Wang, Helen Meng, 
Partially Fake Audio Detection by Self-Attention-Based Fake Span Discovery.

Interspeech2022 Wen-Chin Huang, Erica Cooper, Yu Tsao 0001, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi, 
The VoiceMOS Challenge 2022.

Interspeech2022 Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang
Chain-based Discriminative Autoencoders for Speech Recognition.

Interspeech2022 Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao 0001, 
NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling.

Interspeech2022 Fan-Lin Wang, Hung-Shin Lee, Yu Tsao 0001, Hsin-Min Wang
Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks.

Interspeech2022 Ryandhimas Edo Zezario, Fei Chen 0011, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001, 
MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids.

Interspeech2022 Ryandhimas Edo Zezario, Szu-Wei Fu, Fei Chen 0011, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao 0001, 
MTI-Net: A Multi-Target Speech Intelligibility Prediction Model.

ICASSP2021 Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, Tomoki Toda, 
Speech Recognition by Simply Fine-Tuning Bert.

ICASSP2021 Chung-En Sun, Yi-Wei Chen, Hung-Shin Lee, Yen-Hsing Chen, Hsin-Min Wang
Melody Harmonization Using Orderless Nade, Chord Balancing, and Blocked Gibbs Sampling.

Interspeech2021 Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang
AlloST: Low-Resource Speech Translation Without Source Transcription.

#83  | Paavo Alku | DBLP Google Scholar  
By venueSpeechComm: 14Interspeech: 12ICASSP: 8TASLP: 5
By year2024: 42023: 72022: 42021: 22020: 62019: 132018: 3
ISCA sessionsvoice conversion for style, accent, and emotion: 2pathological speech analysis: 1health-related speech analysis: 1atypical speech analysis and detection: 1phonetics: 1speech in health: 1neural techniques for voice conversion and waveform generation: 1voice quality characterization for clinical voice assessment: 1speech analysis and representation: 1voice conversion and speech synthesis: 1speech pathology, depression, and medical applications: 1
IEEE keywordsspeech recognition: 4diseases: 3glottal features: 3parkinson's disease: 3task analysis: 3support vector machines: 3speech synthesis: 3vocoders: 3svm: 2mfccs: 2cepstral analysis: 2databases: 2medical diagnosis: 2dnn: 2speech analysis: 2text to speech: 2pathology: 1mel frequency cepstral coefficient: 1phonological features: 1automatic speech assessment: 1speech attributes: 1mathematical models: 1exemplar based: 1random forest: 1sparse representation: 1pipelines: 1non negative least squares: 1dysarthria: 1severity level classification: 1wav2vec 2.0: 1spectrogram: 1index terms: vocal intensity: 1oral communication: 1calibration: 1regulation: 1spl: 1training data: 1voice disorders: 1pathological voices: 1crossdatabase evaluation: 1wav2vec: 1reliability: 1recording: 1multilayer perceptrons: 1glottal source estimation: 1iterative methods: 1pattern classification: 1end to end systems: 1filtering theory: 1formant tracking: 1time varying linear prediction: 1weighted linear prediction: 1kalman filters: 1dynamic programming: 1quasi closed phase analysis: 1glottal closure instants: 1excitation source: 1emotion recognition: 1epochs: 1emotions: 1formant modification: 1children speech recognition: 1hidden markov models: 1wavenet: 1glottal source model: 1feedforward neural nets: 1noise robustness: 1f0 estimation: 1spectral analysis: 1data augmentation: 1correlation methods: 1neural vocoding: 1gan: 1inference mechanisms: 1glottal excitation model: 1vocal effort: 1style conversion: 1pulse model in log domain vocoder: 1cyclegan: 1lom bard speech: 1
Most publications (all venues) at2012: 192019: 182014: 182016: 162017: 15

Affiliations
URLs

Recent publications

SpeechComm2024 Mittapalle Kiran Reddy, Yagnavajjula Madhu Keerthana, Paavo Alku
Classification of functional dysphonia using the tunable Q wavelet transform.

SpeechComm2024 Paavo Alku, Manila Kodali, Laura Laaksonen, Sudarsana Reddy Kadiri, 
AVID: A speech database for machine learning studies on vocal intensity.

SpeechComm2024 Yagnavajjula Madhu Keerthana, Mittapalle Kiran Reddy, Paavo Alku, K. Sreenivasa Rao, Pabitra Mitra, 
Automatic classification of neurological voice disorders using wavelet scattering features.

SpeechComm2024 Farhad Javanmardi, Sudarsana Reddy Kadiri, Paavo Alku
Pre-trained models for detection and severity level classification of dysarthria from speech.

TASLP2023 Yuanyuan Liu 0002, Mittapalle Kiran Reddy, Nelly Penttilä, Tiina Ihalainen, Paavo Alku, Okko Räsänen, 
Automatic Assessment of Parkinson's Disease Using Speech Representations of Phonation and Articulation.

TASLP2023 Mittapalle Kiran Reddy, Paavo Alku
Exemplar-Based Sparse Representations for Detection of Parkinson's Disease From Speech.

ICASSP2023 Farhad Javanmardi, Saska Tirronen, Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku
Wav2vec-Based Detection and Severity Level Classification of Dysarthria From Speech.

ICASSP2023 Manila Kodali, Sudarsana Reddy Kadiri, Laura Laaksonen, Paavo Alku
Automatic Classification of Vocal Intensity Category from Speech.

ICASSP2023 Saska Tirronen, Farhad Javanmardi, Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku
Utilizing Wav2Vec In Database-Independent Voice Disorder Detection.

Interspeech2023 Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku
Severity Classification of Parkinson's Disease from Speech using Single Frequency Filtering-based Features.

Interspeech2023 Manila Kodali, Sudarsana Reddy Kadiri, Paavo Alku
Classification of Vocal Intensity Category from Speech using the Wav2vec2 and Whisper Embeddings.

SpeechComm2022 Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Paavo Alku, Mikko Kurimo, 
A formant modification method for improved ASR of children's speech.

SpeechComm2022 Mittapalle Kiran Reddy, Hilla Pohjalainen, Pyry Helkkula, Kasimir Kaitue, Mikko Minkkinen, Heli Tolppanen, Tuomo Nieminen, Paavo Alku
Glottal flow characteristics in vowels produced by speakers with heart failure.

Interspeech2022 Farhad Javanmardi, Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku
Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers.

Interspeech2022 Sudarsana Reddy Kadiri, Farhad Javanmardi, Paavo Alku
Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals.

SpeechComm2021 Krishna Gurugubelli, Anil Kumar Vuppala, N. P. Narendra, Paavo Alku
Duration of the rhotic approximant /ɹ/ in spastic dysarthria of different severity levels.

TASLP2021 N. P. Narendra, Björn W. Schuller, Paavo Alku
The Detection of Parkinson's Disease From Speech Using Voice Source Information.

SpeechComm2020 Sudarsana Reddy Kadiri, Paavo Alku, B. Yegnanarayana 0001, 
Analysis and classification of phonation types in speech and singing voice.

SpeechComm2020 N. P. Narendra, Paavo Alku
Automatic intelligibility assessment of dysarthric speech using glottal parameters.

TASLP2020 Dhananjaya N. Gowda, Sudarsana Reddy Kadiri, Brad H. Story, Paavo Alku
Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals.

#84  | Naoyuki Kanda | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 16TASLP: 1NAACL-Findings: 1
By year2024: 62023: 82022: 62021: 112020: 22019: 52018: 1
ISCA sessionsspeech recognition: 2robust asr, and far-field/multi-talker asr: 2source separation: 2multi- and cross-lingual asr, other topics in asr: 2multi-talker methods in speech processing: 1other topics in speech recognition: 1novel models and training methods for asr: 1streaming for asr/rnn transducers: 1applications in transcription, education and learning: 1neural network training methods for asr: 1asr neural network architectures: 1training strategies for asr: 1speaker recognition: 1turn management in dialogue: 1far-field speech recognition: 1asr neural network training: 1neural network training strategies for asr: 1
IEEE keywordsspeech recognition: 8error analysis: 7speaker diarization: 5oral communication: 4transformers: 3voice activity detection: 3self supervised learning: 3speech separation: 3multi talker automatic speech recognition: 3speaker recognition: 3natural language processing: 3speech enhancement: 2task analysis: 2speech translation: 2decoding: 2signal processing algorithms: 2ts vad: 2streaming inference: 2computational modeling: 2conversation transcription: 2training data: 2rich transcription: 2speaker counting: 2audio signal processing: 2speaker identification: 2speech removal: 1codes: 1speech generation: 1codecs: 1noise reduction: 1audio text input: 1multi task learning: 1noise suppression: 1target speaker extraction: 1zero shot text to speech: 1speech coding: 1speech editing: 1costs: 1timestamp: 1synchronization: 1real time systems: 1streaming: 1joint: 1clustering algorithms: 1speaker profile: 1speaker counting error: 1pet tsvad: 1token level serialized output training: 1transducers: 1multi talker speech recognition: 1factorized neural transducer: 1text only adaptation: 1adaptation models: 1symbols: 1vocabulary: 1measurement: 1overlapping speech: 1recording: 1data models: 1wavlm: 1multi speaker: 1representation learning: 1focusing: 1microphone arrays: 1geometry: 1microphone array: 1tensors: 1eend eda: 1correlation: 1data simulation: 1conversation analysis: 1analytical models: 1hypothesis stitcher: 1computer architecture: 1bayes methods: 1probability: 1minimum bayes risk training: 1language model: 1attention based encoder decoder: 1recurrent neural network transducer: 1transfer learning: 1pre training: 1spoken language understanding: 1end to end approach: 1filtering theory: 1source separation: 1system fusion: 1acoustic model: 1
Most publications (all venues) at2021: 172024: 112022: 112023: 92019: 7

Affiliations
URLs

Recent publications

TASLP2024 Xiaofei Wang 0007, Manthan Thakker, Zhuo Chen 0006, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu 0001, Jinyu Li 0001, Takuya Yoshioka, 
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer.

ICASSP2024 Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, Jinyu Li 0001, Yashesh Gaur, 
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation.

ICASSP2024 Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu, 
Profile-Error-Tolerant Target-Speaker Voice Activity Detection.

ICASSP2024 Jian Wu 0027, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao 0017, Zhuo Chen 0006, Jinyu Li 0001, 
T-SOT FNT: Streaming Multi-Talker ASR with Text-Only Domain Adaptation Capability.

ICASSP2024 Mu Yang, Naoyuki Kanda, Xiaofei Wang 0009, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li 0001, Takuya Yoshioka, 
Diarist: Streaming Speech Translation with Speaker Diarization.

NAACL-Findings2024 Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu 0001, Dongdong Chen 0001, Yao Qian, Xuemei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao 0004, Yu Shi 0001, Lu Yuan, Takuya Yoshioka, Michael Zeng 0001, Xuedong Huang 0001, 
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Zili Huang, Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yiming Wang, Jinyu Li 0001, Takuya Yoshioka, Xiaofei Wang 0009, Peidong Wang, 
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition.

ICASSP2023 Naoyuki Kanda, Jian Wu 0027, Xiaofei Wang 0009, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Vararray Meets T-Sot: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition.

ICASSP2023 Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu 0027, 
Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization.

ICASSP2023 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.

Interspeech2023 Naoyuki Kanda, Takuya Yoshioka, Yang Liu, 
Factual Consistency Oriented Speech Recognition.

Interspeech2023 Chenda Li, Yao Qian, Zhuo Chen 0006, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng 0001, 
Adapting Multi-Lingual ASR Models for Handling Multiple Talkers.

Interspeech2023 Midia Yousefi, Naoyuki Kanda, Dongmei Wang, Zhuo Chen 0006, Xiaofei Wang 0009, Takuya Yoshioka, 
Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach.

ICASSP2022 Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang 0009, Zhong Meng, Zhuo Chen 0006, Takuya Yoshioka, 
Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR.

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings.

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Multi-Talker ASR with Token-Level Serialized Output Training.

Interspeech2022 Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li 0001, Xie Chen 0001, Yu Wu 0012, Yifan Gong 0001, 
Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition.

Interspeech2022 Xiaofei Wang 0009, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka, 
Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation.

Interspeech2022 Wangyou Zhang, Zhuo Chen 0006, Naoyuki Kanda, Shujie Liu 0001, Jinyu Li 0001, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei, 
Separating Long-Form Speech with Group-wise Permutation Invariant Training.

#85  | Alan W. Black | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 13ACL: 2EMNLP-Findings: 1TASLP: 1
By year2024: 22023: 42022: 92021: 62020: 52019: 102018: 3
ISCA sessionsspoken language understanding: 3articulation: 1low-resource asr development: 1human speech & signal processing: 1inclusive and fair speech technologies: 1speech processing & measurement: 1cross/multi-lingual and code-switched asr: 1spoken language processing: 1multilingual and code-switched asr: 1speech signal representation: 1speech and language analytics for mental health: 1the zero resource speech challenge 2019: 1speech translation: 1cross-lingual and multilingual asr: 1speech annotation and labelling: 1speech technologies for code-switching in multilingual communities: 1the interspeech 2019 computational paralinguistics challenge (compare): 1speech synthesis paradigms and methods: 1spoken dialogue systems and conversational analysis: 1the interspeech 2018 computational paralinguistics challenge (compare): 1
IEEE keywordsspeech recognition: 8natural language processing: 6text analysis: 4signal processing algorithms: 3self supervised learning: 2spoken language understanding: 2data models: 2kinematics: 2linguistics: 2estimation: 2multilingual: 2multilingual speech recognition: 2speech synthesis: 2task analysis: 1benchmark testing: 1phonetics: 1unsupervised unit discovery: 1organizations: 1cross lingual and multilingual speech processing: 1transforms: 1articulatory kinematics: 1articulatory: 1representation learning: 1gestural scores: 1production systems: 1factor analysis: 1time frequency analysis: 1pseudo wigner ville distribution (pwvd): 1speech analysis: 1frequency estimation: 1classification algorithms: 1interference: 1pitch tracking: 1buildings: 1production: 1articulatory inversion: 1articulatory speech processing: 1measurement: 1correlation: 1behavioral sciences: 1public domain software: 1speech based user interfaces: 1language translation: 1open source: 1end to end: 1speech summarization: 1long sequence modeling: 1concept learning: 1convolutional neural nets: 1intent: 1long short term memory: 1cross lingual: 1low resource: 1phone distribution estimation: 1fitting: 1ranking models: 1low resource languages: 1multilingual speech alignment: 1multilingual phonetic dataset: 1low resource speech recognition: 1automatic speech recognition: 1human computer interaction: 1unsupervised learning: 1image retrieval: 1image representation: 1phonology: 1universal phone recognition: 1found speech data: 1natural languages: 1multilingual language models: 1ctc based decoding: 1low resource asr: 1phoneme level language models: 1
Most publications (all venues) at2019: 352020: 252021: 242018: 212016: 21


Recent publications

ICASSP2024 Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li 0001, Alan W. Black, Gopala Krishna Anumanchipalli, 
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in Hubert.

ICASSP2024 Cheol Jun Cho, Abdelrahman Mohamed, Alan W. Black, Gopala Krishna Anumanchipalli, 
Self-Supervised Models of Speech Infer Universal Articulatory Kinematics.

ICASSP2023 Jiachen Lian, Alan W. Black, Yijing Lu, Louis Goldstein, Shinji Watanabe 0001, Gopala Krishna Anumanchipalli, 
Articulatory Representation Learning via Joint Factor Analysis and Neural Matrix Factorization.

ICASSP2023 Yisi Liu, Peter Wu, Alan W. Black, Gopala Krishna Anumanchipalli, 
A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution.

ICASSP2023 Peter Wu, Li-Wei Chen, Cheol Jun Cho, Shinji Watanabe 0001, Louis Goldstein, Alan W. Black, Gopala Krishna Anumanchipalli, 
Speaker-Independent Acoustic-to-Articulatory Speech Inversion.

Interspeech2023 Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W. Black, Louis Goldstein, Shinji Watanabe 0001, Gopala Krishna Anumanchipalli, 
Deep Speech Synthesis from MRI-Based Articulatory Representations.

ICASSP2022 Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan 0003, Brian Yan, Ngoc Thang Vu, Alan W. Black, Shinji Watanabe 0001, 
ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet.

ICASSP2022 Roshan Sharma, Shruti Palaskar, Alan W. Black, Florian Metze, 
End-to-End Speech Summarization Using Restricted Self-Attention.

Interspeech2022 Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan W. Black, Shinji Watanabe 0001, 
Two-Pass Low Latency End-to-End Spoken Language Understanding.

Interspeech2022 Xinjian Li, Florian Metze, David R. Mortensen, Alan W. Black, Shinji Watanabe 0001, 
ASR2K: Speech Recognition for Around 2000 Languages without Audio.

Interspeech2022 Jiachen Lian, Alan W. Black, Louis Goldstein, Gopala Krishna Anumanchipalli, 
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition.

Interspeech2022 Perez Ogayo, Graham Neubig, Alan W. Black
Building African Voices.

Interspeech2022 Peter Wu, Shinji Watanabe 0001, Louis Goldstein, Alan W. Black, Gopala Krishna Anumanchipalli, 
Deep Speech Synthesis from Articulatory Representations.

Interspeech2022 Hemant Yadav, Akshat Gupta, Sai Krishna Rallabandi, Alan W. Black, Rajiv Ratn Shah, 
Intent classification using pre-trained language agnostic embeddings for low resource languages.

EMNLP-Findings2022 Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W. Black, Shinji Watanabe 0001, 
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models.

ICASSP2021 Akshat Gupta, Xinjian Li, Sai Krishna Rallabandi, Alan W. Black
Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages.

ICASSP2021 Xinjian Li, Juncheng Li 0001, Jiali Yao, Alan W. Black, Florian Metze, 
Phone Distribution Estimation for Low Resource Languages.

ICASSP2021 Xinjian Li, David R. Mortensen, Florian Metze, Alan W. Black
Multilingual Phonetic Dataset for Low Resource Speech Recognition.

Interspeech2021 Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan 0002, Siddharth Dalmia, Florian Metze, Shinji Watanabe 0001, Alan W. Black
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding.

Interspeech2021 Xinjian Li, Juncheng Li 0001, Florian Metze, Alan W. Black
Hierarchical Phone Recognition with Compositional Phonetics.

#86  | Shiliang Zhang | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 15ACL-Findings: 1EMNLP: 1
By year2024: 82023: 112022: 62021: 22020: 52019: 52018: 2
ISCA sessionsspeech recognition: 6analysis of speech and audio signals: 1multi-talker methods in speech processing: 1speech activity detection and modeling: 1neural transducers, streaming asr and novel asr models: 1other topics in speech recognition: 1resource-constrained asr: 1speaker diarization: 1single-channel speech enhancement: 1evaluation of speech technology systems and methods for resource construction and annotation: 1asr neural network architectures: 1streaming asr: 1speech and audio classification: 1cross-lingual and multilingual asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1topics in speech recognition: 1sequence models for asr: 1
IEEE keywordsspeech recognition: 7audio visual speech recognition: 4data models: 3speech enhancement: 3task analysis: 2computational modeling: 2web conferencing: 2benchmark testing: 2visualization: 2streaming media: 2robustness: 2robust speech recognition: 2predictive models: 2speaker diarization: 2microphone arrays: 2alimeeting: 2meeting transcription: 2decoder only transformer: 1transformers: 1loss masking: 1kl divergence loss: 1discrete token based asr: 1speech discretization: 1vocabulary: 1superluminescent diodes: 1freqcodec: 1funcodec: 1sound stream: 1complexity theory: 1encodec: 1frequency domain analysis: 1speech coding: 1speech codecs: 1speech codec: 1speech emotion recognition: 1data mining: 1emotion recognition: 1text generation: 1data augmentation: 1self supervised learning: 1synthetic data: 1speech synthesis: 1stability analysis: 1source coding: 1contextualized asr: 1non autoregressive asr: 1end to end asr: 1hotword customization: 1decoding: 1slides: 1corpus: 1text recognition: 1pipelines: 1degradation: 1costs: 1multi modal processing: 1computational efficiency: 1cross layer design: 1down up sampling: 1biasing prediction: 1multi modal: 1adaptation models: 1contextual adaptation: 1real time systems: 1joint training: 1noise measurement: 1acoustic distortion: 1residual noise: 1speech distortion: 1refine network: 1error analysis: 1two stage framework: 1overlap aware modeling: 1end to end neural diarization: 1encoding: 1simple object access protocol: 1text to speech: 1prosody modeling: 1pre training: 1focusing: 1shape: 1headphones: 1automatic speech recognition: 1meeting scenario: 1speak diarization: 1arrays: 1multi speaker asr: 1natural language processing: 1speaker recognition: 1m2met: 1iterative methods: 1phoneme aware network: 1speech intelligibility: 1phonetic posteriorgram: 1monaural speech enhancement: 1mandarin speech recognition: 1modeling units: 1hybrid character syllable: 1connectionist temporal classification: 1dfsmn ctc smbr: 1audio visual systems: 1dropout: 1bimodal df smn: 1multi condition training: 1
Most publications (all venues) at2023: 382024: 322022: 312021: 172018: 16

Affiliations
URLs

Recent publications

ICASSP2024 Qian Chen 0003, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang 0003, 
Loss Masking Is Not Needed In Decoder-Only Transformer For Discrete-Token-Based ASR.

ICASSP2024 Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng, 
FunCodec: A Fundamental, Reproducible and Integrable Open-Source Toolkit for Neural Speech Codec.

ICASSP2024 Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen 0003, Shiliang Zhang, Xie Chen 0001, 
Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition.

ICASSP2024 Xian Shi, Yexin Yang, Zerui Li, Yanni Chen, Zhifu Gao, Shiliang Zhang
SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability.

ICASSP2024 Haoxu Wang, Fan Yu, Xian Shi, Yuezhang Wang, Shiliang Zhang, Ming Li, 
SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus.

ICASSP2024 Fan Yu, Haoxu Wang, Ziyang Ma, Shiliang Zhang
Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition.

ICASSP2024 Fan Yu, Haoxu Wang, Xian Shi, Shiliang Zhang
LCB-Net: Long-Context Biasing for Audio-Visual Speech Recognition.

ACL-Findings2024 Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen 0001, 
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.

ICASSP2023 Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang 0001, Xiaobao Wang, Shiliang Zhang
Speech and Noise Dual-Stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition.

ICASSP2023 Jiaming Wang, Zhihao Du, Shiliang Zhang
TOLD: a Novel Two-Stage Overlap-Aware Framework for Speaker Diarization.

Interspeech2023 Keyu An, Xian Shi, Shiliang Zhang
BAT: Boundary aware transducer for memory-efficient and low-latency ASR.

Interspeech2023 Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Shiliang Zhang
FunASR: A Fundamental End-to-End Speech Recognition Toolkit.

Interspeech2023 Yue Gu, Zhihao Du, Shiliang Zhang, Qian Chen 0003, Jiqing Han 0001, 
Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition.

Interspeech2023 Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang 0001, Shiliang Zhang
Rethinking the Visual Cues in Audio-Visual Speaker Extraction.

Interspeech2023 Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen 0003, Lei Xie 0001, 
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR.

Interspeech2023 Mohan Shi, Zhihao Du, Qian Chen 0003, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang 0042, Li-Rong Dai 0001, 
CASA-ASR: Context-Aware Speaker-Attributed ASR.

Interspeech2023 Xian Shi, Haoneng Luo, Zhifu Gao, Shiliang Zhang, Zhijie Yan, 
Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System.

Interspeech2023 Mohan Shi, Yuchun Shu, Lingyun Zuo, Qian Chen 0003, Shiliang Zhang, Jie Zhang 0042, Li-Rong Dai 0001, 
Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction.

Interspeech2023 Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, Chang Zhou, 
MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition.

ICASSP2022 Yi Ren 0006, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen 0003, Zhijie Yan, Zhou Zhao, 
Prosospeech: Enhancing Prosody with Quantized Vector Pre-Training in Text-To-Speech.

#87  | Xuankai Chang | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 14TASLP: 2ACL: 2ICML: 1AAAI: 1
By year2024: 82023: 72022: 92021: 62020: 52019: 22018: 1
ISCA sessionsspeech recognition: 3non-autoregressive sequential modeling for speech processing: 2multi-talker methods in speech processing: 1speech, voice, and hearing disorders: 1spoken language understanding: 1robust asr, and far-field/multi-talker asr: 1speech enhancement and intelligibility: 1speech synthesis: 1low-resource speech recognition: 1speech signal analysis and representation: 1asr neural network architectures and training: 1neural networks for language modeling: 1noise robust and distant speech recognition: 1asr neural network training: 1robust speech recognition: 1
IEEE keywordsspeech recognition: 11self supervised learning: 6task analysis: 4computational modeling: 3representation learning: 3hubert: 3natural language processing: 3decoding: 3spoken language understanding: 2end to end: 2topic model: 2data models: 2vocabulary: 2transfer learning: 2multitasking: 2error analysis: 2hidden markov models: 2ctc: 2transformer: 2biological system modeling: 1task generalization: 1evaluation: 1benchmark: 1benchmark testing: 1protocols: 1analytical models: 1foundation model: 1speech: 1redundancy: 1discrete units: 1speech translation: 1correlation: 1systematics: 1semantics: 1multitask: 1spoken language model: 1speech synthesis: 1transducers: 1st: 1multi tasking: 1mt: 1automatic speech recognition: 1adapter: 1adaptation models: 1lda: 1unsupervised: 1wavlm: 1predictive models: 1probabilistic logic: 1text analysis: 1public domain software: 1speech based user interfaces: 1language translation: 1open source: 1gtc: 1pattern classification: 1graph theory: 1multi speaker overlapped speech: 1end to end asr: 1wfst: 1bic: 1interactive systems: 1unit based language model: 1acoustic unit discovery: 1noise robustness: 1speech enhancement: 1joint modeling: 1natural languages: 1transformers: 1audio captioning: 1aac: 1asr: 1rich transcription: 1speaker identification: 1hypothesis stitcher: 1computer architecture: 1end to end speech processing: 1conformer: 1curriculum learning: 1end to end model: 1multi talker mixed speech recognition: 1knowledge distillation: 1speaker recognition: 1permutation invariant training: 1overlapped speech recognition: 1recurrent neural nets: 1speech separation: 1reverberation: 1neural beamforming: 1end to end speech recognition: 1multi speaker speech recognition: 1cocktail party problem: 1attention mechanism: 1
Most publications (all venues) at2023: 172024: 162022: 162021: 102020: 7

Affiliations
URLs

Recent publications

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee, 
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe 0001, 
Hubertopic: Enhancing Semantic Representation of Hubert Through Self-Supervision Utilizing Topic Model.

ICASSP2024 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-Weon Jung, Xuankai Chang, Shinji Watanabe 0001, 
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks.

ICASSP2024 Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe 0001, 
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng, 
UniAudio: Towards Universal Audio Generation with Large Language Models.

AAAI2024 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren 0006, Yuexian Zou, Zhou Zhao, Shinji Watanabe 0001, 
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

ACL2024 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang 0001, Ziyue Jiang 0001, Xuankai Chang, Jiatong Shi, Chao Weng, Zhou Zhao, Dong Yu 0001, 
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

ICASSP2023 Junwei Huang, Karthik Ganesan 0003, Soumi Maiti, Young Min Kim, Xuankai Chang, Paul Liang, Shinji Watanabe 0001, 
FindAdaptNet: Find and Insert Adapters by Learned Layer Importance.

ICASSP2023 Takashi Maekaku, Yuya Fujita, Xuankai Chang, Shinji Watanabe 0001, 
Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model.

Interspeech2023 Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe 0001, 
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning.

Interspeech2023 William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, Shinji Watanabe 0001, 
Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute.

Interspeech2023 Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey, 
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition.

Interspeech2023 Jiatong Shi, Dan Berrebbi, William Chen, En-Pei Hu, Wei-Ping Huang, Ho-Lam Chung, Xuankai Chang, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe 0001, 
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark.

Interspeech2023 Jiyang Tang, William Chen, Xuankai Chang, Shinji Watanabe 0001, Brian MacWhinney, 
A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning.

ICASSP2022 Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan 0003, Brian Yan, Ngoc Thang Vu, Alan W. Black, Shinji Watanabe 0001, 
ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet.

ICASSP2022 Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe 0001, Jonathan Le Roux, 
Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR.

ICASSP2022 Takashi Maekaku, Xuankai Chang, Yuya Fujita, Shinji Watanabe 0001, 
An Exploration of Hubert with Large Number of Cluster Units and Model Assessment Using Bayesian Information Criterion.

ICASSP2022 Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe 0001, 
Joint Speech Recognition and Audio Captioning.

Interspeech2022 Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan W. Black, Shinji Watanabe 0001, 
Two-Pass Low Latency End-to-End Spoken Language Understanding.

#88  | Ian McLoughlin 0001 | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 13TASLP: 3
By year2024: 12023: 32022: 52021: 72020: 82019: 102018: 4
ISCA sessionsspeaker and language recognition: 3analysis of speech and audio signals: 2neural transducers, streaming asr and novel asr models: 1resource-constrained asr: 1language and accent recognition: 1acoustic event detection and acoustic scene classification: 1asr neural network architectures: 1learning techniques for speaker recognition: 1speech and voice disorders: 1asr neural network architectures and training: 1acoustic event detection: 1speaker recognition and diarization: 1speech synthesis: 1speech and audio classification: 1audio signal characterization: 1speaker verification using neural network methods: 1representation learning for emotion: 1acoustic scenes and rare events: 1novel neural network architectures for acoustic modelling: 1
IEEE keywordsspeaker verification: 4speech recognition: 4speaker recognition: 4convolutional neural nets: 3signal classification: 3recurrent neural nets: 3audio signal processing: 3representation learning: 2knowledge based systems: 2deep learning (artificial intelligence): 2supervised learning: 2speech separation: 2sound event detection: 2audio tagging: 2runtime: 1meta learning: 1episodic training: 1iron: 1domain alignment: 1performance gain: 1degradation: 1stargan: 1domain adaptation: 1performance evaluation: 1data models: 1adaptation models: 1data augmentation: 1recording: 1self supervised learning: 1anomalous sound detection: 1label smoothing: 1unsupervised domain adaptation: 1knowledge distillation: 1end to end: 1speech emotion recognition: 1emotion recognition: 1signal reconstruction: 1style transformation: 1convolutional neural network: 1disentanglement: 1sequence alignment: 1probability: 1multi granularity: 1post inference: 1inference mechanisms: 1end to end asr: 1encoder decoder: 1dense residual networks: 1model ensemble: 1embedding learning: 1segan: 1speech enhancement: 1generative adversarial network: 1gan: 1convolution: 1deconvolution: 1self attention: 1multi view learning: 1pattern classification: 1music classification: 1audio classification: 1gradient blending: 1music: 1vocal source excitation: 1medical disorders: 1medical computing: 1patient rehabilitation: 1whisper to speech conversion: 1glottal flow model: 1speech synthesis: 1laryngectomy: 1source separation: 1time domain: 1sparse encoder: 1speaker identification: 1target tracking: 1signal representation: 1time domain analysis: 1semi supervised learning: 1weakly labeled: 1computational auditory scene analysis: 1label permutation problem: 1autoregressive processes: 1spectral model: 1voice quality: 1iterative methods: 1glottal flow: 1adaptive filters: 1spectral tilt: 1glottal inverse filtering: 1multi label: 1audio event detection: 1estimation theory: 1isolated sound: 1overlapping sound: 1convolutional recurrent neural network: 1multi task: 1weakly labelled data: 1attention: 1
Most publications (all venues) at2021: 212020: 172016: 132019: 122014: 12

Affiliations
Singapore Institute of Technology, Singapore
University of Science and Technology of China, Hefei, China
University of Kent, UK (2015 - 2019)
Nanyang Technological University, Singapore (former)
University of Birmingham, UK (PhD 1997)

Recent publications

ICASSP2024 Jian-Tao Zhang, Yan Song 0001, Jin Li, Wu Guo, Hao-Yu Song, Ian McLoughlin 0001
Meta Representation Learning Method for Robust Speaker Verification in Unseen Domains.

ICASSP2023 Hang-Rui Hu, Yan Song 0001, Jian-Tao Zhang, Li-Rong Dai 0001, Ian McLoughlin 0001, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, 
Stargan-vc Based Cross-Domain Data Augmentation for Speaker Verification.

Interspeech2023 Kang Li, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Jin Li, Li-Rong Dai 0001, 
Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection.

Interspeech2023 Xiao-Min Zeng, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Li-Rong Dai 0001, 
Robust Prototype Learning for Anomalous Sound Detection.

ICASSP2022 Han Chen, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Self-Supervised Representation Learning for Unsupervised Anomalous Sound Detection Under Domain Shift.

ICASSP2022 Hang-Rui Hu, Yan Song 0001, Ying Liu, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Domain Robust Deep Embedding Learning for Speaker Recognition.

ICASSP2022 Yuxuan Xi, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Frontend Attributes Disentanglement for Speech Emotion Recognition.

Interspeech2022 Zhifu Gao, Shiliang Zhang, Ian McLoughlin 0001, Zhijie Yan, 
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition.

Interspeech2022 Hang-Rui Hu, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Class-Aware Distribution Alignment based Unsupervised Domain Adaptation for Speaker Verification.

TASLP2021 Jian Tang, Jie Zhang 0042, Yan Song 0001, Ian McLoughlin 0001, Li-Rong Dai 0001, 
Multi-Granularity Sequence Alignment Mapping for Encoder-Decoder Based End-to-End ASR.

ICASSP2021 Ying Liu, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Li-Rong Dai 0001, 
An Effective Deep Embedding Learning Method Based on Dense-Residual Networks for Speaker Verification.

ICASSP2021 Huy Phan, Huy Le Nguyen, Oliver Y. Chén, Philipp Koch, Ngoc Q. K. Duong, Ian McLoughlin 0001, Alfred Mertins, 
Self-Attention Generative Adversarial Network for Speech Enhancement.

ICASSP2021 Huy Phan, Huy Le Nguyen, Oliver Y. Chén, Lam Dang Pham, Philipp Koch, Ian McLoughlin 0001, Alfred Mertins, 
Multi-View Audio And Music Classification.

Interspeech2021 Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin 0001
Extremely Low Footprint End-to-End ASR System for Smart Device.

Interspeech2021 Hui Wang, Lin Liu 0017, Yan Song 0001, Lei Fang, Ian McLoughlin 0001, Li-Rong Dai 0001, 
A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification.

Interspeech2021 Xu Zheng, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection.

TASLP2020 Olivier Perrotin, Ian Vince McLoughlin
Glottal Flow Synthesis for Whisper-to-Speech Conversion.

ICASSP2020 Hui Wang, Yan Song 0001, Zengxi Li, Ian McLoughlin 0001, Li-Rong Dai 0001, 
An Online Speaker-aware Speech Separation Approach Based on Time-domain Representation.

ICASSP2020 Jie Yan, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001
Task-Aware Mean Teacher Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection.

Interspeech2020 Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin 0001
SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition.

#89  | Takafumi Moriya | DBLP Google Scholar  
By venueInterspeech: 23ICASSP: 14TASLP: 1
By year2024: 22023: 112022: 72021: 62020: 32019: 62018: 3
ISCA sessionsnovel models and training methods for asr: 2adjusting to speaker, accent, and domain: 2analysis of neural speech representations: 1speech synthesis and voice conversion: 1end-to-end asr: 1spoken language understanding, summarization, and information retrieval: 1speech recognition: 1speech coding and enhancement: 1speech representation: 1multi-, cross-lingual and other topics in asr: 1single-channel speech enhancement: 1speech perception: 1streaming for asr/rnn transducers: 1source separation, dereverberation and echo cancellation: 1search/decoding techniques and confidence measures for asr: 1asr neural network architectures and training: 1speech and audio classification: 1model training for asr: 1nn architectures for asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1selected topics in neural speech processing: 1
IEEE keywordsspeech recognition: 9automatic speech recognition: 4neural network: 4speech enhancement: 3self supervised learning: 3end to end: 3recurrent neural nets: 3speech representation: 2linguistics: 2analytical models: 2transformers: 2task analysis: 2natural language processing: 2recurrent neural network transducer: 2probability: 2end to end automatic speech recognition: 2speaker representation: 1refining: 1probing task: 1layer wise similarity analysis: 1speaker embeddings: 1noise robustness: 1degradation: 1adaptation models: 1zero shot tts: 1speech synthesis: 1self supervised learning model: 1data models: 1focusing: 1language dependency: 1low resource: 1training data: 1end to end speech summarization: 1measurement: 1synthetic data augmentation: 1how2 dataset: 1multi modal data augmentation: 1neural transducer: 1recurrent neural networks: 1robustness: 1decoding: 1scheduled sampling: 1iterative methods: 1forward language model: 1end to end speech recognition: 1iterative decoding: 1partial sentence aware backward language model: 1iterative shallow fusion: 1symbols: 1shallow fusion: 1buildings: 1multilingual: 1representation learning: 1cross lingual: 1self supervised speech representation learning: 1attention based decoder: 1input switching: 1deep learning (artificial intelligence): 1noise robust speech recognition: 1speech separation: 1speakerbeam: 1speech extraction: 1listener adaptation: 1perceived emotion: 1speech emotion recognition: 1emotion recognition: 1entropy: 1whole network pre training: 1synchronisation: 1autoregressive processes: 1sequence level consistency training: 1specaugment: 1transformer: 1semi supervised learning: 1connectionist temporal classification: 1attention weight: 1knowledge distillation: 1speech codecs: 1cloud computing: 1covariance matrix adaptation evolution strategy (cma es): 1multi objective optimization: 1pareto optimisation: 1genetic algorithm: 1hidden markov models: 1parallel processing: 1deep neural network (dnn): 1evolutionary computation: 1speech coding: 1attention based encoder decoder: 1hierarchical recurrent encoder decoder: 1
Most publications (all venues) at2023: 132022: 92019: 92024: 72018: 7

Affiliations
URLs

Recent publications

ICASSP2024 Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima, 
What Do Self-Supervised Speech and Speaker Models Learn? New Findings from a Cross Model Layer-Wise Analysis.

ICASSP2024 Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima, 
Noise-Robust Zero-Shot Text-to-Speech Synthesis Conditioned on Self-Supervised Speech-Representation Model with Adapters.

ICASSP2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, 
Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models.

ICASSP2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura, 
Leveraging Large Text Corpora For End-To-End Speech Summarization.

ICASSP2023 Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, 
Improving Scheduled Sampling for Neural Transducer-Based ASR.

ICASSP2023 Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix, 
Iterative Shallow Fusion of Backward Language Model for End-To-End Speech Recognition.

ICASSP2023 Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi Moriya
Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning.

Interspeech2023 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma, 
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Interspeech2023 Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima, 
VC-T: Streaming Voice Conversion Based on Neural Transducer.

Interspeech2023 Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Target and Non-Target Speakers ASR.

Interspeech2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, 
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Takafumi Moriya, Takanori Ashihara, Atsushi Ando, Hiroshi Sato, Tomohiro Tanaka, Kohei Matsuura, Ryo Masumura, Marc Delcroix, Takahiro Shinozaki, 
Hybrid RNN-T/Attention-Based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration.

ICASSP2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya
Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition.

Interspeech2022 Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, 
Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models.

Interspeech2022 Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training.

Interspeech2022 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, 
Streaming Target-Speaker ASR with Neural Transducer.

Interspeech2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, 
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations.

Interspeech2022 Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks.

#90  | Zhong Meng | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 18NAACL: 1
By year2024: 42023: 32022: 72021: 102020: 62019: 62018: 2
ISCA sessionsmulti- and cross-lingual asr, other topics in asr: 3speech recognition: 2other topics in speech recognition: 1robust asr, and far-field/multi-talker asr: 1novel models and training methods for asr: 1source separation: 1self-supervision and semi-supervision for neural asr training: 1applications in transcription, education and learning: 1neural network training methods for asr: 1asr neural network architectures: 1training strategies for asr: 1asr neural network architectures and training: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1asr neural network training: 1novel approaches to enhancement: 1deep enhancement: 1
IEEE keywordsspeech recognition: 14automatic speech recognition: 5computational modeling: 4adaptation models: 4task analysis: 3data models: 3speaker recognition: 3natural language processing: 3deep neural network: 3teacher student learning: 3adversarial learning: 3error analysis: 2decoding: 2transducers: 2predictive models: 2rich transcription: 2speaker counting: 2audio signal processing: 2continuous speech separation: 2recurrent neural nets: 2speaker identification: 2probability: 2domain adaptation: 2neural network: 2speaker adaptation: 2computational efficiency: 1runtime efficiency: 1end to end asr: 1computational latency: 1large models: 1causal model: 1context modeling: 1online asr: 1state space model: 1convolution: 1conformer: 1systematics: 1parameter efficient adaptation: 1foundation model: 1tuning: 1universal speech model: 1internal lm: 1text recognition: 1text injection: 1transformer transducer: 1factorized neural transducer: 1language model adaptation: 1vocabulary: 1voice activity detection: 1speaker diarization: 1meeting transcription: 1recurrent selective attention network: 1source separation: 1hypothesis stitcher: 1computer architecture: 1bayes methods: 1speech separation: 1minimum bayes risk training: 1language model: 1attention based encoder decoder: 1recurrent neural network transducer: 1sequence training: 1self teaching: 1regularization: 1permutation invariant training: 1libricss: 1microphones: 1overlapped speech: 1entropy: 1computer aided instruction: 1latency: 1lstm: 1label embedding: 1knowledge representation: 1backpropagation: 1domain invariant training: 1attention: 1speaker verification: 1
Most publications (all venues) at2021: 152020: 92024: 82023: 82022: 8

Affiliations
URLs

Recent publications

ICASSP2024 Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno 0001, 
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models.

ICASSP2024 Haozhe Shan, Albert Gu, Zhong Meng, Weiran Wang, Krzysztof Choromanski, Tara N. Sainath, 
Augmenting Conformers With Structured State-Space Sequence Models For Online Speech Recognition.

ICASSP2024 Khe Chai Sim, Zhouyuan Huo, Tsendsuren Munkhdalai, Nikhil Siddhartha, Adam Stooke, Zhong Meng, Bo Li 0028, Tara N. Sainath, 
A Comparison of Parameter-Efficient ASR Domain Adaptation Methods for Universal Speech and Language Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar, 
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran, 
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

Interspeech2023 Shaan Bijwadia, Shuo-Yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, 
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models.

Interspeech2023 Cal Peyser, Zhong Meng, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho, Ke Hu, 
Improving Joint Speech-Text Representations Without Alignment.

ICASSP2022 Xie Chen 0001, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li 0001, 
Factorized Neural Transducer for Efficient Language Model Adaptation.

ICASSP2022 Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang 0009, Zhong Meng, Zhuo Chen 0006, Takuya Yoshioka, 
Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR.

ICASSP2022 Yixuan Zhang 0005, Zhuo Chen 0006, Jian Wu 0027, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li 0001, 
Continuous Speech Separation with Recurrent Selective Attention Network.

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings.

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Multi-Talker ASR with Token-Level Serialized Output Training.

Interspeech2022 Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li 0001, Xie Chen 0001, Yu Wu 0012, Yifan Gong 0001, 
Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition.

Interspeech2022 Wangyou Zhang, Zhuo Chen 0006, Naoyuki Kanda, Shujie Liu 0001, Jinyu Li 0001, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei, 
Separating Long-Form Speech with Group-wise Permutation Invariant Training.

ICASSP2021 Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang 0009, Zhong Meng, Takuya Yoshioka, 
Hypothesis Stitcher for End-to-End Speaker-Attributed ASR on Long-Form Multi-Talker Recordings.

ICASSP2021 Naoyuki Kanda, Zhong Meng, Liang Lu 0001, Yashesh Gaur, Xiaofei Wang 0009, Zhuo Chen 0006, Takuya Yoshioka, 
Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR.

ICASSP2021 Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu 0001, Xie Chen 0001, Jinyu Li 0001, Yifan Gong 0001, 
Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition.

ICASSP2021 Eric Sun, Liang Lu 0001, Zhong Meng, Yifan Gong 0001, 
Sequence-Level Self-Teaching Regularization.

Interspeech2021 Liang Lu 0001, Zhong Meng, Naoyuki Kanda, Jinyu Li 0001, Yifan Gong 0001, 
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer.

Interspeech2021 Yan Deng, Rui Zhao 0017, Zhong Meng, Xie Chen 0001, Bing Liu, Jinyu Li 0001, Yifan Gong 0001, Lei He 0005, 
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS.

#91  | Nicholas W. D. Evans | DBLP Google Scholar  
By venueInterspeech: 23ICASSP: 10TASLP: 4
By year2024: 42023: 72022: 32021: 92020: 62019: 42018: 4
ISCA sessionsvoice anti-spoofing and countermeasure: 3voice privacy challenge: 3speech coding: 2anti-spoofing for speaker verification: 1speaker and language identification: 1spoofing-aware automatic speaker verification (sasv): 1robust speaker recognition: 1privacy-preserving machine learning for audio & speech processing: 1the first dicova challenge: 1graph and end-to-end learning for speaker recognition: 1anti-spoofing and liveness detection: 1speaker recognition: 1privacy in speech and audio interfaces: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1novel approaches to enhancement: 1spoken corpora and annotation: 1the first dihard speech diarization challenge: 1speaker verification: 1
IEEE keywordsspeaker recognition: 5task analysis: 4anti spoofing: 4presentation attack detection: 4privacy: 3data privacy: 3protocols: 3countermeasure: 2speaker anonymization: 2codecs: 2spoofing: 2countermeasures: 2speech recognition: 2databases: 2automatic speaker verification: 2artificial bandwidth extension: 2variational auto encoder: 2speech quality: 2latent variable: 2pseudonymisation: 1voice privacy: 1anonymisation: 1voice conversion: 1attack model: 1speech synthesis: 1recording: 1training data: 1degradation: 1text to speech: 1deepfake detection: 1signal processing algorithms: 1privacy friendly data: 1information filtering: 1language robust orthogonal householder neural network: 1vocoders: 1language modeling: 1linguistics: 1neural audio codec: 1semantics: 1asvspoof: 1deepfakes: 1distributed databases: 1communication networks: 1internet: 1deepfake: 1self supervised learning: 1spoof localization: 1partialspoof: 1splicing: 1forgery: 1error analysis: 1spoofing countermeasures: 1joint optimisation: 1speaker verification: 1spoofing detection: 1graph attention networks: 1audio spoofing detection: 1graph theory: 1end to end: 1heterogeneous: 1transient response: 1data augmentation: 1filtering theory: 1media: 1i vectors: 1tv: 1sinc net: 1cepstral analysis: 1raw signal: 1spectrogram: 1x vectors: 1speaker diarization: 1public domain software: 1signal classification: 1automatic speaker verification (asv): 1security of data: 1detect ion cost function: 1spoofing counter measures: 1statistical distributions: 1mean square error methods: 1generative adversarial network: 1telephony: 1regression analysis: 1speech coding: 1dimensionality reduction: 1
Most publications (all venues) at2024: 162021: 152015: 152022: 142018: 14


Recent publications

TASLP2024 Michele Panariello, Natalia A. Tomashenko, Xin Wang 0037, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas W. D. Evans, Emmanuel Vincent 0001, Junichi Yamagishi, 
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation.

ICASSP2024 Wanying Ge, Xin Wang 0037, Junichi Yamagishi, Massimiliano Todisco, Nicholas W. D. Evans
Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?

ICASSP2024 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Nicholas W. D. Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier, 
Synvox2: Towards A Privacy-Friendly Voxceleb2 Dataset.

ICASSP2024 Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas W. D. Evans
Speaker Anonymization Using Neural Audio Codec Language Models.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee, 
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

TASLP2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi, 
The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance.

ICASSP2023 Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas W. D. Evans
Can Spoofing Countermeasure And Speaker Verification Systems Be Jointly Optimised?

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Michele Panariello, Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas W. D. Evans
Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems.

Interspeech2023 Michele Panariello, Massimiliano Todisco, Nicholas W. D. Evans
Vocoder drift in x-vector-based speaker anonymization.

Interspeech2023 Lin Zhang, Xin Wang 0037, Erica Cooper, Nicholas W. D. Evans, Junichi Yamagishi, 
Range-Based Equal Error Rate for Spoof Localization.

ICASSP2022 Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, Nicholas W. D. Evans
AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks.

ICASSP2022 Hemlata Tak, Madhu R. Kamble, Jose Patino 0001, Massimiliano Todisco, Nicholas W. D. Evans
Rawboost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing.

Interspeech2022 Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas W. D. Evans, Tomi Kinnunen, 
SASV 2022: The First Spoofing-Aware Speaker Verification Challenge.

ICASSP2021 Anthony Larcher, Ambuj Mehrish, Marie Tahon, Sylvain Meignier, Jean Carrive, David Doukhan, Olivier Galibert, Nicholas W. D. Evans
Speaker Embeddings for Diarization of Broadcast Data In The Allies Challenge.

ICASSP2021 Hemlata Tak, Jose Patino 0001, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans, Anthony Larcher, 
End-to-End anti-spoofing with RawNet2.

Interspeech2021 Jose Patino 0001, Natalia A. Tomashenko, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans
Speaker Anonymisation Using the McAdams Coefficient.

Interspeech2021 Oubaïda Chouchane, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino 0001, Nicholas W. D. Evans, Melek Önen, Massimiliano Todisco, 
Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation.

Interspeech2021 Wanying Ge, Michele Panariello, Jose Patino 0001, Massimiliano Todisco, Nicholas W. D. Evans
Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection.

Interspeech2021 Madhu R. Kamble, José Andrés González López, Teresa Grau, Juan M. Espín, Lorenzo Cascioli, Yiqing Huang, Alejandro Gómez Alanís, Jose Patino 0001, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas W. D. Evans, Maria A. Zuluaga, Massimiliano Todisco, 
PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge.

#92  | Satoshi Nakamura 0001 | DBLP Google Scholar  
By venueInterspeech: 24TASLP: 8ICASSP: 5
By year2024: 12023: 32022: 62021: 62020: 92019: 82018: 4
ISCA sessionsspoken machine translation: 4search methods and decoding algorithms for asr: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1phonetics: 1speaking styles and interaction styles: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1speech synthesis: 1low-resource speech recognition: 1lm adaptation, lexical units and punctuation: 1general topics in speech recognition: 1neural signals for spoken communication: 1the zero resource speech challenge 2020: 1topics in asr: 1turn management in dialogue: 1search methods for speech recognition: 1speech in the brain: 1the zero resource speech challenge 2019: 1sequence models for asr: 1acoustic model adaptation: 1integrating speech science and technology for clinical applications: 1statistical parametric speech synthesis: 1
IEEE keywordsspeech recognition: 7speech synthesis: 5natural language processing: 3text to speech: 2lombard effect: 2dpgmm: 2unsupervised phoneme discovery: 2zerospeech: 2recurrent neural nets: 2unsupervised learning: 2gaussian processes: 2signal reconstruction: 2speech chain: 2tts: 2asr: 2computational modeling: 1speech segmentation: 1transformers: 1symbols: 1pretrained speech encoder: 1predictive models: 1end to end speech to text translation: 1self adaptive: 1machine speech chain: 1incremental: 1low latency communication: 1hurricanes: 1noise measurement: 1real time systems: 1acoustic noise: 1dynamic adaptation: 1machine speech chain inference: 1signal denoising: 1speech intelligibility: 1hearing: 1low resource asr: 1infant speech perception: 1engrams: 1perception of phonemes: 1rnn: 1functional load: 1training data: 1multi linguality: 1standards: 1probability: 1neural machine translation (nmt): 1data augmentation: 1decoding: 1europe: 1automatic speech recognition: 1ctc: 1transformer: 1hybrid asr: 1video signal processing: 1information retrieval: 1interactive systems: 1emotion recognition: 1affective computing: 1human computer interaction: 1emotion elicitation: 1chat based dialogue system: 1independent component analysis: 1blind source separation: 1cognition: 1eeg: 1medical signal processing: 1electroencephalography: 1speech artifact removal: 1neurophysiology: 1spoken word production: 1tensor decomposition: 1brain: 1straight through estimator: 1end to end feedback loss: 1word segmentation: 1data models: 1cross lingual speech processing: 1labeling: 1indexes: 1tobi label generation: 1prosody detection: 1task analysis: 1
Most publications (all venues) at2014: 512015: 472018: 462016: 412020: 39

Affiliations
Nara Institute of Science and Technology, Ikoma, Japan
ATR Spoken Language Communication Labs, Kyoto, Japan
National Institute of Information and Communications Technology (NICT), Spoken Language Communication Group, Keihanna Science City, Japan
Sharp Corporation, Nara, Japan
Kyoto University, Japan (PhD 1992)

Recent publications

TASLP2024 Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura 0001
Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation.

ICASSP2023 Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001
Self-Adaptive Incremental Machine Speech Chain for Lombard TTS with High-Granularity ASR Feedback in Dynamic Noise Condition.

Interspeech2023 Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura 0001
Average Token Delay: A Latency Metric for Simultaneous Translation.

Interspeech2023 Yuta Nishikawa, Satoshi Nakamura 0001
Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation.

TASLP2022 Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001
A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments.

TASLP2022 Bin Wu, Sakriani Sakti, Jinsong Zhang 0001, Satoshi Nakamura 0001
Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR.

Interspeech2022 Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura 0001
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation.

Interspeech2022 Kei Furukawa, Takeshi Kishiyama, Satoshi Nakamura 0001
Applying Syntax-Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis.

Interspeech2022 Seiya Kawano, Muteki Arioka, Akishige Yuguchi, Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara, Satoshi Nakamura 0001, Koichiro Yoshino, 
Multimodal Persuasive Dialogue Corpus using Teleoperated Android.

Interspeech2022 Heli Qi, Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001
Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing.

TASLP2021 Bin Wu, Sakriani Sakti, Jinsong Zhang 0001, Satoshi Nakamura 0001
Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load.

Interspeech2021 Johanes Effendi, Sakriani Sakti, Satoshi Nakamura 0001
Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer.

Interspeech2021 Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura 0001
ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation.

Interspeech2021 Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001
Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder.

Interspeech2021 Shun Takahashi, Sakriani Sakti, Satoshi Nakamura 0001
Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages.

Interspeech2021 Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura 0001
Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation.

TASLP2020 Yuta Nishimura, Katsuhito Sudoh, Graham Neubig, Satoshi Nakamura 0001
Multi-Source Neural Machine Translation With Missing Data.

TASLP2020 Andros Tjandra, Sakriani Sakti, Satoshi Nakamura 0001
Machine Speech Chain.

TASLP2020 Andros Tjandra, Sakriani Sakti, Satoshi Nakamura 0001
Corrections to "Machine Speech Chain".

ICASSP2020 Andros Tjandra, Chunxi Liu, Frank Zhang 0001, Xiaohui Zhang 0007, Yongqiang Wang 0005, Gabriel Synnaeve, Satoshi Nakamura 0001, Geoffrey Zweig, 
DEJA-VU: Double Feature Presentation and Iterated Loss in Deep Transformer Networks.

#93  | Andreas Stolcke | DBLP Google Scholar  
By venueICASSP: 21Interspeech: 14ACL-Findings: 1EMNLP: 1
By year2024: 32023: 52022: 112021: 122020: 32019: 22018: 1
ISCA sessionsspeaker recognition and anti-spoofing: 2inclusive and fair speech technologies: 2speaker recognition: 2speech recognition: 1new computational strategies for asr training and inference: 1speaker diarization: 1multi- and cross-lingual asr, other topics in asr: 1self-supervision and semi-supervision for neural asr training: 1training strategies for asr: 1speaker recognition challenges and applications: 1rich transcription and asr systems: 1
IEEE keywordsspeech recognition: 13speaker recognition: 6automatic speech recognition: 4recurrent neural nets: 4speaker verification: 3adaptation models: 3natural language processing: 3spoken language understanding: 2emotion recognition: 2degradation: 2signal processing algorithms: 2error analysis: 2personalization: 2embedding adaptation: 2optimisation: 2language modeling: 2speech enhancement: 1zero shot learning: 1upper bound: 1robustness: 1lattices: 1question answering (information retrieval): 1in context learning: 1large language models. asr confusion networks: 1runtime: 1computational modeling: 1bridges: 1asymmetric speaker recognition: 1embedding space alignment: 1performance gain: 1multitasking: 1switches: 1speech sentiment analysis: 1paralinguistics: 1transformers: 1large language models: 1spoken dialogue modeling: 1linguistics: 1analytical models: 1heuristic algorithms: 1turn taking: 1multi armed bandits: 1endpointing: 1dialog modeling: 1neural transducer: 1transducers: 1attention: 1measurement uncertainty: 1fuses: 1contextual biasing: 1pronunciation: 1rnn t: 1graph based learning: 1cross utterance: 1measurement: 1hypothesis rescoring: 1label propagation: 1interactive systems: 1self supervised training: 1human computer interaction: 1supervised learning: 1dialogue: 1rejection mechanism: 1signal representation: 1diarization: 1multi task learning: 1speaker identification: 1few shot open set learning: 1score fusion: 1deep learning (artificial intelligence): 1model fairness: 1bert: 1pretrained model: 1second pass rescoring: 1minimum wer training: 1masked language model: 1bayes methods: 1robust speech recognition: 1adversarial robustness: 1sequence modeling: 1speech recognition safety: 1metric learning: 1pattern classification: 1mixup: 1prototypical loss: 1interpolation: 1rescoring: 1decoding: 1shallow fusion: 1inference algorithms: 1pattern clustering: 1neural net architecture: 1computational complexity: 1encoder decoder attractor: 1end to end neural diarization: 1online inference: 1speaker diarization: 1accent invariance: 1domain adversarial training: 1rnn transducer: 1end to end asr: 1multi accent asr: 1unsupervised pre training: 1contrastive predictive coding: 1speech emotion recognition: 1unsupervised learning: 1multilingual: 1joint modeling: 1recurrent neural network transducer: 1code switching: 1language identification: 1reinforce: 1multitask training: 1neural interfaces: 1entropy: 1hot spots: 1feature fusion: 1meeting understanding: 1involvement: 1sentiment analysis: 1acoustic modeling: 1multimodal fusion: 1cepstral analysis: 1audio feature extraction: 1
Most publications (all venues) at2006: 182005: 182004: 162021: 152022: 14


Recent publications

ICASSP2024 Kevin Everson, Yile Gu, Chao-Han Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke
Towards ASR Robust Spoken Language Understanding Through in-Context Learning with Word Confusion Networks.

ICASSP2024 Chenyang Gao, Brecht Desplanques, Chelsea J.-T. Ju, Aman Chadha, Andreas Stolcke
Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models.

ICASSP2024 Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko, 
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue.

ICASSP2023 Do June Min, Andreas Stolcke, Anirudh Raju, Colin Vaz, Di He 0004, Venkatesh Ravichandran, Viet Anh Trinh, 
Adaptive Endpointing with Deep Contextual Multi-Armed Bandits.

ICASSP2023 Rahul Pandey, Roger Ren, Qi Luo, Jing Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant P. Strimel, Andreas Stolcke, Ivan Bulyko, 
Procter: Pronunciation-Aware Contextual Adapter For Personalized Speech Recognition In Neural Transducers.

ICASSP2023 Srinath Tankasala, Long Chen, Andreas Stolcke, Anirudh Raju, Qianli Deng, Chander Chandak, Aparna Khare, Roland Maas, Venkatesh Ravichandran, 
Cross-Utterance ASR Rescoring with Graph-Based Label Propagation.

Interspeech2023 Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke
Learning When to Trust Which Teacher for Weakly Supervised ASR.

Interspeech2023 Denis Filimonov, Prabhat Pandey, Ariya Rastrow, Ankur Gandhe, Andreas Stolcke
Streaming Speech-to-Confusion Network Speech Recognition.

ICASSP2022 Metehan Cekic, Ruirui Li 0002, Zeya Chen, Yuguang Yang 0004, Andreas Stolcke, Upamanyu Madhow, 
Self-Supervised Speaker Recognition Training using Human-Machine Dialogues.

ICASSP2022 Aparna Khare, Eunjung Han, Yuguang Yang 0004, Andreas Stolcke
ASR-Aware End-to-End Neural Diarization.

ICASSP2022 K. C. Kishan, Zhenning Tan, Long Chen, Minho Jin, Eunjung Han, Andreas Stolcke, Chul Lee, 
OpenFEAT: Improving Speaker Identification by Open-Set Few-Shot Embedding Adaptation with Transformer.

ICASSP2022 Hua Shen, Yuguang Yang 0004, Guoli Sun, Ryan Langman, Eunjung Han, Jasha Droppo, Andreas Stolcke
Improving Fairness in Speaker Verification via Group-Adapted Fusion Network.

ICASSP2022 Liyan Xu, Yile Gu, Jari Kolehmainen, Haidar Khan, Ankur Gandhe, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko, 
RescoreBERT: Discriminative Speech Recognition Rescoring With Bert.

ICASSP2022 Chao-Han Huck Yang, Zeeshan Ahmed, Yile Gu, Joseph Szurley, Roger Ren, Linda Liu, Andreas Stolcke, Ivan Bulyko, 
Mitigating Closed-Model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition.

ICASSP2022 Xin Zhang, Minho Jin, Roger Cheng, Ruirui Li 0002, Eunjung Han, Andreas Stolcke
Contrastive-mixup Learning for Improved Speaker Verification.

Interspeech2022 Long Chen, Yixiong Meng, Venkatesh Ravichandran, Andreas Stolcke
Graph-based Multi-View Fusion and Local Adaptation: Mitigating Within-Household Confusability for Speaker Identification.

Interspeech2022 Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke
Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities.

Interspeech2022 Minho Jin, Chelsea Ju, Zeya Chen, Yi-Chieh Liu, Jasha Droppo, Andreas Stolcke
Adversarial Reweighting for Speaker Verification Fairness.

Interspeech2022 Viet Anh Trinh, Pegah Ghahremani, Brian John King, Jasha Droppo, Andreas Stolcke, Roland Maas, 
Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation.

ICASSP2021 Aditya Gourav, Linda Liu, Ankur Gandhe, Yile Gu, Guitang Lan, Xiangyang Huang, Shashank Kalmane, Gautam Tiwari, Denis Filimonov, Ariya Rastrow, Andreas Stolcke, Ivan Bulyko, 
Personalization Strategies for End-to-End Speech Recognition Systems.

#94  | Sheng Li 0010 | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 15SpeechComm: 2ACL-Findings: 1TASLP: 1
By year2024: 32023: 62022: 92021: 52020: 52019: 62018: 3
ISCA sessionsmulti-, cross-lingual and other topics in asr: 1speech quality assessment: 1speech representation: 1zero, low-resource and multi-modal speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1other topics in speech recognition: 1speech enhancement and intelligibility: 1source separation: 1oriental language recognition: 1speech and voice disorders: 1single-channel speech enhancement: 1cross-lingual and multilingual asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1nn architectures for asr: 1speech and audio classification: 1acoustic modelling: 1audio events and acoustic scenes: 1language identification: 1
IEEE keywordsspeech recognition: 12training data: 3linguistics: 3knowledge distillation: 3task analysis: 2acoustic model: 2speech separation: 2end to end: 2transformer: 2spoken language identification: 2fake audio detection (fad): 1self supervised learned (ssl) model: 1model fusion: 1mos prediction: 1predictive models: 1speech synthesis: 1logic gates: 1language adaptation: 1domain adaptation: 1automatic speech recognition: 1multitasking: 1adaptation models: 1khmer language: 1low resource language: 1self supervised pretraining: 1binary codes: 1bridges: 1end to end multilingual model: 1speech emotion recognition: 1speech enhancement: 1emotion recognition: 1federated learning: 1differential privacy: 1distributed databases: 1privacy protection: 1privacy: 1generalization: 1vocoders: 1source separation: 1data augmentation: 1vocoder: 1error analysis: 1multi modal: 1representation: 1linear programming: 1bidirectional attention: 1forced alignment: 1learning systems: 1feature distillation: 1task driven loss: 1model compression: 1data imbalance: 1hard sample mining: 1dynamic mixing: 1weighted loss: 1speaker recognition: 1modeling units: 1multi task learning: 1ctc/attention: 1natural language processing: 1mandarin speech recognition: 1tone modeling: 1estimation: 1encoder decoder: 1signal processing algorithms: 1pitch tracking: 1viterbi algorithm: 1auditory encoder: 1hearing: 1convolutional neural network: 1voice activity detection: 1ear: 1internal representation learning: 1short utterances: 1end to end model: 1dysarthric speech recognition: 1medical signal processing: 1articulatory attribute detection: 1time frequency analysis: 1multi target learning: 1speech dereverberation: 1two stage: 1spectrograms fusion: 1reverberation: 1teacher model optimization: 1natural languages: 1computer aided instruction: 1short utterance feature representation: 1interactive teacher student learning: 1connec tionist temporal classification: 1
Most publications (all venues) at2022: 182023: 172021: 132020: 122024: 9

Affiliations
National Institute of Information and Communications Technology (NICT), Universal Communication Research Institute (UCRI), Kyoto, Japan
Kyoto University, Graduate School of Informatics, Japan (2012-2017, PhD 2016)
Shenzhen Institutes of Advanced Technology, Shenzhen, China (2008-2012)
Chinese Academy of Sciences, Beijing, China (2008-2012)
Chinese University of Hong Kong, Hong Kong (2008-2012)
Nanjing University, China (2002-2009)

Recent publications

SpeechComm2024 Yuqin Lin, Jianwu Dang 0001, Longbiao Wang, Sheng Li 0010, Chenchen Ding, 
Disordered speech recognition considering low resources and abnormal articulation.

SpeechComm2024 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001, 
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network.

ICASSP2024 Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li 0010, Raj Dabre, Yi Zhao, Tatsuya Kawahara, 
MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction.

ICASSP2023 Soky Kak, Sheng Li 0010, Chenhui Chu, Tatsuya Kawahara, 
Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language.

ICASSP2023 Qianying Liu, Zhuo Gong, Zhengdong Yang, Yuhang Yang, Sheng Li 0010, Chenchen Ding, Nobuaki Minematsu, Hao Huang 0009, Fei Cheng 0002, Chenhui Chu, Sadao Kurohashi, 
Hierarchical Softmax for End-To-End Low-Resource Multilingual Speech Recognition.

ICASSP2023 Chao Tan, Yang Cao 0011, Sheng Li 0010, Masatoshi Yoshikawa, 
General or Specific? Investigating Effective Privacy Protection in Federated Learning for Speech Emotion Recognition.

ICASSP2023 Kai Wang, Yuhang Yang, Hao Huang 0009, Ying Hu 0005, Sheng Li 0010
Speakeraugment: Data Augmentation for Generalizable Source Separation via Speaker Parameter Manipulation.

ICASSP2023 Yuhang Yang, Haihua Xu, Hao Huang 0009, Eng Siong Chng, Sheng Li 0010
Speech-Text Based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition.

ACL-Findings2023 Shuichiro Shimizu, Chenhui Chu, Sheng Li 0010, Sadao Kurohashi, 
Towards Speech Dialogue Translation Mediating Speakers of Different Languages.

ICASSP2022 Yongjie Lv, Longbiao Wang, Meng Ge, Sheng Li 0010, Chenchen Ding, Lixin Pan, Yuguang Wang 0003, Jianwu Dang 0001, Kiyoshi Honda, 
Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation.

ICASSP2022 Kai Wang, Yizhou Peng, Hao Huang 0009, Ying Hu 0005, Sheng Li 0010
Mining Hard Samples Locally And Globally For Improved Speech Separation.

Interspeech2022 Soky Kak, Sheng Li 0010, Masato Mimura, Chenhui Chu, Tatsuya Kawahara, 
Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism.

Interspeech2022 Kai Li 0018, Sheng Li 0010, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang, Jianwu Dang 0001, Masashi Unoki, 
Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection.

Interspeech2022 Nan Li, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001, 
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network.

Interspeech2022 Siqing Qin, Longbiao Wang, Sheng Li 0010, Yuqin Lin, Jianwu Dang 0001, 
Finer-grained Modeling units-based Meta-Learning for Low-resource Tibetan Speech Recognition.

Interspeech2022 Hao Shi, Longbiao Wang, Sheng Li 0010, Jianwu Dang 0001, Tatsuya Kawahara, 
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction.

Interspeech2022 Longfei Yang, Wenqing Wei, Sheng Li 0010, Jiyi Li, Takahiro Shinozaki, 
Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection.

Interspeech2022 Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li 0010, Raj Dabre, Raphael Rubino, Yi Zhao, 
Fusion of Self-supervised Learned Models for MOS Prediction.

ICASSP2021 Shunfei Chen, Xinhui Hu, Sheng Li 0010, Xinkang Xu, 
An Investigation of Using Hybrid Modeling Units for Improving End-to-End Speech Recognition System.

ICASSP2021 Hao Huang 0009, Kai Wang, Ying Hu 0005, Sheng Li 0010
Encoder-Decoder Based Pitch Tracking and Joint Model Training for Mandarin Tone Classification.

#95  | Zhengqi Wen | DBLP Google Scholar  
By venueInterspeech: 21TASLP: 8ICASSP: 8
By year2024: 12023: 12022: 32021: 82020: 132019: 72018: 4
ISCA sessionsspeech synthesis: 4voice conversion and adaptation: 3statistical parametric speech synthesis: 2topics in asr: 1search/decoding techniques and confidence measures for asr: 1computational resource constrained speech recognition: 1multi-channel audio and emotion recognition: 1speech enhancement: 1asr neural network architectures: 1sequence-to-sequence speech recognition: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1speech and audio source separation and scene analysis: 1nn architectures for asr: 1speech synthesis paradigms and methods: 1prosody modeling and generation: 1
IEEE keywordsspeech synthesis: 8speech recognition: 7natural language processing: 4end to end: 4speaker recognition: 4speech enhancement: 3text analysis: 3transfer learning: 3decoding: 3spectrogram: 2speech coding: 2text based speech editing: 2text editing: 2end to end model: 2attention: 2speaker adaptation: 2noise robustness: 1synthetic speech detection: 1interactive fusion: 1noise measurement: 1data models: 1knowledge distillation: 1noise: 1noise robust: 1asvspoof: 1buildings: 1automatic speaker verification: 1complexity theory: 1architecture: 1fake speech detection: 1voice activity detection: 1self distillation: 1task analysis: 1waveform generators: 1vocoders: 1deterministic plus stochastic: 1multiband excitation: 1noise control: 1vocoder: 1filtering theory: 1stochastic processes: 1text to speech: 1one shot learning: 1coarse to fine decoding: 1mask prediction: 1mask and prediction: 1fast: 1bert: 1non autoregressive: 1cross modal: 1autoregressive processes: 1teacher student learning: 1language modeling: 1gated recurrent fusion: 1robust end to end speech recognition: 1speech transformer: 1speech distortion: 1decoupled transformer: 1automatic speech recognition: 1code switching: 1bi level decoupling: 1prosody modeling: 1speaking style modeling: 1personalized speech synthesis: 1few shot speaker adaptation: 1the m2voc challenge: 1prosody and voice factorization: 1clustering algorithms: 1end to end post filter: 1deep clustering: 1permutation invariant training: 1deep attention fusion features: 1speech separation: 1interference: 1prosody transfer: 1audio signal processing: 1optimisation: 1optimization strategy: 1forward backward algorithm: 1synchronous transformer: 1online speech recognition: 1encoding: 1asynchronous problem: 1signal processing algorithms: 1predictive models: 1chunk by chunk: 1cross lingual: 1adversarial training: 1low resource: 1phoneme representation: 1matrix decomposition: 1speaker embedding: 1
Most publications (all venues) at2024: 222020: 192021: 152016: 142019: 12

Affiliations
URLs

Recent publications

TASLP2024 Cunhang Fan, Mingming Ding, Jianhua Tao 0001, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv, 
Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection.

ICASSP2023 Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv, 
Learning From Yourself: A Self-Distillation Method For Fake Speech Detection.

TASLP2022 Tao Wang 0074, Ruibo Fu, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen
NeuralDPS: Neural Deterministic Plus Stochastic Model With Multiband Excitation for Noise-Controllable Waveform Generation.

TASLP2022 Tao Wang 0074, Jiangyan Yi, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen
CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing.

ICASSP2022 Tao Wang 0074, Jiangyan Yi, Liqun Deng, Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen
Context-Aware Mask Prediction Network for End-to-End Text-Based Speech Editing.

TASLP2021 Ye Bai, Jiangyan Yi, Jianhua Tao 0001, Zhengkun Tian, Zhengqi Wen, Shuai Zhang 0014, 
Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT.

TASLP2021 Ye Bai, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen, Zhengkun Tian, Shuai Zhang 0014, 
Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data.

TASLP2021 Cunhang Fan, Jiangyan Yi, Jianhua Tao 0001, Zhengkun Tian, Bin Liu 0041, Zhengqi Wen
Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition.

ICASSP2021 Shuai Zhang 0014, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao 0001, Zhengqi Wen
Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition.

ICASSP2021 Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, Jiangyan Yi, Tao Wang 0074, Chunyu Qiang, 
Bi-Level Style and Prosody Decoupling Modeling for Personalized End-to-End Speech Synthesis.

ICASSP2021 Tao Wang 0074, Ruibo Fu, Jiangyan Yi, Jianhua Tao 0001, Zhengqi Wen, Chunyu Qiang, Shiming Wang, 
Prosody and Voice Factorization for Few-Shot Speaker Adaptation in the Challenge M2voc 2021.

Interspeech2021 Shuai Zhang 0014, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao 0001, Xuefei Liu, Zhengqi Wen
End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition.

Interspeech2021 Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao 0001, Shuai Zhang 0014, Zhengqi Wen
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization.

TASLP2020 Cunhang Fan, Jianhua Tao 0001, Bin Liu 0041, Jiangyan Yi, Zhengqi Wen, Xuefei Liu, 
End-to-End Post-Filter for Speech Separation With Deep Attention Fusion Features.

ICASSP2020 Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, Jiangyan Yi, Tao Wang 0074, 
Focusing on Attention: Prosody Transfer and Adaptative Optimization Strategy for Multi-Speaker End-to-End Speech Synthesis.

ICASSP2020 Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao 0001, Shuai Zhang 0014, Zhengqi Wen
Synchronous Transformers for end-to-end Speech Recognition.

Interspeech2020 Ye Bai, Jiangyan Yi, Jianhua Tao 0001, Zhengkun Tian, Zhengqi Wen, Shuai Zhang 0014, 
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition.

Interspeech2020 Cunhang Fan, Jianhua Tao 0001, Bin Liu 0041, Jiangyan Yi, Zhengqi Wen
Gated Recurrent Fusion of Spatial and Spectral Features for Multi-Channel Speech Separation with Deep Embedding Representations.

Interspeech2020 Cunhang Fan, Jianhua Tao 0001, Bin Liu 0041, Jiangyan Yi, Zhengqi Wen
Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations.

Interspeech2020 Ruibo Fu, Jianhua Tao 0001, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang 0074, 
Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis.

#96  | Jan Cernocký | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 13TASLP: 2
By year2024: 22023: 42022: 62021: 102020: 32019: 82018: 4
ISCA sessionsspeaker recognition and diarization: 2automatic speech recognition in air traffic management: 2source separation: 1speaker and language identification: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1search/decoding algorithms for asr: 1robust speaker recognition: 1linguistic components in end-to-end asr: 1graph and end-to-end learning for speaker recognition: 1embedding and network architecture for speaker recognition: 1target speaker detection, localization and separation: 1sequence-to-sequence speech recognition: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1zero-resource asr: 1speaker recognition: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1topics in speech recognition: 1dereverberation: 1low resource speech recognition challenge for indian languages: 1neural network training strategies for asr: 1
IEEE keywordsspeech recognition: 5speaker recognition: 5transformers: 4speaker diarization: 3speaker verification: 3telephone sets: 2data models: 2adaptation models: 2natural language processing: 2training data: 2pattern clustering: 2variational bayes: 2bayes methods: 2hidden markov models: 2speaker embedding: 2optimisation: 2x vector: 2error correction: 1conversational telephone speech: 1source coding: 1calibration: 1telephony: 1target speech extraction: 1feature aggregation: 1pre trained models: 1benchmark testing: 1self supervised learning: 1costs: 1spoken term detection: 1end to end keyword search: 1indexing: 1keyword search: 1decoding: 1asr free keyword search: 1lattices: 1keyword spotting: 1vocabulary: 1transfer learning: 1adapter: 1fine tuning: 1pre trained model: 1blind source separation: 1unsupervised target speech extraction: 1frequency domain analysis: 1cross domain: 1dpccn: 1mixture remix: 1speech separation: 1time domain analysis: 1beamforming: 1speech enhancement: 1multi channel: 1multisv: 1dataset: 1array signal processing: 1self supervision: 1speech synthesis: 1sequence to sequence: 1cycle consistency: 1adaptation: 1data augmentation: 1robustness: 1x vectors: 1auxiliary loss: 1joint training: 1language translation: 1how2 dataset: 1coupled de coding: 1spoken language translation: 1asr objective: 1end to end differentiable pipeline: 1unsupervised learning: 1hierarchical subspace model: 1acoustic unit discovery: 1clustering: 1on the fly data augmentation: 1specaugment: 1convolutional neural nets: 1probability: 1dihard: 1hmm: 1linear discriminant analysis: 1attention models: 1discriminative training: 1recurrent neural nets: 1softmax margin: 1beam search training: 1sequence learning: 1topology: 1deep neural network: 1tensorflow: 1kaldi: 1network topology: 1task analysis: 1
Most publications (all venues) at2021: 152019: 152022: 142016: 122018: 11

Affiliations
Brno University of Technology

Recent publications

ICASSP2024 Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Díez, Lukás Burget, Yuhang Cao, Heng Lu, Jan Cernocký
Diacorrect: Error Correction Back-End for Speaker Diarization.

ICASSP2024 Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocký
Target Speech Extraction with Pre-Trained Self-Supervised Learning Models.

TASLP2023 Bolaji Yusuf, Jan Cernocký, Murat Saraçlar, 
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations.

ICASSP2023 Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldrich Plchot, Ladislav Mosner, Lukás Burget, Jan Cernocký
Parameter-Efficient Transfer Learning of Pre-Trained Transformer Models for Speaker Verification Using Adapters.

Interspeech2023 Ladislav Mosner, Oldrich Plchot, Junyi Peng, Lukás Burget, Jan Cernocký
Multi-Channel Speech Separation with Cross-Attention and Beamforming.

Interspeech2023 Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukás Burget, Jan Cernocký
Improving Speaker Verification with Self-Pretrained Transformer Models.

ICASSP2022 Jiangyu Han, Yanhua Long, Lukás Burget, Jan Cernocký
DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction.

ICASSP2022 Ladislav Mosner, Oldrich Plchot, Lukás Burget, Jan Honza Cernocký
Multisv: Dataset for Far-Field Multi-Channel Speaker Verification.

Interspeech2022 Murali Karthick Baskar, Tim Herzig, Diana Nguyen, Mireia Díez, Tim Polzehl, Lukás Burget, Jan Cernocký
Speaker adaptation for Wav2vec2 based dysarthric ASR.

Interspeech2022 Martin Kocour, Katerina Zmolíková, Lucas Ondel, Jan Svec, Marc Delcroix, Tsubasa Ochiai, Lukás Burget, Jan Cernocký
Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model.

Interspeech2022 Junyi Peng, Rongzhi Gu, Ladislav Mosner, Oldrich Plchot, Lukás Burget, Jan Cernocký
Learnable Sparse Filterbank for Speaker Verification.

Interspeech2022 Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Anna Silnova, Lukás Burget, Jan Cernocký
Training speaker embedding extractors using multi-speaker audio with unknown speaker boundaries.

ICASSP2021 Murali Karthick Baskar, Lukás Burget, Shinji Watanabe 0001, Ramón Fernandez Astudillo, Jan Honza Cernocký
Eat: Enhanced ASR-TTS for Self-Supervised Speech Recognition.

ICASSP2021 Martin Karafiát, Karel Veselý, Jan Honza Cernocký, Ján Profant, Jirí Nytra, Miroslav Hlavácek, Tomás Pavlícek, 
Analysis of X-Vectors for Low-Resource Speech Recognition.

ICASSP2021 Hari Krishna Vydana, Martin Karafiát, Katerina Zmolíková, Lukás Burget, Honza Cernocký
Jointly Trained Transformers Models for Spoken Language Translation.

ICASSP2021 Bolaji Yusuf, Lucas Ondel, Lukás Burget, Jan Cernocký, Murat Saraçlar, 
A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery.

Interspeech2021 Ekaterina Egorova, Hari Krishna Vydana, Lukás Burget, Jan Cernocký
Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System.

Interspeech2021 Martin Kocour, Karel Veselý, Alexander Blatt, Juan Zuluaga-Gomez, Igor Szöke, Jan Cernocký, Dietrich Klakow, Petr Motlícek, 
Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition.

Interspeech2021 Junyi Peng, Xiaoyang Qu, Rongzhi Gu, Jianzong Wang, Jing Xiao 0006, Lukás Burget, Jan Cernocký
Effective Phase Encoding for End-To-End Speaker Verification.

Interspeech2021 Junyi Peng, Xiaoyang Qu, Jianzong Wang, Rongzhi Gu, Jing Xiao 0006, Lukás Burget, Jan Cernocký
ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform.

#97  | Yu Wu 0012 | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 11TASLP: 3ICML: 2ACL: 2AAAI: 2
By year2024: 32023: 72022: 112021: 102020: 6
ISCA sessionsnovel models and training methods for asr: 3source separation: 2multi- and cross-lingual asr, other topics in asr: 2analysis of speech and audio signals: 1speech recognition: 1statistical machine translation: 1speaker and language recognition: 1other topics in speech recognition: 1robust asr, and far-field/multi-talker asr: 1neural network training methods for asr: 1asr model training and strategies: 1streaming asr: 1asr neural network architectures: 1
IEEE keywordsspeech recognition: 8representation learning: 4self supervised learning: 4speaker recognition: 4transformers: 3speech separation: 3transformer: 3transducers: 2factorized neural transducer: 2predictive models: 2vocabulary: 2speech translation: 2task analysis: 2data models: 2error analysis: 2computational modeling: 2speech enhancement: 2natural language processing: 2source separation: 2recurrent neural nets: 2long content speech recognition: 1streaming and non streaming: 1context modeling: 1rnn t: 1computer architecture: 1codecs: 1machine translation: 1language model: 1speech coding: 1speech synthesis: 1semantics: 1speech text joint pre training: 1discrete tokenization: 1unified modeling language: 1oral communication: 1wavlm: 1multi speaker: 1conversation transcription: 1model size reduction: 1speech interruption detection: 1performance evaluation: 1semi supervised learning: 1pandemics: 1analytical models: 1quantization (signal): 1fuses: 1long form speech recognition: 1context and speech encoder: 1self supervised pretrain: 1unsupervised learning: 1speaker verification: 1image representation: 1multitasking: 1pre training: 1benchmark testing: 1speaker: 1linear programming: 1robust speech recognition: 1contrastive learning: 1wav2vec 2.0: 1automatic speech recognition: 1supervised learning: 1multi channel microphone: 1deep learning (artificial intelligence): 1signal representation: 1real time decoding: 1decoding: 1transducer: 1encoding: 1continuous speech separation: 1multi speaker asr: 1conformer: 1filtering theory: 1audio signal processing: 1system fusion: 1speaker diarization: 1
Most publications (all venues) at2022: 172023: 132021: 132020: 112018: 8

Affiliations
Microsoft Research Asia, Beijing, China
Beihang University, State Key Lab of Software Development Environment, Beijing, China
URLs

Recent publications

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

TASLP2024 Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu 0012, Shujie Liu 0001, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Furu Wei, 
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation.

TASLP2024 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu 0012, Shuo Ren, Shujie Liu 0001, Zhuoyuan Yao, Xun Gong 0005, Li-Rong Dai 0001, Jinyu Li 0001, Furu Wei, 
SpeechLM: Enhanced Speech Pre-Training With Unpaired Textual Data.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Quchen Fu, Szu-Wei Fu, Yaran Fan, Yu Wu 0012, Zhuo Chen 0006, Jayant Gupchup, Ross Cutler, 
Real-Time Speech Interruption Analysis: from Cloud to Client Deployment.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

Interspeech2023 Youngdo Ahn, Chengyi Wang 0002, Yu Wu 0012, Jong Won Shin, Shujie Liu 0001, 
GRAVO: Learning to Generate Relevant Audio from Visual Features with Noisy Online Videos.

Interspeech2023 Yuang Li, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, 
Accelerating Transducers through Adjacent Token Merging.

Interspeech2023 Peidong Wang, Eric Sun, Jian Xue, Yu Wu 0012, Long Zhou, Yashesh Gaur, Shujie Liu 0001, Jinyu Li 0001, 
LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers.

ICML2023 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Shujie Liu 0001, Daniel Tompkins, Zhuo Chen 0006, Wanxiang Che, Xiangzhan Yu, Furu Wei, 
BEATs: Audio Pre-Training with Acoustic Tokenizers.

ICASSP2022 Zhengyang Chen, Sanyuan Chen, Yu Wu 0012, Yao Qian, Chengyi Wang 0002, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification.

ICASSP2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Zhengyang Chen, Zhuo Chen 0006, Shujie Liu 0001, Jian Wu 0027, Yao Qian, Furu Wei, Jinyu Li 0001, Xiangzhan Yu, 
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training.

ICASSP2022 Yiming Wang, Jinyu Li 0001, Heming Wang, Yao Qian, Chengyi Wang 0002, Yu Wu 0012
Wav2vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition.

ICASSP2022 Chengyi Wang 0002, Yu Wu 0012, Sanyuan Chen, Shujie Liu 0001, Jinyu Li 0001, Yao Qian, Zhenglu Yang, 
Improving Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision.

Interspeech2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Shujie Liu 0001, Zhuo Chen 0006, Peidong Wang, Gang Liu 0001, Jinyu Li 0001, Jian Wu 0027, Xiangzhan Yu, Furu Wei, 
Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings.

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Multi-Talker ASR with Token-Level Serialized Output Training.

Interspeech2022 Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li 0001, Xie Chen 0001, Yu Wu 0012, Yifan Gong 0001, 
Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition.

Interspeech2022 Shuo Ren, Shujie Liu 0001, Yu Wu 0012, Long Zhou, Furu Wei, 
Speech Pre-training with Acoustic Piece.

Interspeech2022 Chengyi Wang 0002, Yiming Wang, Yu Wu 0012, Sanyuan Chen, Jinyu Li 0001, Shujie Liu 0001, Furu Wei, 
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training.

#98  | Bhuvana Ramabhadran | DBLP Google Scholar  
By venueInterspeech: 20ICASSP: 17
By year2024: 22023: 72022: 82021: 82020: 72019: 32018: 2
ISCA sessionsnovel models and training methods for asr: 3neural network training methods for asr: 2self-supervised learning in asr: 1speech recognition: 1resource-constrained asr: 1asr: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1streaming for asr/rnn transducers: 1speech recognition of atypical speech: 1self-supervision and semi-supervision for neural asr training: 1asr neural network architectures and training: 1training strategies for asr: 1multilingual and code-switched asr: 1cross-lingual and multilingual asr: 1speech synthesis: 1adjusting to speaker, accent, and domain: 1perspective talk: 1
IEEE keywordsspeech recognition: 10speech synthesis: 4natural language processing: 4computational modeling: 3adaptation models: 3data models: 3n best rescoring: 3rnn t: 3multilingual: 3automatic speech recognition: 2error analysis: 2standards: 2convolution: 2entropy: 2text injection: 2task analysis: 2end to end speech recognition: 2signal processing algorithms: 2data augmentation: 2domain adaptation: 1multitasking: 1algebra: 1trajectory: 1task vectors: 1multilingual modeling: 1representation learning: 1unsupervised learning: 1soft sensors: 1recording: 1joint speech text models: 1submodels: 1decoding: 1self attention: 1video on demand: 1additives: 1lattices: 1fine tuning: 1large scale language models: 1transducers: 1internal lm: 1text recognition: 1multilingual text to speech synthesis: 1semisupervised learning: 1massive multilingual pretraining: 1speech–text semi supervised joint learning: 1loss measurement: 1speech text representation learning: 1visualization: 1simultaneous localization and mapping: 1inspection: 1text analysis: 1consistency regularization: 1self supervised: 1sequence to sequence model: 1speech normalization: 1speaker recognition: 1speech impairments: 1voice conversion: 1gradient methods: 1language id: 1mixture of experts: 1smoothing methods: 1neurons: 1random variables: 1transformer: 1rnn transducer: 1dropout: 1regularization: 1error statistics: 1language independent: 1transliteration: 1fine grained vae: 1text to speech: 1vector quantization: 1measurement: 1tacotron 2: 1classification algorithms: 1encoder decoder: 1code switched automatic speech recognition: 1speech coding: 1language model adaptation: 1
Most publications (all venues) at2011: 242017: 202010: 162009: 162014: 15

Affiliations
URLs

Recent publications

ICASSP2024 Gowtham Ramesh, Kartik Audhkhasi, Bhuvana Ramabhadran
Task Vector Algebra for ASR Models.

ICASSP2024 Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov, 
Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data.

ICASSP2023 Kartik Audhkhasi, Brian Farris, Bhuvana Ramabhadran, Pedro J. Moreno 0001, 
Modular Conformer Training for Flexible End-to-End ASR.

ICASSP2023 Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel S. Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Michael Riley 0001, 
Large-Scale Language Model Rescoring on Long-Form Data.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

ICASSP2023 Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang 0033, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech.

ICASSP2023 Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang 0033, 
Understanding Shared Speech-Text Representations.

Interspeech2023 Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi, 
O-1: Self-training with Oracle and 1-best Hypothesis.

Interspeech2023 Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran
Using Text Injection to Improve Recognition of Personal Identifiers in Speech.

ICASSP2022 Zhehuai Chen, Yu Zhang 0033, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Gary Wang, 
Tts4pretrain 2.0: Advancing the use of Text and Speech in ASR Pretraining with Consistency and Contrastive Losses.

ICASSP2022 Neeraj Gaur, Tongzhou Chen, Ehsan Variani, Parisa Haghani, Bhuvana Ramabhadran, Pedro J. Moreno 0001, 
Multilingual Second-Pass Rescoring for Automatic Speech Recognition Systems.

Interspeech2022 Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno 0001, 
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition.

Interspeech2022 Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang 0033, Nicolás Serrano, 
Reducing Domain mismatch in Self-supervised speech pre-training.

Interspeech2022 Zhehuai Chen, Yu Zhang 0033, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Ankur Bapna, Heiga Zen, 
MAESTRO: Matched Speech Text Representations through Modality Matching.

Interspeech2022 Ehsan Variani, Michael Riley 0001, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran
On Adaptive Weight Interpolation of the Hybrid Autoregressive Transducer.

Interspeech2022 Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach, 
Improving Rare Word Recognition with LM-aware MWER Training.

Interspeech2022 Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno 0001, 
Non-Parallel Voice Conversion for ASR Augmentation.

ICASSP2021 Rohan Doshi, Youzheng Chen, Liyang Jiang, Xia Zhang, Fadi Biadsy, Bhuvana Ramabhadran, Fang Chu, Andrew Rosenberg, Pedro J. Moreno 0001, 
Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech.

ICASSP2021 Neeraj Gaur, Brian Farris, Parisa Haghani, Isabel Leal, Pedro J. Moreno 0001, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu, 
Mixture of Informed Experts for Multilingual Speech Recognition.

ICASSP2021 Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, Bhuvana Ramabhadran
Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition.

#99  | Boris Ginsburg | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 17ICML: 1
By year2024: 82023: 122022: 72021: 62020: 32019: 1
ISCA sessionsspeech synthesis: 3speech recognition: 2show and tell: 2applications in transcription, education and learning: 2end-to-end asr: 1spoken language understanding, summarization, and information retrieval: 1speaker and language identification: 1other topics in speech recognition: 1novel models and training methods for asr: 1speaker recognition and diarization: 1non-autoregressive sequential modeling for speech processing: 1miscellanous topics in asr: 1computational resource constrained speech recognition: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 7adaptation models: 4automatic speech recognition: 4asr: 4decoding: 4error analysis: 3benchmark testing: 3conformer: 3speech synthesis: 3convolutional neural nets: 3in context learning: 2computational modeling: 2convolution: 2robustness: 2computer architecture: 2diarization: 2data models: 2speaker verification: 2transducers: 2rnn t: 2self supervised learning: 2taxonomy: 1degradation: 1gpt: 1finite state automata: 1text to speech: 1cognition: 1noise measurement: 1large language models: 1text normalization: 1noise robustness: 1training data: 1multi lingual: 1generated transcriptions: 1visualization: 1audio visual speech recognition: 1llm: 1ast: 1task analysis: 1real time systems: 1earnings 21: 1long form audio: 1ted lium: 1automatic speech recognition (asr): 1coraal: 1costs: 1ctc: 1rnnt: 1streaming asr: 1fastconformer: 1vector quantization: 1speech coding: 1neural codec: 1fast conformer: 1systematics: 1tokenization: 1pronunciation aware modeling: 1analytical models: 1data preprocessing: 1target tracking: 1control systems: 1voxlingua107: 1data mining: 1spoken language identification: 1measurement: 1representation learning: 1self supervised: 1linguistics: 1predictive models: 1voice conversion: 1standards: 1symbols: 1time frequency analysis: 1multi speaker asr: 1source separation: 1target speaker asr: 1context: 1speaker recognition: 1t vectors: 1speaker embedding: 1multilayer perceptrons: 1speech enhancement: 1mlp mixer: 1mel spectrogram generation: 1autoregressive processes: 1depth wise separable convolution: 1voice activity detection: 1spelling correction: 1pre trained language models: 1natural language processing: 1time channel separable convolution: 1convolutional networks: 1depthwise separable convolution: 1
Most publications (all venues) at2024: 222023: 222021: 142022: 102018: 5

Affiliations
URLs

Recent publications

ICASSP2024 Yang Zhang 0089, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg
A Chat about Boring Problems: Studying GPT-Based Text Normalization.

ICASSP2024 Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg, Radu Timofte, 
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer.

ICASSP2024 Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg
SALM: Speech-Augmented Language Model with in-Context Learning for Speech Recognition and Translation.

ICASSP2024 Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg
Investigating End-to-End ASR Architectures for Long Form Audio Transcription.

ICASSP2024 Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg
Stateful Conformer with Cache-Based Inference for Streaming Automatic Speech Recognition.

ICASSP2024 Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg
Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition.

ICASSP2024 Hainan Xu, Zhehuai Chen, Fei Jia, Boris Ginsburg
Transducers with Pronunciation-Aware Embeddings for Automatic Speech Recognition.

ICML2024 Paarth Neekhara, Shehzeen Samarah Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian J. McAuley, 
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations.

ICASSP2023 Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro, 
Vani: Very-Lightweight Accent-Controllable TTS for Native And Non-Native Speakers With Identity Preservation.

ICASSP2023 Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, Boris Ginsburg
Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models.

ICASSP2023 Shehzeen Hussain, Paarth Neekhara, Jocelyn Huang, Jason Li, Boris Ginsburg
ACE-VC: Adaptive and Controllable Voice Conversion Using Explicitly Disentangled Self-Supervised Speech Representations.

ICASSP2023 Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe 0001, Boris Ginsburg
Multi-Blank Transducers for Speech Recognition.

ICASSP2023 Yang Zhang 0089, Krishna C. Puvvada, Vitaly Lavrukhin, Boris Ginsburg
Conformer-Based Target-Speaker Automatic Speech Recognition For Single-Channel Audio.

Interspeech2023 Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg
SpellMapper: A non-autoregressive neural spellchecker for ASR customization with candidate retrieval based on n-gram mappings.

Interspeech2023 Vladimir Bataev, Roman Korostik, Evgeny Shabalin, Vitaly Lavrukhin, Boris Ginsburg
Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator.

Interspeech2023 Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg
Confidence-based Ensembles of End-to-End Speech Recognition Models.

Interspeech2023 Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg
Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers.

Interspeech2023 He Huang, Jagadeesh Balam, Boris Ginsburg
Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling.

Interspeech2023 Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
A Compact End-to-End Model with Local and Global Context for Spoken Language Identification.

Interspeech2023 Elena Rastorgueva, Vitaly Lavrukhin, Boris Ginsburg
NeMo Forced Aligner and its application to word alignment for subtitle generation.

#100  | Bin Ma 0001 | DBLP Google Scholar  
By venueICASSP: 19Interspeech: 15EMNLP: 1TASLP: 1
By year2024: 32023: 122022: 52021: 42020: 62019: 52018: 1
ISCA sessionsspeech recognition: 2analysis of speech and audio signals: 2asr neural network architectures: 2cross-lingual and multilingual asr: 2self-supervised learning in asr: 1speaker and language identification: 1acoustic model adaptation for asr: 1speech synthesis: 1far-field speech recognition: 1neural techniques for voice conversion and waveform generation: 1spoken term detection: 1
IEEE keywordsspeech enhancement: 8speech recognition: 7benchmark testing: 5adaptation models: 4time frequency analysis: 4task analysis: 3transformer: 3transformers: 3speech separation: 3attention: 3convolution: 3natural language processing: 3robustness: 2recurrent neural networks: 2logic gates: 2semantics: 2signal processing algorithms: 2low resource: 2data models: 2microphone arrays: 2alimeeting: 2meeting transcription: 2speaker recognition: 2convolutional neural nets: 2recurrent neural nets: 2attention mechanism: 2prompt tuning: 1computational modeling: 1zero shot learning: 1explainable prompt: 1background noise: 1data mining: 1modulation: 1attentive pooling: 1feature modulation: 1recurrent: 1time domain analysis: 1cross modal learning: 1knowledge transfer: 1fuses: 1pooling layer: 1speech representation: 1noise robustness: 1disentangling representations: 1noise robust automatic speech recognition: 1noise measurement: 1self supervised learning: 1visualization: 1boosting: 1performance evaluation: 1contrastive learning: 1data augmentation: 1keyword spotting: 1pre trained models: 1knowledge distillation: 1linguistics: 1spoken language understanding: 1conformer: 1monaural speech enhancement: 1complex networks: 1upper bound: 1lightweight adaptation: 1prefix tuning: 1language translation: 1speech to text translation: 1energy resolution: 1noise reduction: 1filtering: 1acoustic echo cancellation: 1deep noise suppression: 1architecture: 1complex neural network: 1echo cancellers: 1headphones: 1automatic speech recognition: 1meeting scenario: 1speak diarization: 1arrays: 1multi speaker asr: 1m2met: 1speaker diarization: 1frequency recurrence: 1feature representation: 1frequency domain analysis: 1speech coding: 1online speech recognition: 1early endpointing: 1scalegrad: 1analytical models: 1deep learning (artificial intelligence): 1complex network: 1text analysis: 1phonetic posteriorgrams: 1vocoders: 1fastspeech: 1lpcnet: 1speech synthesis: 1cross lingual: 1voice conversion: 1gaussian processes: 1pattern matching: 1deep binary embeddings: 1temporal context: 1query by example: 1image retrieval: 1quantization (signal): 1low resource asr: 1pre training: 1catastrophic forgetting.: 1independent language model: 1fine tuning: 1audio visual systems: 1robust speech recognition: 1dropout: 1bimodal df smn: 1multi condition training: 1audio visual speech recognition: 1
Most publications (all venues) at2010: 292015: 212014: 212016: 192017: 18

Affiliations
Alibaba Group, Speech Lab, Singapore
Nanyang Technological University, School of Computer Science and Engineering, Singapore
Institute for Infocomm Research, A*STAR, Singapore (since 2004)
University of Hong Kong, Hong Kong (PhD 2000)

Recent publications

ICASSP2024 Dianwen Ng, Chong Zhang 0003, Ruixi Zhang, Yukun Ma, Fabian Ritter Gutierrez, Trung Hieu Nguyen 0001, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma 0001
Are Soft Prompts Good Zero-Shot Learners for Speech Recognition?

ICASSP2024 Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang 0003, Hao Wang 0199, Trung Hieu Nguyen 0001, Kun Zhou 0003, Dianwen Ng, Eng Siong Chng, Bin Ma 0001
SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance.

ICASSP2024 Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang 0003, Hao Wang 0199, Trung Hieu Nguyen 0001, Kun Zhou 0003, Jia Qi Yip, Dianwen Ng, Bin Ma 0001
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation.

ICASSP2023 Yukun Ma, Trung Hieu Nguyen 0001, Jinjie Ni, Wen Wang, Qian Chen 0003, Chong Zhang 0003, Bin Ma 0001
Auxiliary Pooling Layer For Spoken Language Understanding.

ICASSP2023 Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, Jinjie Ni, Chong Zhang 0003, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma 0001
De'hubert: Disentangling Noise in a Self-Supervised Model for Robust Speech Recognition.

ICASSP2023 Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang 0003, Yukun Ma, Trung Hieu Nguyen 0001, Chongjia Ni, Eng Siong Chng, Bin Ma 0001
Contrastive Speech Mixup for Low-Resource Keyword Spotting.

ICASSP2023 Jinjie Ni, Yukun Ma, Wen Wang, Qian Chen 0033, Dianwen Ng, Han Lei, Trung Hieu Nguyen 0001, Chong Zhang 0003, Bin Ma 0001, Erik Cambria, 
Adaptive Knowledge Distillation Between Text and Speech Pre-Trained Models.

ICASSP2023 Shengkui Zhao, Bin Ma 0001
D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network Using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement.

ICASSP2023 Shengkui Zhao, Bin Ma 0001
MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions.

Interspeech2023 Dianwen Ng, Chong Zhang 0003, Ruixi Zhang, Yukun Ma, Trung Hieu Nguyen 0001, Chongjia Ni, Shengkui Zhao, Qian Chen 0003, Wen Wang, Eng Siong Chng, Bin Ma 0001
Adapter-tuning with Effective Token-dependent Representation Shift for Automatic Speech Recognition.

Interspeech2023 Dianwen Ng, Yang Xiao, Jia Qi Yip, Zhao Yang, Biao Tian, Qiang Fu 0001, Eng Siong Chng, Bin Ma 0001
Small Footprint Multi-channel Network for Keyword Spotting with Centroid Based Awareness.

Interspeech2023 Zhao Yang, Dianwen Ng, Chong Zhang 0003, Xiao Fu, Rui Jiang, Wei Xi, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma 0001, Jizhong Zhao, 
Dual Acoustic Linguistic Self-supervised Representation Learning for Cross-Domain Speech Recognition.

Interspeech2023 Zhao Yang, Dianwen Ng, Chong Zhang 0003, Rui Jiang, Wei Xi, Yukun Ma, Chongjia Ni, Jizhong Zhao, Bin Ma 0001, Eng Siong Chng, 
A Unified Recognition and Correction Model under Noisy and Accent Speech Conditions.

Interspeech2023 Zhao Yang, Dianwen Ng, Xizhe Li, Chong Zhang 0003, Rui Jiang, Wei Xi, Yukun Ma, Chongjia Ni, Jizhong Zhao, Bin Ma 0001, Eng Siong Chng, 
Dual-Memory Multi-Modal Learning for Continual Spoken Keyword Spotting with Confidence Selection and Diversity Enhancement.

Interspeech2023 Jia Qi Yip, Duc-Tuan Truong, Dianwen Ng, Chong Zhang 0003, Yukun Ma, Trung Hieu Nguyen 0001, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma 0001
ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention.

ICASSP2022 Yukun Ma, Trung Hieu Nguyen 0001, Bin Ma 0001
CPT: Cross-Modal Prefix-Tuning for Speech-To-Text Translation.

ICASSP2022 Karn N. Watcharasupat, Thi Ngoc Tho Nguyen, Woon-Seng Gan, Shengkui Zhao, Bin Ma 0001
End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression.

ICASSP2022 Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie 0001, Siqi Zheng, Zhihao Du, Weilong Huang, Pengcheng Guo, Zhijie Yan, Bin Ma 0001, Xin Xu, Hui Bu, 
M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.

ICASSP2022 Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie 0001, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma 0001, Xin Xu, Hui Bu, 
Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge.

ICASSP2022 Shengkui Zhao, Bin Ma 0001, Karn N. Watcharasupat, Woon-Seng Gan, 
FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement.

#101  | Lei He 0005 | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 13ICLR: 2NeurIPS: 2ICML: 1AAAI: 1
By year2024: 42023: 72022: 102021: 32020: 62019: 52018: 1
ISCA sessionsspeech synthesis: 11show and tell: 1voice conversion and adaptation: 1self-supervision and semi-supervision for neural asr training: 1acoustic model adaptation for asr: 1asr neural network architectures and training: 1voice conversion and speech synthesis: 1
IEEE keywordsspeech synthesis: 9prosody: 4data models: 3neural tts: 3text to speech: 3vocoders: 3speech recognition: 3text analysis: 3training data: 2speech enhancement: 2data mining: 2pre training: 2semiconductor device modeling: 2decoding: 2natural language processing: 2unsupervised learning: 2vq vae: 1expressive speech synthesis: 1self supervised style enhancing: 1spectrogram: 1multitasking: 1variance adaptor: 1adaptation models: 1speecht5: 1speech to speech translation: 1joint pre training: 1cross lingual modeling: 1analytical models: 1task analysis: 1multi lingual: 1convolution: 1waveglow: 1multi speaker: 1lightweight: 1iterative methods: 1fast sampling: 1probability: 1image denoising: 1denoising diffusion probabilistic models: 1medical image processing: 1optimisation: 1vocoder: 1computational modeling: 1memory management: 1self attention: 1transformers: 1efficient transformer: 1few shot: 1attention: 1emotion recognition: 1mist: 1speech coding: 1speech codecs: 1tts: 1domain adaptation: 1machine speech chain: 1speech bert embedding: 1large scale pre training: 1acoustic distortion: 1bit error rate: 1databases: 1neural language generation: 1acoustic model adaptation: 1speaker recognition: 1speaker adaptation: 1adaptation: 1recurrent neural nets: 1rnn t: 1keyword spotting: 1bert: 1knowledge representation: 1style transfer: 1variational autoencoder: 1
Most publications (all venues) at2023: 112022: 112021: 92020: 92024: 7

Affiliations
Microsoft China, Speech and Language Group, Beijing, China

Recent publications

ICASSP2024 Xueyuan Chen, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Zhiyong Wu 0001, Xixin Wu, Helen Meng, 
Stylespeech: Self-Supervised Style Enhancing with VQ-VAE-Based Pre-Training for Expressive Audiobook Speech Synthesis.

ICML2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan 0003, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu 0001, Tao Qin 0001, Xiangyang Li 0001, Wei Ye 0004, Shikun Zhang, Jiang Bian 0002, Lei He 0005, Jinyu Li 0001, Sheng Zhao, 
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

ICLR2024 Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He 0005, Xiangyang Li 0001, Sheng Zhao, Tao Qin 0001, Jiang Bian 0002, 
PromptTTS 2: Describing and Generating Voices with Text Prompt.

ICLR2024 Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yichong Leng, Lei He 0005, Tao Qin 0001, Sheng Zhao, Jiang Bian 0002, 
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

ICASSP2023 Yan Deng, Long Zhou, Yuanhao Yi, Shujie Liu 0001, Lei He 0005
Prosody-Aware Speecht5 for Expressive Neural TTS.

ICASSP2023 Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu 0001, Lei He 0005, Jinyu Li 0001, Furu Wei, 
Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation.

ICASSP2023 Chen Zhang 0020, Shubham Bansal, Aakash Lakhera, Jinzhu Li, Gang Wang 0001, Sandeepkumar Satpal, Sheng Zhao, Lei He 0005
LeanSpeech: The Microsoft Lightweight Speech Synthesis System for Limmits Challenge 2023.

Interspeech2023 Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang 0016, Serena Ruan, Sheng Zhao, Lei He 0005, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus Weimer, 
Large-Scale Automatic Audiobook Creation.

Interspeech2023 Yujia Xiao, Shaofei Zhang, Xi Wang 0016, Xu Tan 0003, Lei He 0005, Sheng Zhao, Frank K. Soong, Tan Lee 0001, 
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

NeurIPS2023 Yuancheng Wang, Zeqian Ju, Xu Tan 0003, Lei He 0005, Zhizheng Wu 0001, Jiang Bian 0002, Sheng Zhao, 
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models.

AAAI2023 Yihan Wu, Junliang Guo, Xu Tan 0003, Chen Zhang 0020, Bohan Li 0003, Ruihua Song, Lei He 0005, Sheng Zhao, Arul Menezes, Jiang Bian 0002, 
VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing.

ICASSP2022 Zehua Chen, Xu Tan 0003, Ke Wang, Shifeng Pan, Danilo P. Mandic, Lei He 0005, Sheng Zhao, 
Infergrad: Improving Diffusion Models for Vocoder by Considering Inference in Training.

ICASSP2022 Yujia Xiao, Xi Wang 0016, Lei He 0005, Frank K. Soong, 
Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network.

ICASSP2022 Yuanhao Yi, Lei He 0005, Shifeng Pan, Xi Wang 0016, Yujia Xiao, 
Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech.

ICASSP2022 Fengpeng Yue, Yan Deng, Lei He 0005, Tom Ko, Yu Zhang 0006, 
Exploring Machine Speech Chain For Domain Adaptation.

Interspeech2022 Mutian He 0001, Jingzhou Yang, Lei He 0005, Frank K. Soong, 
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge.

Interspeech2022 Yanqing Liu, Ruiqing Xue, Lei He 0005, Xu Tan 0003, Sheng Zhao, 
DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders.

Interspeech2022 Yihan Wu, Xu Tan 0003, Bohan Li 0003, Lei He 0005, Sheng Zhao, Ruihua Song, Tao Qin 0001, Tie-Yan Liu, 
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.

Interspeech2022 Yihan Wu, Xi Wang 0016, Shaofei Zhang, Lei He 0005, Ruihua Song, Jian-Yun Nie, 
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis.

Interspeech2022 Yuanhao Yi, Lei He 0005, Shifeng Pan, Xi Wang 0016, Yuchao Zhang, 
SoftSpeech: Unsupervised Duration Model in FastSpeech 2.

#102  | Philip C. Woodland | DBLP Google Scholar  
By venueICASSP: 16Interspeech: 13TASLP: 3SpeechComm: 2ACL: 1ACL-Findings: 1
By year2024: 72023: 112022: 42021: 62020: 22019: 32018: 3
ISCA sessionsspeech recognition: 3speech emotion recognition: 1neural transducers, streaming asr and novel asr models: 1robust asr, and far-field/multi-talker asr: 1neural network training methods for asr: 1search/decoding techniques and confidence measures for asr: 1speaker embedding: 1asr neural network architectures: 1neural network training strategies for asr: 1acoustic model adaptation: 1novel neural network architectures for acoustic modelling: 1
IEEE keywordsspeech recognition: 12data models: 5hidden markov models: 4adaptation models: 4probability: 4natural language processing: 4predictive models: 3pointer generator: 3generators: 3end to end: 3self supervised learning: 3emotion recognition: 3error analysis: 3neural transducer: 2domain adaptation: 2context modeling: 2contextual speech recognition: 2asr: 2decoding: 2training data: 2speaker diarisation: 2foundation model: 2automatic speech recognition: 2confidence scores: 2d vector: 2speaker recognition: 2transducers: 1e2e asr: 1standards: 1vegetation: 1encoding: 1audio visual: 1graph neural networks: 1ctc: 1bridges: 1domain shifts: 1text data: 1system performance: 1speech emotion recognition: 1parameter efficient finetuning: 1pipelines: 1language model discounting: 1minimum bayes' risk: 1domain shifting: 1switches: 1language model: 1spectral clustering: 1speaker embedding: 1wav2vec 2.0: 1clustering methods: 1time frequency analysis: 1source separation: 1contextual biasing: 1zero shot learning: 1spoken language understanding: 1filling: 1slot filling: 1depression: 1speech based depression detection: 1analytical models: 1out of domain: 1feature selection: 1estimation theory: 1supervised learning: 1knowledge distillation: 1end to end asr: 1cross utterance: 1language models: 1transformer: 1lstm: 1diarisation: 1distributed representation: 1content aware speaker embedding: 1speaker embeddings: 1large margin softmax: 1overlapping speech: 1self attention: 1model combination: 1speaker diarization: 1python: 1
Most publications (all venues) at2023: 162015: 152024: 142021: 112013: 10

Affiliations
URLs

Recent publications

SpeechComm2024 Keqi Deng, Philip C. Woodland
Decoupled structure for improved adaptability of end-to-end models.

TASLP2024 Keqi Deng, Philip C. Woodland
Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition.

TASLP2024 Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland
Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator.

ICASSP2024 Keqi Deng, Philip C. Woodland
FastInject: Injecting Unpaired Text Data into CTC-Based ASR Training.

ICASSP2024 Nineli Lashkarashvili, Wen Wu, Guangzhi Sun, Philip C. Woodland
Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation.

ACL2024 Keqi Deng, Philip C. Woodland
Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation.

ACL-Findings2024 Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang 0031, Milica Gasic, Philip C. Woodland
Speech-based Slot Filling using Large Language Models.

SpeechComm2023 Qiujia Li, Chao Zhang 0031, Philip C. Woodland
Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring.

TASLP2023 Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland
Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator.

ICASSP2023 Keqi Deng, Philip C. Woodland
Adaptable End-to-End ASR Models Using Replaceable Internal LMs and Residual Softmax.

ICASSP2023 Evonne P. C. Lee, Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland
Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation.

ICASSP2023 Yuang Li, Xianrui Zheng, Philip C. Woodland
Self-Supervised Learning-Based Source Separation for Meeting Data.

ICASSP2023 Guangzhi Sun, Chao Zhang 0031, Philip C. Woodland
End-to-End Spoken Language Understanding with Tree-Constrained Pointer Generator.

ICASSP2023 Wen Wu, Chao Zhang 0031, Philip C. Woodland
Self-Supervised Representations in Speech-Based Depression Detection.

Interspeech2023 Dongcheng Jiang, Chao Zhang 0031, Philip C. Woodland
A Neural Time Alignment Module for End-to-End Automatic Speech Recognition.

Interspeech2023 Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdel-rahman Mohamed, Philip C. Woodland
Biased Self-supervised Learning for ASR.

Interspeech2023 Guangzhi Sun, Xianrui Zheng, Chao Zhang 0031, Philip C. Woodland
Can Contextual Biasing Remain Effective with Whisper and GPT-2?

Interspeech2023 Wen Wu, Chao Zhang 0031, Philip C. Woodland
Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations.

ICASSP2022 Qiujia Li, Yu Zhang 0033, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.

ICASSP2022 Xiaoyu Yang, Qiujia Li, Philip C. Woodland
Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-Trained Models.

#103  | Timo Gerkmann | DBLP Google Scholar  
By venueICASSP: 16Interspeech: 12TASLP: 7NeurIPS: 1
By year2024: 52023: 132022: 72021: 52020: 32019: 3
ISCA sessionssingle-channel and multi-channel speech enhancement: 3speech enhancement: 2speech coding and enhancement: 1speech coding: 1source separation: 1speech emotion recognition: 1dereverberation, noise reduction, and speaker extraction: 1(multimodal) speech emotion recognition: 1speech and audio source separation and scene analysis: 1
IEEE keywordsspeech enhancement: 17task analysis: 6diffusion models: 6signal processing algorithms: 5noise reduction: 4noise measurement: 4multi channel: 3nonlinear filters: 3spatial filters: 3computational modeling: 3uncertainty estimation: 3deep neural networks: 3estimation: 3dereverberation: 3time frequency analysis: 3variational autoencoder: 3training data: 3filtering theory: 3speech separation: 2predictive models: 2data models: 2bayesian estimator: 2wiener filters: 2uncertainty: 2diffusion processes: 2score based generative models: 2stochastic processes: 2speech recognition: 2acoustic distortion: 2nonlinear distortion: 2end to end learning: 2input features: 2generalization: 2multichannel: 2least mean squares methods: 2gaussian distribution: 2gaussian noise: 2array signal processing: 2semi supervised learning: 2filtering: 1maximum likelihood detection: 1dnn based: 1spatially selective filter (ssf): 1stochastic differential equations: 1automatic speech recognition: 1transformers: 1knowledge distillation: 1self supervised learning: 1in the wild: 1arousal: 1speech emotion conversion: 1non parallel samples: 1speech synthesis: 1iterative algorithms: 1ptychography: 1diffractive imaging: 1x ray microscopy: 1spectrogram: 1phase retrieval: 1image reconstruction: 1real time systems: 1visualization: 1live feedback: 1mathematical models: 1predictive learning: 1speech dereverberation: 1biological system modeling: 1score matching: 1microphones: 1joint non linear spatial and tempo spectral filtering: 1complex gaussian mixture models: 1working environment noise: 1ego noise reduction: 1human robot interaction: 1audio recording: 1adaptation models: 1multichannel non negative matrix factorization: 1bandwidth: 1generative modelling: 1computer vision: 1bandwidth extension: 1speech signal improvement: 1causal processing: 1process control: 1adaptive systems: 1universal speech enhancement: 1costs: 1spatially selective non linear filters: 1speaker extraction: 1spatial steering: 1location awareness: 1wiener filter: 1deep neural network: 1maximum likelihood estimation: 1prediction theory: 1neural network: 1online algorithm: 1hearing devices: 1signal to noise ratio: 1microphone arrays: 1nonlinear spatial filtering: 1statistical distributions: 1deep generative model: 1estimation theory: 1signal classification: 1generative model: 1robustness: 1channel bank filters: 1auditory filterbank: 1tasnet: 1audio coding: 1hearing: 1nonlinear filtering: 1acoustic beamforming: 1
Most publications (all venues) at2023: 212024: 162022: 152015: 152021: 14


Recent publications

TASLP2024 Kristina Tesch, Timo Gerkmann
Multi-Channel Speech Separation Using Spatially Selective Deep Non-Linear Filters.

ICASSP2024 Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann
Single and Few-Step Diffusion for Generative Speech Enhancement.

ICASSP2024 Danilo de Oliveira, Timo Gerkmann
Distilling Hubert with LSTMs via Decoupled Knowledge Distillation.

ICASSP2024 Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann
EMOCONV-Diff: Diffusion-Based Speech Emotion Conversion for Non-Parallel and in-the-Wild Data.

ICASSP2024 Simon Welker, Tal Peer, Henry N. Chapman, Timo Gerkmann
Live Iterative Ptychography with Projection-Based Algorithms.

TASLP2023 Huajian Fang, Dennis Becker, Stefan Wermter, Timo Gerkmann
Integrating Uncertainty Into Neural Network-Based Speech Enhancement.

TASLP2023 Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann
StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation.

TASLP2023 Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann
Speech Enhancement and Dereverberation With Diffusion-Based Generative Models.

TASLP2023 Kristina Tesch, Timo Gerkmann
Insights Into Deep Non-Linear Filters for Improved Multi-Channel Speech Enhancement.

ICASSP2023 Huajian Fang, Timo Gerkmann
Uncertainty Estimation in Deep Speech Enhancement Using Complex Gaussian Mixture Models.

ICASSP2023 Huajian Fang, Niklas Wittmer, Johannes Twiefel, Stefan Wermter, Timo Gerkmann
Partially Adaptive Multichannel Joint Reduction of Ego-Noise and Environmental Noise.

ICASSP2023 Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann
Analysing Diffusion-based Generative Approaches Versus Discriminative Approaches for Speech Restoration.

ICASSP2023 Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Tal Peer, Timo Gerkmann
Speech Signal Improvement Using Causal Generative Diffusion Models.

ICASSP2023 Kristina Tesch, Timo Gerkmann
Spatially Selective Deep Non-Linear Filters For Speaker Extraction.

Interspeech2023 Bunlong Lay, Simon Welker, Julius Richter, Timo Gerkmann
Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement.

Interspeech2023 Jean-Marie Lemercier, Julian Tobergte, Timo Gerkmann
Extending DNN-based Multiplicative Masking to Deep Subband Filtering for Improved Dereverberation.

Interspeech2023 Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu 0001, Timo Gerkmann
Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model.

Interspeech2023 Danilo de Oliveira, Navin Raj Prabhu, Timo Gerkmann
Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models.

ICASSP2022 Huajian Fang, Tal Peer, Stefan Wermter, Timo Gerkmann
Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement.

ICASSP2022 Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann
Customizable End-To-End Optimization Of Online Neural Network-Supported Dereverberation For Hearing Devices.

#104  | Joon-Hyuk Chang | DBLP Google Scholar  
By venueInterspeech: 23ICASSP: 9TASLP: 4
By year2024: 32023: 152022: 142021: 12020: 22019: 1
ISCA sessionsanalysis of speech and audio signals: 3speech synthesis: 2speaker and language recognition: 2voice conversion and adaptation: 2dereverberation and echo cancellation: 2speaker and language identification: 1novel transformer models for asr: 1speech recognition: 1speech emotion recognition: 1speech coding and enhancement: 1acoustic signal representation and analysis: 1spoofing-aware automatic speaker verification (sasv): 1resource-constrained asr: 1neural network training methods for asr: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1novel models and training methods for asr: 1search/decoding techniques and confidence measures for asr: 1
IEEE keywordsdata models: 5speech recognition: 4semi supervised learning: 3data augmentation: 3automatic speech recognition: 3adaptation models: 3computational modeling: 2semisupervised learning: 2self training: 2signal processing algorithms: 2predictive models: 2deep neural network: 2speech dereverberation: 2microphones: 2single microphone: 2offline processing: 2weighted prediction error: 2reverberation: 2end to end speech recognition: 1transformers: 1self supervised learning: 1pseudo labeling: 1task analysis: 1diffusion model: 1noise reduction: 1vocoders: 1e2e tts: 1score based model: 1pipelines: 1neural transducer: 1filtering: 1unsupervised domain adaptation: 1personalization: 1synthesized data: 1domain adaptation: 1continual learning: 1representation learning: 1learning systems: 1time scale modification: 1clustering algorithms: 1heuristic algorithms: 1adaptive speaking rate: 1over the top media services: 1time scale reconstruction: 1dynamics: 1real time systems: 1speech: 1forced alignment: 1redundancy: 1auxiliary loss: 1error analysis: 1visualization: 1end to end neural diarization: 1multi head attention: 1voice activity detection: 1speaker diarization: 1degradation: 1weight averaging: 1representation based speech recognition model: 1additive noise: 1benchmark testing: 1robustness: 1self distillation: 1speech enhancement: 1film conditioning: 1stacking: 1frame wise ctc based posterior probability: 1estimation: 1modulation: 1text to speech: 1iterative decoding: 1attention based end to end speech synthesis: 1decoding: 1linguistics: 1spectrogram: 1speech synthesis: 1prediction theory: 1signal denoising: 1virtual acoustic channel expansion: 1optimisation: 1speaker verification: 1speaker recognition: 1cross modal distillation: 1multi task learning: 1deep learning (artificial intelligence): 1acoustic model: 1supervised learning: 1language model: 1knowledge distillation: 1natural language processing: 1interpolation: 1
Most publications (all venues) at2023: 292022: 212016: 122021: 112010: 11

Affiliations
URLs

Recent publications

TASLP2024 Jae-Hong Lee, Joon-Hyuk Chang
Partitioning Attention Weight: Mitigating Adverse Effect of Incorrect Pseudo-Labels for Self-Supervised ASR.

ICASSP2024 Won-Gook Choi, Donghyun Seong, Joon-Hyuk Chang
Adversarial Learning on Compressed Posterior Space for Non-Iterative Score-based End-to-End Text-to-Speech.

ICASSP2024 Dong-Hyun Kim, Jae-Hong Lee, Joon-Hyuk Chang
Text-Only Unsupervised Domain Adaptation for Neural Transducer-Based ASR Personalization Using Synthesized Data.

ICASSP2023 Jin-Seong Choi, Jae-Hong Lee, Chae-Won Lee, Joon-Hyuk Chang
M-CTRL: A Continual Representation Learning Framework with Slowly Improving Past Pre-Trained Model.

ICASSP2023 Sohee Jang, Jiye Kim, Yeon-Ju Kim, Joon-Hyuk Chang
Adaptive Time-Scale Modification for Improving Speech Intelligibility Based On Phoneme Clustering For Streaming Services.

ICASSP2023 Ye-Rin Jeoung, Joon-Young Yang, Jeong-Hwan Choi, Joon-Hyuk Chang
Improving Transformer-Based End-to-End Speaker Diarization by Assigning Auxiliary Losses to Attention Heads.

ICASSP2023 Jae-Hong Lee, Dong-Hyun Kim, Joon-Hyuk Chang
Repackagingaugment: Overcoming Prediction Error Amplification in Weight-Averaged Speech Recognition Models Subject to Self-Training.

ICASSP2023 Ju-Seok Seong, Jeong-Hwan Choi, Jehyun Kyung, Ye-Rin Jeoung, Joon-Hyuk Chang
Noise-Aware Target Extension with Self-Distillation for Robust Speech Recognition.

ICASSP2023 Da-Hee Yang, Joon-Hyuk Chang
Selective Film Conditioning with CTC-Based ASR Probability for Speech Enhancement.

Interspeech2023 Min-Sang Baek, Joon-Young Yang, Joon-Hyuk Chang
Deeply Supervised Curriculum Learning for Deep Neural Network-based Sound Source Localization.

Interspeech2023 Jae-Heung Cho, Joon-Hyuk Chang
SR-SRP: Super-Resolution based SRP-PHAT for Sound Source Localization and Tracking.

Interspeech2023 Won-Gook Choi, Joon-Hyuk Chang
Resolution Consistency Training on Time-Frequency Domain for Semi-Supervised Sound Event Detection.

Interspeech2023 Won-Gook Choi, So-Jeong Kim, Tae-Ho Kim, Joon-Hyuk Chang
Prior-free Guided TTS: An Improved and Efficient Diffusion-based Text-Guided Speech Synthesis.

Interspeech2023 Ye-Rin Jeoung, Jeong-Hwan Choi, Ju-Seok Seong, Jehyun Kyung, Joon-Hyuk Chang
Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization.

Interspeech2023 Do-Hee Kim, Ji-Eun Choi, Joon-Hyuk Chang
Intra-ensemble: A New Method for Combining Intermediate Outputs in Transformer-based Automatic Speech Recognition.

Interspeech2023 Do-Hee Kim, Daeyeol Shim, Joon-Hyuk Chang
General-purpose Adversarial Training for Enhanced Automatic Speech Recognition Model Generalization.

Interspeech2023 Jehyun Kyung, Ju-Seok Seong, Jeong-Hwan Choi, Ye-Rin Jeoung, Joon-Hyuk Chang
Improving Joint Speech and Emotion Recognition Using Global Style Tokens.

Interspeech2023 JungPhil Park, Jeong-Hwan Choi, Yungyeo Kim, Joon-Hyuk Chang
HAD-ANC: A Hybrid System Comprising an Adaptive Filter and Deep Neural Networks for Active Noise Control.

TASLP2022 Moa Lee, Junmo Lee, Joon-Hyuk Chang
Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis.

TASLP2022 Joon-Young Yang, Joon-Hyuk Chang
VACE-WPE: Virtual Acoustic Channel Expansion Based on Neural Networks for Weighted Prediction Error-Based Speech Dereverberation.

#105  | Bhiksha Raj | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 12NeurIPS: 2EMNLP: 2ACL: 1ACL-Findings: 1NAACL-Findings: 1AAAI: 1TASLP: 1ICLR: 1
By year2024: 92023: 92022: 42021: 52020: 32019: 52018: 1
ISCA sessionsprivacy and security in speech communication: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech recognition: 1phonetics, phonology, and prosody: 1(multimodal) speech emotion recognition: 1novel models and training methods for asr: 1single-channel and multi-channel speech enhancement: 1acoustic event detection and acoustic scene classification: 1speaker recognition: 1speech synthesis: 1application of asr in medical practice: 1
IEEE keywordstask analysis: 3speech recognition: 3self supervised learning: 3speech enhancement: 3computational efficiency: 2audio classification: 2natural languages: 2measurement: 2recording: 2noise reduction: 2time frequency analysis: 2speech: 2audio signal processing: 2natural language processing: 2speaker recognition: 2convolutional neural nets: 2weak label learning: 1negative sampling: 1sampling methods: 1image classification: 1training data: 1prefix tuning: 1force: 1bridges: 1contrastive learning: 1text only training: 1adaptation models: 1automated audio captioning: 1prompt generation: 1contrastive language audio pre training: 1speech emotion recognition: 1acoustic properties: 1emotion recognition: 1prompt augmentation: 1emotion audio retrieval: 1ear: 1ser: 1buildings: 1benchmark: 1benchmark testing: 1instruction tuning: 1collaboration: 1speech summarization: 1synthetic summary: 1chatbots: 1large language model: 1chatgpt: 1data augmentation: 1annotations: 1probabilistic logic: 1diffusion models: 1signal to noise ratio: 1reverberation: 1generative models: 1speech editing: 1secure multiparty computation: 1oral communication: 1secure modular hashing: 1automatic speaker diarization: 1cryptography: 1real time systems: 1privacy: 1interpretability: 1impedance matching: 1acoustic measurements: 1acoustic parameters: 1phonetic alignment: 1perceptual quality: 1enhancement: 1explainable enhancement evaluation: 1data models: 1frequency estimation: 1pattern classification: 1adversarial attacks: 1high frequency: 1robustness: 1filtering theory: 1speaker identification: 1discrete cosine transforms: 1steganography: 1adversarial examples: 1neurons: 1computational modeling: 1model compression: 1annealing: 1compact representations: 1distillation: 1visualization: 1sound event detection: 1confidence intervals: 1pattern recognition: 1jackknife estimates: 1acoustic signal detection: 1weak labels: 1signal representation: 1signal classification: 1query processing: 1joint audio text embedding: 1search engines: 1cross modal retrieval: 1text analysis: 1audio search engine: 1content based audio retrieval: 1audio databases: 1siamese neural network: 1meta data: 1query by example: 1activity recognition: 1wifi sensing: 1channel state information: 1
Most publications (all venues) at2023: 402024: 382012: 252011: 222022: 21

Affiliations
Carnegie Mellon University, Pittsburgh, USA

Recent publications

ICASSP2024 Ankit Shah 0001, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj
Importance of Negative Sampling in Weak Label Learning.

ICASSP2024 Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang, 
Training Audio Captioning Models without Audio.

ICASSP2024 Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh, 
Prompting Audios Using Acoustic Properties for Emotion Representation.

ICASSP2024 Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan S. Sharma, Shinji Watanabe 0001, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee, 
Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.

ICASSP2024 Jee-Weon Jung, Roshan S. Sharma, William Chen, Bhiksha Raj, Shinji Watanabe 0001, 
AugSumm: Towards Generalizable Speech Summarization Using Synthetic Labels from Large Language Models.

ICASSP2024 Muqiao Yang, Chunlei Zhang, Yong Xu 0004, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu 0001, 
uSee: Unified Speech Enhancement And Editing with Conditional Diffusion Models.

ACL2024 Roshan Sharma 0001, Suwon Shon, Mark Lindsey, Hira Dhamyal, Bhiksha Raj
Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?

ACL-Findings2024 Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna, Alessio Brutti, Bhiksha Raj
Continual Contrastive Spoken Language Understanding.

NAACL-Findings2024 Roshan Sharma 0001, Ruchira Sharma, Hira Dhamyal, Rita Singh, Bhiksha Raj
R-BASS : Relevance-aided Block-wise Adaptation for Speech Summarization.

ICASSP2023 Francisco Teixeira, Alberto Abad, Bhiksha Raj, Isabel Trancoso, 
Privacy-Preserving Automatic Speaker Diarization.

ICASSP2023 Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar 0003, Shinji Watanabe 0001, Bhiksha Raj
Paaploss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement.

ICASSP2023 Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar 0003, Shinji Watanabe 0001, Bhiksha Raj
TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement.

Interspeech2023 Roshan Sharma 0001, Siddhant Arora, Kenneth Zheng, Shinji Watanabe 0001, Rita Singh, Bhiksha Raj
BASS: Block-wise Adaptation for Speech Summarization.

Interspeech2023 Raphaël Olivier, Bhiksha Raj
There is more than one kind of robustness: Fooling Whisper with adversarial examples.

Interspeech2023 Liao Qu, Xianwei Zou, Xiang Li 0106, Yandong Wen, Rita Singh, Bhiksha Raj
The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features.

NeurIPS2023 Shentong Mo, Bhiksha Raj
Weakly-Supervised Audio-Visual Segmentation.

AAAI2023 Xiang Li 0003, Haoyuan Cao, Shijie Zhao 0001, Junlin Li, Li Zhang 0006, Bhiksha Raj
Panoramic Video Salient Object Detection with Ambisonic Audio Guidance.

EMNLP2023 Xiang Li 0106, Jinglu Wang, Xiaohao Xu, Muqiao Yang, Fan Yang, Yizhou Zhao, Rita Singh, Bhiksha Raj
Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text.

Interspeech2022 Hira Dhamyal, Bhiksha Raj, Rita Singh, 
Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection.

Interspeech2022 Raphaël Olivier, Bhiksha Raj
Recent improvements of ASR models in the face of adversarial attacks.

#106  | Meng Yu 0003 | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 14TASLP: 3
By year2024: 12023: 22022: 42021: 112020: 72019: 82018: 3
ISCA sessionssource separation, dereverberation and echo cancellation: 2multi-channel speech enhancement: 2speech and audio source separation and scene analysis: 2asr for noisy and far-field speech: 2deep learning for source separation and pitch tracking: 2speech enhancement and bandwidth expansion: 1speech coding and enhancement: 1dereverberation and echo cancellation: 1source separation: 1speech localization, enhancement, and quality assessment: 1speech synthesis paradigms and methods: 1multimodal speech processing: 1speech enhancement: 1topics in speech recognition: 1
IEEE keywordsspeech recognition: 8speaker recognition: 7speech enhancement: 6speaker embedding: 5speech separation: 4source separation: 3reverberation: 2pattern clustering: 2microphone arrays: 2end to end speech recognition: 2overlapped speech: 2speaker verification: 2multi channel: 2artificial neural networks: 1loudspeakers: 1hybrid method: 1acoustic howling suppression: 1kalman filters: 1microphones: 1adaptation models: 1kalman filter: 1recursive training: 1acoustic environment: 1speech simulation: 1transient response: 1application program interfaces: 1graphics processing units: 1rnn t: 1code switched asr: 1natural language processing: 1bilingual asr: 1computational linguistics: 1speaker clustering: 1inference mechanisms: 1voice activity detection: 1overlap speech detection: 1speaker diarization: 1sensor fusion: 1audio visual systems: 1sound source separation: 1audio signal processing: 1audio visual processing: 1speech synthesis: 1mvdr: 1array signal processing: 1adl mvdr: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1recurrent neural nets: 1direction of arrival estimation: 1source localization: 1text analysis: 1contrastive learning: 1semi supervised learning: 1data augmentation: 1unsupervised learning: 1self supervised learning: 1target speaker enhancement: 1robust speaker verification: 1interference suppression: 1multi look: 1end to end: 1multi channel speech separation: 1inter channel convolution differences: 1spatial filters: 1filtering theory: 1spatial features: 1joint learning: 1noise measurement: 1speaker aware: 1target speech enhancement: 1time domain analysis: 1gain: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1permutation invariant training: 1encoding: 1model integration: 1multi band: 1nist: 1artificial intelligence: 1mel frequency cepstral coefficient: 1loss function: 1boundary: 1top k loss: 1task analysis: 1text dependent: 1end to end speaker verification: 1seq2seq attention: 1optimisation: 1siamese neural networks: 1
Most publications (all venues) at2021: 172019: 152022: 102020: 92023: 7


Recent publications

TASLP2024 Hao Zhang, Yixuan Zhang 0005, Meng Yu 0003, Dong Yu 0001, 
Enhanced Acoustic Howling Suppression via Hybrid Kalman Filter and Deep Learning Models.

Interspeech2023 Yong Xu 0004, Vinay Kothapally, Meng Yu 0003, Shixiong Zhang, Dong Yu 0001, 
Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation.

Interspeech2023 Hao Zhang, Meng Yu 0003, Yuzhong Wu, Tao Yu, Dong Yu 0001, 
Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression.

ICASSP2022 Anton Ratnarajah, Shi-Xiong Zhang, Meng Yu 0003, Zhenyu Tang 0001, Dinesh Manocha, Dong Yu 0001, 
Fast-Rir: Fast Neural Diffuse Room Impulse Response Generator.

ICASSP2022 Brian Yan, Chunlei Zhang, Meng Yu 0003, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe 0001, Dong Yu 0001, 
Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization.

ICASSP2022 Chunlei Zhang, Jiatong Shi, Chao Weng, Meng Yu 0003, Dong Yu 0001, 
Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering.

Interspeech2022 Vinay Kothapally, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Dong Yu 0001, 
Joint Neural AEC and Beamforming with Double-Talk Detection.

TASLP2021 Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu 0004, Meng Yu 0003, Dong Yu 0001, Jesper Jensen 0001, 
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation.

TASLP2021 Zhuohuang Zhang, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Lianwu Chen, Donald S. Williamson, Dong Yu 0001, 
Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation.

ICASSP2021 Jiatong Shi, Chunlei Zhang, Chao Weng, Shinji Watanabe 0001, Meng Yu 0003, Dong Yu 0001, 
Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation.

ICASSP2021 Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe 0001, Meng Yu 0003, Yong Xu 0004, Shi-Xiong Zhang, Dong Yu 0001, 
Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization.

ICASSP2021 Wei Xia, Chunlei Zhang, Chao Weng, Meng Yu 0003, Dong Yu 0001, 
Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning.

ICASSP2021 Chunlei Zhang, Meng Yu 0003, Chao Weng, Dong Yu 0001, 
Towards Robust Speaker Verification with Target Speaker Enhancement.

ICASSP2021 Naijun Zheng, Na Li 0012, Bo Wu, Meng Yu 0003, Jianwei Yu, Chao Weng, Dan Su 0002, Xunying Liu, Helen Meng, 
A Joint Training Framework of Multi-Look Separator and Speaker Embedding Extractor for Overlapped Speech.

Interspeech2021 Xiyun Li, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Jiaming Xu 0001, Bo Xu 0002, Dong Yu 0001, 
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation.

Interspeech2021 Helin Wang, Bo Wu, Lianwu Chen, Meng Yu 0003, Jianwei Yu, Yong Xu 0004, Shi-Xiong Zhang, Chao Weng, Dan Su 0002, Dong Yu 0001, 
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation.

Interspeech2021 Yong Xu 0004, Zhuohuang Zhang, Meng Yu 0003, Shi-Xiong Zhang, Dong Yu 0001, 
Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation.

Interspeech2021 Meng Yu 0003, Chunlei Zhang, Yong Xu 0004, Shi-Xiong Zhang, Dong Yu 0001, 
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment.

ICASSP2020 Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu 0004, Meng Yu 0003, Dan Su 0002, Yuexian Zou, Dong Yu 0001, 
Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning.

ICASSP2020 Xuan Ji, Meng Yu 0003, Chunlei Zhang, Dan Su 0002, Tao Yu, Xiaoyu Liu, Dong Yu 0001, 
Speaker-Aware Target Speaker Enhancement by Jointly Learning with Speaker Embedding Extraction.

#107  | Samuel Thomas 0001 | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 17TASLP: 1
By year2023: 62022: 82021: 82020: 72019: 52018: 2
ISCA sessionsspoken language understanding: 3cross-lingual and multilingual asr: 1end-to-end spoken dialog systems: 1multi-, cross-lingual and other topics in asr: 1other topics in speech recognition: 1spoken language modeling and understanding: 1multi- and cross-lingual asr, other topics in asr: 1multimodal systems: 1low-resource speech recognition: 1multilingual and code-switched asr: 1asr neural network architectures: 1multimodal speech processing: 1model adaptation for asr: 1rich transcription and asr systems: 1adjusting to speaker, accent, and domain: 1neural network training strategies for asr: 1
IEEE keywordsspoken language understanding: 8speech recognition: 8automatic speech recognition: 6natural language processing: 6recurrent neural nets: 5text analysis: 4transducers: 3data models: 3interpolation: 2predictive models: 2data handling: 2end to end systems: 2rnn transducers: 2speaker change detection: 2testing: 2task analysis: 2oral communication: 1computationally inexpensive: 1recurrent neural networks: 1rnn transducer: 1semi supervised training: 1end to end asr: 1retrieval: 1multilingual: 1entropy: 1cross modal: 1knowledge distillation: 1analytical models: 1cross lingual: 1knowledge transfer: 1switches: 1transformers: 1bit error rate: 1filling: 1telephone sets: 1training data: 1dialog history: 1transforms: 1robustness: 1multi speaker: 1end to end: 1recording: 1weakly supervised learning: 1nearest neighbors: 1text classification: 1voice conversations: 1nearest neighbour methods: 1intent classification: 1software agents: 1virtual reality: 1attention: 1atis: 1decoding: 1speech coding: 1encoder decoder: 1interactive systems: 1spoken dialog system: 1end to end models: 1natural languages: 1adaptation: 1end to end mod els: 1language model customization: 1signal detection: 1affine transforms: 1speaker adaptation: 1speaker segmentation: 1speaker recognition: 1data analysis: 1transformer networks: 1end to end systems: 1self supervised pre training: 1synthetic speech augmentation: 1pre trained text embedding: 1speech to intent: 1image sequences: 1audio visual speech processing: 1image texture: 1u net: 1image restoration: 1image inpainting: 1video signal processing: 1encoding: 1dialog act recognition: 1text recognition: 1multiview training: 1non parallel data: 1loss measurement: 1microsoft windows: 1siamese networks: 1sequence embedding: 1n gram: 1subword: 1rnnlm: 1vocabulary: 1tem plate: 1broadcast news: 1deep neural networks.: 1
Most publications (all venues) at2021: 112022: 92019: 92010: 92012: 8

Affiliations
IBM Research AI, Thomas J. Watson Research Center, NY, USA
Johns Hopkins University, USA (former)

Recent publications

ICASSP2023 Takashi Fukuda, Samuel Thomas 0001
Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data.

ICASSP2023 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas 0001, Rogério Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James R. Glass, 
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval.

ICASSP2023 Vishal Sunder, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury, Eric Fosler-Lussier, 
Fine-Grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding.

ICASSP2023 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, George Saon, Brian Kingsbury, 
Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition.

Interspeech2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas 0001, Rogério Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James R. Glass, 
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.

Interspeech2023 Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury, 
ConvKT: Conversation-Level Knowledge Transfer for Context Aware End-to-End Spoken Language Understanding.

ICASSP2022 Zvi Kons, Aharon Satt, Hong-Kwang Kuo, Samuel Thomas 0001, Boaz Carmeli, Ron Hoory, Brian Kingsbury, 
A New Data Augmentation Method for Intent Classification Enhancement and its Application on Spoken Conversation Datasets.

ICASSP2022 Hong-Kwang Jeff Kuo, Zoltán Tüske, Samuel Thomas 0001, Brian Kingsbury, George Saon, 
Improving End-to-end Models for Set Prediction in Spoken Language Understanding.

ICASSP2022 Vishal Sunder, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Jatin Ganhotra, Brian Kingsbury, Eric Fosler-Lussier, 
Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding.

ICASSP2022 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury, George Saon, 
Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems.

ICASSP2022 Samuel Thomas 0001, Brian Kingsbury, George Saon, Hong-Kwang Jeff Kuo, 
Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models.

Interspeech2022 Takashi Fukuda, Samuel Thomas 0001, Masayuki Suzuki, Gakuto Kurata, George Saon, Brian Kingsbury, 
Global RNN Transducer Models For Multi-dialect Speech Recognition.

Interspeech2022 Zvi Kons, Hagai Aronowitz, Edmilson da Silva Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas 0001, George Saon, 
Extending RNN-T-based speech recognition systems with emotion and language classification.

Interspeech2022 Vishal Sunder, Eric Fosler-Lussier, Samuel Thomas 0001, Hong-Kwang Kuo, Brian Kingsbury, 
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems.

TASLP2021 Leda Sari, Mark Hasegawa-Johnson, Samuel Thomas 0001
Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection.

ICASSP2021 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, 
RNN Transducer Models for Spoken Language Understanding.

ICASSP2021 Edmilson da Silva Morais, Hong-Kwang Jeff Kuo, Samuel Thomas 0001, Zoltán Tüske, Brian Kingsbury, 
End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features.

Interspeech2021 Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Jeff Kuo, Samuel Thomas 0001, Edmilson da Silva Morais, 
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs.

Interspeech2021 Takashi Fukuda, Samuel Thomas 0001
Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer.

Interspeech2021 Jatin Ganhotra, Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury, 
Integrating Dialog History into End-to-End Spoken Language Understanding Systems.

#108  | Jee-Weon Jung | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 14TASLP: 1NAACL: 1ACL-Findings: 1
By year2024: 102023: 92022: 42021: 42020: 42019: 42018: 1
ISCA sessionsspeaker recognition: 3speaker and language recognition: 3speaker and language identification: 2speaker diarization: 2anti-spoofing for speaker verification: 1speech coding and enhancement: 1spoofing-aware automatic speaker verification (sasv): 1graph and end-to-end learning for speaker recognition: 1acoustic scene classification: 1anti-spoofing and liveness detection: 1speech and audio characterization and segmentation: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker verification using neural network methods: 1
IEEE keywordsspeaker diarisation: 6speaker recognition: 4speech recognition: 4annotations: 3speaker verification: 3end to end: 3training data: 2data models: 2task analysis: 2self supervised learning: 2correlation: 2pipelines: 2recording: 2data augmentation: 2graph attention networks: 2graph theory: 2redundancy: 1discrete units: 1speech translation: 1spoken language understanding: 1systematics: 1data processing: 1probes: 1mutual information: 1linear probing: 1interpretability: 1information theory: 1analytical models: 1solids: 1behavioral sciences: 1conversational speech recognition: 1conversation transcription: 1multi talker automatic speech recognition: 1speaker diarization: 1speech summarization: 1synthetic summary: 1chatbots: 1large language model: 1measurement: 1chatgpt: 1probabilistic logic: 1costs: 1oral communication: 1dataset: 1audio visual: 1multitask: 1spoken language model: 1speech synthesis: 1vocabulary: 1time frequency analysis: 1speech enhancement: 1computational modeling: 1reproducibility of results: 1sampling frequency independent: 1microphone number invariant: 1frequency diversity: 1computer architecture: 1universal speech enhancement: 1embedding extractor: 1aggregates: 1reliability: 1data mining: 1evaluation protocol: 1protocols: 1noise robustness: 1speaker embeddings: 1noise robust: 1speech coding: 1background noise: 1visualization: 1dimensionality reduction: 1codes: 1online speaker diarisation: 1dual buffer: 1centroid: 1silhouette coefficient: 1real time systems: 1anti spoofing: 1audio spoofing detection: 1heterogeneous: 1pattern clustering: 1multi scale: 1gaussian processes: 1graph neural network: 1graph attention network: 1signal classification: 1
Most publications (all venues) at2024: 222023: 122022: 112020: 112021: 9

Affiliations
URLs

Recent publications

TASLP2024 Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown 0006, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman, 
The VoxCeleb Speaker Recognition Challenge: A Retrospective.

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 Kwanghee Choi, Jee-Weon Jung, Shinji Watanabe 0001, 
Understanding Probe Behaviors Through Variational Bounds of Mutual Information.

ICASSP2024 Samuele Cornell, Jee-Weon Jung, Shinji Watanabe 0001, Stefano Squartini, 
One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition.

ICASSP2024 Jee-Weon Jung, Roshan S. Sharma, William Chen, Bhiksha Raj, Shinji Watanabe 0001, 
AugSumm: Towards Generalizable Speech Summarization Using Synthetic Labels from Large Language Models.

ICASSP2024 Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe 0001, Joon Son Chung, 
VoxMM: Rich Transcription of Conversations in the Wild.

ICASSP2024 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-Weon Jung, Xuankai Chang, Shinji Watanabe 0001, 
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks.

ICASSP2024 Wangyou Zhang, Jee-weon Jung, Yanmin Qian, 
Improving Design of Input Condition Invariant Speech Enhancement.

NAACL2024 Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan S. Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe 0001, 
UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions.

ACL-Findings2024 Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan S. Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe 0001, 
On the Evaluation of Speech Foundation Models for Spoken Language Understanding.

ICASSP2023 Hee-Soo Heo, Youngki Kwon, Bong-Jin Lee, You Jin Kim, Jee-Weon Jung
High-Resolution Embedding Extractor for Speaker Diarisation.

ICASSP2023 Jee-Weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown 0006, Youngki Kwon, Shinji Watanabe 0001, Joon Son Chung, 
In Search of Strong Embedding Extractors for Speaker Diarisation.

ICASSP2023 You Jin Kim, Hee-Soo Heo, Jee-Weon Jung, Youngki Kwon, Bong-Jin Lee, Joon Son Chung, 
Advancing the Dimensionality Reduction of Speaker Embeddings for Speaker Diarisation: Disentangling Noise and Informing Speech Activity.

ICASSP2023 Youngki Kwon, Hee-Soo Heo, Bong-Jin Lee, You Jin Kim, Jee-Weon Jung
Absolute Decision Corrupts Absolutely: Conservative Online Speaker Diarisation.

Interspeech2023 Hee-Soo Heo, Jee-weon Jung, Jingu Kang, Youngki Kwon, Bong-Jin Lee, You Jin Kim, Joon Son Chung, 
Curriculum Learning for Self-supervised Speaker Verification.

Interspeech2023 Jee-weon Jung, Soonshin Seo, Hee-Soo Heo, Geonmin Kim, You Jin Kim, Youngki Kwon, Minjae Lee, Bong-Jin Lee, 
Encoder-decoder Multimodal Speaker Change Detection.

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung, 
Disentangled Representation Learning for Multilingual Speaker Recognition.

Interspeech2023 Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen, 
Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing.

ICASSP2022 Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, Nicholas W. D. Evans, 
AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks.

#109  | Abdel-rahman Mohamed | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 11ACL: 4TASLP: 3ICLR: 1EMNLP: 1NeurIPS: 1
By year2024: 62023: 82022: 92021: 62020: 6
ISCA sessionsspeech recognition: 3spoken language processing: 1zero, low-resource and multi-modal speech recognition: 1speaker recognition and anti-spoofing: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1cross/multi-lingual and code-switched asr: 1speech synthesis: 1speech signal analysis and representation: 1new trends in self-supervised speech processing: 1
IEEE keywordsspeech recognition: 11self supervised learning: 7task analysis: 6representation learning: 4benchmark testing: 4natural language processing: 4supervised learning: 4unsupervised learning: 3evaluation: 2computational modeling: 2analytical models: 2speech: 2spoken language understanding: 2semantics: 2adaptation models: 2text analysis: 2biological system modeling: 1task generalization: 1benchmark: 1protocols: 1foundation model: 1phonetics: 1unsupervised unit discovery: 1organizations: 1cross lingual and multilingual speech processing: 1data models: 1transforms: 1kinematics: 1articulatory kinematics: 1spoken question answering: 1question answering (information retrieval): 1manuals: 1spoken content retrieval: 1audio visual learning: 1soft sensors: 1visualization: 1artificial intelligence: 1machine translation: 1encoder decoder models: 1modularity: 1decoding: 1predictive models: 1end to end: 1vocabulary: 1probing analysis: 1production: 1acoustic measurements: 1grounding: 1speech representation: 1acoustic to articulatory inversion: 1electromagnetic articulography (ema): 1domain adaptation: 1continual learning: 1on device: 1signal processing algorithms: 1asr: 1smoothing methods: 1measurement: 1self supervision: 1unit discovery: 1training data: 1multilingual: 1multi softmax: 1rnn t: 1pipelines: 1tokenization: 1bert: 1pattern clustering: 1pre training: 1signal representation: 1pseudo labeling: 1contrastive learning: 1signal classification: 1distant supervision: 1unsupervised and semi supervised learning: 1audio signal processing: 1zero and low resource asr.: 1dataset: 1social networking (online): 1end to end asr: 1weak supervision: 1acoustic modeling: 1hybrid speech recognition: 1transformer: 1recurrent neural networks: 1
Most publications (all venues) at2023: 172022: 172021: 82024: 72020: 7

Affiliations
URLs

Recent publications

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee, 
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li 0001, Alan W. Black, Gopala Krishna Anumanchipalli, 
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in Hubert.

ICASSP2024 Cheol Jun Cho, Abdelrahman Mohamed, Alan W. Black, Gopala Krishna Anumanchipalli, 
Self-Supervised Models of Speech Infer Universal Articulatory Kinematics.

ICASSP2024 Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-Yi Lee, Lin-Shan Lee, 
SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering.

ICASSP2024 Yuan Tseng, Layne Berry, Yiting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Poyao Huang 0001, Chun-Mao Lai, Shang-Wen Li 0001, David Harwath, Yu Tsao 0001, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee, 
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.

ACL2024 Puyuan Peng, Po-Yao Huang 0001, Shang-Wen Li 0001, Abdelrahman Mohamed, David Harwath, 
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild.

TASLP2023 Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe 0001, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed
LegoNN: Building Modular Encoder-Decoder Models.

ICASSP2023 Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, Gopala Krishna Anumanchipalli, 
Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech.

ICASSP2023 Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed
Continual Learning for On-Device Speech Recognition Using Disentangled Conformers.

ICASSP2023 Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux, Abdelrahman Mohamed
Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

ICASSP2023 Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer, 
Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities.

Interspeech2023 Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdel-rahman Mohamed, Philip C. Woodland, 
Biased Self-supervised Learning for ASR.

Interspeech2023 Puyuan Peng, Shang-Wen Li 0001, Okko Räsänen, Abdelrahman Mohamed, David Harwath, 
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model.

Interspeech2023 Jiatong Shi, Dan Berrebbi, William Chen, En-Pei Hu, Wei-Ping Huang, Ho-Lam Chung, Xuankai Chang, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe 0001, 
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark.

Interspeech2022 Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-Wen Yang, Hsuan-Jui Chen, Shuyan Annie Dong, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-yi Lee, Lin-Shan Lee, 
DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering.

Interspeech2022 Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Robust Self-Supervised Audio-Visual Speech Recognition.

Interspeech2022 Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu, 
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT.

Interspeech2022 Weiyi Zheng, Alex Xiao, Gil Keren, Duc Le, Frank Zhang 0001, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed
Scaling ASR Improves Zero and Few Shot Learning.

ICLR2022 Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.

ACL2022 Eugene Kharitonov, Ann Lee 0001, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu, 
Text-Free Prosody-Aware Generative Spoken Language Modeling.

#110  | Xie Chen 0001 | DBLP Google Scholar  
By venueICASSP: 18Interspeech: 12TASLP: 3AAAI: 1ACL-Findings: 1
By year2024: 92023: 132022: 32021: 42020: 12019: 42018: 1
ISCA sessionsspeech recognition: 4speech synthesis: 2novel transformer models for asr: 1self-supervised learning in asr: 1novel models and training methods for asr: 1self-supervision and semi-supervision for neural asr training: 1neural network training methods for asr: 1language modeling: 1
IEEE keywordsspeech recognition: 13self supervised learning: 5recurrent neural nets: 5predictive models: 4vocabulary: 4adaptation models: 4transducers: 3factorized neural transducer: 3task analysis: 3natural language processing: 3transformers: 2text to speech: 2timbre: 2recording: 2speech synthesis: 2speech enhancement: 2adapter: 2error analysis: 2decoding: 2noise reduction: 2language model: 2optimisation: 2long content speech recognition: 1streaming and non streaming: 1context modeling: 1rnn t: 1computer architecture: 1efficiency: 1flow matching: 1mathematical models: 1rectified flow: 1trajectory: 1speed quality tradeoff: 1signal processing algorithms: 1speaker embedding free: 1stability analysis: 1zero shot voice conversion: 1linguistics: 1semantics: 1cross attention: 1rhetoric: 1expressive text to speech: 1tts dataset: 1labeling: 1large language models: 1annotations: 1manuals: 1textual expressiveness: 1systematics: 1speech emotion recognition: 1data mining: 1emotion recognition: 1text generation: 1data augmentation: 1synthetic data: 1byte pair encoding: 1syntactics: 1rescore: 1discrete audio token: 1language modeling: 1correlation: 1degradation: 1multitasking: 1measurement: 1discrete tokens: 1speaker adaptation: 1timbre normalization: 1vocoders: 1vector quantization: 1automatic speech recognition: 1computational modeling: 1front end feature: 1filter banks: 1fuses: 1long form speech recognition: 1context and speech encoder: 1costs: 1domain adaptation: 1factorized aed: 1end to end speech recognition: 1text only: 1interpolation: 1classifier guidance: 1emotion intensity control: 1controllability: 1optimization: 1de noising diffusion models: 1emotional tts: 1speech separation: 1performance gain: 1transformer transducer: 1data models: 1language model adaptation: 1real time decoding: 1transducer: 1transformer: 1encoding: 1attention based encoder decoder: 1recurrent neural network transducer: 1quantization: 1recurrent neural networks: 1data compression: 1alternating direction methods of multipliers: 1language models: 1quantisation (signal): 1probability: 1keyword search: 1feedforward: 1recurrent neural network: 1succeeding words: 1pattern clustering: 1importance sampling: 1computational linguistics: 1maximum entropy language model: 1noise contrastive estimation: 1sampled softmax: 1maximum entropy methods: 1variational inference: 1bayes methods: 1neural network language models: 1lstm: 1parameter estimation: 1gaussian processes: 1entropy: 1natural gradient: 1rnnlms: 1gradient methods: 1
Most publications (all venues) at2024: 332023: 212021: 72015: 62019: 5

Affiliations
Shanghai Jiao Tong University, China
Microsoft, Redmond, WA, USA (former)
University of Cambridge, UK (former)

Recent publications

TASLP2024 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
Advanced Long-Content Speech Recognition With Factorized Neural Transducer.

ICASSP2024 Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen 0001, Kai Yu 0004, 
VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching.

ICASSP2024 Junjie Li, Yiwei Guo, Xie Chen 0001, Kai Yu 0004, 
SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention.

ICASSP2024 Sen Liu, Yiwei Guo, Xie Chen 0001, Kai Yu 0004, 
StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations.

ICASSP2024 Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen 0003, Shiliang Zhang, Xie Chen 0001
Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition.

ICASSP2024 Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004, 
Acoustic BPE for Speech Generation with Discrete Tokens.

ICASSP2024 Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu 0004, Daniel Povey, Xie Chen 0001
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS.

AAAI2024 Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen 0001, Shuai Wang 0016, Hui Zhang, Kai Yu 0004, 
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.

ACL-Findings2024 Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen 0001
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.

TASLP2023 Chenpeng Du, Yiwei Guo, Xie Chen 0001, Kai Yu 0004, 
Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature.

ICASSP2023 Xie Chen 0001, Ziyang Ma, Changli Tang, Yujin Wang, Zhisheng Zheng, 
Front-End Adapter: Adapting Front-End Input of Speech Based Self-Supervised Learning for Speech Recognition.

ICASSP2023 Xun Gong 0005, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001, Rui Zhao 0017, Xie Chen 0001, Yanmin Qian, 
LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer.

ICASSP2023 Xun Gong 0005, Wei Wang 0010, Hang Shao, Xie Chen 0001, Yanmin Qian, 
Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR.

ICASSP2023 Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004, 
Emodiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance.

ICASSP2023 Tianrui Wang, Xie Chen 0001, Zhuo Chen, Shu Yu, Weibin Zhu, 
An Adapter Based Multi-Label Pre-Training for Speech Separation and Enhancement.

Interspeech2023 Mingyu Cui, Jiawen Kang 0002, Jiajun Deng, Xi Yin 0010, Yutao Xie, Xie Chen 0001, Xunying Liu, 
Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems.

Interspeech2023 Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu 0004, Xie Chen 0001
Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation.

Interspeech2023 Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen 0001, Kai Yu 0004, 
DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech.

Interspeech2023 Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen 0001
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets.

Interspeech2023 Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang 0027, Chao Zhang 0031, Xie Chen 0001
Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.

#111  | Yanzhang He | DBLP Google Scholar  
By venueICASSP: 17Interspeech: 17NAACL: 1
By year2024: 32023: 52022: 102021: 82020: 72019: 2
ISCA sessionsasr: 2speech recognition: 1asr technologies and systems: 1speech segmentation: 1search/decoding algorithms for asr: 1multi-, cross-lingual and other topics in asr: 1novel models and training methods for asr: 1resource-constrained asr: 1search/decoding techniques and confidence measures for asr: 1spoken term detection & voice search: 1streaming for asr/rnn transducers: 1speech classification: 1streaming asr: 1evaluation of speech technology systems and methods for resource construction and annotation: 1single-channel speech enhancement: 1asr neural network architectures: 1
IEEE keywordsspeech recognition: 13recurrent neural nets: 6end to end asr: 4error analysis: 3decoding: 3conformer: 3automatic speech recognition: 3natural language processing: 3rnn t: 3probability: 3speech coding: 3computational modeling: 2computational efficiency: 2vocabulary: 2confidence scores: 2latency: 2optimisation: 2sparsity: 1costs: 1topology: 1model pruning: 1model quantization: 1quantization (signal): 1universal speech model: 1runtime efficiency: 1computational latency: 1large models: 1task analysis: 1weight sharing: 1machine learning: 1low rank decomposition: 1model compression: 1wearable computers: 1program processors: 1data models: 1embedded speech recognition: 1video on demand: 1segmentation: 1earth observing system: 1decoding algorithms: 1real time systems: 1signal processing algorithms: 1asr: 1noise robustness: 1speech enhancement: 1noise robust asr: 1speaker embedding: 1noise measurement: 1voice filter: 1modulation: 1transducers: 1network architecture: 1multitasking: 1capitalization: 1joint network: 1rnn transducer: 1predictive models: 1pause prediction: 1semi supervised learning (artificial intelligence): 1domain adaptation: 1self supervised learning: 1semi supervised learning: 1out of domain: 1feature selection: 1estimation theory: 1end to end: 1two pass asr: 1rnnt: 1long form asr: 1speaker recognition: 1cascaded encoders: 1hidden markov models: 1mean square error methods: 1transformer: 1calibration: 1confidence: 1voice activity detection: 1attention based end to end models: 1regression analysis: 1endpointer: 1text analysis: 1supervised learning: 1mobile handsets: 1
Most publications (all venues) at2022: 142021: 112023: 102020: 72019: 4

Affiliations
URLs

Recent publications

ICASSP2024 Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li 0028, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal, 
USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models.

ICASSP2024 Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno 0001, 
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar, 
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw, 
Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models.

ICASSP2023 W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman, 
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model.

ICASSP2023 Tom O'Malley, Shaojin Ding, Arun Narayanan, Quan Wang, Rajeev Rikhye, Qiao Liang 0001, Yanzhang He, Ian McGraw, 
Conditional Conformer: Improving Speaker Modulation For Single And Multi-User Speech Enhancement.

ICASSP2023 Weiran Wang, Ding Zhao, Shaojin Ding, Hao Zhang 0010, Shuo-Yiin Chang, David Rybach, Tara N. Sainath, Yanzhang He, Ian McGraw, Shankar Kumar, 
Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks.

Interspeech2023 Oleg Rybakov, Phoenix Meadowlark, Shaojin Ding, David Qiu, Jian Li, David Rim, Yanzhang He
2-bit Conformer quantization for automatic speech recognition.

ICASSP2022 Dongseong Hwang, Ananya Misra, Zhouyuan Huo, Nikhil Siddhartha, Shefali Garg, David Qiu, Khe Chai Sim, Trevor Strohman, Françoise Beaufays, Yanzhang He
Large-Scale ASR Domain Adaptation Using Self- and Semi-Supervised Learning.

ICASSP2022 Qiujia Li, Yu Zhang 0033, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland, 
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033, 
Improving The Latency And Quality Of Cascaded Encoders.

Interspeech2022 Shuo-Yiin Chang, Bo Li 0028, Tara N. Sainath, Chao Zhang 0031, Trevor Strohman, Qiao Liang 0001, Yanzhang He
Turn-Taking Prediction for Natural Conversational Speech.

Interspeech2022 Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov, 
4-bit Conformer with Native Quantization Aware Training for Speech Recognition.

Interspeech2022 Shaojin Ding, Rajeev Rikhye, Qiao Liang 0001, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, Ian McGraw, 
Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition.

Interspeech2022 Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang 0016, Rina Panigrahy, Qiao Liang 0001, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman, 
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes.

Interspeech2022 Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang, 
Improving Deliberation by Text-Only and Semi-Supervised Training.

Interspeech2022 Bo Li 0028, Tara N. Sainath, Ruoming Pang, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang 0001, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani, 
A Language Agnostic Multilingual Streaming On-Device ASR System.

Interspeech2022 Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach, 
Improving Rare Word Recognition with LM-aware MWER Training.

ICASSP2021 Bo Li 0028, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han 0002, Qiao Liang 0001, Yu Zhang 0033, Trevor Strohman, Yonghui Wu, 
A Better and Faster end-to-end Model for Streaming ASR.

ICASSP2021 Qiujia Li, David Qiu, Yu Zhang 0033, Bo Li 0028, Yanzhang He, Philip C. Woodland, Liangliang Cao, Trevor Strohman, 
Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition.

#112  | Shoukang Hu | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 12TASLP: 6
By year2023: 22022: 82021: 102020: 32019: 102018: 2
ISCA sessionsspeech and language in health: 2speech recognition of atypical speech: 2topics in asr: 2medical applications and visual asr: 2speech recognition: 1multi-, cross-lingual and other topics in asr: 1miscellaneous topics in speech, voice and hearing disorders: 1zero, low-resource and multi-modal speech recognition: 1speech and speaker recognition: 1asr neural network architectures: 1lexicon and language model for speech recognition: 1novel neural network architectures for acoustic modelling: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 16bayes methods: 7recurrent neural nets: 5neural architecture search: 4bayesian learning: 4gaussian processes: 4optimisation: 4natural language processing: 4deep learning (artificial intelligence): 3speaker recognition: 3transformer: 3language models: 3quantisation (signal): 3time delay neural network: 2model uncertainty: 2neural language models: 2domain adaptation: 2adaptation models: 2speech emotion recognition: 2emotion recognition: 2variational inference: 2inference mechanisms: 2speaker adaptation: 2gradient methods: 2admm: 2quantization: 2search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1monte carlo methods: 1error analysis: 1articulatory inversion: 1dysarthric speech: 1hybrid power systems: 1benchmark testing: 1uniform sampling: 1path dropout: 1delays: 1generalisation (artificial intelligence): 1lf mmi: 1gaussian process: 1handicapped aids: 1data augmentation: 1multimodal speech recognition: 1disordered speech recognition: 1lstm rnn: 1low bit quantization: 1image recognition: 1microphone arrays: 1audio visual: 1visual occlusion: 1overlapped speech recognition: 1multi channel: 1jointly fine tuning: 1speech separation: 1filtering theory: 1video signal processing: 1training data: 1switches: 1estimation: 1uncertainty: 1automatic speech recognition: 1neurocognitive disorder detection: 1elderly speech: 1dementia: 1recurrent neural networks: 1data compression: 1alternating direction methods of multipliers: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1neural network language models: 1lstm: 1parameter estimation: 1utterance level features: 1spatial relationship information: 1convolutional neural nets: 1recurrent connection: 1capsule networks: 1hidden markov models: 1maximum likelihood estimation: 1lhuc: 1entropy: 1natural gradient: 1rnnlms: 1
Most publications (all venues) at2022: 142021: 102019: 102024: 82023: 6

Affiliations
URLs

Recent publications

Interspeech2023 Zhaoqing Li, Tianzi Wang, Jiajun Deng, Junhao Xu, Shoukang Hu, Xunying Liu, 
Lossless 4-bit Quantization of Architecture Compressed Conformer ASR Systems on the 300-hr Switchboard Corpus.

Interspeech2023 Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui Jin, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu, 
Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition.

TASLP2022 Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

TASLP2022 Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Neural Network Language Modeling for Speech Recognition.

ICASSP2022 Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng, 
Exploiting Cross Domain Acoustic-to-Articulatory Inverted Features for Disordered Speech Recognition.

ICASSP2022 Xixin Wu, Shoukang Hu, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Neural Architecture Search for Speech Emotion Recognition.

Interspeech2022 Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng, 
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems.

Interspeech2022 Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye 0001, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, Helen Meng, 
Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection.

Interspeech2022 Yi Wang, Tianzi Wang, Zi Ye 0001, Lingwei Meng, Shoukang Hu, Xixin Wu, Xunying Liu, Helen Meng, 
Exploring linguistic feature and model combination for speech recognition based automatic AD detection.

Interspeech2022 Junhao Xu, Shoukang Hu, Xunying Liu, Helen Meng, 
Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Swithboard Corpus.

TASLP2021 Shoukang Hu, Xurong Xie, Shansong Liu, Jianwei Yu, Zi Ye 0001, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition.

TASLP2021 Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen Meng, 
Recent Progress in the CUHK Dysarthric Speech Recognition System.

TASLP2021 Junhao Xu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng, 
Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition.

TASLP2021 Jianwei Yu, Shi-Xiong Zhang, Bo Wu, Shansong Liu, Shoukang Hu, Mengzhe Geng, Xunying Liu, Helen Meng, Dong Yu 0001, 
Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech.

ICASSP2021 Shoukang Hu, Xurong Xie, Shansong Liu, Mingyu Cui, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

ICASSP2021 Junhao Xu, Shoukang Hu, Jianwei Yu, Xunying Liu, Helen Meng, 
Mixed Precision Quantization of Transformer Language Models for Speech Recognition.

ICASSP2021 Boyang Xue, Jianwei Yu, Junhao Xu, Shansong Liu, Shoukang Hu, Zi Ye 0001, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Transformer Language Models for Speech Recognition.

ICASSP2021 Zi Ye 0001, Shoukang Hu, Jinchao Li, Xurong Xie, Mengzhe Geng, Jianwei Yu, Junhao Xu, Boyang Xue, Shansong Liu, Xunying Liu, Helen Meng, 
Development of the Cuhk Elderly Speech Recognition System for Neurocognitive Disorder Detection Using the Dementiabank Corpus.

Interspeech2021 Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye 0001, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng, 
Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition.

Interspeech2021 Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye 0001, Zengrui Jin, Xunying Liu, Helen Meng, 
Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition.

#113  | Rohit Prabhavalkar | DBLP Google Scholar  
By venueICASSP: 20Interspeech: 13TASLP: 1NAACL: 1
By year2024: 42023: 102022: 62021: 62020: 32019: 52018: 1
ISCA sessionsspeech recognition: 2asr: 1search/decoding algorithms for asr: 1speech analysis: 1novel models and training methods for asr: 1resource-constrained asr: 1novel neural network architectures for asr: 1noise robust and distant speech recognition: 1cross-lingual and multilingual asr: 1sequence-to-sequence speech recognition: 1asr neural network architectures: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 14decoding: 6data models: 5computational modeling: 5end to end asr: 4error analysis: 3speech coding: 3automatic speech recognition: 2task analysis: 2computational efficiency: 2conformer: 2video on demand: 2adaptation models: 2context modeling: 2two pass asr: 2rnnt: 2long form asr: 2predictive models: 2recurrent neural nets: 2hidden markov models: 1end to end: 1sparsity: 1costs: 1topology: 1model pruning: 1model quantization: 1quantization (signal): 1universal speech model: 1runtime efficiency: 1computational latency: 1large models: 1buildings: 1acoustic beams: 1representations: 1modular: 1zero shot stitching: 1weight sharing: 1machine learning: 1low rank decomposition: 1model compression: 1wearable computers: 1program processors: 1embedded speech recognition: 1segmentation: 1earth observing system: 1decoding algorithms: 1real time systems: 1signal processing algorithms: 1asr: 1force: 1siamese network: 1bridges: 1semi supervised learning: 1knowledge engineering: 1contrastive loss: 1cross training: 1transducers: 1internal lm: 1text recognition: 1text injection: 1semisupervised learning: 1lattices: 1production: 1contextual biasing: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1named entities: 1natural language processing: 1class language model: 1shallow fusion: 1end toend speech recognition: 1speaker recognition: 1speech enhancement: 1acoustic echo cancellation: 1acoustic simulation: 1echo suppression: 1multi task loss: 1sequence to sequence model: 1computer architecture: 1second pass asr: 1mean square error methods: 1transformer: 1calibration: 1confidence: 1voice activity detection: 1attention based end to end models: 1audio recording: 1acoustic features: 1unspoken punctuation: 1optimisation: 1vocabulary: 1mathematical model: 1pronunciation: 1sequence to sequence: 1las: 1phonetics: 1biasing: 1mobile handsets: 1
Most publications (all venues) at2023: 172019: 102018: 102021: 92022: 8

Affiliations
URLs

Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001, 
End-to-End Speech Recognition: A Survey.

ICASSP2024 Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li 0028, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal, 
USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models.

ICASSP2024 Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno 0001, 
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar, 
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara N. Sainath, Françoise Beaufays, 
Lego-Features: Exporting Modular Encoder Features for Streaming and Deliberation ASR.

ICASSP2023 Steven M. Hernandez, Ding Zhao, Shaojin Ding, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He, Ian McGraw, 
Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models.

ICASSP2023 W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman, 
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model.

ICASSP2023 Soheil Khorram, Anshuman Tripathi, Jaeyoung Kim, Han Lu, Qian Zhang, Rohit Prabhavalkar, Hasim Sak, 
Cross-Training: A Semi-Supervised Training Scheme for Speech Recognition.

ICASSP2023 Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang 0033, Bo Li 0028, Andrew Rosenberg, Bhuvana Ramabhadran, 
JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.

ICASSP2023 Cal Peyser, Michael Picheny, Kyunghyun Cho, Rohit Prabhavalkar, W. Ronny Huang, Tara N. Sainath, 
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale.

ICASSP2023 Tara N. Sainath, Rohit Prabhavalkar, Diamantino Caseiro, Pat Rondon, Cyril Allauzen, 
Improving Contextual Biasing with Text Injection.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman, 
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition.

Interspeech2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath, 
How to Estimate Model Transferability of Pre-Trained Speech Models?

Interspeech2023 Cal Peyser, Zhong Meng, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho, Ke Hu, 
Improving Joint Speech-Text Representations Without Alignment.

ICASSP2022 Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu 0011, Bo Wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer, 
Neural-FST Class Language Model for End-to-End Speech Recognition.

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033, 
Improving The Latency And Quality Of Cascaded Encoders.

Interspeech2022 Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang 0016, Rina Panigrahy, Qiao Liang 0001, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman, 
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes.

Interspeech2022 Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang, 
Improving Deliberation by Text-Only and Semi-Supervised Training.

Interspeech2022 W. Ronny Huang, Shuo-Yiin Chang, David Rybach, Tara N. Sainath, Rohit Prabhavalkar, Cal Peyser, Zhiyun Lu, Cyril Allauzen, 
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR.

Interspeech2022 Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach, 
Improving Rare Word Recognition with LM-aware MWER Training.

#114  | Jian Wu 0027 | DBLP Google Scholar  
By venueICASSP: 18Interspeech: 17
By year2024: 12023: 72022: 52021: 72020: 132019: 2
ISCA sessionssource separation: 2speaker and language recognition: 1other topics in speech recognition: 1robust asr, and far-field/multi-talker asr: 1search/decoding techniques and confidence measures for asr: 1tools, corpora and resources: 1deep noise suppression challenge: 1streaming asr: 1feature extraction and distant asr: 1asr neural network architectures and training: 1singing voice computing and processing in music: 1speaker diarization: 1multi-channel speech enhancement: 1the interspeech 2020 far field speaker verification challenge: 1speech and audio source separation and scene analysis: 1asr for noisy and far-field speech: 1
IEEE keywordsspeech recognition: 7error analysis: 6speaker recognition: 5recurrent neural nets: 4oral communication: 3self supervised learning: 3speech separation: 3multi talker automatic speech recognition: 3transformers: 3speaker diarization: 3continuous speech separation: 3transducers: 2adaptation models: 2computational modeling: 2conversation transcription: 2representation learning: 2training data: 2transformer: 2conformer: 2speaker verification: 2source separation: 2audio signal processing: 2overlapped speech: 2token level serialized output training: 1multi talker speech recognition: 1factorized neural transducer: 1text only adaptation: 1symbols: 1vocabulary: 1data models: 1wavlm: 1multi speaker: 1focusing: 1microphone arrays: 1streaming inference: 1geometry: 1microphone array: 1operating systems: 1data mining: 1self attention: 1convolution: 1swin transformer: 1tensors: 1eend eda: 1correlation: 1voice activity detection: 1ts vad: 1speaker change detection: 1degradation: 1e2e asr: 1transformer transducer: 1f1 score: 1limiting: 1data simulation: 1conversation analysis: 1analytical models: 1signal processing algorithms: 1multitasking: 1pre training: 1benchmark testing: 1speaker: 1linear programming: 1meeting transcription: 1recurrent selective attention network: 1multi speaker asr: 1attention: 1multi dialect: 1acoustic modeling: 1mixture of experts: 1natural language processing: 1filtering theory: 1system fusion: 1automatic speech recognition: 1permutation invariant training: 1libricss: 1microphones: 1adaptation: 1text analysis: 1text to speech: 1rnn t: 1keyword spotting: 1speech synthesis: 1pattern clustering: 1graph neural networks: 1matrix algebra: 1graph theory: 1deep speaker embedding: 1multi modal: 1audio visual systems: 1audio visual speech recognition: 1convolutional neural nets: 1cnn: 1attentive pooling: 1lstm: 1
Most publications (all venues) at2020: 132022: 122021: 122023: 112019: 5

Affiliations
Microsoft Corporation, USA
Northwestern Polytechnical University, Xi'an, China
URLs

Recent publications

ICASSP2024 Jian Wu 0027, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao 0017, Zhuo Chen 0006, Jinyu Li 0001, 
T-SOT FNT: Streaming Multi-Talker ASR with Text-Only Domain Adaptation Capability.

ICASSP2023 Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiaofei Wang 0009, Takuya Yoshioka, Jinyu Li 0001, Sunit Sivasankaran, Sefik Emre Eskimez, 
Speech Separation with Large-Scale Self-Supervised Learning.

ICASSP2023 Zili Huang, Zhuo Chen 0006, Naoyuki Kanda, Jian Wu 0027, Yiming Wang, Jinyu Li 0001, Takuya Yoshioka, Xiaofei Wang 0009, Peidong Wang, 
Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition.

ICASSP2023 Naoyuki Kanda, Jian Wu 0027, Xiaofei Wang 0009, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Vararray Meets T-Sot: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition.

ICASSP2023 Mufan Sang, Yong Zhao 0008, Gang Liu 0001, John H. L. Hansen, Jian Wu 0027
Improving Transformer-Based Networks with Locality for Automatic Speaker Verification.

ICASSP2023 Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu 0027
Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization.

ICASSP2023 Jian Wu 0027, Zhuo Chen 0006, Min Hu, Xiong Xiao, Jinyu Li 0001, 
Speaker Change Detection For Transformer Transducer ASR.

ICASSP2023 Muqiao Yang, Naoyuki Kanda, Xiaofei Wang 0009, Jian Wu 0027, Sunit Sivasankaran, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Simulating Realistic Speech Overlaps Improves Multi-Talker ASR.

ICASSP2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Zhengyang Chen, Zhuo Chen 0006, Shujie Liu 0001, Jian Wu 0027, Yao Qian, Furu Wei, Jinyu Li 0001, Xiangzhan Yu, 
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training.

ICASSP2022 Yixuan Zhang 0005, Zhuo Chen 0006, Jian Wu 0027, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li 0001, 
Continuous Speech Separation with Recurrent Selective Attention Network.

Interspeech2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Shujie Liu 0001, Zhuo Chen 0006, Peidong Wang, Gang Liu 0001, Jinyu Li 0001, Jian Wu 0027, Xiangzhan Yu, Furu Wei, 
Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings.

Interspeech2022 Naoyuki Kanda, Jian Wu 0027, Yu Wu 0012, Xiong Xiao, Zhong Meng, Xiaofei Wang 0009, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka, 
Streaming Multi-Talker ASR with Token-Level Serialized Output Training.

ICASSP2021 Sanyuan Chen, Yu Wu 0012, Zhuo Chen 0006, Jian Wu 0027, Jinyu Li 0001, Takuya Yoshioka, Chengyi Wang 0002, Shujie Liu 0001, Ming Zhou 0001, 
Continuous Speech Separation with Conformer.

ICASSP2021 Amit Das 0007, Kshitiz Kumar, Jian Wu 0027
Multi-Dialect Speech Recognition in English Using Attention on Ensemble of Experts.

ICASSP2021 Xiong Xiao, Naoyuki Kanda, Zhuo Chen 0006, Tianyan Zhou, Takuya Yoshioka, Sanyuan Chen, Yong Zhao 0008, Gang Liu 0001, Yu Wu 0012, Jian Wu 0027, Shujie Liu 0001, Jinyu Li 0001, Yifan Gong 0001, 
Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020.

Interspeech2021 Amber Afshan, Kshitiz Kumar, Jian Wu 0027
Sequence-Level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models.

Interspeech2021 Sanyuan Chen, Yu Wu 0012, Zhuo Chen 0006, Jian Wu 0027, Takuya Yoshioka, Shujie Liu 0001, Jinyu Li 0001, Xiangzhan Yu, 
Ultra Fast Speech Separation Model with Teacher Student Learning.

Interspeech2021 Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen 0006, Yanxin Hu, Lei Xie 0001, Jian Wu 0027, Hui Bu, Xin Xu, Jun Du, Jingdong Chen, 
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario.

Interspeech2021 Jian Wu 0027, Zhuo Chen 0006, Sanyuan Chen, Yu Wu 0012, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu 0001, Jinyu Li 0001, 
Investigation of Practical Aspects of Single Channel Speech Separation for ASR.

#115  | Tanja Schultz | DBLP Google Scholar  
By venueInterspeech: 23ICASSP: 10SpeechComm: 1
By year2023: 42022: 102021: 52020: 102019: 22018: 3
ISCA sessionsspeech and language in health: 3source separation: 2health and affect: 2cross/multi-lingual and code-switched speech recognition: 2novel paradigms for direct synthesis based on speech-related biosignals: 2biosignal-enabled spoken communication: 1speech synthesis: 1target speaker detection, localization and separation: 1new trends in self-supervised speech processing: 1neural signals for spoken communication: 1speech in multimodality: 1computational paralinguistics: 1human speech production: 1accoustic phonetics of l1-l2 and other interactions: 1applications of language technologies: 1keynote: 1speech and language analytics for mental health: 1
IEEE keywordsnatural language processing: 6speech recognition: 6ethiopian languages: 3electroencephalography: 2decoding: 2speech: 2speech synthesis: 2gaussian processes: 2language translation: 2hearing: 2globalphone: 2vocabulary: 2deep neural networks: 2brain modeling: 1auditory system: 1speech stimulus: 1eeg decoding: 1speech envelope: 1match mismatch classification: 1task analysis: 1production: 1silent speech interfaces: 1emg to speech: 1electromyography: 1predictive models: 1muscles: 1voice conversion: 1support vector machines: 1alzheimer’s disease: 1diseases: 1speech & language: 1handicapped aids: 1ilse corpus: 1adress challenge: 1acoustic and linguistic features: 1medical disorders: 1stereo tactic eeg: 1medical signal processing: 1neuroprosthesis: 1low latency processing of neural signals: 1prosthetics: 1neurophysiology: 1speech intelligibility: 1biomedical electrodes: 1multilingual: 1grammars: 1natural languages: 1selective auditory attention: 1target language extraction: 1cocktail party problem: 1malayalam: 1sub word segmentation: 1code switching: 1oov: 1language modelling: 1image segmentation: 1low resource languages: 1computer vision: 1computer audition: 1medical computing: 1healthcare: 1audio signal processing: 1intelligent medicine: 1health care: 1digital phenotype: 1overview: 1modeling units: 1end to end asr: 1automatic speech recognition: 1out of vocabulary: 1hidden markov models: 1dnn: 1linguistics: 1
Most publications (all venues) at2014: 362013: 242009: 242022: 232012: 23

Affiliations
University of Bremen, Cognitive Systems Lab, Germany
Carnegie Mellon University, Pittsburgh, USA (former)

Recent publications

ICASSP2023 Marvin Borsdorf, Saurav Pahuja, Gabriel Ivucic, Siqi Cai, Haizhou Li 0001, Tanja Schultz
Multi-Head Attention and GRU for Improved Match-Mismatch Classification of Speech Stimulus and EEG Response.

ICASSP2023 Kevin Scheck, Tanja Schultz
Multi-Speaker Speech Synthesis from Electromyographic Signals by Soft Speech Unit Prediction.

Interspeech2023 Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso, 
Towards Reference Speech Characterization for Health Applications.

Interspeech2023 Kevin Scheck, Tanja Schultz
STE-GAN: Speech-to-Electromyography Signal Conversion using Generative Adversarial Networks.

SpeechComm2022 Martha Yifiru Tachbelie, Solomon Teferra Abate, Tanja Schultz
Multilingual speech recognition for GlobalPhone languages.

ICASSP2022 Ayimnisagul Ablimit, Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso, 
Exploring Dementia Detection from Speech: Cross Corpus Analysis.

ICASSP2022 Miguel Angrick, Maarten C. Ottenhoff, Lorenz Diener, Darius Ivucic, Gabriel Ivucic, Sophocles Goulis, Albert J. Colon, G. Louis Wagner, Dean J. Krusienski, Pieter Leonard Kubben, Tanja Schultz, Christian Herff, 
Towards Closed-Loop Speech Synthesis from Stereotactic EEG: A Unit Selection Approach.

ICASSP2022 Marvin Borsdorf, Kevin Scheck, Haizhou Li 0001, Tanja Schultz
Experts Versus All-Rounders: Target Language Extraction for Multiple Target Languages.

ICASSP2022 Sreeja Manghat, Sreeram Manghat, Tanja Schultz
Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages.

ICASSP2022 Kun Qian 0003, Tanja Schultz, Björn W. Schuller, 
An Overview of the FIRST ICASSP Special Session on Computer Audition for Healthcare.

Interspeech2022 Ayimnisagul Ablimit, Karen Scholz, Tanja Schultz
Deep Learning Approaches for Detecting Alzheimer's Dementia from Conversational Speech of ILSE Study.

Interspeech2022 Marvin Borsdorf, Kevin Scheck, Haizhou Li 0001, Tanja Schultz
Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by Language.

Interspeech2022 Catarina Botelho, Tanja Schultz, Alberto Abad, Isabel Trancoso, 
Challenges of using longitudinal and cross-domain corpora on studies of pathological speech.

Interspeech2022 Sreeram Manghat, Sreeja Manghat, Tanja Schultz
Normalization of code-switched text for speech synthesis.

ICASSP2021 Solomon Teferra Abate, Martha Yifiru Tachbelie, Tanja Schultz
End-to-End Multilingual Automatic Speech Recognition for Less-Resourced Languages: The Case of Four Ethiopian Languages.

Interspeech2021 Marvin Borsdorf, Chenglin Xu, Haizhou Li 0001, Tanja Schultz
Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers.

Interspeech2021 Marvin Borsdorf, Chenglin Xu, Haizhou Li 0001, Tanja Schultz
GlobalPhone Mix-To-Separate Out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation.

Interspeech2021 Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso, 
Visual Speech for Obstructive Sleep Apnea Detection.

Interspeech2021 Lars Steinert, Felix Putze, Dennis Küster, Tanja Schultz
Audio-Visual Recognition of Emotional Engagement of People with Dementia.

ICASSP2020 Solomon Teferra Abate, Martha Yifiru Tachbelie, Tanja Schultz
Deep Neural Networks Based Automatic Speech Recognition for Four Ethiopian Languages.

#116  | Florian Metze | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 10TASLP: 2EMNLP-Findings: 2SpeechComm: 1NeurIPS: 1ACL: 1NAACL: 1
By year2023: 12022: 42021: 62020: 72019: 122018: 4
ISCA sessionscross/multi-lingual and code-switched asr: 2acoustic event detection and classification: 1low-resource asr development: 1spoken language understanding: 1spoken language processing: 1asr neural network architectures: 1multilingual and code-switched asr: 1nn architectures for asr: 1cross-lingual and multilingual asr: 1speech annotation and labelling: 1multimodal asr: 1speaker diarization: 1acoustic scenes and rare events: 1audio events and acoustic scenes: 1asr systems and technologies: 1
IEEE keywordsspeech recognition: 9natural language processing: 5machine translation: 2end to end: 2vocabulary: 2task analysis: 2text analysis: 2multilingual speech recognition: 2artificial intelligence: 1encoder decoder models: 1modularity: 1decoding: 1predictive models: 1speech summarization: 1long sequence modeling: 1concept learning: 1phone distribution estimation: 1fitting: 1ranking models: 1estimation: 1signal processing algorithms: 1low resource languages: 1multilingual speech alignment: 1multilingual phonetic dataset: 1low resource speech recognition: 1automatic speech recognition: 1human computer interaction: 1unsupervised learning: 1image retrieval: 1speech synthesis: 1image representation: 1phonology: 1universal phone recognition: 1domain adaptation: 1diarization: 1computational linguistics: 1language translation: 1medical transcription: 1speaker recognition: 1asr error correction: 1noisy asr: 1robustness: 1multimodal learning: 1event detection: 1dogs: 1weak labeling: 1probability: 1sound event detection (sed): 1connectionist temporal classification (ctc): 1labeling: 1indexes: 1sequential labeling: 1automobiles: 1error statistics: 1multimodal asr: 1multilingual language models: 1ctc based decoding: 1low resource asr: 1phoneme level language models: 1attention: 1context modeling: 1switches: 1contextual embeddings: 1acoustic word embeddings: 1acoustic to word speech recognition: 1
Most publications (all venues) at2019: 272014: 242013: 202012: 202018: 19


Recent publications

TASLP2023 Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe 0001, Florian Metze, Luke Zettlemoyer, Abdelrahman Mohamed, 
LegoNN: Building Modular Encoder-Decoder Models.

ICASSP2022 Roshan Sharma, Shruti Palaskar, Alan W. Black, Florian Metze
End-to-End Speech Summarization Using Restricted Self-Attention.

Interspeech2022 Juncheng Li 0001, Shuhui Qu, Po-Yao Huang 0001, Florian Metze
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification.

Interspeech2022 Xinjian Li, Florian Metze, David R. Mortensen, Alan W. Black, Shinji Watanabe 0001, 
ASR2K: Speech Recognition for Around 2000 Languages without Audio.

EMNLP-Findings2022 Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W. Black, Shinji Watanabe 0001, 
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models.

ICASSP2021 Xinjian Li, Juncheng Li 0001, Jiali Yao, Alan W. Black, Florian Metze
Phone Distribution Estimation for Low Resource Languages.

ICASSP2021 Xinjian Li, David R. Mortensen, Florian Metze, Alan W. Black, 
Multilingual Phonetic Dataset for Low Resource Speech Recognition.

Interspeech2021 Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan 0002, Siddharth Dalmia, Florian Metze, Shinji Watanabe 0001, Alan W. Black, 
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding.

Interspeech2021 Xinjian Li, Juncheng Li 0001, Florian Metze, Alan W. Black, 
Hierarchical Phone Recognition with Compositional Phonetics.

Interspeech2021 Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze
Multimodal Speech Summarization Through Semantic Concept Learning.

Interspeech2021 Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe 0001, 
Differentiable Allophone Graphs for Language-Universal Speech Recognition.

TASLP2020 Odette Scharenborg, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux, Laurent Besacier, Alan W. Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stüker, Pierre Godard, Markus Müller 0001, 
Speech Technology for Unwritten Languages.

ICASSP2020 Xinjian Li, Siddharth Dalmia, Juncheng Li 0001, Matthew Lee 0012, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W. Black, Florian Metze
Universal Phone Recognition with a Multilingual Allophone System.

ICASSP2020 Anirudh Mani, Shruti Palaskar, Nimshi Venkat Meripo, Sandeep Konam, Florian Metze
ASR Error Correction and Domain Adaptation Using Machine Translation.

ICASSP2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze
Looking Enhances Listening: Recovering Missing Speech Using Images.

Interspeech2020 Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf, 
Contextual RNN-T for Open Domain ASR.

Interspeech2020 Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, William M. Campbell, 
Towards Context-Aware End-to-End Code-Switching Speech Recognition.

EMNLP-Findings2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott, 
Fine-Grained Grounding for Multimodal Speech Recognition.

SpeechComm2019 Okko Räsänen, Shreyas Seshadri, Julien Karadayi, Eric Riebling, John P. Bunce, Alejandrina Cristià, Florian Metze, Marisa Casillas, Celia Rosemberg, Elika Bergelson, Melanie Soderstrom, 
Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech.

ICASSP2019 Yun Wang 0005, Florian Metze
Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling.

#117  | Shuai Wang 0016 | DBLP Google Scholar  
By venueICASSP: 15Interspeech: 11TASLP: 6SpeechComm: 1AAAI: 1
By year2024: 82023: 22022: 22021: 52020: 92019: 72018: 1
ISCA sessionsspeaker recognition: 2speaker and language diarization: 1embedding and network architecture for speaker recognition: 1speaker recognition challenges and applications: 1learning techniques for speaker recognition: 1anti-spoofing and liveness detection: 1speaker recognition and diarization: 1speaker recognition and anti-spoofing: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker verification using neural network methods: 1
IEEE keywordsspeaker recognition: 12speaker verification: 8task analysis: 4data augmentation: 4system performance: 3dihard: 3speaker embedding: 3speaker diarization: 3hidden markov models: 3data handling: 3voice activity detection: 2degradation: 2data mining: 2adaptation models: 2speech recognition: 2training data: 2self supervised learning: 2teacher student learning: 2text dependent speaker verification: 2x vector: 2pattern clustering: 2bayes methods: 2variational bayes: 2hmm: 2clustering algorithms: 1neural speaker diarization: 1attention based encoder decoder: 1ami: 1transformers: 1iterative decoding: 1callhome: 1decoding: 1pretraining: 1siamese network: 1psychoacoustic models: 1self supervise: 1synthetic data: 1speech separation: 1maximum mean discrepancy: 1predictive coding: 1electronics packaging: 13d speaker: 1cross domain learning: 1domain mismatch: 1rendering (computer graphics): 1emotional text to speech: 1emotion prediction: 1emotion control: 1linguistics: 1predictive models: 1target speaker extraction: 1synchronization: 1active speaker detection: 1audio visual: 1interference: 1speech: 1sparsely overlapped speech: 1in the wild: 1filtering algorithms: 1dino: 1pipelines: 1production: 1data models: 1wespeaker: 1codes: 1self knowledge distillation: 1computational modeling: 1model compression: 1deep embedding learning: 1knowledge engineering: 1quantization (signal): 1teacher training: 1speech activity detection. weakly supervised learning: 1convolutional neural networks: 1biometrics (access control): 1audio visual deep neural network: 1person verification: 1deep learning (artificial intelligence): 1face recognition: 1data analysis: 1multi modal system: 1domain adaptation: 1unsupervised learning: 1contrastive learning: 1speech synthesis: 1i vector: 1unit selection synthesis: 1variational auto encoder: 1text independent speaker verification: 1generative adversarial network: 1on the fly data augmentation: 1specaugment: 1convolutional neural nets: 1multitask learning: 1channel information: 1adversarial training: 1probability: 1optimisation: 1linear discriminant analysis: 1chime: 1inference mechanisms: 1text dependent: 1adaptation: 1text mismatch: 1data collection: 1center loss: 1angular softmax: 1short duration text independent speaker verification: 1speaker neural embedding: 1triplet loss: 1gaussian processes: 1knowledge distillation: 1computer aided instruction: 1
Most publications (all venues) at2024: 172020: 112019: 102021: 72018: 7

Affiliations
Shanghai Jiao Tong University, Department of Computer Science and Engineering, China
URLs

Recent publications

SpeechComm2024 Shuai Wang 0016, Zhengyang Chen, Bing Han, Hongji Wang, Chengdong Liang, Binbin Zhang, Xu Xiang, Wen Ding, Johan Rohdin, Anna Silnova, Yanmin Qian, Haizhou Li 0001, 
Advancing speaker embedding learning: Wespeaker toolkit for research and production.

TASLP2024 Zhengyang Chen, Bing Han, Shuai Wang 0016, Yanmin Qian, 
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer.

TASLP2024 Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang 0016, Haizhou Li 0001, 
Speech Separation With Pretrained Frontend to Minimize Domain Mismatch.

ICASSP2024 Wen Huang 0004, Bing Han, Shuai Wang 0016, Zhengyang Chen, Yanmin Qian, 
Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.

ICASSP2024 Sho Inoue, Kun Zhou 0003, Shuai Wang 0016, Haizhou Li 0001, 
Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis.

ICASSP2024 Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang 0016, Haizhou Li 0001, 
Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-Talker Speech.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001, 
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

AAAI2024 Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen 0001, Shuai Wang 0016, Hui Zhang, Kai Yu 0004, 
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.

ICASSP2023 Hongji Wang, Chengdong Liang, Shuai Wang 0016, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, Yanmin Qian, 
Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit.

Interspeech2023 Zhengyang Chen, Bing Han, Shuai Wang 0016, Yanmin Qian, 
Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor.

ICASSP2022 Bei Liu, Haoyu Wang 0007, Zhengyang Chen, Shuai Wang 0016, Yanmin Qian, 
Self-Knowledge Distillation via Feature Enhancement for Speaker Verification.

Interspeech2022 Bei Liu, Zhengyang Chen, Shuai Wang 0016, Haoyu Wang 0007, Bing Han, Yanmin Qian, 
DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design.

TASLP2021 Heinrich Dinkel, Shuai Wang 0016, Xuenan Xu, Mengyue Wu, Kai Yu 0004, 
Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training.

TASLP2021 Yanmin Qian, Zhengyang Chen, Shuai Wang 0016
Audio-Visual Deep Neural Network for Robust Person Verification.

ICASSP2021 Zhengyang Chen, Shuai Wang 0016, Yanmin Qian, 
Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification.

ICASSP2021 Chenpeng Du, Bing Han, Shuai Wang 0016, Yanmin Qian, Kai Yu 0004, 
SynAug: Synthesis-Based Data Augmentation for Text-Dependent Speaker Verification.

ICASSP2021 Houjun Huang, Xu Xiang, Fei Zhao, Shuai Wang 0016, Yanmin Qian, 
Unit Selection Synthesis Based Data Augmentation for Fixed Phrase Speaker Verification.

TASLP2020 Shuai Wang 0016, Yexin Yang, Zhanghao Wu, Yanmin Qian, Kai Yu 0004, 
Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition.

ICASSP2020 Shuai Wang 0016, Johan Rohdin, Oldrich Plchot, Lukás Burget, Kai Yu 0004, Jan Cernocký, 
Investigation of Specaugment for Deep Speaker Embedding Learning.

ICASSP2020 Zhengyang Chen, Shuai Wang 0016, Yanmin Qian, Kai Yu 0004, 
Channel Invariant Speaker Embedding Learning with Joint Multi-Task and Adversarial Training.

#118  | Thomas Hain | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 14TASLP: 1
By year2024: 52023: 52022: 52021: 42020: 92019: 42018: 2
ISCA sessionsspeech emotion recognition: 2speech recognition: 1multilingual models for asr: 1speech intelligibility prediction for hearing-impaired listeners: 1asr: 1low-resource asr development: 1search/decoding techniques and confidence measures for asr: 1speech and audio quality assessment: 1the zero resource speech challenge 2020: 1multilingual and code-switched asr: 1speaker recognition: 1learning techniques for speaker recognition: 1computational paralinguistics: 1speech recognition and beyond: 1applications of language technologies: 1network architectures for emotion and paralinguistics recognition: 1speech analysis and representation: 1applications in education and learning: 1
IEEE keywordsadaptation models: 7training data: 5data models: 4speech recognition: 4switches: 3speech enhancement: 3noise measurement: 3self supervised learning: 2predictive models: 2benchmark testing: 2computational modeling: 2speech separation: 2task analysis: 2telephone sets: 1domain adaptation: 1automatic speech recognition: 1pre training: 1pseudo labeling: 1measurement: 1performance gain: 1self supervised fine tuning: 1correspondence training: 1data augmentation: 1graphics processing units: 1hearing impairment: 1intelligibility prediction: 1psychology: 1dual path transformer: 1artificial neural networks: 1transformers: 1conformer: 1teacher student training: 1knowledge distillation: 1analytical models: 1asr: 1self supervised representations: 1loss measurement: 1loss functions: 1encoding: 1psychoacoustic models: 1deformable convolution: 1dynamic neural networks: 1deformable models: 1convolution: 1radio frequency: 1unsupervised: 1submodular: 1hidden markov models: 1natural language processing: 1data selection: 1contrastive loss: 1gaussian processes: 1perception bias: 1l2 learning: 1voice activity detection: 1pronunciation assessment: 1temporal convolutional network: 1audio anomaly classification: 1convolutional neural nets: 1signal denoising: 1self attention: 1audio recording: 1audio signal processing: 1distortion: 1speaker recognition: 1gaussian noise: 1filtering theory: 1generative adversarial networks: 1voice conversion: 1low resource: 1back propagation: 1connectionist temporal classification: 1end to end speech recognition: 1multiple hypothesis: 1semi supervised adaptation: 1signal classification: 1speaker embeddings: 1two dimensional displays: 1speaker identification: 1periodic structures: 1hierarchical attention: 1attention mechanism: 1x vectors: 1context modeling: 1rnnlm: 1language model adaptation: 1multi domain asr: 1
Most publications (all venues) at2015: 182023: 172016: 172024: 142022: 14

Affiliations
University of Sheffield, England, UK

Recent publications

ICASSP2024 Rehan Ahmad, Muhammad Umar Farooq, Thomas Hain
Progressive Unsupervised Domain Adaptation for ASR Using Ensemble Models and Multi-Stage Training.

ICASSP2024 George Close, William Ravenscroft, Thomas Hain, Stefan Goetze, 
Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement.

ICASSP2024 Amit Meghanani, Thomas Hain
SCORE: Self-Supervised Correspondence Fine-Tuning for Improved Content Representations.

ICASSP2024 Rhiannon Mogridge, George Close, Robert Sutherland, Thomas Hain, Jon Barker, Stefan Goetze, Anton Ragni, 
Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users Using Intermediate ASR Features and Human Memory Models.

ICASSP2024 William Ravenscroft, Stefan Goetze, Thomas Hain
Combining Conformer and Dual-Path-Transformer Networks for Single Channel Noisy Reverberant Speech Separation.

ICASSP2023 Rehan Ahmad, Md Asif Jalal, Muhammad Umar Farooq, Anna Ollerenshaw, Thomas Hain
Towards Domain Generalisation in ASR with Elitist Sampling and Ensemble Knowledge Distillation.

ICASSP2023 George Close, William Ravenscroft, Thomas Hain, Stefan Goetze, 
Perceive and Predict: Self-Supervised Speech Representation Based Loss Functions for Speech Enhancement.

ICASSP2023 William Ravenscroft, Stefan Goetze, Thomas Hain
Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation.

Interspeech2023 Cong-Thanh Do, Rama Doddipatla, Mohan Li, Thomas Hain
Domain Adaptive Self-supervised Training of Automatic Speech Recognition.

Interspeech2023 Muhammad Umar Farooq, Thomas Hain
Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition.

ICASSP2022 Chanho Park, Rehan Ahmad, Thomas Hain
Unsupervised Data Selection for Speech Recognition with Contrastive Loss Ratios.

ICASSP2022 Jose Antonio Lopez Saenz, Thomas Hain
A Model for Assessor Bias in Automatic Pronunciation Assessment.

Interspeech2022 George Close, Samuel Hollands, Stefan Goetze, Thomas Hain
Non-intrusive Speech Intelligibility Metric Prediction for Hearing Impaired Individuals.

Interspeech2022 Muhammad Umar Farooq, Thomas Hain
Investigating the Impact of Crosslingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition.

Interspeech2022 Muhammad Umar Farooq, Darshan Adiga Haniya Narayana, Thomas Hain
Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion.

ICASSP2021 Qiang Huang 0008, Thomas Hain
Improving Audio Anomalies Recognition Using Temporal Convolutional Attention Networks.

ICASSP2021 Mingjie Chen, Yanpei Shi, Thomas Hain
Towards Low-Resource Stargan Voice Conversion Using Weight Adaptive Instance Normalization.

ICASSP2021 Cong-Thanh Do, Rama Doddipatla, Thomas Hain
Multiple-Hypothesis CTC-Based Semi-Supervised Adaptation of End-to-End Speech Recognition.

Interspeech2021 Anna Ollerenshaw, Md. Asif Jalal, Thomas Hain
Insights on Neural Representations for End-to-End Speech Recognition.

ICASSP2020 Yanpei Shi, Qiang Huang 0008, Thomas Hain
H-Vectors: Utterance-Level Speaker Embedding Using a Hierarchical Attention Model.

#119  | George Saon | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 14EMNLP: 1TASLP: 1
By year2024: 22023: 42022: 102021: 82020: 32019: 7
ISCA sessionsasr: 2asr neural network training: 2speech recognition: 1novel models and training methods for asr: 1neural transducers, streaming asr and novel asr models: 1multi-, cross-lingual and other topics in asr: 1other topics in speech recognition: 1streaming for asr/rnn transducers: 1neural network training methods for asr: 1spoken language understanding: 1language and lexical modeling for asr: 1novel neural network architectures for asr: 1streaming asr: 1asr neural network architectures and training: 1resources – annotation – evaluation: 1sequence-to-sequence speech recognition: 1
IEEE keywordsspeech recognition: 10automatic speech recognition: 8recurrent neural nets: 6switches: 4transducers: 3end to end asr: 3natural language processing: 3spoken language understanding: 3decoding: 2telephone sets: 2rnn transducers: 2text analysis: 2distributed training: 2parallel computing: 2ctc: 1encoding: 1streaming: 1semi autoregressive: 1signal processing algorithms: 1asr: 1inference algorithms: 1context modeling: 1bert: 1large language models: 1knowledge distillation: 1linguistics: 1diagonal state space models: 1extraterrestrial measurements: 1transformers: 1neural transducers: 1aerospace electronics: 1structured state space models: 1training data: 1dialog history: 1data models: 1transforms: 1robustness: 1multi speaker: 1end to end: 1recording: 1spiking neural unit: 1neural net architecture: 1neurophysiology: 1spiking neural networks: 1rnn t: 1synapse types: 1brain: 1data handling: 1attention: 1atis: 1speech coding: 1encoder decoder: 1end to end models: 1natural languages: 1adaptation: 1end to end mod els: 1language model customization: 1decentralized training: 1deep neural networks: 1convergence: 1asynchronous training: 1data analysis: 1sensor fusion: 1recurrent neural network transducer: 1multiplicative integration: 1speaker recognition: 1decentralized sgd: 1image recognition: 1supercomputers: 1task analysis: 1noise injection: 1broadcast news: 1deep neural networks.: 1switchboard.: 1parallel processing: 1graphics processing units: 1lstm: 1
Most publications (all venues) at2022: 102017: 102021: 92019: 82014: 8

Affiliations
URLs

Recent publications

ICASSP2024 Siddhant Arora, George Saon, Shinji Watanabe 0001, Brian Kingsbury, 
Semi-Autoregressive Streaming ASR with Label Context.

ICASSP2024 Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon
Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems.

ICASSP2023 George Saon, Ankit Gupta 0001, Xiaodong Cui, 
Diagonal State Space Augmented Transformers for Speech Recognition.

ICASSP2023 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, George Saon, Brian Kingsbury, 
Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition.

Interspeech2023 Xiaodong Cui, George Saon, Brian Kingsbury, 
Improving RNN Transducer Acoustic Models for English Conversational Speech Recognition.

EMNLP2023 Ashish R. Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, Gakuto Kurata, 
Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries.

ICASSP2022 Thomas Bohnstingl, Ayush Garg 0006, Stanislaw Wozniak, George Saon, Evangelos Eleftheriou, Angeliki Pantazi, 
Speech Recognition Using Biologically-Inspired Neural Networks.

ICASSP2022 Hong-Kwang Jeff Kuo, Zoltán Tüske, Samuel Thomas 0001, Brian Kingsbury, George Saon
Improving End-to-end Models for Set Prediction in Spoken Language Understanding.

ICASSP2022 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, Brian Kingsbury, George Saon
Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems.

ICASSP2022 Samuel Thomas 0001, Brian Kingsbury, George Saon, Hong-Kwang Jeff Kuo, 
Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models.

Interspeech2022 Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata, 
Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing.

Interspeech2022 Andrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan, 
Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization.

Interspeech2022 Takashi Fukuda, Samuel Thomas 0001, Masayuki Suzuki, Gakuto Kurata, George Saon, Brian Kingsbury, 
Global RNN Transducer Models For Multi-dialect Speech Recognition.

Interspeech2022 Zvi Kons, Hagai Aronowitz, Edmilson da Silva Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas 0001, George Saon
Extending RNN-T-based speech recognition systems with emotion and language classification.

Interspeech2022 Jiatong Shi, George Saon, David Haws, Shinji Watanabe 0001, Brian Kingsbury, 
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States.

Interspeech2022 Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon
Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems.

TASLP2021 Xiaodong Cui, Wei Zhang 0022, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David S. Kung 0001, 
Asynchronous Decentralized Distributed Training of Acoustic Models.

ICASSP2021 Samuel Thomas 0001, Hong-Kwang Jeff Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, 
RNN Transducer Models for Spoken Language Understanding.

ICASSP2021 George Saon, Zoltán Tüske, Daniel Bolaños, Brian Kingsbury, 
Advancing RNN Transducer Technology for Speech Recognition.

Interspeech2021 Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltán Tüske, 
Reducing Exposure Bias in Training Recurrent Neural Network Transducers.

#120  | Xurong Xie | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 10TASLP: 7
By year2024: 32023: 82022: 62021: 92020: 22019: 42018: 2
ISCA sessionsspeech recognition of atypical speech: 4speech and language in health: 3multi-, cross-lingual and other topics in asr: 2topics in asr: 2acoustic model adaptation for asr: 1novel models and training methods for asr: 1asr neural network architectures: 1model adaptation for asr: 1novel neural network architectures for acoustic modelling: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 14bayes methods: 6speaker adaptation: 5adaptation models: 5speaker recognition: 5data models: 4bayesian learning: 4elderly speech: 3dysarthric speech: 3switches: 3data augmentation: 3natural language processing: 3pre trained asr system: 2wav2vec2.0: 2older adults: 2decoding: 2task analysis: 2conformer: 2controllability: 2speech disorders: 2hidden markov models: 2adaptation: 2lf mmi: 2handicapped aids: 2disordered speech recognition: 2neural architecture search: 2deep learning (artificial intelligence): 2time delay neural network: 2domain adaptation: 2inference mechanisms: 2lhuc: 2gaussian processes: 2standards: 1multi lingual xlsr: 1hubert: 1training data: 1low latency: 1rapid adaptation: 1interpolation: 1specaugment: 1reinforcement learning: 1confidence score estimation: 1transformers: 1self supervised learning: 1generative adversarial networks: 1vae: 1gan: 1perturbation methods: 1bayesian: 1nist: 1end to end: 1parameter estimation: 1uncertainty: 1elderly speech recognition: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1error analysis: 1articulatory inversion: 1hybrid power systems: 1benchmark testing: 1variational inference: 1delays: 1generalisation (artificial intelligence): 1gaussian process: 1multimodal speech recognition: 1tdnn: 1switchboard: 1automatic speech recognition: 1neurocognitive disorder detection: 1dementia: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1maximum likelihood estimation: 1
Most publications (all venues) at2022: 132023: 102024: 72021: 62019: 4

Affiliations
URLs

Recent publications

TASLP2024 Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu, 
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition.

ICASSP2024 Jiajun Deng, Xurong Xie, Guinan Li, Mingyu Cui, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Zhaoqing Li, Xunying Liu, 
Towards High-Performance and Low-Latency Feature-Based Speaker Adaptation of Conformer Speech Recognition Systems.

ICASSP2024 Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu, 
Towards Automatic Data Augmentation for Disordered Speech Recognition.

TASLP2023 Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Guinan Li, Shujie Hu, Xunying Liu, 
Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems.

ICASSP2023 Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng, 
Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition.

ICASSP2023 Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu, 
Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition.

ICASSP2023 Xurong Xie, Xunying Liu, Hui Chen 0020, Hongan Wang, 
Unsupervised Model-Based Speaker Adaptation of End-To-End Lattice-Free MMI Model for Speech Recognition.

Interspeech2023 Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu, 
Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems.

Interspeech2023 Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu, 
Use of Speech Impairment Severity for Dysarthric Speech Recognition.

Interspeech2023 Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye 0001, Helen Meng, Xunying Liu, 
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition.

Interspeech2023 Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Helen Meng, Xunying Liu, 
Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition.

TASLP2022 Mengzhe Geng, Xurong Xie, Zi Ye 0001, Tianzi Wang, Guinan Li, Shujie Hu, Xunying Liu, Helen Meng, 
Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition.

TASLP2022 Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

ICASSP2022 Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng, 
Exploiting Cross Domain Acoustic-to-Articulatory Inverted Features for Disordered Speech Recognition.

Interspeech2022 Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng, 
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems.

Interspeech2022 Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng, 
Confidence Score Based Conformer Speaker Adaptation for Speech Recognition.

Interspeech2022 Jin Li, Rongfeng Su, Xurong Xie, Lan Wang, Nan Yan, 
A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition.

TASLP2021 Shoukang Hu, Xurong Xie, Shansong Liu, Jianwei Yu, Zi Ye 0001, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition.

TASLP2021 Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen Meng, 
Recent Progress in the CUHK Dysarthric Speech Recognition System.

TASLP2021 Xurong Xie, Xunying Liu, Tan Lee 0001, Lan Wang, 
Bayesian Learning for Deep Neural Network Adaptation.

#121  | Trevor Strohman | DBLP Google Scholar  
By venueICASSP: 20Interspeech: 14
By year2024: 12023: 82022: 152021: 42020: 52019: 1
ISCA sessionsasr technologies and systems: 2multi-, cross-lingual and other topics in asr: 2asr: 1search/decoding algorithms for asr: 1language modeling and lexical modeling for asr: 1low-resource asr development: 1zero, low-resource and multi-modal speech recognition: 1novel models and training methods for asr: 1language and lexical modeling for asr: 1speech classification: 1asr neural network architectures and training: 1asr neural network architectures: 1
IEEE keywordsspeech recognition: 12adaptation models: 6decoding: 6end to end asr: 4recurrent neural nets: 4transducers: 3error analysis: 3video on demand: 3natural language processing: 3computational modeling: 3rnn t: 3degradation: 2computer architecture: 2task analysis: 2asr: 2production: 2data models: 2semi supervised learning: 2domain adaptation: 2multilingual: 2text analysis: 2two pass asr: 2conformer: 2speech coding: 2rnnt: 2long form asr: 2latency: 2tail: 1adapter finetuning: 1streaming multilingual asr: 1longform asr: 1fuses: 1tensors: 1segmentation: 1earth observing system: 1decoding algorithms: 1real time systems: 1signal processing algorithms: 1transfer learning: 1costs: 1memory management: 1analytical models: 1foundation model: 1noisy student training: 1machine learning: 1rnn transducer: 1knowledge distillation: 1foundation models: 1frequency modulation: 1soft sensors: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1buildings: 1switches: 1utf 8 byte: 1unified modeling language: 1word piece: 1semi supervised learning (artificial intelligence): 1self supervised learning: 1massive: 1lifelong learning: 1on device learning: 1associative memory: 1fast contextual adaptation: 1speaker recognition: 1cascaded encoders: 1confidence scores: 1probability: 1hidden markov models: 1second pass asr: 1endpointer: 1optimisation: 1vocabulary: 1supervised learning: 1
Most publications (all venues) at2022: 202023: 102021: 82020: 52005: 3

Affiliations
URLs

Recent publications

ICASSP2024 Junwen Bai, Bo Li 0028, Qiujia Li, Tara N. Sainath, Trevor Strohman
Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR.

ICASSP2023 Shuo-Yiin Chang, Chao Zhang 0031, Tara N. Sainath, Bo Li 0028, Trevor Strohman
Context-Aware end-to-end ASR Using Self-Attentive Embedding and Tensor Fusion.

ICASSP2023 Ke Hu, Tara N. Sainath, Bo Li 0028, Nan Du 0002, Yanping Huang, Andrew M. Dai, Yu Zhang 0033, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman
Massively Multilingual Shallow Fusion with Large Language Models.

ICASSP2023 W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model.

ICASSP2023 Zhouyuan Huo, Khe Chai Sim, Bo Li 0028, Dongseong Hwang, Tara N. Sainath, Trevor Strohman
Resource-Efficient Transfer Learning from Speech Foundation Model Using Hierarchical Feature Fusion.

ICASSP2023 Dongseong Hwang, Khe Chai Sim, Yu Zhang 0033, Trevor Strohman
Comparison of Soft and Hard Target RNN-T Distillation for Large-Scale ASR.

ICASSP2023 Bo Li 0028, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang 0033, Wei Han 0002, Trevor Strohman, Françoise Beaufays, 
Efficient Domain Adaptation for Speech Foundation Models.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition.

ICASSP2023 Chao Zhang 0031, Bo Li 0028, Tara N. Sainath, Trevor Strohman, Shuo-Yiin Chang, 
UML: A Universal Monolingual Output Layer For Multilingual Asr.

ICASSP2022 Ke Hu, Tara N. Sainath, Arun Narayanan, Ruoming Pang, Trevor Strohman
Transducer-Based Streaming Deliberation for Cascaded Encoders.

ICASSP2022 Dongseong Hwang, Ananya Misra, Zhouyuan Huo, Nikhil Siddhartha, Shefali Garg, David Qiu, Khe Chai Sim, Trevor Strohman, Françoise Beaufays, Yanzhang He, 
Large-Scale ASR Domain Adaptation Using Self- and Semi-Supervised Learning.

ICASSP2022 Bo Li 0028, Ruoming Pang, Yu Zhang 0033, Tara N. Sainath, Trevor Strohman, Parisa Haghani, Yun Zhu, Brian Farris, Neeraj Gaur, Manasa Prasad, 
Massively Multilingual ASR: A Lifelong Learning Solution.

ICASSP2022 Tsendsuren Munkhdalai, Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Trevor Strohman, Françoise Beaufays, 
Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition.

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033, 
Improving The Latency And Quality Of Cascaded Encoders.

Interspeech2022 Shuo-Yiin Chang, Bo Li 0028, Tara N. Sainath, Chao Zhang 0031, Trevor Strohman, Qiao Liang 0001, Yanzhang He, 
Turn-Taking Prediction for Natural Conversational Speech.

Interspeech2022 Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara N. Sainath, Bo Li 0028, Qiao Liang 0001, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman
Streaming Intended Query Detection using E2E Modeling for Continued Conversation.

Interspeech2022 Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang 0016, Rina Panigrahy, Qiao Liang 0001, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes.

Interspeech2022 Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang, 
Improving Deliberation by Text-Only and Semi-Supervised Training.

Interspeech2022 W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor D. Strohman, Shankar Kumar, 
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition.

Interspeech2022 Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays, 
Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device.

#122  | Abeer Alwan | DBLP Google Scholar  
By venueInterspeech: 22ICASSP: 7SpeechComm: 3TASLP: 1
By year2024: 22023: 62022: 102021: 32020: 42019: 42018: 4
ISCA sessionsconnecting speech-science and speech-technology for children's speech: 2speech and language in health: 2speaking styles and interaction styles: 2speech signal analysis: 1low-resource asr development: 1inclusive and fair speech technologies: 1multimodal speech emotion recognition and paralinguistics: 1non-autoregressive sequential modeling for speech processing: 1topics in asr: 1speaker recognition: 1computational paralinguistics: 1large-scale evaluation of short-duration speaker verification: 1summarization, semantic analysis and classification: 1the interspeech 2019 computational paralinguistics challenge (compare): 1spoken language processing for children’s speech: 1integrating speech science and technology for clinical applications: 1acoustic modelling: 1the interspeech 2018 computational paralinguistics challenge (compare): 1applications in education and learning: 1
IEEE keywordsspeech recognition: 4data augmentation: 4adaptation models: 2transformers: 2children’s speech: 2african american english: 2information retrieval: 1training data: 1automatic speech recognition: 1spoken question answering: 1audio recording: 1large language models: 1spoken language understanding: 1computational modeling: 1iterative decoding: 1non autoregressive transformer: 1ctc alignment: 1end to end asr: 1intermediate loss: 1decoding: 1semantics: 1social networking (online): 1data models: 1self supervised learning: 1language modeling: 1dialect identification: 1blogs: 1dialect robust asr: 1hidden markov models: 1low resource asr: 1linear predictive coding: 1frame rate: 1time frequency resolution: 1x vector: 1depression detection: 1meta initialization: 1computer aided instruction: 1kindergarten aged asr: 1task augmentation: 1child asr: 1error analysis: 1feature normalization: 1child speech recognition: 1frequency domain analysis: 1perturbation methods: 1fundamental frequency: 1sensor fusion: 1speaker discrimination: 1voice quality: 1automatic speaker verification: 1speaker perception: 1decision making: 1cepstral analysis: 1speaker recognition: 1
Most publications (all venues) at2022: 122013: 112006: 112010: 102009: 10

Affiliations
University of California, Los Angeles, USA

Recent publications

SpeechComm2024 Jinhan Wang, Vijay Ravi, Jonathan Flint, Abeer Alwan
Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification.

ICASSP2024 Natarajan Balaji Shankar, Alexander Johnson, Christina Chance, Hariram Veeramani, Abeer Alwan
CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files.

TASLP2023 Ruchao Fan, Wei Chu, Peng Chang 0002, Abeer Alwan
A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition.

ICASSP2023 Alexander Johnson, Vishwas M. Shetty, Mari Ostendorf, Abeer Alwan
Leveraging Multiple Sources in Automatic African American English Dialect Detection for Adults and Children.

Interspeech2023 Eray Eren, Lee Ngee Tan, Abeer Alwan
FusedF0: Improving DNN-based F0 Estimation by Fusion of Summary-Correlograms and Raw Waveform Representations of Speech Signals.

Interspeech2023 Alexander Johnson, Hariram Veeramani, Natarajan Balaji Shankar, Abeer Alwan
An Equitable Framework for Automatically Assessing Children's Oral Narrative Language Abilities.

Interspeech2023 Vishwas M. Shetty, Steven M. Lulich, Abeer Alwan
Developmental Articulatory and Acoustic Features for Six to Ten Year Old Children.

Interspeech2023 Jinhan Wang, Vijay Ravi, Abeer Alwan
Non-uniform Speaker Disentanglement For Depression Detection From Raw Speech Signals.

SpeechComm2022 Gary Yeung, Ruchao Fan, Abeer Alwan
Fundamental frequency feature warping for frequency normalization and data augmentation in child automatic speech recognition.

ICASSP2022 Alexander Johnson, Ruchao Fan, Robin Morris, Abeer Alwan
LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects.

ICASSP2022 Vijay Ravi, Jinhan Wang, Jonathan Flint, Abeer Alwan
Fraug: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals.

ICASSP2022 Yunzheng Zhu, Ruchao Fan, Abeer Alwan
Towards Better Meta-Initialization with Task Augmentation for Kindergarten-Aged Speech Recognition.

Interspeech2022 Amber Afshan, Abeer Alwan
Attention-based conditioning methods using variable frame rate for style-robust speaker verification.

Interspeech2022 Amber Afshan, Abeer Alwan
Learning from human perception to improve automatic speaker verification in style-mismatched conditions.

Interspeech2022 Ruchao Fan, Abeer Alwan
DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children's ASR.

Interspeech2022 Alexander Johnson, Kevin Everson, Vijay Ravi, Anissa Gladney, Mari Ostendorf, Abeer Alwan
Automatic Dialect Density Estimation for African American English.

Interspeech2022 Vijay Ravi, Jinhan Wang, Jonathan Flint, Abeer Alwan
A Step Towards Preserving Speakers' Identity While Detecting Depression Via Speaker Disentanglement.

Interspeech2022 Jinhan Wang, Vijay Ravi, Jonathan Flint, Abeer Alwan
Unsupervised Instance Discriminative Learning for Depression Detection from Speech Signals.

ICASSP2021 Gary Yeung, Ruchao Fan, Abeer Alwan
Fundamental Frequency Feature Normalization and Data Augmentation for Child Speech Recognition.

Interspeech2021 Ruchao Fan, Wei Chu, Peng Chang 0002, Jing Xiao 0006, Abeer Alwan
An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition.

#123  | Mark J. F. Gales | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 8TASLP: 3NAACL: 1
By year2024: 22023: 52022: 22021: 42020: 82019: 62018: 6
ISCA sessionsspeech recognition: 3spoken language evaluatiosn: 3speech synthesis: 2language learning and databases: 2applications in education and learning: 2show and tell: 1applications in transcription, education and learning: 1automatic speech recognition for non-native children’s speech: 1pronunciation: 1summarization, semantic analysis and classification: 1language modeling: 1recurrent neural models for asr: 1statistical parametric speech synthesis: 1acoustic model adaptation: 1
IEEE keywordsspeech recognition: 8natural language processing: 5recurrent neural nets: 3data models: 2computational modeling: 2predictive models: 2confidence: 2recurrent neural network: 2ensemble: 2spoken language assessment: 2lattice: 2probability: 2grammar: 1spoken grammatical error correction: 1disfluency detection: 1automatic speaking assessment and feedback: 1foundation speech recognition models: 1analytical models: 1pipelines: 1speech: 1text to speech: 1computer architecture: 1speech synthesis: 1ensemble methods: 1prosody prediction: 1information retrieval: 1attention: 1graph structures: 1error correction: 1training data: 1task analysis: 1measurement uncertainty: 1dirichlet: 1grammatical error correction: 1transformer: 1distribution distillation: 1uncertainty: 1end to end training: 1embedding passing: 1language translation: 1speech translation: 1spoken language processing: 1regression analysis: 1bias in deep learning: 1deep learning (artificial intelligence): 1concept activation vectors: 1sub word: 1neural network: 1quality control: 1keyword search: 1language model: 1feedforward: 1succeeding words: 1automatic speech recognition: 1teacher student: 1random forest: 1audio signal processing: 1television broadcasting: 1lattice free: 1computer aided instruction: 1linguistics: 1grammatical error detection: 1call: 1bi directional recurrent neural network: 1confusion network: 1confidence estimation: 1
Most publications (all venues) at2023: 302015: 262011: 252013: 242024: 23


Recent publications

ICASSP2024 Stefano Bannò, Rao Ma, Mengjie Qian, Kate M. Knill, Mark J. F. Gales
Towards End-to-End Spoken Grammatical Error Correction.

NAACL2024 Rao Ma, Adian Liusie, Mark J. F. Gales, Kate M. Knill, 
Investigating the Emergent Audio Classification Ability of ASR Foundation Models.

ICASSP2023 Tian Huey Teh, Vivian Hu, Devang S. Ram Mohan, Zack Hodari, Christopher G. R. Wallis, Tomás Gómez Ibarrondo, Alexandra Torresquintero, James Leoni, Mark J. F. Gales, Simon King 0001, 
Ensemble Prosody Prediction For Expressive Speech Synthesis.

Interspeech2023 Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales
Multi-Head State Space Model for Speech Recognition.

Interspeech2023 Rao Ma, Mark J. F. Gales, Kate M. Knill, Mengjie Qian, 
N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space.

Interspeech2023 Rao Ma, Mengjie Qian, Mark J. F. Gales, Kate M. Knill, 
Adapting an Unadaptable ASR System.

Interspeech2023 Diane Nicholls, Kate M. Knill, Mark J. F. Gales, Anton Ragni, Paul Ricketts, 
Speak & Improve: L2 English Speaking Practice Tool.

TASLP2022 Anton Ragni, Mark J. F. Gales, Oliver Rose, Katherine M. Knill, Alexandros Kastanos, Qiujia Li, Preben Ness, 
Increasing Context for Estimating Confidence Scores in Automatic Speech Recognition.

Interspeech2022 Stefano Bannò, Bhanu Balusu, Mark J. F. Gales, Kate M. Knill, Konstantinos Kyriakopoulos, 
View-Specific Assessment of L2 Spoken English.

ICASSP2021 Yassir Fathullah, Mark J. F. Gales, Andrey Malinin, 
Ensemble Distillation Approaches for Grammatical Error Correction.

ICASSP2021 Yiting Lu, Yu Wang 0027, Mark J. F. Gales
Efficient Use of End-to-End Data in Spoken Language Processing.

ICASSP2021 Xizi Wei, Mark J. F. Gales, Kate M. Knill, 
Analysing Bias in Spoken Language Assessment Using Concept Activation Vectors.

Interspeech2021 Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J. F. Gales
Deliberation-Based Multi-Pass Speech Synthesis.

ICASSP2020 Alexandros Kastanos, Anton Ragni, Mark J. F. Gales
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks.

Interspeech2020 Qingyun Dou, Joshua Efiong, Mark J. F. Gales
Attention Forcing for Speech Synthesis.

Interspeech2020 Kate M. Knill, Linlin Wang, Yu Wang 0027, Xixin Wu, Mark J. F. Gales
Non-Native Children's Automatic Speech Recognition: The INTERSPEECH 2020 Shared Task ALTA Systems.

Interspeech2020 Konstantinos Kyriakopoulos, Kate M. Knill, Mark J. F. Gales
Automatic Detection of Accent and Lexical Pronunciation Errors in Spontaneous Non-Native English Speech.

Interspeech2020 Yiting Lu, Mark J. F. Gales, Yu Wang 0027, 
Spoken Language 'Grammatical Error Correction'.

Interspeech2020 Potsawee Manakul, Mark J. F. Gales, Linlin Wang, 
Abstractive Spoken Document Summarization Using Hierarchical Model with Multi-Stage Attention Diversity Optimization.

Interspeech2020 Vyas Raina, Mark J. F. Gales, Kate M. Knill, 
Universal Adversarial Attacks on Spoken Language Assessment Systems.

#124  | Sakriani Sakti | DBLP Google Scholar  
By venueInterspeech: 21TASLP: 6ICASSP: 5EMNLP: 1
By year2023: 52022: 32021: 62020: 82019: 82018: 3
ISCA sessionsspoken machine translation: 3speech synthesis: 2the zero resource speech challenge 2020: 2the zero resource speech challenge 2019: 2speech recognition: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1low-resource speech recognition: 1lm adaptation, lexical units and punctuation: 1general topics in speech recognition: 1neural signals for spoken communication: 1topics in asr: 1search methods for speech recognition: 1speech in the brain: 1sequence models for asr: 1acoustic model adaptation: 1statistical parametric speech synthesis: 1
IEEE keywordsspeech synthesis: 5speech recognition: 5text to speech: 2lombard effect: 2dpgmm: 2unsupervised phoneme discovery: 2zerospeech: 2recurrent neural nets: 2unsupervised learning: 2natural language processing: 2gaussian processes: 2signal reconstruction: 2speech chain: 2tts: 2asr: 2feature fusion: 1isotropy analysis: 1representation learning: 1bit error rate: 1acoustic measurements: 1self supervised learning: 1geometry: 1predictive coding: 1zero resource speech challenge: 1self adaptive: 1machine speech chain: 1incremental: 1low latency communication: 1hurricanes: 1noise measurement: 1real time systems: 1acoustic noise: 1dynamic adaptation: 1machine speech chain inference: 1signal denoising: 1speech intelligibility: 1hearing: 1low resource asr: 1infant speech perception: 1engrams: 1perception of phonemes: 1rnn: 1functional load: 1automatic speech recognition: 1information retrieval: 1interactive systems: 1emotion recognition: 1affective computing: 1human computer interaction: 1emotion elicitation: 1chat based dialogue system: 1independent component analysis: 1blind source separation: 1cognition: 1eeg: 1medical signal processing: 1electroencephalography: 1speech artifact removal: 1neurophysiology: 1spoken word production: 1tensor decomposition: 1brain: 1straight through estimator: 1end to end feedback loss: 1word segmentation: 1data models: 1cross lingual speech processing: 1labeling: 1indexes: 1tobi label generation: 1prosody detection: 1task analysis: 1
Most publications (all venues) at2014: 412015: 352018: 332020: 212019: 20

Affiliations
URLs

Recent publications

ICASSP2023 Jianan Chen, Sakriani Sakti
An Isotropy Analysis for Self-Supervised Acoustic Unit Embeddings on the Zero Resource Speech Challenge 2021 Framework.

ICASSP2023 Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001, 
Self-Adaptive Incremental Machine Speech Chain for Lombard TTS with High-Granularity ASR Feedback in Dynamic Noise Condition.

Interspeech2023 Shun Takahashi, Sakriani Sakti
Unsupervised Learning of Discrete Latent Representations with Data-Adaptive Dimensionality from Continuous Speech Streams.

Interspeech2023 Chung Tran, Chi Mai Luong, Sakriani Sakti
STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework.

EMNLP2023 Ruhiyah Widiaputri, Ayu Purwarianti, Dessi Puji Lestari, Kurniawati Azizah, Dipta Tanaya, Sakriani Sakti
Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian.

TASLP2022 Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001, 
A Machine Speech Chain Approach for Dynamically Adaptive Lombard TTS in Static and Dynamic Noise Environments.

TASLP2022 Bin Wu, Sakriani Sakti, Jinsong Zhang 0001, Satoshi Nakamura 0001, 
Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR.

Interspeech2022 Heli Qi, Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001, 
Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing.

TASLP2021 Bin Wu, Sakriani Sakti, Jinsong Zhang 0001, Satoshi Nakamura 0001, 
Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load.

Interspeech2021 Johanes Effendi, Sakriani Sakti, Satoshi Nakamura 0001, 
Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer.

Interspeech2021 Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura 0001, 
ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation.

Interspeech2021 Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura 0001, 
Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder.

Interspeech2021 Shun Takahashi, Sakriani Sakti, Satoshi Nakamura 0001, 
Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages.

Interspeech2021 Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura 0001, 
Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation.

TASLP2020 Andros Tjandra, Sakriani Sakti, Satoshi Nakamura 0001, 
Machine Speech Chain.

TASLP2020 Andros Tjandra, Sakriani Sakti, Satoshi Nakamura 0001, 
Corrections to "Machine Speech Chain".

Interspeech2020 Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux, 
The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units.

Interspeech2020 Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura 0001, 
Augmenting Images for ASR and TTS Through Single-Loop and Dual-Loop Multimodal Chain Framework.

Interspeech2020 Sashi Novitasari, Andros Tjandra, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura 0001, 
Incremental Machine Speech Chain Towards Enabling Listening While Speaking in Real-Time.

Interspeech2020 Ivan Halim Parmonangan, Hiroki Tanaka, Sakriani Sakti, Satoshi Nakamura 0001, 
Combining Audio and Brain Activity for Predicting Speech Quality.

#125  | Odette Scharenborg | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 9SpeechComm: 4TASLP: 3
By year2024: 12023: 42022: 82021: 72020: 62019: 42018: 3
ISCA sessionsspeech and voice disorders: 2applications of asr: 2spoken language processing: 1low-resource asr development: 1technology for disordered speech: 1multi-, cross-lingual and other topics in asr: 1spoken dialogue systems and multimodality: 1low-resource speech recognition: 1topics in asr: 1phonetic event detection and segmentation: 1neural networks for language modeling: 1speech in the brain: 1signal analysis for the natural, biological and social sciences: 1speech perception in adverse conditions: 1deep neural networks: 1
IEEE keywordsspeech recognition: 7misp challenge: 4visualization: 4speech synthesis: 4automatic speech recognition: 3natural language processing: 3speech enhancement: 2multimodality: 2recording: 2audio visual: 2speaker diarization: 2multimodal modelling: 2text analysis: 2decoding: 2data mining: 1target speaker extraction: 1real world scenarios: 1benchmark testing: 1synchronization: 1brain modeling: 1resnet: 1covert (imagined) speech: 1computational modeling: 1electroencephalography (eeg): 1brain computer interfaces: 1electroencephalography: 1databases: 1tv: 1quality assessment: 1public domain software: 1wake word spotting: 1audio visual systems: 1speaker recognition: 1microphone array: 1pathology: 1sequence to sequence modeling: 1decision making: 1dysarthric speech: 1pathological speech: 1autoencoder: 1voice conversion: 1task analysis: 1cross modal captioning: 1image annotation: 1image to speech generation: 1image recognition: 1speech to image generation: 1adversarial learning: 1supervised learning: 1speech embedding: 1multilingual: 1zero shot learning: 1phonotactics: 1sequence to sequence: 1image to speech: 1image captioning: 1encoder decoder: 1language acquisition: 1spoken term discovery: 1low resource speech technology: 1multimodal learning: 1human computer interaction: 1unsupervised learning: 1image retrieval: 1image representation: 1
Most publications (all venues) at2021: 162022: 132023: 102020: 92019: 8

Affiliations
URLs

Recent publications

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

SpeechComm2023 Bence Mark Halpern, Siyuan Feng 0001, Rob van Son, Michiel W. M. van den Brekel, Odette Scharenborg
Automatic evaluation of spontaneous oral cancer speech using ratings from naive listeners.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Bo Dekker, Alfred C. Schouten, Odette Scharenborg
DAIS: The Delft Database of EEG Recordings of Dutch Articulated and Imagined Speech.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

SpeechComm2022 Bence Mark Halpern, Siyuan Feng 0001, Rob van Son, Michiel W. M. van den Brekel, Odette Scharenborg
Low-resource automatic speech recognition and error analyses of oral cancer speech.

ICASSP2022 Hang Chen, Hengshun Zhou, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results.

ICASSP2022 Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda, 
Towards Identity Preserving Normal to Dysarthric Voice Conversion.

Interspeech2022 Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan, 
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis.

Interspeech2022 Tanvina Patel, Odette Scharenborg
Using cross-model learnings for the Gram Vaani ASR Challenge 2022.

Interspeech2022 Luke Prananta, Bence Mark Halpern, Siyuan Feng 0001, Odette Scharenborg
The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition.

Interspeech2022 Yuanyuan Zhang, Yixuan Zhang, Bence Mark Halpern, Tanvina Patel, Odette Scharenborg
Mitigating bias against non-native accents.

Interspeech2022 Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee 0001, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jianqing Gao, 
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis.

SpeechComm2021 Polina Drozdova, Roeland van Hout, Sven L. Mattys, Odette Scharenborg
The effect of intermittent noise on lexically-guided perceptual learning in native and non-native listening.

TASLP2021 Xinsheng Wang, Justin van der Hout, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
Synthesizing Spoken Descriptions of Images.

TASLP2021 Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg
Generating Images From Spoken Descriptions.

ICASSP2021 Siyuan Feng 0001, Piotr Zelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak, 
How Phonotactics Affect Multilingual and Zero-Shot ASR Performance.

ICASSP2021 Xinsheng Wang, Siyuan Feng 0001, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
Show and Speak: Directly Synthesize Spoken Description of Images.

ICASSP2021 Liming Wang, Xinsheng Wang, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak, 
Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval.

Interspeech2021 Siyuan Feng 0001, Piotr Zelasko, Laureano Moro-Velázquez, Odette Scharenborg
Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation.

#126  | Shinnosuke Takamichi | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 11SpeechComm: 2TASLP: 2IJCAI: 1
By year2024: 52023: 92022: 62021: 52020: 62019: 2
ISCA sessionsspeech synthesis: 10speech perception, production, and acquisition: 1speech coding and restoration: 1the voicemos challenge: 1spoken language processing: 1speech annotation and speech assessment: 1speech synthesis paradigms and methods: 1speech in the brain: 1
IEEE keywordsspeech synthesis: 5text to speech synthesis: 3symbols: 2task analysis: 2video on demand: 2speaker embedding: 2speaker recognition: 2speech perception: 2generative adversarial networks: 2generative adversarial network: 2black box optimization: 2human computation: 2speech recognition: 2transfer learning: 1training data: 1multilingual text to speech: 1low resource adaptation: 1graphone: 1data models: 1adaptation models: 1adaptation of masked language model: 1vocal imitation: 1foley sound synthesis: 1sound event label: 1rhythm: 1environmental sound synthesis: 1decoding: 1acoustic measurements: 1corpus construction: 1linguistics: 1core set selection: 1diversification: 1data selection: 1human machine systems: 1oral communication: 1generative spoken language model: 1speech analysis: 1zipf’s law: 1annotations: 1speech representation: 1vocabulary: 1social networking (online): 1vocal ensemble: 1audio source separation: 1corpus: 1lead: 1audio recording: 1singing voice synthesis: 1signal processing algorithms: 1singing voice: 1degradation: 1controllability: 1cross lingual speech synthesis: 1multi speaker speech synthesis: 1speaker generation: 1interpolation: 1context modeling: 1audiobook: 1aggregates: 1signal resolution: 1speech prosody: 1tts: 1predictive models: 1feeds: 1active learning: 1deep speaker representation learning: 1multi speaker generative modeling: 1perceptual speaker similarity: 1scalability: 1auxiliary classifier: 1backpropagation algorithms: 1conditional generator: 1domain adaptation: 1text analysis: 1mutual information: 1cross lingual: 1crowdsourcing: 1computational modeling: 1human voice: 1gallium nitride: 1generators: 1spectral differentials: 1deep neural network: 1minimum phase filter: 1sub band processing: 1hilbert transforms: 1voice conversion: 1gaussian processes: 1artificial double tracking: 1modulation spectrum: 1moment matching network: 1inter utterance pitch variation: 1dnn based singing voice synthesis: 1music: 1filtering theory: 1
Most publications (all venues) at2021: 182024: 162023: 162022: 162020: 16

Affiliations
URLs

Recent publications

SpeechComm2024 Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari, 
JNV corpus: A corpus of Japanese nonverbal vocalizations with diverse phrases and emotions.

TASLP2024 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe 0001, Shinnosuke Takamichi, Hiroshi Saruwatari, 
Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis.

ICASSP2024 Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryotaro Nagase, Takahiro Fukumori, Yoichi Yamashita, 
Environmental Sound Synthesis from Vocal Imitations and Sound Event Labels.

ICASSP2024 Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari, 
Diversity-Based Core-Set Selection for Text-to-Speech with Linguistic and Acoustic Features.

ICASSP2024 Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari, 
Do Learned Speech Symbols Follow Zipf's Law?

ICASSP2023 Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, Hiroshi Saruwatari, 
jaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus.

ICASSP2023 Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari, 
MID-Attribute Speaker Generation Using Optimal-Transport-Based Interpolation of Gaussian Mixture Models.

ICASSP2023 Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari, 
Improving Speech Prosody of Audiobook Text-To-Speech Synthesis with Acoustic and Textual Contexts.

Interspeech2023 Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari, 
How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics.

Interspeech2023 Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari, 
CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center.

Interspeech2023 Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari, 
ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings.

Interspeech2023 Yota Ueda, Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Hiroshi Saruwatari, 
HumanDiffusion: diffusion model using perceptual gradients.

Interspeech2023 Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi Saruwatari, 
Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus.

IJCAI2023 Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe 0001, Shinnosuke Takamichi, Hiroshi Saruwatari, 
Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining.

Interspeech2022 Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari, 
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis.

Interspeech2022 Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari, 
Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History.

Interspeech2022 Takaaki Saeki, Shinnosuke Takamichi, Tomohiko Nakamura, Naoko Tanji, Hiroshi Saruwatari, 
SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling.

Interspeech2022 Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari, 
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022.

Interspeech2022 Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari, 
STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent.

Interspeech2022 Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari, 
J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis.

#127  | Sabato Marco Siniscalchi | DBLP Google Scholar  
By venueInterspeech: 14ICASSP: 13TASLP: 4ICLR: 1NeurIPS: 1
By year2024: 32023: 82022: 42021: 52020: 82019: 5
ISCA sessionsspeech coding and enhancement: 2acoustic scene classification: 2bioacoustics and articulation: 2speaker and language identification: 1speech recognition: 1spoken language processing: 1spoken dialogue systems and multimodality: 1speech signal analysis and representation: 1privacy-preserving machine learning for audio & speech processing: 1multi-channel speech enhancement: 1speech synthesis: 1
IEEE keywordsspeech recognition: 9speech enhancement: 6misp challenge: 4visualization: 4deep neural network: 4data models: 3benchmark testing: 2multimodality: 2recording: 2audio visual: 2speaker diarization: 2hidden markov models: 2noise measurement: 2acoustic to articulatory inversion: 2task analysis: 2automatic speech recognition: 2natural language processing: 2regression analysis: 2transfer learning: 2data mining: 1target speaker extraction: 1real world scenarios: 1error analysis: 1degradation: 1knowledge based systems: 1boosting: 1multilingual automatic speech recognition: 1articulatory speech attributes: 1synchronization: 1tv: 1quality assessment: 1convolution: 1kernel: 1encoding: 1multi task training: 1mel frequency cepstral coefficient: 1speaker independent models: 1public domain software: 1wake word spotting: 1audio visual systems: 1speaker recognition: 1microphone array: 1image analysis: 1acoustic scene classification: 1data augmentation: 1convolutional neural networks: 1robustness: 1analytical models: 1class activation mapping: 1deep learning (artificial intelligence): 1estimation theory: 1dnn: 1feedforward neural nets: 1fbe: 1data privacy: 1acoustic modeling: 1and federated learning: 1quantum machine learning: 1recurrent neural nets: 1speech articulatory attributes: 1backpropagation: 1maximal figure of merit: 1deep bottleneck features: 1convolutional recurrent neural network: 1spoken language recognition: 1tensors: 1tensor train network: 1tensor to vector regression: 1noise robustness: 1domain adaptation: 1teacher student learning: 1adaptation models: 1pattern classification: 1non native tone modeling and mispronunciation detection: 1computer assisted pronunciation training (capt): 1computer assisted language learning (call): 1function approximation: 1expressive power: 1universal approximation: 1vector to vector regression: 1training data: 1switches: 1code switching: 1decoding: 1retraining free: 1multilingual speech recognition: 1cross modal training: 1signal to noise ratio: 1environmental aware training: 1databases: 1student teacher training: 1audio visual speech recognition: 1
Most publications (all venues) at2024: 132020: 132023: 122021: 122017: 10

Affiliations
URLs

Recent publications

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

ICASSP2024 Hao Yen, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
Boosting End-to-End Multilingual Phoneme Recognition Through Exploiting Universal Speech Attributes Constraints.

ICLR2024 Chen Chen 0075, Ruizhe Li 0001, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Engsiong Chng, Chao-Han Huck Yang, 
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.

Interspeech2023 Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi
Differentially Private Adapters for Parameter Efficient Acoustic Modeling.

Interspeech2023 Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models.

Interspeech2023 Salvatore Sarni, Sandro Cumani, Sabato Marco Siniscalchi, Andrea Bottino, 
Description and analysis of the KPT system for NIST Language Recognition Evaluation 2022.

Interspeech2023 Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao 0001, 
Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition.

NeurIPS2023 Chen Chen 0075, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, Chng Eng Siong, 
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models.

TASLP2022 Abdolreza Sabzi Shahrebabaki, Giampiero Salvi, Torbjørn Svendsen, Sabato Marco Siniscalchi
Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models.

ICASSP2022 Hang Chen, Hengshun Zhou, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results.

Interspeech2022 Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan, 
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis.

Interspeech2022 Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee 0001, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jianqing Gao, 
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis.

ICASSP2021 Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai 0002, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee 0001, 
A Two-Stage Approach to Device-Robust Acoustic Scene Classification.

ICASSP2021 Abdolreza Sabzi Shahrebabaki, Negar Olfati, Ali Shariq Imran, Magne Hallstein Johnsen, Sabato Marco Siniscalchi, Torbjørn Svendsen, 
A Two-Stage Deep Modeling Approach to Articulatory Inversion.

ICASSP2021 Chao-Han Huck Yang, Jun Qi 0002, Samuel Yen-Chi Chen, Pin-Yu Chen, Sabato Marco Siniscalchi, Xiaoli Ma, Chin-Hui Lee 0001, 
Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition.

Interspeech2021 Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Torbjørn Svendsen, 
Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation.

Interspeech2021 Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification.

#128  | Tao Qin 0001 | DBLP Google Scholar  
By venueInterspeech: 6TASLP: 5ICASSP: 5ICLR: 4NeurIPS: 4ICML: 2AAAI: 2ACL: 2KDD: 2EMNLP-Findings: 1
By year2024: 32023: 12022: 62021: 132020: 52019: 5
ISCA sessionsspeech synthesis: 4voice conversion and adaptation: 1multi- and cross-lingual asr, other topics in asr: 1
IEEE keywordsspeech synthesis: 4speech recognition: 3neural machine translation: 3text to speech: 3natural language processing: 3data models: 2task analysis: 2language translation: 2speech intelligibility: 2training data: 1meta learning: 1transformers: 1cross lingual adaptation: 1adaptation models: 1parameter efficiency: 1multiple teachers: 1random processes: 1sub networks: 1natural languages: 1knowledge distillation: 1dropout: 1speech quality assessment: 1medical image processing: 1correlation methods: 1mos prediction: 1mean bias network: 1sensitivity analysis: 1video signal processing: 1search problems: 1neural architecture search: 1fast: 1lightweight: 1autoregressive processes: 1data augmentation: 1low resource: 1mixup: 1adaptation: 1untranscribed data: 1signal reconstruction: 1frame level condition: 1speech enhancement: 1noisy speech: 1signal denoising: 1speaker recognition: 1denoise: 1neural architecture search (nas): 1neural net architecture: 1mathematical model: 1computational modeling: 1semi supervised learning: 1decoding: 1error propagation: 1accuracy drop: 1text analysis: 1language characteristic: 1sequence generation: 1
Most publications (all venues) at2021: 452019: 452022: 392020: 262023: 23

Affiliations
Microsoft Research, Beijing, China
URLs

Recent publications

ICML2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan 0003, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu 0001, Tao Qin 0001, Xiangyang Li 0001, Wei Ye 0004, Shikun Zhang, Jiang Bian 0002, Lei He 0005, Jinyu Li 0001, Sheng Zhao, 
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models.

ICLR2024 Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He 0005, Xiangyang Li 0001, Sheng Zhao, Tao Qin 0001, Jiang Bian 0002, 
PromptTTS 2: Describing and Generating Voices with Text Prompt.

ICLR2024 Kai Shen, Zeqian Ju, Xu Tan 0003, Eric Liu, Yichong Leng, Lei He 0005, Tao Qin 0001, Sheng Zhao, Jiang Bian 0002, 
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers.

AAAI2023 Yichong Leng, Xu Tan 0003, Wenjie Liu, Kaitao Song, Rui Wang 0028, Xiang-Yang Li 0001, Tao Qin 0001, Edward Lin, Tie-Yan Liu, 
SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition.

TASLP2022 Wenxin Hou, Han Zhu 0004, Yidong Wang, Jindong Wang 0001, Tao Qin 0001, Renjun Xu, Takahiro Shinozaki, 
Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition.

TASLP2022 Xiaobo Liang, Lijun Wu, Juntao Li, Tao Qin 0001, Min Zhang 0005, Tie-Yan Liu, 
Multi-Teacher Distillation With Single Model for Neural Machine Translation.

Interspeech2022 Yihan Wu, Xu Tan 0003, Bohan Li 0003, Lei He 0005, Sheng Zhao, Ruihua Song, Tao Qin 0001, Tie-Yan Liu, 
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.

Interspeech2022 Guangyan Zhang, Kaitao Song, Xu Tan 0003, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang 0001, Wei Zhou, Tao Qin 0001, Tan Lee 0001, Sheng Zhao, 
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.

NeurIPS2022 Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen 0008, Xu Tan 0003, Danilo P. Mandic, Lei He 0005, Xiangyang Li 0001, Tao Qin 0001, Sheng Zhao, Tie-Yan Liu, 
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis.

ACL2022 Yi Ren 0006, Xu Tan 0003, Tao Qin 0001, Zhou Zhao, Tie-Yan Liu, 
Revisiting Over-Smoothness in Text to Speech.

ICASSP2021 Yichong Leng, Xu Tan 0003, Sheng Zhao, Frank K. Soong, Xiang-Yang Li 0001, Tao Qin 0001
MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network.

ICASSP2021 Renqian Luo, Xu Tan 0003, Rui Wang 0028, Tao Qin 0001, Jinzhu Li, Sheng Zhao, Enhong Chen, Tie-Yan Liu, 
Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

ICASSP2021 Linghui Meng 0001, Jin Xu 0010, Xu Tan 0003, Jindong Wang 0001, Tao Qin 0001, Bo Xu 0002, 
MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition.

ICASSP2021 Yuzi Yan, Xu Tan 0003, Bohan Li 0003, Tao Qin 0001, Sheng Zhao, Yuan Shen 0001, Tie-Yan Liu, 
Adaspeech 2: Adaptive Text to Speech with Untranscribed Data.

ICASSP2021 Chen Zhang 0020, Yi Ren 0006, Xu Tan 0003, Jinglin Liu, Kejun Zhang, Tao Qin 0001, Sheng Zhao, Tie-Yan Liu, 
Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling.

Interspeech2021 Wenxin Hou, Jindong Wang 0001, Xu Tan 0003, Tao Qin 0001, Takahiro Shinozaki, 
Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching.

Interspeech2021 Yuzi Yan, Xu Tan 0003, Bohan Li 0003, Guangyan Zhang, Tao Qin 0001, Sheng Zhao, Yuan Shen 0001, Wei-Qiang Zhang, Tie-Yan Liu, 
Adaptive Text to Speech for Spontaneous Style.

NeurIPS2021 Jiawei Chen 0008, Xu Tan 0003, Yichong Leng, Jin Xu 0010, Guihua Wen, Tao Qin 0001, Tie-Yan Liu, 
Speech-T: Transducer for Text to Speech and Beyond.

NeurIPS2021 Yichong Leng, Xu Tan 0003, Linchen Zhu, Jin Xu 0010, Renqian Luo, Linquan Liu, Tao Qin 0001, Xiangyang Li 0001, Edward Lin, Tie-Yan Liu, 
FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition.

ICLR2021 Yi Ren 0006, Chenxu Hu, Xu Tan 0003, Tao Qin 0001, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, 
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

#129  | Jiqing Han 0001 | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 12TASLP: 3
By year2024: 32023: 62022: 32021: 42020: 92019: 62018: 2
ISCA sessionssingle-channel speech enhancement: 2audio signal characterization: 2acoustic scenes and rare events: 2spoken language translation, information retrieval, summarization, resources, and evaluation: 1speech recognition: 1speaker and language identification: 1multimodal speech emotion recognition and paralinguistics: 1robust speaker recognition: 1emotion and sentiment analysis: 1neural network training methods for asr: 1noise robust and distant speech recognition: 1acoustic scene classification: 1emotion modeling and analysis: 1speech and audio source separation and scene analysis: 1speech enhancement: 1
IEEE keywordsspeech recognition: 4speaker verification: 3task analysis: 3adaptation models: 2open set domain adaptation: 2event detection: 2emotion recognition: 2probability: 2speaker recognition: 2speech coding: 2speech enhancement: 2convolutional neural nets: 2speech intelligibility: 2monaural speech enhancement: 2multiaccess communication: 1graph convolutional network: 1measurement: 1distance distribution: 1clustering methods: 1covid 19: 1quasi periodic dependency: 1respiratory sound classification: 1sparse matrices: 1self supervised learning: 1representation learning: 1feature disentanglement: 1polyphonic sound event detection: 1contrastive loss: 1error analysis: 1spoofed speech detection: 1graph convolutional networks: 1image edge detection: 1design methodology: 1knowledge engineering: 1focusing: 1anti spoofing: 1graph neural networks: 1sound event detection: 1subband dependency: 1recurrent neural networks: 1convolution: 1convolutional neural networks: 1self attention: 1data models: 1bit error rate: 1bert: 1wav2vec2.0: 1auxiliary task: 1cross attention: 1multimodal emotion recognition: 1signal detection: 1sparse self attention: 1sparsemax: 1semi supervised sound event detection: 1audio signal processing: 1signal classification: 1pairwise distance distributions: 1audio classification: 1temporal dependency modeling: 1benchmark testing: 1hierarchical contrastive predictive coding: 1music: 1predictive models: 1predictive coding: 1generative vocoder: 1vocoders: 1recurrent neural nets: 1joint framework: 1denoising autoencoder: 1text analysis: 1speech emotion recognition: 1non negative matrix factorization: 1cross corpus: 1matrix decomposition: 1natural language processing: 1transfer subspace learning: 1i vector framework: 1task driven multilevel framework: 1end to end: 1gaussian processes: 1iterative methods: 1phoneme aware network: 1phonetic posteriorgram: 1structured sparse: 1end toend: 1automatic speech recognition: 1attention: 1backpropagation: 1time frequency analysis: 1generative adversarial training: 1permutation invariant training: 1source separation: 1gated convolutional neural network: 1speech separation: 1cocktail party problem: 1
Most publications (all venues) at2019: 162020: 152011: 122023: 112021: 8

Affiliations
Harbin Institute of Technology, School of Computer Science and Technology, China
URLs

Recent publications

TASLP2024 Jianchen Li, Jiqing Han 0001, Fan Qian, Tieran Zheng, Yongjun He, Guibin Zheng, 
Distance Metric-Based Open-Set Domain Adaptation for Speaker Verification.

ICASSP2024 Wenjie Song 0003, Jiqing Han 0001, Jianchen Li, Guibin Zheng, Tieran Zheng, Yongjun He, 
Modeling Quasi-Periodic Dependency via Self-Supervised Pre-Training for Respiratory Sound Classification.

ICASSP2024 Yadong Guan, Jiqing Han 0001, Hongwei Song, Wenjie Song 0003, Guibin Zheng, Tieran Zheng, Yongjun He, 
Contrastive Loss Based Frame-Wise Feature Disentanglement for Polyphonic Sound Event Detection.

ICASSP2023 Feng Chen, Shiwen Deng, Tieran Zheng, Yongjun He, Jiqing Han 0001
Graph-Based Spectro-Temporal Dependency Modeling for Anti-Spoofing.

ICASSP2023 Yadong Guan, Guibin Zheng, Jiqing Han 0001, Huanliang Wang, 
Subband Dependency Modeling for Sound Event Detection.

ICASSP2023 Dekai Sun, Yancheng He, Jiqing Han 0001
Using Auxiliary Tasks In Multimodal Fusion of Wav2vec 2.0 And Bert for Multimodal Emotion Recognition.

Interspeech2023 Ying Shi 0001, Dong Wang 0013, Lantian Li, Jiqing Han 0001, Shi Yin, 
Spot Keywords From Very Noisy and Mixed Speech.

Interspeech2023 Yue Gu, Zhihao Du, Shiliang Zhang, Qian Chen 0003, Jiqing Han 0001
Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition.

Interspeech2023 Jianchen Li, Jiqing Han 0001, Shiwen Deng, Tieran Zheng, Yongjun He, Guibin Zheng, 
Mutual Information-based Embedding Decoupling for Generalizable Speaker Verification.

ICASSP2022 Yadong Guan, Jiabin Xue, Guibin Zheng, Jiqing Han 0001
Sparse Self-Attention for Semi-Supervised Sound Event Detection.

ICASSP2022 Jianchen Li, Jiqing Han 0001, Hongwei Song, 
CDMA: Cross-Domain Distance Metric Adaptation for Speaker Verification.

Interspeech2022 Fan Qian, Hongwei Song, Jiqing Han 0001
Word-wise Sparse Attention for Multimodal Sentiment Analysis.

ICASSP2021 Hongwei Song, Jiqing Han 0001, Shiwen Deng, Zhihao Du, 
Capturing Temporal Dependencies Through Future Prediction for CNN-Based Audio Classifiers.

Interspeech2021 Jianchen Li, Jiqing Han 0001, Hongwei Song, 
Gradient Regularization for Noise-Robust Speaker Verification.

Interspeech2021 Fan Qian, Jiqing Han 0001
Multimodal Sentiment Analysis with Temporal Modality Attention.

Interspeech2021 Jiabin Xue, Tieran Zheng, Jiqing Han 0001
Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training.

TASLP2020 Zhihao Du, Xueliang Zhang 0001, Jiqing Han 0001
A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement.

TASLP2020 Hui Luo, Jiqing Han 0001
Nonnegative Matrix Factorization Based Transfer Subspace Learning for Cross-Corpus Speech Emotion Recognition.

ICASSP2020 Chen Chen 0086, Jiqing Han 0001
TDMF: Task-Driven Multilevel Framework for End-to-End Speaker Verification.

ICASSP2020 Zhihao Du, Ming Lei, Jiqing Han 0001, Shiliang Zhang, 
Pan: Phoneme-Aware Network for Monaural Speech Enhancement.

#130  | Yossi Adi | DBLP Google Scholar  
By venueInterspeech: 12ICASSP: 8NeurIPS: 3ICLR: 2EMNLP: 2ACL: 2AAAI: 1EMNLP-Findings: 1NAACL: 1ICML: 1
By year2024: 22023: 112022: 102021: 32020: 62019: 1
ISCA sessionsspeech synthesis: 2single-channel speech enhancement: 2analysis of speech and audio signals: 1acoustic signal representation and analysis: 1spoken language processing: 1zero, low-resource and multi-modal speech recognition: 1single-channel and multi-channel speech enhancement: 1privacy and security in speech communication: 1phonetic event detection and segmentation: 1voice conversion and adaptation: 1
IEEE keywordsself supervised learning: 3speech recognition: 2speech enhancement: 2speaker recognition: 2smoothing methods: 1measurement: 1representation learning: 1self supervision: 1unit discovery: 1prosody transfer: 1tv: 1expressive speech to speech translation: 1benchmark testing: 1controllable text to speech: 1focusing: 1protocols: 1redundancy: 1measurement units: 1speech lm: 1textless nlp: 1correlation: 1analytical models: 1visualization: 1generative spoken language modeling: 1codes: 1domain adaptation: 1zero shot learning: 1unsupervised denoising: 1unsupervised learning: 1speaker classification: 1reverberation: 1source separation: 1signal classification: 1audio generation: 1speech synthesis: 1phoneme boundary detection: 1neural net architecture: 1sequence segmentation: 1structured prediction: 1recurrent neural networks (rnns): 1convergence: 1natural language processing: 1signal representation: 1error statistics: 1automatic speech recognition: 1multi task learning: 1adversarial learning: 1
Most publications (all venues) at2023: 212022: 202024: 192021: 92020: 8

Affiliations
URLs

Recent publications

ICLR2024 Alon Ziv, Itai Gat, Gaël Le Lan, Tal Remez, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Yossi Adi
Masked Audio Generation using a Single Non-Autoregressive Transformer.

AAAI2024 Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation.

ICASSP2023 Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux, Abdelrahman Mohamed, 
Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

ICASSP2023 Wen-Chin Huang, Benjamin Peloquin, Justine Kao, Changhan Wang, Hongyu Gong, Elizabeth Salesky, Yossi Adi, Ann Lee 0001, Peng-Jen Chen, 
A Holistic Cascade System, Benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation.

ICASSP2023 Amitay Sicherman, Yossi Adi
Analysing Discrete Self Supervised Speech Representation For Spoken Language Modeling.

Interspeech2023 Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarandi, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux, 
Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis.

Interspeech2023 Guy Yariv, Itai Gat, Lior Wolf, Yossi Adi, Idan Schwartz, 
Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation.

NeurIPS2023 Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz 0001, Yossi Adi
Textually Pretrained Speech Language Models.

NeurIPS2023 Matthew Le 0001, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu, 
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale.

NeurIPS2023 Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez, 
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion.

ICLR2023 Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi
AudioGen: Textually Guided Audio Generation.

EMNLP2023 Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot, Emmanuel Dupoux, 
Generative Spoken Language Model based on continuous word-sized audio tokens.

EMNLP-Findings2023 Gallil Maimon, Yossi Adi
Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units.

ICASSP2022 Efthymios Tzinis, Yossi Adi, Vamsi K. Ithapu, Buye Xu, Anurag Kumar 0003, 
Continual Self-Training With Bootstrapped Remixing For Speech Enhancement.

Interspeech2022 Shahaf Bassan, Yossi Adi, Jeffrey S. Rosenschein, 
Unsupervised Symbolic Music Segmentation using Ensemble Temporal Prediction Errors.

Interspeech2022 Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino 0001, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee 0001, 
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation.

Interspeech2022 Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski, 
Probing phoneme, language and speaker information in unsupervised speech representations.

Interspeech2022 Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement.

Interspeech2022 Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg, 
Deep Audio Waveform Prior.

ACL2022 Eugene Kharitonov, Ann Lee 0001, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu, 
Text-Free Prosody-Aware Generative Spoken Language Modeling.

#131  | Ozlem Kalinli | DBLP Google Scholar  
By venueICASSP: 16Interspeech: 15NAACL: 1EMNLP-Findings: 1
By year2024: 92023: 52022: 92021: 82019: 2
ISCA sessionsnovel neural network architectures for asr: 2resource-constrained asr: 2speech recognition: 1end-to-end spoken dialog systems: 1other topics in speech recognition: 1summarization, entity extraction, evaluation and others: 1spoken language understanding: 1neural transducers, streaming asr and novel asr models: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1language and lexical modeling for asr: 1streaming for asr/rnn transducers: 1speech synthesis: 1speech recognition and beyond: 1
IEEE keywordsspeech recognition: 13adaptation models: 6automatic speech recognition: 4large language models: 4error analysis: 4task analysis: 3transducers: 3rnn t: 3natural language processing: 3multilingual: 3contextual biasing: 2multitasking: 2decoding: 2language modeling: 2supernet: 2training data: 2pruning: 2robustness: 2computational modeling: 1aggregates: 1ensemble learning: 1knowledge distillation: 1memorization: 1language models: 1ensemble methods: 1llama: 1large language model: 1question answering (information retrieval): 1domain adaptation: 1generators: 1correction focused training: 1efficiency: 1compression: 1knowledge distillations: 1on device: 1hardware: 1recurrent neural networks: 1multi task training: 1context modeling: 1dynamic prompting: 1privacy preserving machine learning: 1data models: 1servers: 1sparsity: 1distance measurement: 1anchored speech recognition: 1performance evaluation: 1noise measurement: 1background speech suppression: 1multi softmax: 1pipelines: 1tokenization: 1sparse: 1signal processing algorithms: 1named entities: 1class language model: 1shallow fusion: 1end toend speech recognition: 1non causal convolution: 1talking heads: 1data compression: 1convolution: 1augmented memory: 1e2e asr: 1deep learning (artificial intelligence): 1neural network pruning: 1graphics processing units: 1pareto optimisation: 1sparsity optimization: 1sample adaptive policy: 1data augmentation: 1perturbation methods: 1gaussian distribution: 1gaussian noise: 1robust automatic speech recognition: 1cepstral analysis: 1channel normalization: 1cepstral mean normalization: 1
Most publications (all venues) at2024: 122023: 112021: 112022: 102009: 4

Affiliations
URLs

Recent publications

ICASSP2024 Zhe Liu 0011, Ozlem Kalinli
Forgetting Private Textual Sequences in Language Models Via Leave-One-Out Ensemble.

ICASSP2024 Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer, 
Prompting Large Language Models with Speech Recognition Abilities.

ICASSP2024 Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen, 
End-to-End Speech Recognition Contextualization with Large Language Models.

ICASSP2024 Yingyi Ma, Zhe Liu 0011, Ozlem Kalinli
Correction Focused Language Model Training For Speech Recognition.

ICASSP2024 Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra, 
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-Device ASR Models.

ICASSP2024 Chuanneng Sun, Zeeshan Ahmed, Yingyi Ma, Zhe Liu 0011, Lucas Kabela, Yutong Pang, Ozlem Kalinli
Contextual Biasing of Named-Entities with Large Language Models.

ICASSP2024 Arpita Vats, Zhe Liu, Peng Su, Debjyoti Paul, Yingyi Ma, Yutong Pang, Zeeshan Ahmed, Ozlem Kalinli
Recovering from Privacy-Preserving Masking with Large Language Models.

ICASSP2024 Jiamin Xie, Ke Li, Jinxi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of a Multilingual ASR Model.

NAACL2024 Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer, 
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs.

ICASSP2023 Desh Raj, Junteng Jia, Jay Mahadeokar, Chunyang Wu, Niko Moritz, Xiaohui Zhang 0007, Ozlem Kalinli
Anchored Speech Recognition with Neural Transducers.

ICASSP2023 Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer, 
Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities.

ICASSP2023 Mu Yang, Andros Tjandra, Chunxi Liu, David Zhang, Duc Le, Ozlem Kalinli
Learning ASR Pathways: A Sparse Multilingual ASR Model.

Interspeech2023 Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales, 
Multi-Head State Space Model for Speech Recognition.

Interspeech2023 Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer, 
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding.

ICASSP2022 Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu 0011, Bo Wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer, 
Neural-FST Class Language Model for End-to-End Speech Recognition.

ICASSP2022 Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang 0007, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, Mike Seltzer, 
Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution.

ICASSP2022 Haichuan Yang, Yuan Shangguan, Dilin Wang, Meng Li 0004, Pierce Chuang, Xiaohui Zhang 0007, Ganesh Venkatesh, Ozlem Kalinli, Vikas Chandra, 
Omni-Sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR Via Supernet.

Interspeech2022 Junteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide, 
Federated Domain Adaptation for ASR with Full Self-Supervision.

Interspeech2022 Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer, 
Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric.

Interspeech2022 Duc Le, Akshat Shrivastava, Paden D. Tomasello, Suyoun Kim, Aleksandr Livshits, Ozlem Kalinli, Michael L. Seltzer, 
Deliberation Model for On-Device Spoken Language Understanding.

#132  | Ning Cheng 0001 | DBLP Google Scholar  
By venueICASSP: 19Interspeech: 14
By year2024: 32023: 102022: 82021: 62020: 6
ISCA sessionsspeech synthesis: 3spoken language translation, information retrieval, summarization, resources, and evaluation: 1analysis of speech and audio signals: 1question answering from speech: 1source separation: 1voice conversion and adaptation: 1acoustic event detection and classification: 1speech signal analysis and representation: 1acoustic event detection and acoustic scene classification: 1spoken language understanding: 1topics in asr: 1phonetic event detection and segmentation: 1
IEEE keywordsspeech synthesis: 10voice conversion: 5speech recognition: 5natural language processing: 4contrastive learning: 3timbre: 3predictive models: 3speaker recognition: 3emotion recognition: 2emotional speech synthesis: 2multi modal: 2convolution: 2task analysis: 2vector quantization: 2dynamic programming: 2zero shot: 2text analysis: 2text to speech: 2time invariant retrieval: 1data mining: 1self supervised learning: 1phonetics: 1noise reduction: 1speech emotion diarization: 1diffusion denoising probabilistic model: 1probabilistic logic: 1llm: 1model bias: 1text categorization: 1zero shot learning: 1bias leverage: 1adaptation models: 1robustness: 1few shot learning: 1knn methods: 1gold: 1multiple signal classification: 1fuses: 1music genre classification: 1multi label: 1contrastive loss: 1correlation: 1symmetric cross modal attention: 1adversarial learning: 1speech representation disentanglement: 1linear programming: 1linguistics: 1intonation intensity control: 1relative attribute: 1aligned cross entropy: 1entropy: 1non autoregressive asr: 1mask ctc: 1brain modeling: 1time frequency analysis: 1feature fusion: 1federated learning: 1graph convolution network: 1electroencephalogram: 1any to any: 1object detection: 1self supervised: 1low resource: 1query processing: 1pattern clustering: 1interactive systems: 1visual dialog: 1transformer: 1patch embedding: 1question answering (information retrieval): 1computer vision: 1incomplete utterance rewriting: 1self attention weight matrix: 1text edit: 1multi speaker text to speech: 1conditional variational autoencoder: 1intent detection: 1continual learning: 1computational linguistics: 1slot filling: 1recurrent neural nets: 1self attention: 1rnn transducer: 1waveform generators: 1vocoders: 1waveform generation: 1location variable convolution: 1vocoder: 1convolutional codes: 1speech coding: 1prosody modelling: 1graph theory: 1graph neural network: 1baum welch algorithm: 1real time systems: 1signal processing algorithms: 1feed forward transformer: 1
Most publications (all venues) at2022: 282023: 272021: 212024: 172020: 9

Affiliations
Ping An Technology (Shenzhen) Co., Ltd., China
Chinese Academy of Sciences, Institute of Automation, Beijing, China (former)
Chinese Academy of Sciences, Shenzhen Institute of Advanced Technology, China (former)
University of the Chinese Academy of Sciences (UCAS), Beijing, China (PhD 2009)

Recent publications

ICASSP2024 Yimin Deng, Huaizhen Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang, 
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval.

ICASSP2024 Haobin Tang, Xulong Zhang 0001, Ning Cheng 0001, Jing Xiao 0006, Jianzong Wang, 
ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis.

ICASSP2024 Yong Zhang, Hanzhang Li, Zhitao Li, Ning Cheng 0001, Ming Li, Jing Xiao 0006, Jianzong Wang, 
Leveraging Biases in Large Language Models: "bias-kNN" for Effective Few-Shot Learning.

ICASSP2023 Ganghui Ru, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Improving Music Genre Classification from multi-modal Properties of Music and Genre Correlations Perspective.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Learning Speech Representations with Flexible Hidden Feature Dimensions.

ICASSP2023 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
VQ-CL: Learning Disentangled Speech Representations with Contrastive Learning and Vector Quantization.

ICASSP2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis.

ICASSP2023 Xulong Zhang 0001, Haobin Tang, Jianzong Wang, Ning Cheng 0001, Jian Luo, Jing Xiao 0006, 
Dynamic Alignment Mask CTC: Improved Mask CTC With Aligned Cross Entropy.

ICASSP2023 Kexin Zhu, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Improving EEG-based Emotion Recognition by Fusing Time-Frequency and Spatial Representations.

Interspeech2023 Jiaxin Fan, Yong Zhang, Hanzhang Li, Jianzong Wang, Zhitao Li, Sheng Ouyang, Ning Cheng 0001, Jing Xiao 0006, 
Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism.

Interspeech2023 Yifu Sun, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Kaiyu Hu, Jing Xiao 0006, 
Investigation of Music Emotion Recognition Based on Segmented Semi-Supervised Learning.

Interspeech2023 Haobin Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis.

Interspeech2023 Yong Zhang, Zhitao Li, Jianzong Wang, Yiming Gao 0010, Ning Cheng 0001, Fengying Yu, Jing Xiao 0006, 
Prompt Guided Copy Mechanism for Conversational Question Answering.

ICASSP2022 Huaizhen Tang, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Avqvc: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning.

ICASSP2022 Qiqi Wang 0005, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning.

ICASSP2022 Tong Ye, Shijing Si, Jianzong Wang, Rui Wang, Ning Cheng 0001, Jing Xiao 0006, 
VU-BERT: A Unified Framework for Visual Dialog.

ICASSP2022 Yong Zhang, Zhitao Li, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
Self-Attention for Incomplete Utterance Rewriting.

ICASSP2022 Botao Zhao 0001, Xulong Zhang 0001, Jianzong Wang, Ning Cheng 0001, Jing Xiao 0006, 
nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech.

Interspeech2022 Jian Luo, Jianzong Wang, Ning Cheng 0001, Edward Xiao, Xulong Zhang 0001, Jing Xiao 0006, 
Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation.

Interspeech2022 Sicheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu 0001, Aolan Sun, Jianzong Wang, Ning Cheng 0001, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng, 
Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

#133  | Mathew Magimai-Doss | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 13TASLP: 1SpeechComm: 1
By year2024: 32023: 52022: 32021: 52020: 52019: 72018: 4
ISCA sessionsspeaker state and trait: 2dysarthric speech assessment: 1connecting speech-science and speech-technology for children's speech: 1paralinguistics: 1neural-based speech and acoustic analysis: 1acoustic signal representation and analysis: 1speech segmentation: 1speech recognition of atypical speech: 1speech signal analysis and representation: 1disordered speech: 1assessment of pathological speech and language: 1alzheimer’s dementia recognition through spontaneous speech: 1the interspeech 2019 computational paralinguistics challenge (compare): 1topics in speech and audio signal processing: 1speaker verification: 1the interspeech 2018 computational paralinguistics challenge (compare): 1
IEEE keywordsspeech recognition: 5hidden markov models: 4diseases: 3medical signal processing: 3convolutional neural networks: 3computational modeling: 2emotion recognition: 2articulatory features: 2low pass filters: 2sign language recognition: 2quantization: 1standards: 1adaptive precision: 1quantization aware training: 1adaptation models: 1numerical models: 1post training quantization: 1quantization (signal): 1language disorder: 1svm: 1proposals: 1parkinson’s disease: 1syllable level features: 1classification: 1linguistics: 1dimensional emotion: 1vad: 1self supervised learning (ssl): 1self supervised learning: 1planning: 1reliability: 1uncertainty: 1speech emotion recognition: 1cepstral analysis: 1end to end modelling: 1convolution neural network: 1anxiety disorders: 1boaw: 1lung: 1covid 19 identification: 1epidemics: 1compare features: 1audio signal processing: 1phoneme recognition: 1breathing pattern estimation: 1pneumodynamics: 1mean square error methods: 1speech breathing: 1respiratory parameters: 1parameter estimation: 1transfer learning: 1production: 1sleep: 1sleepiness: 1estimation: 1paralinguistic speech processing: 1end to end acoustic modeling: 1pathological speech processing: 1lf mmi: 1dysarthria: 1entropy: 1gaussian processes: 1phonocardiogram: 1time frequency analysis: 1s1–s2 detection: 1modified zff: 1zero frequency filter: 1phonocardiography: 1sign language processing: 1hand shape modeling: 1hand movement modeling: 1handicapped aids: 1multilingual sign language recognition: 1gesture recognition: 1natural language processing: 1acoustic modeling: 1end to end training.: 1children speech recognition: 1segment level training: 1probability: 1confidence measures: 1local posterior probability: 1zero frequency filtering: 1glottal source signals: 1depression detection: 1subunits: 1hidden markov model: 1sign language: 1
Most publications (all venues) at2011: 142021: 132012: 112007: 112024: 9

Affiliations
URLs

Recent publications

TASLP2024 Vishal Kumar, Vinayak Abrol, Mathew Magimai-Doss
On the Quantization of Neural Models for Speaker Verification.

ICASSP2024 Sevada Hovsepyan, Mathew Magimai-Doss
Syllable Level Features for Parkinson's Disease Detection from Speech.

ICASSP2024 Bogdan Vlasenko, Sargam Vyas, Mathew Magimai-Doss
Comparing data-Driven and Handcrafted Features for Dimensional Emotion Recognition.

ICASSP2023 Tilak Purohit, Sarthak Yadav, Bogdan Vlasenko, S. Pavankumar Dubagunta, Mathew Magimai-Doss
Towards Learning Emotion Information from Short Segments of Speech.

Interspeech2023 Enno Hermann, Mathew Magimai-Doss
Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation.

Interspeech2023 Timothy Piton, Enno Hermann, Angela Pasqualotto, Marjolaine Cohen, Mathew Magimai-Doss, Daphne Bavelier, 
Using Commercial ASR Solutions to Assess Reading Skills in Children: A Case Report.

Interspeech2023 Tilak Purohit, Bogdan Vlasenko, Mathew Magimai-Doss
Implicit phonetic information modeling for speech emotion recognition.

Interspeech2023 Eklavya Sarkar, Mathew Magimai-Doss
Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

ICASSP2022 Zohreh Mostaani, RaviShankar Prasad, Bogdan Vlasenko, Mathew Magimai-Doss
Modeling of Pre-Trained Neural Network Embeddings Learned From Raw Waveform for COVID-19 Infection Detection.

Interspeech2022 Zohreh Mostaani, Mathew Magimai-Doss
On Breathing Pattern Information in Synthetic Speech.

Interspeech2022 Eklavya Sarkar, RaviShankar Prasad, Mathew Magimai-Doss
Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering.

ICASSP2021 Zohreh Mostaani, Venkata Srikanth Nallanthighal, Aki Härmä, Helmer Strik, Mathew Magimai-Doss
On The Relationship Between Speech-Based Breathing Signal Prediction Evaluation Measures and Breathing Parameters Estimation.

Interspeech2021 Enno Hermann, Mathew Magimai-Doss
Handling Acoustic Variation in Dysarthric Speech Recognition Systems Through Model Combination.

Interspeech2021 RaviShankar Prasad, Mathew Magimai-Doss
Identification of F1 and F2 in Speech Using Modified Zero Frequency Filtering.

Interspeech2021 Juan Camilo Vásquez-Correa, Julian Fritsch, Juan Rafael Orozco-Arroyave, Elmar Nöth, Mathew Magimai-Doss
On Modeling Glottal Source Information for Phonation Assessment in Parkinson's Disease.

Interspeech2021 Esaú Villatoro-Tello, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlícek, Mathew Magimai-Doss
Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition.

ICASSP2020 Julian Fritsch, S. Pavankumar Dubagunta, Mathew Magimai-Doss
Estimating the Degree of Sleepiness by Integrating Articulatory Feature Knowledge in Raw Waveform Based CNNS.

ICASSP2020 Enno Hermann, Mathew Magimai-Doss
Dysarthric Speech Recognition with Lattice-Free MMI.

ICASSP2020 RaviShankar Prasad, Gürkan Yilmaz, Olivier Chételat, Mathew Magimai-Doss
Detection Of S1 And S2 Locations In Phonocardiogram Signals Using Zero Frequency Filter.

ICASSP2020 Sandrine Tornay, Marzieh Razavi, Mathew Magimai-Doss
Towards Multilingual Sign Language Recognition.

#134  | Hisashi Kawai | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 10TASLP: 4SpeechComm: 1
By year2024: 32023: 22022: 22021: 62020: 42019: 112018: 4
ISCA sessionsspeech synthesis: 3speech synthesis and voice conversion: 1speaker and language recognition: 1topics in asr: 1large-scale evaluation of short-duration speaker verification: 1cross-lingual and multilingual asr: 1asr for noisy and far-field speech: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1nn architectures for asr: 1speech enhancement: 1speech and audio classification: 1acoustic modelling: 1audio events and acoustic scenes: 1text analysis, multilingual issues and evaluation in speech synthesis: 1language identification: 1
IEEE keywordsvocoders: 6speech synthesis: 6neural vocoder: 5speech recognition: 5spoken language identification: 3knowledge distillation: 3transformers: 2fundamental frequency control: 2text to speech: 2real time systems: 2voice conversion: 2pattern classification: 2speaker recognition: 2autoregressive processes: 2error analysis: 1sinkhorn attention: 1cross modality alignment: 1automatic speech recognition (asr): 1pretrained language model (plm): 1linguistics: 1linear programming: 1controllability: 1finite impulse response filters: 1finite impulse response: 1source filter model: 1synthesizers: 1predictive models: 1jets: 1wavenext: 1convnext: 1decoding: 1harmonic analysis: 1speech rate conversion: 1generators: 1convolution: 1bayes methods: 1joint bayesian model: 1affine transforms: 1discriminative model: 1generative model: 1speaker verification: 1parallel wavegan: 1convolutional neural nets: 1pitch dependent dilated convolution: 1quasi periodic wavenet: 1statistical distributions: 1unsupervised domain adaptation: 1optimal transport: 1medical disorders: 1speech intelligibility: 1dysarthria: 1diffwave: 1diffusion probabilistic vocoder: 1speech enhancement: 1probability: 1sub modeling: 1wavegrad: 1noise: 1internal representation learning: 1short utterances: 1transformer: 1weighted forced attention: 1natural language processing: 1forced alignment: 1sequence to sequence model: 1fast fourier transforms: 1gaussian inverse autoregressive flow: 1parallel wavenet: 1fftnet: 1noise shaping: 1gaussian processes: 1teacher model optimization: 1natural languages: 1computer aided instruction: 1short utterance feature representation: 1interactive teacher student learning: 1acoustic model: 1connec tionist temporal classification: 1
Most publications (all venues) at2010: 262019: 172011: 152018: 142016: 13

Affiliations
URLs

Recent publications

ICASSP2024 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-Based ASR.

ICASSP2024 Yamato Ohtani, Takuma Okamoto, Tomoki Toda, Hisashi Kawai
FIRNet: Fundamental Frequency Controllable Fast Neural Vocoder With Trainable Finite Impulse Response Filter.

ICASSP2024 Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai
Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion.

TASLP2023 Keisuke Matsubara, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Hisashi Kawai
Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder.

Interspeech2023 Takuma Okamoto, Tomoki Toda, Hisashi Kawai
E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion.

SpeechComm2022 Takuma Okamoto, Keisuke Matsubara, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
Neural speech-rate conversion with multispeaker WaveNet vocoder.

Interspeech2022 Peng Shen, Xugang Lu, Hisashi Kawai
Transducer-based language embedding for spoken language identification.

TASLP2021 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai
Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification.

TASLP2021 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda, 
Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network.

ICASSP2021 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai
Unsupervised Neural Adaptation Model Based on Optimal Transport for Spoken Language Identification.

ICASSP2021 Keisuke Matsubara, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
High-Intelligibility Speech Synthesis for Dysarthric Speakers with LPCNet-Based TTS and CycleVAE-Based VC.

ICASSP2021 Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders.

Interspeech2021 Masakiyo Fujimoto, Hisashi Kawai
Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture.

TASLP2020 Peng Shen, Xugang Lu, Sheng Li 0010, Hisashi Kawai
Knowledge Distillation-Based Representation Learning for Short-Utterance Spoken Language Identification.

ICASSP2020 Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
Transformer-Based Text-to-Speech with Weighted Forced Attention.

Interspeech2020 Peng Shen, Xugang Lu, Hisashi Kawai
Investigation of NICT Submission for Short-Duration Speaker Verification Challenge 2020.

Interspeech2020 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda, 
Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation.

ICASSP2019 Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
Investigations of Real-time Gaussian Fftnet and Parallel Wavenet Neural Vocoders with Simple Acoustic Features.

ICASSP2019 Peng Shen, Xugang Lu, Sheng Li 0010, Hisashi Kawai
Interactive Learning of Teacher-student Model for Short Utterance Spoken Language Identification.

ICASSP2019 Ryoichi Takashima, Sheng Li 0010, Hisashi Kawai
Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models.

#135  | Qingyang Hong | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 13AAAI: 1TASLP: 1
By year2024: 32023: 82022: 42021: 92020: 52019: 3
ISCA sessionsspeaker recognition: 3oriental language recognition: 2language recognition: 2speech synthesis: 1end-to-end spoken dialog systems: 1speaker and language identification: 1speaker embedding and diarization: 1speaker and language recognition: 1non-autoregressive sequential modeling for speech processing: 1feature, embedding and neural architecture for speaker recognition: 1large-scale evaluation of short-duration speaker verification: 1asr neural network architectures: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1
IEEE keywordsspeaker recognition: 6speaker verification: 4clustering algorithms: 3transformers: 3speaker clustering: 3speech recognition: 3noise reduction: 2speech synthesis: 2signal processing algorithms: 2pre trained model: 2error analysis: 2graph convolutional network: 2convolution: 2clustering methods: 2adaptation models: 2x vector: 2deep learning (artificial intelligence): 2costs: 1rectified flow: 1ordinary differential equations: 1gaussian distribution: 1probabilistic logic: 1system performance: 1multitasking: 1fine tuning: 1benchmark testing: 1clustering: 1label correction: 1degradation: 1multimodal: 1far field: 1multi talker: 1transfer learning: 1runtime: 1production: 1conformer: 1computer architecture: 1runtime environment: 1couplings: 1convolutional neural networks: 1community detection: 1voice activity detection: 1speaker diarization: 1stability analysis: 1loss weight adaption: 1model agnostic meta learning: 1homoscedastic uncertainty: 1manuals: 1convolutional neural network: 1low resource automatic speech recognition: 1uncertainty: 1noisy labels: 1bayes methods: 1probabilistic linear discriminant analysis: 1training data: 1semi supervised learning: 1semisupervised learning: 1multi speaker: 1text analysis: 1multi lingual: 1non autoregressive: 1natural language processing: 1lightweight: 1autoregressive processes: 1multi accent: 1global embedding: 1end to end: 1data augmentation: 1domain adaptation: 1open source toolkit: 1deep neural networks: 1linear discriminant analysis: 1sensor fusion: 1f tdnn: 1prediction theory: 1as norm: 1domain mismatch: 1sre19: 1speaker embedding: 1speech coding: 1optimisation: 1adversarial training: 1multi task: 1
Most publications (all venues) at2021: 152023: 92022: 92024: 82019: 7

Affiliations
URLs

Recent publications

ICASSP2024 Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao, Xingjia Xie, Lin Li, Qingyang Hong
Reflow-TTS: A Rectified Flow Model for High-Fidelity Text-to-Speech.

ICASSP2024 Yishuang Li, Hukai Huang, Zhicong Chen, Wenhao Guan, Jiayan Lin, Lin Li, Qingyang Hong
SR-HuBERT : An Efficient Pre-Trained Model for Speaker Verification.

AAAI2024 Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis.

ICASSP2023 Zhicong Chen, Jie Wang, Wenxuan Hu, Lin Li 0032, Qingyang Hong
Unsupervised Speaker Verification Using Pre-Trained Model and Label Correction.

ICASSP2023 Tao Li, Haodong Zhou, Jie Wang, Qingyang Hong, Lin Li, 
The XMU System for Audio-Visual Diarization and Recognition in MISP Challenge 2022.

ICASSP2023 Dexin Liao, Tao Jiang 0033, Feng Wang, Lin Li 0032, Qingyang Hong
Towards A Unified Conformer Structure: from ASR to ASV Task.

ICASSP2023 Jie Wang, Zhicong Chen, Haodong Zhou, Lin Li, Qingyang Hong
Community Detection Graph Convolutional Network for Overlap-Aware Speaker Diarization.

ICASSP2023 Qiulin Wang, Wenxuan Hu, Lin Li 0032, Qingyang Hong
Meta Learning with Adaptive Loss Weight for Low-Resource Speech Recognition.

Interspeech2023 Wenhao Guan, Tao Li, Yishuang Li, Hukai Huang, Qingyang Hong, Lin Li, 
Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge.

Interspeech2023 Lingyan Huang, Tao Li, Haodong Zhou, Qingyang Hong, Lin Li, 
Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding.

Interspeech2023 Feng Wang, Lingyan Huang, Tao Li, Qingyang Hong, Lin Li 0032, 
Conformer-based Language Embedding with Self-Knowledge Distillation for Spoken Language Identification.

TASLP2022 Lin Li 0032, Fuchuan Tong, Qingyang Hong
When Speaker Recognition Meets Noisy Labels: Optimizations for Front-Ends and Back-Ends.

ICASSP2022 Fuchuan Tong, Siqi Zheng, Min Zhang, Yafeng Chen, Hongbin Suo, Qingyang Hong, Lin Li 0032, 
Graph Convolutional Network Based Semi-Supervised Learning on Multi-Speaker Meeting Data.

Interspeech2022 Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li 0032, Qingyang Hong
Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party Meeting.

Interspeech2022 Binling Wang, Feng Wang, Wenxuan Hu, Qiulin Wang, Jing Li, Dong Wang 0013, Lin Li 0032, Qingyang Hong
Oriental Language Recognition (OLR) 2021: Summary and Analysis.

ICASSP2021 Song Li, Beibei Ouyang, Lin Li 0032, Qingyang Hong
Light-TTS: Lightweight Multi-Speaker Multi-Lingual Text-to-Speech.

ICASSP2021 Song Li, Beibei Ouyang, Dexin Liao, Shipeng Xia, Lin Li 0032, Qingyang Hong
End-To-End Multi-Accent Speech Recognition with Unsupervised Accent Modelling.

ICASSP2021 Fuchuan Tong, Miao Zhao, Jianfeng Zhou, Hao Lu, Zheng Li, Lin Li 0032, Qingyang Hong
ASV-SUBTOOLS: Open Source Toolkit for Automatic Speaker Verification.

Interspeech2021 Zheng Li, Yan Liu, Lin Li 0032, Qingyang Hong
Additive Phoneme-Aware Margin Softmax Loss for Language Recognition.

Interspeech2021 Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li 0032, Qingyang Hong
Real-Time End-to-End Monaural Multi-Speaker Speech Recognition.

#136  | Zhengyang Chen | DBLP Google Scholar  
By venueICASSP: 15Interspeech: 12TASLP: 4SpeechComm: 1
By year2024: 72023: 72022: 102021: 42020: 32019: 1
ISCA sessionsembedding and network architecture for speaker recognition: 4speaker and language diarization: 1speaker and language identification: 1speaker recognition and anti-spoofing: 1sdsv challenge 2021: 1speaker, language, and privacy: 1speaker recognition challenges and applications: 1learning techniques for speaker recognition: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1
IEEE keywordsspeaker verification: 10speaker recognition: 8task analysis: 6transformers: 4data models: 4self supervised learning: 4degradation: 4speaker diarization: 3system performance: 2clustering algorithms: 2voice activity detection: 2robustness: 2adaptation models: 2speech recognition: 2computational modeling: 2large margin fine tuning: 2speaker embedding: 2representation learning: 2unsupervised learning: 2neural speaker diarization: 1attention based encoder decoder: 1ami: 1iterative decoding: 1callhome: 1dihard: 1decoding: 1label correction: 1iterative methods: 1self supervised speaker verification: 1cluster aware dino: 1multi modality: 1reliability: 1dynamic loss gate: 1fuses: 1machine anomalous sound detection: 1data collection: 1self supervised pre train: 1fine tune: 1employee welfare: 13d speaker: 1data mining: 1cross domain learning: 1domain mismatch: 1target speech diarization: 1switches: 1semantics: 1prompt driven: 1mimics: 1training data: 1in the wild: 1filtering algorithms: 1dino: 1pipelines: 1attentive feature fusion: 1resnet: 1depth first architecture: 1complexity theory: 1computer architecture: 1ecapa tdnn: 1search problems: 1binary classification: 1noise measurement: 1sphereface2: 1error analysis: 1audio visual: 1misp challenge: 1production: 1wespeaker: 1codes: 1asymmetric scenario: 1duration mismatch: 1focusing: 1signal processing algorithms: 1collaboration: 1self supervised pretrain: 1image representation: 1multitasking: 1pre training: 1benchmark testing: 1speaker: 1linear programming: 1multilayer perceptrons: 1text independent: 1multi layer perceptron: 1convolution attention: 1local attention: 1natural language processing: 1local information: 1gaussian attention: 1self knowledge distillation: 1model compression: 1deep embedding learning: 1knowledge engineering: 1quantization (signal): 1biometrics (access control): 1audio visual deep neural network: 1person verification: 1deep learning (artificial intelligence): 1face recognition: 1data augmentation: 1data analysis: 1multi modal system: 1domain adaptation: 1contrastive learning: 1multitask learning: 1channel information: 1text dependent speaker verification: 1adversarial training: 1
Most publications (all venues) at2022: 172024: 142023: 102021: 52020: 3

Affiliations
URLs

Recent publications

SpeechComm2024 Shuai Wang 0016, Zhengyang Chen, Bing Han, Hongji Wang, Chengdong Liang, Binbin Zhang, Xu Xiang, Wen Ding, Johan Rohdin, Anna Silnova, Yanmin Qian, Haizhou Li 0001, 
Advancing speaker embedding learning: Wespeaker toolkit for research and production.

TASLP2024 Zhengyang Chen, Bing Han, Shuai Wang 0016, Yanmin Qian, 
Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer.

TASLP2024 Bing Han, Zhengyang Chen, Yanmin Qian, 
Self-Supervised Learning With Cluster-Aware-DINO for High-Performance Robust Speaker Verification.

ICASSP2024 Bing Han, Zhiqiang Lv, Anbai Jiang, Wen Huang 0004, Zhengyang Chen, Yufeng Deng, Jiawei Ding, Cheng Lu 0007, Wei-Qiang Zhang 0001, Pingyi Fan, Jia Liu 0001, Yanmin Qian, 
Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection.

ICASSP2024 Wen Huang 0004, Bing Han, Shuai Wang 0016, Zhengyang Chen, Yanmin Qian, 
Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.

ICASSP2024 Yidi Jiang, Zhengyang Chen, Ruijie Tao, Liqun Deng, Yanmin Qian, Haizhou Li 0001, 
Prompt-Driven Target Speech Diarization.

ICASSP2024 Shuai Wang 0016, Qibing Bai, Qi Liu 0018, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li 0001, 
Leveraging in-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition.

TASLP2023 Bei Liu, Zhengyang Chen, Yanmin Qian, 
Depth-First Neural Architecture With Attentive Feature Fusion for Efficient Speaker Verification.

ICASSP2023 Bing Han, Zhengyang Chen, Yanmin Qian, 
Exploring Binary Classification Loss for Speaker Verification.

ICASSP2023 Tao Liu, Zhengyang Chen, Yanmin Qian, Kai Yu 0004, 
Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge.

ICASSP2023 Hongji Wang, Chengdong Liang, Shuai Wang 0016, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, Yanmin Qian, 
Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit.

ICASSP2023 Leying Zhang, Zhengyang Chen, Yanmin Qian, 
Adaptive Large Margin Fine-Tuning For Robust Speaker Verification.

Interspeech2023 Zhengyang Chen, Bing Han, Shuai Wang 0016, Yanmin Qian, 
Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor.

Interspeech2023 Zhengyang Chen, Bing Han, Xu Xiang, Houjun Huang, Bei Liu, Yanmin Qian, 
Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022.

ICASSP2022 Zhengyang Chen, Sanyuan Chen, Yu Wu 0012, Yao Qian, Chengyi Wang 0002, Shujie Liu 0001, Yanmin Qian, Michael Zeng 0001, 
Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification.

ICASSP2022 Sanyuan Chen, Yu Wu 0012, Chengyi Wang 0002, Zhengyang Chen, Zhuo Chen 0006, Shujie Liu 0001, Jian Wu 0027, Yao Qian, Furu Wei, Jinyu Li 0001, Xiangzhan Yu, 
Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training.

ICASSP2022 Bing Han, Zhengyang Chen, Bei Liu, Yanmin Qian, 
MLP-SVNET: A Multi-Layer Perceptrons Based Network for Speaker Verification.

ICASSP2022 Bing Han, Zhengyang Chen, Yanmin Qian, 
Local Information Modeling with Self-Attention for Speaker Verification.

ICASSP2022 Bei Liu, Haoyu Wang 0007, Zhengyang Chen, Shuai Wang 0016, Yanmin Qian, 
Self-Knowledge Distillation via Feature Enhancement for Speaker Verification.

Interspeech2022 Bing Han, Zhengyang Chen, Yanmin Qian, 
Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction.

#137  | Mengzhe Geng | DBLP Google Scholar  
By venueInterspeech: 14TASLP: 9ICASSP: 9
By year2024: 52023: 82022: 72021: 92020: 22019: 1
ISCA sessionsspeech and language in health: 4speech recognition of atypical speech: 3topics in asr: 2acoustic model adaptation for asr: 1multi-, cross-lingual and other topics in asr: 1novel models and training methods for asr: 1miscellaneous topics in speech, voice and hearing disorders: 1speech and speaker recognition: 1
IEEE keywordsspeech recognition: 17data augmentation: 5adaptation models: 5elderly speech: 4dysarthric speech: 4data models: 4bayes methods: 4pre trained asr system: 3older adults: 3task analysis: 3perturbation methods: 3speaker adaptation: 3speech disorders: 3speaker recognition: 3neural architecture search: 3bayesian learning: 3natural language processing: 3wav2vec2.0: 2decoding: 2gan: 2training data: 2switches: 2conformer: 2controllability: 2error analysis: 2estimation: 2audio visual: 2speech separation: 2handicapped aids: 2disordered speech recognition: 2deep learning (artificial intelligence): 2time delay neural network: 2model uncertainty: 2neural language models: 2domain adaptation: 2standards: 1multi lingual xlsr: 1hubert: 1hybrid tdnn: 1end to end conformer: 1speech: 1low latency: 1rapid adaptation: 1interpolation: 1specaugment: 1reinforcement learning: 1speech enhancement: 1speech dereverberation: 1maximum likelihood detection: 1nonlinear filters: 1visualization: 1end to end: 1self supervised learning: 1generative adversarial networks: 1vae: 1elderly speech recognition: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1recurrent neural nets: 1monte carlo methods: 1gaussian processes: 1articulatory inversion: 1hybrid power systems: 1benchmark testing: 1variational inference: 1delays: 1generalisation (artificial intelligence): 1lf mmi: 1inference mechanisms: 1gaussian process: 1multimodal speech recognition: 1image recognition: 1microphone arrays: 1visual occlusion: 1overlapped speech recognition: 1multi channel: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1transformer: 1uncertainty: 1automatic speech recognition: 1neurocognitive disorder detection: 1dementia: 1
Most publications (all venues) at2022: 112024: 102023: 92021: 62020: 2

Affiliations
URLs

Recent publications

TASLP2024 Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu, 
Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition.

TASLP2024 Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu, 
Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition.

ICASSP2024 Jiajun Deng, Xurong Xie, Guinan Li, Mingyu Cui, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Zhaoqing Li, Xunying Liu, 
Towards High-Performance and Low-Latency Feature-Based Speaker Adaptation of Conformer Speech Recognition Systems.

ICASSP2024 Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu, 
Towards Automatic Data Augmentation for Disordered Speech Recognition.

ICASSP2024 Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu, 
Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation.

TASLP2023 Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu, 
Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition.

ICASSP2023 Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng, 
Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition.

ICASSP2023 Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu, 
Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition.

Interspeech2023 Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu, 
Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems.

Interspeech2023 Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu, 
Use of Speech Impairment Severity for Dysarthric Speech Recognition.

Interspeech2023 Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye 0001, Helen Meng, Xunying Liu, 
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition.

Interspeech2023 Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Helen Meng, Xunying Liu, 
Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition.

Interspeech2023 Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui Jin, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu, 
Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition.

TASLP2022 Mengzhe Geng, Xurong Xie, Zi Ye 0001, Tianzi Wang, Guinan Li, Shujie Hu, Xunying Liu, Helen Meng, 
Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition.

TASLP2022 Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

TASLP2022 Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Neural Network Language Modeling for Speech Recognition.

ICASSP2022 Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng, 
Exploiting Cross Domain Acoustic-to-Articulatory Inverted Features for Disordered Speech Recognition.

Interspeech2022 Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng, 
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems.

Interspeech2022 Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng, 
Confidence Score Based Conformer Speaker Adaptation for Speech Recognition.

Interspeech2022 Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye 0001, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, Helen Meng, 
Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection.

#138  | Chao-Han Huck Yang | DBLP Google Scholar  
By venueICASSP: 16Interspeech: 9ICLR: 2ACL: 1TASLP: 1NeurIPS: 1EMNLP: 1ICML: 1
By year2024: 82023: 122022: 22021: 42020: 6
ISCA sessionsspeech recognition: 2speech coding and enhancement: 2analysis of speech and audio signals: 1speaker and language identification: 1speech synthesis: 1privacy-preserving machine learning for audio & speech processing: 1multi-channel speech enhancement: 1
IEEE keywordsspeech recognition: 9speech enhancement: 4analytical models: 4data models: 3adaptation models: 3upper bound: 2robustness: 2in context learning: 2spoken language understanding: 2automatic speech recognition: 2tensor train network: 2robust speech recognition: 2adversarial robustness: 2recurrent neural nets: 2speech recognition safety: 2quantum machine learning: 2convolutional neural nets: 2gradient methods: 2zero shot learning: 1lattices: 1question answering (information retrieval): 1large language models. asr confusion networks: 1training data: 1low resource speech classification: 1recurrent neural networks: 1quantum mechanics: 1multiple kernel learning: 1quantum kernel projection: 1production: 1wake word verification: 1hot fixing: 1neural model reprogramming: 1end to end asr: 1target recognition: 1predictive models: 1multitasking: 1switches: 1speech sentiment analysis: 1emotion recognition: 1paralinguistics: 1transformers: 1large language models: 1spoken dialogue modeling: 1linguistics: 1whisper model: 1error analysis: 1test time adaptation: 1decoding: 1large pre trained models: 1task analysis: 1computational modeling: 1tensor train deep neural network: 1spoken command recognition: 1low rank tensor train decomposition: 1wireless communication: 1riemannian gradient descent: 1and foundation speech models: 1model reprogramming: 1pre trained adaptation: 1benchmark testing: 1focusing: 1cross lingual speech recognition: 1convolution: 1kernel: 1encoding: 1bayes methods: 1sequence modeling: 1optimisation: 1quantum computing: 1text analysis: 1pattern classification: 1text classification: 1temporal convolution: 1and heterogeneous computing: 1image analysis: 1acoustic scene classification: 1data augmentation: 1convolutional neural networks: 1class activation mapping: 1data privacy: 1acoustic modeling: 1and federated learning: 1road traffic: 1driving behaviors reasoning: 1action recognition: 1temporal reasoning: 1behavioural sciences computing: 1self driving vehicles: 1video saliency: 1self attention models: 1video signal processing: 1regression analysis: 1tensors: 1deep neural network: 1tensor to vector regression: 1concave programming: 1distributed automatic speech recognition: 1submodular function: 1convex programming: 1lovasz bregman divergence: 1rank aggregation: 1speech intelligibility: 1adversarial examples: 1y net: 1discrete wavelet transform: 1image denoising: 1discrete wavelet transforms: 1image restoration: 1image reconstruction: 1multi scale feature aggregation: 1computer vision: 1image representation: 1structure similarity: 1single image dehazing: 1
Most publications (all venues) at2023: 252024: 182021: 142022: 122020: 12

Affiliations
URLs

Recent publications

ICASSP2024 Kevin Everson, Yile Gu, Chao-Han Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke, 
Towards ASR Robust Spoken Language Understanding Through in-Context Learning with Word Confusion Networks.

ICASSP2024 Xianyan Fu, Xiao-Lei Zhang 0001, Chao-Han Huck Yang, Jun Qi 0002, 
Exploiting A Quantum Multiple Kernel Learning Approach For Low-Resource Spoken Command Recognition.

ICASSP2024 Pin-Jui Ku, I-Fan Chen, Chao-Han Huck Yang, Anirudh Raju, Pranav Dheram, Pegah Ghahremani, Brian King, Jing Liu, Roger Ren, Phani Sankar Nidadavolu, 
Hot-Fixing Wake Word Recognition for End-to-End ASR Via Neural Model Reprogramming.

ICASSP2024 Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko, 
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue.

ICASSP2024 Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang 0031, 
Can Whisper Perform Speech-Based In-Context Learning?

ICLR2024 Chen Chen 0075, Ruizhe Li 0001, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Engsiong Chng, Chao-Han Huck Yang
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition.

ICLR2024 Yuchen Hu, Chen Chen 0075, Chao-Han Huck Yang, Ruizhe Li 0001, Chao Zhang 0031, Pin-Yu Chen, Engsiong Chng, 
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition.

ACL2024 Yuchen Hu, Chen Chen 0075, Chao-Han Huck Yang, Ruizhe Li 0001, Dong Zhang, Zhehuai Chen, EngSiong Chng, 
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators.

TASLP2023 Jun Qi 0002, Chao-Han Huck Yang, Pin-Yu Chen, Javier Tejedor, 
Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman, 
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition.

ICASSP2023 Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition.

Interspeech2023 Chen Chen 0075, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng, 
A Neural State-Space Modeling Approach to Efficient Speech Separation.

Interspeech2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li 0028, Yu Zhang 0033, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath, 
How to Estimate Model Transferability of Pre-Trained Speech Models?

Interspeech2023 Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi, 
Differentially Private Adapters for Parameter Efficient Acoustic Modeling.

Interspeech2023 Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee 0001, 
A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models.

Interspeech2023 Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegnér, 
A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model.

Interspeech2023 Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien, 
Parameter-Efficient Learning for Text-to-Speech Accent Adaptation.

Interspeech2023 Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Yu Tsao 0001, 
Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition.

NeurIPS2023 Chen Chen 0075, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, Chng Eng Siong, 
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models.

EMNLP2023 Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper Tegnér, 
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition.

#139  | Jiatong Shi | DBLP Google Scholar  
By venueICASSP: 13Interspeech: 12ACL: 3TASLP: 1ICML: 1ICLR: 1AAAI: 1
By year2024: 92023: 82022: 92021: 42020: 2
ISCA sessionsspeech recognition: 3speech synthesis: 2low-resource asr development: 1spoken language processing: 1asr: 1speech segmentation: 1speech signal analysis and representation: 1cross/multi-lingual and code-switched speech recognition: 1pronunciation: 1
IEEE keywordsself supervised learning: 6benchmark testing: 4task analysis: 4speech recognition: 4speech enhancement: 3benchmark: 2representation learning: 2analytical models: 2discrete units: 2spoken language understanding: 2buildings: 2semantics: 2data models: 2unsupervised asr: 2speaker recognition: 2biological system modeling: 1task generalization: 1evaluation: 1computational modeling: 1protocols: 1foundation model: 1speech: 1redundancy: 1speech translation: 1correlation: 1systematics: 1end to end: 1instruction tuning: 1collaboration: 1topic model: 1hubert: 1multilingual asr: 1ctc: 1face recognition: 1low resource asr: 1predictive models: 1cleaning: 1usability: 1reproducibility of results: 1espnet: 1s3prl: 1decoding: 1pipelines: 1learning systems: 1bridges: 1connectors: 1question answering (information retrieval): 1speech to speech translation: 1multitasking: 1text to speech augmentation: 1duration prediction: 1feature processing: 1singing voice synthesis: 1faces: 1pre training: 1automatic song writing: 1dual transformation loss: 1audio signal processing: 1music objective evaluation: 1natural language processing: 1music: 1pattern clustering: 1speaker embedding: 1speaker clustering: 1inference mechanisms: 1voice activity detection: 1overlap speech detection: 1speaker diarization: 1end to end speech processing: 1conformer: 1transformer: 1data acquisition: 1perceptual loss: 1entropy: 1sequence to sequence singing voice synthesis: 1databases: 1perceptual entropy: 1target speaker speech recognition: 1targetspeaker speech extraction: 1uncertainty estimation: 1recurrent neural nets: 1
Most publications (all venues) at2024: 272023: 242022: 142021: 112020: 2

Affiliations
URLs

Recent publications

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee, 
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan S. Sharma, Shinji Watanabe 0001, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee, 
Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech.

ICASSP2024 Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe 0001, 
Hubertopic: Enhancing Semantic Representation of Hubert Through Self-Supervision Utilizing Topic Model.

ICML2024 Dongchao Yang, Jinchuan Tian, Xu Tan 0003, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian 0002, Zhou Zhao, Xixin Wu, Helen M. Meng, 
UniAudio: Towards Universal Audio Generation with Large Language Models.

ICLR2024 Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Y. Sun, 
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction.

AAAI2024 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren 0006, Yuexian Zou, Zhou Zhao, Shinji Watanabe 0001, 
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.

ACL2024 Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel Robinson, Jiatong Shi, Shinji Watanabe 0001, Graham Neubig, David R. Mortensen, Lori S. Levin, 
Wav2Gloss: Generating Interlinear Glossed Text from Speech.

ACL2024 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang 0001, Ziyue Jiang 0001, Xuankai Chang, Jiatong Shi, Chao Weng, Zhou Zhao, Dong Yu 0001, 
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners.

ICASSP2023 William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe 0001, 
Improving Massively Multilingual ASR with Auxiliary CTC Objectives.

ICASSP2023 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola García, Hung-Yi Lee, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Euro: Espnet Unsupervised ASR Open-Source Toolkit.

ICASSP2023 Jiatong Shi, Chan-Jan Hsu, Ho-Lam Chung, Dongji Gao, Paola García 0001, Shinji Watanabe 0001, Ann Lee 0001, Hung-Yi Lee, 
Bridging Speech and Textual Pre-Trained Models With Unsupervised ASR.

ICASSP2023 Jiatong Shi, Yun Tang 0002, Ann Lee 0001, Hirofumi Inaguma, Changhan Wang, Juan Pino 0001, Shinji Watanabe 0001, 
Enhancing Speech-To-Speech Translation with Multiple TTS Targets.

ICASSP2023 Yuning Wu, Jiatong Shi, Tao Qian, Dongji Gao, Qin Jin, 
Phoneix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation With Phoneme Distribution Predictor.

Interspeech2023 Jiatong Shi, Yun Tang 0002, Hirofumi Inaguma, Hongyu Gong, Juan Pino 0001, Shinji Watanabe 0001, 
Exploration on HuBERT with Multiple Resolution.

Interspeech2023 Jiatong Shi, Dan Berrebbi, William Chen, En-Pei Hu, Wei-Ping Huang, Ho-Lam Chung, Xuankai Chang, Shang-Wen Li 0001, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe 0001, 
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark.

Interspeech2023 Yui Sudo, Muhammad Shakeel 0001, Brian Yan, Jiatong Shi, Shinji Watanabe 0001, 
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders.

ICASSP2022 Tao Qian, Jiatong Shi, Shuai Guo, Peter Wu, Qin Jin, 
Training Strategies for Automatic Song Writing: A Unified Framework Perspective.

ICASSP2022 Chunlei Zhang, Jiatong Shi, Chao Weng, Meng Yu 0003, Dong Yu 0001, 
Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering.

Interspeech2022 Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel López-Francisco, Jonathan D. Amith, Shinji Watanabe 0001, 
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation.

#140  | Xugang Lu | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 7TASLP: 4SpeechComm: 1NeurIPS: 1
By year2024: 32023: 42022: 42021: 52020: 52019: 72018: 3
ISCA sessionsspeech enhancement and intelligibility: 3paralinguistics: 2speech enhancement: 2speech quality assessment: 1speaker and language recognition: 1single-channel speech enhancement: 1large-scale evaluation of short-duration speaker verification: 1cross-lingual and multilingual asr: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1nn architectures for asr: 1speech and audio classification: 1acoustic modelling: 1audio events and acoustic scenes: 1language identification: 1
IEEE keywordsspeech recognition: 5optimal transport: 4speaker verification: 3spoken language identification: 3couplings: 2unsupervised domain adaptation: 2adaptation models: 2speaker recognition: 2probability distribution: 2signal processing algorithms: 2pattern classification: 2knowledge distillation: 2speech enhancement: 2costs: 1smoothing methods: 1coupling regularization: 1task analysis: 1error analysis: 1sinkhorn attention: 1cross modality alignment: 1transformers: 1automatic speech recognition (asr): 1pretrained language model (plm): 1linguistics: 1linear programming: 1speech emotion recognition: 1emotion recognition: 1self supervised learning: 1prediction algorithms: 1open set domain adaptation: 1domain adaptation: 1diversified memory bank: 1heuristic algorithms: 1digital multimedia broadcasting: 1robustness: 1self paced learning: 1mathematical models: 1re parameterization: 1inference speed: 1cross sequential transformation: 1solids: 1network topology: 1codes: 1bayes methods: 1joint bayesian model: 1affine transforms: 1discriminative model: 1generative model: 1statistical distributions: 1internal representation learning: 1short utterances: 1generalizability: 1dynamically sized decision tree: 1decision trees: 1signal denoising: 1deep neural networks: 1ensemble learning: 1regression analysis: 1decoding: 1unsupervised learning: 1deep denoising autoencoder: 1signal classification: 1teacher model optimization: 1natural languages: 1computer aided instruction: 1short utterance feature representation: 1interactive teacher student learning: 1
Most publications (all venues) at2016: 162020: 112017: 112023: 102022: 10

Affiliations
URLs

Recent publications

TASLP2024 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin 0001, Lin Zhang, Junhai Xu, 
Unsupervised Adaptive Speaker Recognition by Coupling-Regularized Optimal Transport.

ICASSP2024 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai, 
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-Based ASR.

ICASSP2024 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Yongwei Li, Wenhuan Lu, Di Jin 0001, Junhai Xu, 
Self-Supervised Domain Exploration with an Optimal Transport Regularization for Open Set Cross-Domain Speech Emotion Recognition.

SpeechComm2023 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin 0001, Lin Zhang, Yantao Ji, Junhai Xu, 
Self-supervised learning based domain regularization for mask-wearing speaker verification.

ICASSP2023 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin 0001, Lin Zhang, Junhai Xu, 
Optimal Transport with a Diversified Memory Bank for Cross-Domain Speaker Verification.

Interspeech2023 Yang Liu, Haoqin Sun, Geng Chen, Qingyue Wang, Zhen Zhao, Xugang Lu, Longbiao Wang, 
Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions.

Interspeech2023 Ruiteng Zhang, Jianguo Wei, Xugang Lu, Yongwei Li, Junhai Xu, Di Jin 0001, Jianhua Tao 0001, 
SOT: Self-supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition.

ICASSP2022 Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Lin Zhang, Yantao Ji, Junhai Xu, Xugang Lu
CS-REP: Making Speaker Verification Networks Embracing Re-Parameterization.

Interspeech2022 Rong Chao, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao 0001, 
Perceptual Contrast Stretching on Target Feature for Speech Enhancement.

Interspeech2022 Kai Li 0018, Sheng Li 0010, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang, Jianwu Dang 0001, Masashi Unoki, 
Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection.

Interspeech2022 Peng Shen, Xugang Lu, Hisashi Kawai, 
Transducer-based language embedding for spoken language identification.

TASLP2021 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai, 
Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification.

ICASSP2021 Xugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai, 
Unsupervised Neural Adaptation Model Based on Optimal Transport for Spoken Language Identification.

Interspeech2021 Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao 0001, 
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement.

Interspeech2021 Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao 0001, 
Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement.

NeurIPS2021 Hsin-Yi Lin, Huan-Hsin Tseng, Xugang Lu, Yu Tsao 0001, 
Unsupervised Noise Adaptive Speech Enhancement by Discriminator-Constrained Optimal Transport.

TASLP2020 Peng Shen, Xugang Lu, Sheng Li 0010, Hisashi Kawai, 
Knowledge Distillation-Based Representation Learning for Short-Utterance Spoken Language Identification.

TASLP2020 Cheng Yu, Ryandhimas E. Zezario, Syu-Siang Wang, Jonathan Sherman, Yi-Yen Hsieh, Xugang Lu, Hsin-Min Wang, Yu Tsao 0001, 
Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders.

ICASSP2020 Ryandhimas E. Zezario, Tassadaq Hussain, Xugang Lu, Hsin-Min Wang, Yu Tsao 0001, 
Self-Supervised Denoising Autoencoder with Linear Regression Decoder for Speech Enhancement.

Interspeech2020 Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung, Yu Tsao 0001, 
Incorporating Broad Phonetic Information for Speech Enhancement.

#141  | S. R. Mahadeva Prasanna | DBLP Google Scholar  
By venueInterspeech: 22TASLP: 5SpeechComm: 3ICASSP: 1
By year2024: 12023: 22022: 22021: 22020: 62019: 62018: 12
ISCA sessionsspeech and voice disorders: 3acoustic analysis-synthesis of speech disorders: 3language identification and diarization: 1speaker and language recognition: 1speech type classification and diagnosis: 1language and accent recognition: 1speech in health: 1phonetic event detection and segmentation: 1speech and speaker recognition: 1show and tell 6: 1show and tell 7: 1speech recognition for indian languages: 1speech segments and voice quality: 1spoofing detection: 1measuring pitch and articulation: 1integrating speech science and technology for clinical applications: 1spoken term detection: 1speech and singing production: 1
IEEE keywordsspectrogram: 3speech recognition: 3task analysis: 2time frequency analysis: 2support vector machines: 2cepstral analysis: 2hidden markov models: 2mel frequency cepstral coefficient: 1self supervised implicit language representation: 1language change detection: 1x vector: 1language diarization (ld): 1covariance matrices: 1harmonic percussive source separation: 1multiple signal classification: 1harmonic analysis: 1radio broadcast audio classification: 1multi task learning: 1speech music overlap detection: 1cnn: 1speech music classification: 1signal classification: 1gmm: 1spectral peak tracking: 1svm: 1time frequency audio features: 1probability: 1audio signal processing: 1music: 1gaussian processes: 1and vowel onset point: 1fourier transforms: 1consonant vowel transitions: 1discrete cosine transforms: 1misarticulated stops: 1cleft lip and palate: 1cleft palate: 1epochs: 1and velopharyngeal dysfunction: 1nasalized voiced stops: 1single pole filter: 1glottal closure instants: 1glottal opening instants: 1neural net architecture: 1electroglottograph: 1generative adversarial network: 1
Most publications (all venues) at2018: 292023: 252019: 232017: 232022: 21

Affiliations
URLs

Recent publications

TASLP2024 Jagabandhu Mishra, S. R. Mahadeva Prasanna
Implicit Self-Supervised Language Representation for Spoken Language Diarization.

TASLP2023 Mrinmoy Bhattacharjee, S. R. M. Prasanna, Prithwijit Guha, 
Clean vs. Overlapped Speech-Music Detection Using Harmonic-Percussive Features and Multi-Task Learning.

Interspeech2023 Jagabandhu Mishra, Jayadev N. Patil, Amartya Chowdhury, S. R. Mahadeva Prasanna
End to End Spoken Language Diarization with Wav2vec Embeddings.

SpeechComm2022 Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna, Prithwijit Guha, 
Speech/music classification using phase-based and magnitude-based features.

Interspeech2022 Moakala Tzudir, Priyankoo Sarmah, S. R. Mahadeva Prasanna
Prosodic Information in Dialect Identification of a Tonal Language: The case of Ao.

Interspeech2021 Shikha Baghel, Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna, Prithwijit Guha, 
Automatic Detection of Shouted Speech Segments in Indian News Debates.

Interspeech2021 Moakala Tzudir, Shikha Baghel, Priyankoo Sarmah, S. R. Mahadeva Prasanna
Excitation Source Feature Based Dialect Identification in Ao - A Low Resource Language.

SpeechComm2020 Protima Nomo Sudro, S. R. Mahadeva Prasanna
Enhancement of cleft palate speech using temporal and spectral processing.

SpeechComm2020 Akhilesh Kumar Dubey, S. R. Mahadeva Prasanna, Samarendra Dandapat, 
Sinusoidal model-based hypernasality detection in cleft palate speech using CVCV sequence.

TASLP2020 Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna, Prithwijit Guha, 
Speech/Music Classification Using Features From Spectral Peaks.

TASLP2020 Vikram C. Mathad, S. R. Mahadeva Prasanna
Vowel Onset Point Based Screening of Misarticulated Stops in Cleft Lip and Palate Speech.

Interspeech2020 Ajish K. Abraham, M. Pushpavathi, N. Sreedevi, A. Navya, Vikram C. Mathad, S. R. Mahadeva Prasanna
Spectral Moment and Duration of Burst of Plosives in Speech of Children with Hearing Impairment and Typically Developing Children - A Comparative Study.

Interspeech2020 Ayush Agarwal, Jagabandhu Mishra, S. R. Mahadeva Prasanna
VOP Detection in Variable Speech Rate Condition.

TASLP2019 Vikram C. M., Nagaraj Adiga, S. R. Mahadeva Prasanna
Detection of Nasalized Voiced Stops in Cleft Palate Speech Using Epoch-Synchronous Features.

ICASSP2019 K. T. Deepak, Pavitra Kulkarni, U. Mudenagudi, S. R. M. Prasanna
Glottal Instants Extraction from Speech Signal Using Generative Adversarial Network.

Interspeech2019 Akhilesh Kumar Dubey, S. R. Mahadeva Prasanna, Samarendra Dandapat, 
Hypernasality Severity Detection Using Constant Q Cepstral Coefficients.

Interspeech2019 Sarfaraz Jelil, Abhishek Shrivastava, Rohan Kumar Das, S. R. Mahadeva Prasanna, Rohit Sinha 0003, 
SpeechMarker: A Voice Based Multi-Level Attendance Application.

Interspeech2019 Sishir Kalita, Protima Nomo Sudro, S. R. Mahadeva Prasanna, Samarendra Dandapat, 
Nasal Air Emission in Sibilant Fricatives of Cleft Lip and Palate Speech.

Interspeech2019 Protima Nomo Sudro, S. R. Mahadeva Prasanna
Modification of Devoicing Error in Cleft Lip and Palate Speech.

Interspeech2018 Kishalay Chakraborty, Senjam Shantirani Devi, Sanjeevan Devnath, S. R. Mahadeva Prasanna, Priyankoo Sarmah, 
Glotto Vibrato Graph: A Device and Method for Recording, Analysis and Visualization of Glottal Activity.

#142  | Jon Barker | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 14TASLP: 1
By year2024: 22023: 22022: 102021: 72020: 52019: 22018: 3
ISCA sessionsspeech intelligibility prediction for hearing-impaired listeners: 3source separation: 2multi-channel speech enhancement and hearing aids: 2technology for disordered speech: 1novel models and training methods for asr: 1assessment of pathological speech and language: 1noise robust and distant speech recognition: 1speech in health: 1applications of language technologies: 1robust speech recognition: 1spatial and phase cues for source separation and speech recognition: 1speech analysis and representation: 1
IEEE keywordsspeech recognition: 9hearing aids: 4speech intelligibility: 4machine learning: 3speech enhancement: 3hearing loss: 3speech in noise: 3source separation: 3handicapped aids: 3hearing aid: 2noise measurement: 2multi stream acoustic modelling: 2data augmentation: 2natural language processing: 2time domain analysis: 2speaker recognition: 2auditory system: 1measurement: 1signal processing algorithms: 1training data: 1hearing impairment: 1intelligibility prediction: 1psychology: 1predictive models: 1complexity theory: 1array signal processing: 1recording: 1transducers: 1acoustic measurements: 1speech quality: 1dysarthric automatic speech recognition: 1filtering theory: 1speech coding: 1source filter separation and fusion: 1audio signal processing: 1automatic speech recognition: 1data simulation: 1microphones: 1auditory model: 1deep neural network: 1buildings: 1feature fusion: 1multi modal dysarthric speech recognition: 1convolution: 1databases: 1performance gain: 1voice source: 1sung speech: 1hearing aid speech processing: 1intelligibility objective: 1medical signal processing: 1hearing: 1optimisation: 1backpropagation: 1differentiable framework: 1noise: 1multi speaker extraction: 1reverberation: 1object recognition: 1multi channel source separation: 1transfer learning: 1dysarthric speech recognition: 1probability: 1posterior probability: 1entropy: 1gaussian distribution: 1data selection: 1spectral analysis: 1language modelling: 1continuous dysarthric speech recognition: 1vocabulary: 1out of domain data: 1multi speaker asr: 1convolutional neural nets: 1tasnet: 1multi channel: 1speech separation: 1end to end: 1dysarthria: 1speech tempo: 1hidden markov models: 1personalised speech recognition: 1mixture models: 1phonetics: 1gaussian processes: 1
Most publications (all venues) at2022: 122017: 112015: 102018: 92021: 7

Affiliations
URLs

Recent publications

ICASSP2024 Jon Barker, Michael A. Akeroyd, Will Bailey, Trevor J. Cox, John F. Culling, Jennifer Firth, Simone Graetzer, Graham Naylor, 
The 2nd Clarity Prediction Challenge: A Machine Learning Challenge for Hearing Aid Intelligibility Prediction.

ICASSP2024 Rhiannon Mogridge, George Close, Robert Sutherland, Thomas Hain, Jon Barker, Stefan Goetze, Anton Ragni, 
Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users Using Intermediate ASR Features and Human Memory Models.

ICASSP2023 Michael A. Akeroyd, Will Bailey, Jon Barker, Trevor J. Cox, John F. Culling, Simone Graetzer, Graham Naylor, Zuzanna Podwinska, Zehai Tu, 
The 2nd Clarity Enhancement Challenge for Hearing Aid Speech Intelligibility Enhancement: Overview and Outcomes.

ICASSP2023 Trevor J. Cox, Jon Barker, Will Bailey, Simone Graetzer, Michael A. Akeroyd, John F. Culling, Graham Naylor, 
Overview of the 2023 ICASSP SP Clarity Challenge: Speech Enhancement for Hearing Aids.

TASLP2022 Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic, 
Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition.

ICASSP2022 Jack Deadman, Jon Barker
Improved Simulation of Realistically-Spatialised Simultaneous Speech Using Multi-Camera Analysis in The Chime-5 Dataset.

ICASSP2022 Zehai Tu, Jack Deadman, Ning Ma 0002, Jon Barker
Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition.

ICASSP2022 Zhengjun Yue, Erfan Loweimi, Zoran Cvetkovic, Heidi Christensen, Jon Barker
Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition.

Interspeech2022 Jon Barker, Michael Akeroyd, Trevor J. Cox, John F. Culling, Jennifer Firth, Simone Graetzer, Holly Griffiths, Lara Harris, Graham Naylor, Zuzanna Podwinska, Eszter Porter, Rhoddy Viveros Muñoz, 
The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction.

Interspeech2022 Jack Deadman, Jon Barker
Modelling Turn-taking in Multispeaker Parties for Realistic Data Simulation.

Interspeech2022 Zehai Tu, Ning Ma 0002, Jon Barker
Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners.

Interspeech2022 Zehai Tu, Ning Ma 0002, Jon Barker
Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction.

Interspeech2022 Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic, 
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs.

Interspeech2022 Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker
On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training.

ICASSP2021 Gerardo Roa Dabike, Jon Barker
The use of Voice Source Features for Sung Speech Recognition.

ICASSP2021 Zehai Tu, Ning Ma 0002, Jon Barker
DHASP: Differentiable Hearing Aid Speech Processing.

ICASSP2021 Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker
Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism.

Interspeech2021 Simone Graetzer, Jon Barker, Trevor J. Cox, Michael Akeroyd, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz, 
Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing.

Interspeech2021 Zehai Tu, Ning Ma 0002, Jon Barker
Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model.

Interspeech2021 Zhengjun Yue, Jon Barker, Heidi Christensen, Cristina McKean, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright, 
Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children.

#143  | Preethi Jyothi | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 5ACL: 2EMNLP: 2ICLR: 1IJCAI: 1EMNLP-Findings: 1ACL-Findings: 1
By year2023: 102022: 52021: 72020: 52019: 12018: 3
ISCA sessionsshow and tell: 2speech recognition: 2speech synthesis: 1novel models and training methods for asr: 1cross/multi-lingual asr: 1multi- and cross-lingual asr, other topics in asr: 1cross/multi-lingual and code-switched asr: 1low-resource speech recognition: 1communication and interaction, multimodality: 1acoustic model adaptation for asr: 1multimodal speech processing: 1multilingual and code-switched asr: 1cross-lingual and multilingual asr: 1topics in speech recognition: 1adjusting to speaker, accent, and domain: 1acoustic scenes and rare events: 1
IEEE keywordsspeech recognition: 4natural language processing: 2ctc: 1code switched asr: 1speech coding: 1zero shot asr: 1task analysis: 1implicit language model: 1rare word asr: 1rnn transducer: 1speaker adaptation: 1personalization: 1error detection: 1speaker recognition: 1data selection: 1accentadaptation: 1data augmentation: 1speech enhancement: 1robust asr: 1multi task and adversarial learning: 1error analysis: 1context modeling: 1coupled training: 1sequence to sequence models with attention: 1accented speech recognition: 1
Most publications (all venues) at2021: 182024: 142023: 142022: 122020: 8

Affiliations
URLs

Recent publications

ICASSP2023 Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe 0001, 
Towards Zero-Shot Code-Switched Speech Recognition.

Interspeech2023 Vineet Bhat, Preethi Jyothi, Pushpak Bhattacharyya, 
DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction.

Interspeech2023 Jie Chi, Brian Lu, Jason Eisner, Peter Bell 0001, Preethi Jyothi, Ahmed M. Ali 0002, 
Unsupervised Code-switched Text Generation from Parallel Text.

Interspeech2023 Tankala Pavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya, 
Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS.

Interspeech2023 Vinit S. Unni, Ashish R. Mittal, Preethi Jyothi, Sunita Sarawagi, 
Improving RNN-Transducers with Acoustic LookAhead.

ICLR2023 Ashish R. Mittal, Sunita Sarawagi, Preethi Jyothi
In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations.

IJCAI2023 Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, Ganesh Ramakrishnan, Tanmay Mahapatra, Manoj Singh, 
Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration.

ACL2023 Suraj Kothawade, Anmol Reddy Mekala, D. Chandra Sekhara Hetha Havya, Mayank Kothyari, Rishabh K. Iyer, Ganesh Ramakrishnan, Preethi Jyothi
DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation.

EMNLP2023 Ashish R. Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, Gakuto Kurata, 
Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries.

EMNLP2023 Darshan Prabhu, Preethi Jyothi, Sriram Ganapathy, Vinit Unni, 
Accented Speech Recognition With Accent-specific Codebooks.

ICASSP2022 Vinit Unni, Shreya Khare, Ashish R. Mittal, Preethi Jyothi, Sunita Sarawagi, Samarth Bharadwaj, 
Adaptive Discounting of Implicit Language Models in RNN-Transducers.

Interspeech2022 Arjit Jain, Pranay Reddy Samala, Deepak Mittal, Preethi Jyothi, Maneesh Singh 0001, 
SPLICEOUT: A Simple and Efficient Audio Augmentation Method.

Interspeech2022 Rishabh Kumar, Devaraja Adiga, Mayank Kothyari, Jatin Dalal, Ganesh Ramakrishnan, Preethi Jyothi
VAgyojaka: An Annotating and Post-Editing Tool for Automatic Speech Recognition.

Interspeech2022 Rishabh Kumar, Devaraja Adiga, Rishav Ranjan, Amrith Krishna, Ganesh Ramakrishnan, Pawan Goyal 0002, Preethi Jyothi
Linguistically Informed Post-processing for ASR Error correction in Sanskrit.

EMNLP-Findings2022 Ashish R. Mittal, Durga Sivasubramanian, Rishabh K. Iyer, Preethi Jyothi, Ganesh Ramakrishnan, 
Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training.

ICASSP2021 Abhijeet Awasthi, Aman Kansal, Sunita Sarawagi, Preethi Jyothi
Error-Driven Fixed-Budget ASR Personalization for Accented Speakers.

ICASSP2021 Archiki Prasad, Preethi Jyothi, Rajbabu Velmurugan, 
An Investigation of End-to-End Models for Robust Speech Recognition.

Interspeech2021 Anuj Diwan, Preethi Jyothi
Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages.

Interspeech2021 Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan K. M., Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish R. Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, 
MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages.

Interspeech2021 Shreya Khare, Ashish R. Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj, 
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration.

#144  | Massimiliano Todisco | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 9TASLP: 3
By year2024: 42023: 52022: 22021: 72020: 52019: 52018: 3
ISCA sessionsspeech coding: 2voice anti-spoofing and countermeasure: 2voice privacy challenge: 2anti-spoofing for speaker verification: 1robust speaker recognition: 1privacy-preserving machine learning for audio & speech processing: 1the first dicova challenge: 1graph and end-to-end learning for speaker recognition: 1anti-spoofing and liveness detection: 1speaker recognition evaluation: 1speaker recognition: 1privacy in speech and audio interfaces: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1novel approaches to enhancement: 1spoken corpora and annotation: 1speaker verification: 1
IEEE keywordsspeaker recognition: 5task analysis: 4presentation attack detection: 4privacy: 3data privacy: 3protocols: 3anti spoofing: 2speaker anonymization: 2codecs: 2spoofing: 2countermeasures: 2automatic speaker verification: 2artificial bandwidth extension: 2variational auto encoder: 2speech quality: 2latent variable: 2pseudonymisation: 1voice privacy: 1anonymisation: 1voice conversion: 1attack model: 1speech synthesis: 1recording: 1training data: 1degradation: 1text to speech: 1countermeasure: 1deepfake detection: 1signal processing algorithms: 1privacy friendly data: 1information filtering: 1language robust orthogonal householder neural network: 1vocoders: 1language modeling: 1linguistics: 1neural audio codec: 1semantics: 1asvspoof: 1deepfakes: 1distributed databases: 1communication networks: 1error analysis: 1spoofing countermeasures: 1joint optimisation: 1speaker verification: 1spoofing detection: 1covid 19: 1diseases: 1patient diagnosis: 1respiratory sounds: 1medical signal processing: 1auditory acoustic features: 1recurrent neural nets: 1bi lstm: 1audio signal processing: 1transient response: 1data augmentation: 1filtering theory: 1public domain software: 1signal classification: 1automatic speaker verification (asv): 1security of data: 1detect ion cost function: 1spoofing counter measures: 1speech recognition: 1statistical distributions: 1mean square error methods: 1generative adversarial network: 1telephony: 1regression analysis: 1speech coding: 1dimensionality reduction: 1
Most publications (all venues) at2022: 132021: 132023: 112024: 102020: 9

Affiliations
URLs

Recent publications

TASLP2024 Michele Panariello, Natalia A. Tomashenko, Xin Wang 0037, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas W. D. Evans, Emmanuel Vincent 0001, Junichi Yamagishi, 
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation.

ICASSP2024 Wanying Ge, Xin Wang 0037, Junichi Yamagishi, Massimiliano Todisco, Nicholas W. D. Evans, 
Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?

ICASSP2024 Xiaoxiao Miao, Xin Wang 0037, Erica Cooper, Junichi Yamagishi, Nicholas W. D. Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier, 
Synvox2: Towards A Privacy-Friendly Voxceleb2 Dataset.

ICASSP2024 Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas W. D. Evans, 
Speaker Anonymization Using Neural Audio Codec Language Models.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee, 
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

ICASSP2023 Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas W. D. Evans, 
Can Spoofing Countermeasure And Speaker Verification Systems Be Jointly Optimised?

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Michele Panariello, Wanying Ge, Hemlata Tak, Massimiliano Todisco, Nicholas W. D. Evans, 
Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems.

Interspeech2023 Michele Panariello, Massimiliano Todisco, Nicholas W. D. Evans, 
Vocoder drift in x-vector-based speaker anonymization.

ICASSP2022 Madhu R. Kamble, Jose Patino 0001, Maria A. Zuluaga, Massimiliano Todisco
Exploring Auditory Acoustic Features for The Diagnosis of Covid-19.

ICASSP2022 Hemlata Tak, Madhu R. Kamble, Jose Patino 0001, Massimiliano Todisco, Nicholas W. D. Evans, 
Rawboost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing.

ICASSP2021 Hemlata Tak, Jose Patino 0001, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans, Anthony Larcher, 
End-to-End anti-spoofing with RawNet2.

Interspeech2021 Jose Patino 0001, Natalia A. Tomashenko, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans, 
Speaker Anonymisation Using the McAdams Coefficient.

Interspeech2021 Oubaïda Chouchane, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino 0001, Nicholas W. D. Evans, Melek Önen, Massimiliano Todisco
Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation.

Interspeech2021 Wanying Ge, Michele Panariello, Jose Patino 0001, Massimiliano Todisco, Nicholas W. D. Evans, 
Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection.

Interspeech2021 Madhu R. Kamble, José Andrés González López, Teresa Grau, Juan M. Espín, Lorenzo Cascioli, Yiqing Huang, Alejandro Gómez Alanís, Jose Patino 0001, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas W. D. Evans, Maria A. Zuluaga, Massimiliano Todisco
PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge.

Interspeech2021 Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas W. D. Evans, Xin Wang 0037, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee, 
Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing.

Interspeech2021 Hemlata Tak, Jee-weon Jung, Jose Patino 0001, Massimiliano Todisco, Nicholas W. D. Evans, 
Graph Attention Networks for Anti-Spoofing.

TASLP2020 Tomi Kinnunen, Héctor Delgado, Nicholas W. D. Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang 0037, Md. Sahidullah, Junichi Yamagishi, Douglas A. Reynolds, 
Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals.

ICASSP2020 Pramod B. Bachhav, Massimiliano Todisco, Nicholas W. D. Evans, 
Artificial Bandwidth Extension Using Conditional Variational Auto-encoders and Adversarial Learning.

#145  | Tsubasa Ochiai | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 12TASLP: 2
By year2024: 32023: 42022: 62021: 82020: 52019: 42018: 1
ISCA sessionsspeech coding and enhancement: 2speech recognition: 1dereverberation, noise reduction, and speaker extraction: 1speech enhancement and intelligibility: 1search/decoding algorithms for asr: 1novel models and training methods for asr: 1single-channel speech enhancement: 1source separation: 1streaming for asr/rnn transducers: 1source separation, dereverberation and echo cancellation: 1speech localization, enhancement, and quality assessment: 1asr neural network architectures and training: 1targeted source separation: 1asr for noisy and far-field speech: 1speech and audio source separation and scene analysis: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 11speech enhancement: 8neural network: 4speaker recognition: 4single channel speech enhancement: 3target speech extraction: 3source separation: 3noise robust speech recognition: 2processing distortion: 2speech extraction: 2blind source separation: 2reverberation: 2array signal processing: 2recurrent neural nets: 2dynamic stream weights: 2time domain network: 2degradation: 1nonlinear distortion: 1noise measurement: 1interference: 1joint training: 1data models: 1acoustic distortion: 1interpolation: 1feature aggregation: 1pre trained models: 1transformers: 1adaptation models: 1benchmark testing: 1self supervised learning: 1telephone sets: 1artificial neural networks: 1data mining: 1few shot adaptation: 1sound event: 1soundbeam: 1target sound extraction: 1recording: 1input switching: 1deep learning (artificial intelligence): 1speech separation: 1speakerbeam: 1signal to distortion ratio: 1acoustic beamforming: 1complex backpropagation: 1convolution: 1transfer functions: 1multi channel source separation: 1speaker activity: 1meeting recognition: 1recurrent neural network transducer: 1entropy: 1natural language processing: 1whole network pre training: 1synchronisation: 1end to end: 1autoregressive processes: 1sensor fusion: 1audiovisual speaker localization: 1audio visual systems: 1audio signal processing: 1image fusion: 1data fusion: 1video signal processing: 1microphone arrays: 1multi task loss: 1spatial features: 1signal denoising: 1robust asr: 1time domain analysis: 1audiovisual speaker tracking: 1kalman filters: 1tracking: 1backprop kalman filter: 1backpropagation: 1adaptation: 1auxiliary feature: 1speaker attention: 1speech separation/extraction: 1
Most publications (all venues) at2021: 122024: 92022: 82023: 72020: 7

Affiliations
URLs

Recent publications

TASLP2024 Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance.

ICASSP2024 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

ICASSP2024 Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocký, 
Target Speech Extraction with Pre-Trained Self-Supervised Learning Models.

TASLP2023 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki, 
SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning.

Interspeech2023 Shoko Araki, Ayako Yamamoto, Tsubasa Ochiai, Kenichi Arai, Atsunori Ogawa, Tomohiro Nakatani, Toshio Irino, 
Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya, 
Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition.

Interspeech2022 Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani, 
Listen only to me! How well can target speech extraction handle false alarms?

Interspeech2022 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR.

Interspeech2022 Martin Kocour, Katerina Zmolíková, Lucas Ondel, Jan Svec, Marc Delcroix, Tsubasa Ochiai, Lukás Burget, Jan Cernocký, 
Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model.

Interspeech2022 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, 
Streaming Target-Speaker ASR with Neural Transducer.

Interspeech2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, 
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations.

ICASSP2021 Christoph Böddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach, 
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation.

ICASSP2021 Marc Delcroix, Katerina Zmolíková, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani, 
Speaker Activity Driven Neural Speech Extraction.

ICASSP2021 Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, Yusuke Shinohara, 
Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition.

ICASSP2021 Julio Wissing, Benedikt T. Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura, 
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain.

Interspeech2021 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki, 
Few-Shot Learning of New Sound Classes for Target Sound Extraction.

Interspeech2021 Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix, Taichi Asami, 
Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture.

Interspeech2021 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo, 
Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition.

#146  | Yonghui Wu | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 13ICLR: 2ICML: 1NeurIPS: 1
By year2022: 32021: 62020: 92019: 102018: 3
ISCA sessionsspeech synthesis: 5asr neural network architectures: 2asr neural network architectures and training: 1training strategies for asr: 1speech translation: 1cross-lingual and multilingual asr: 1application of asr in medical practice: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 9speech synthesis: 5recurrent neural nets: 5speech coding: 4data augmentation: 3conformer: 2speaker recognition: 2rnn t: 2latency: 2regression analysis: 2optimisation: 2natural language processing: 2end to end speech recognition: 2fine grained vae: 2text to speech: 2data models: 2tacotron 2: 2two pass asr: 1end to end asr: 1rnnt: 1long form asr: 1vae: 1iterative methods: 1text analysis: 1computational complexity: 1neural tts: 1self attention: 1non autoregressive: 1autoregressive processes: 1cascaded encoders: 1probability: 1endpointer: 1multi domain training: 1vocabulary: 1standards: 1vector quantization: 1measurement: 1hierarchical: 1mobile handsets: 1variational autoencoder: 1adversarial training: 1text to speech synthesis: 1weakly supervised learning: 1training data: 1speech translation: 1synthetic training data: 1decoding: 1computer architecture: 1predictive models: 1sequence to sequence model: 1task analysis: 1multilingual: 1end to end speech synthesis: 1
Most publications (all venues) at2022: 252020: 252019: 252023: 212021: 21

Affiliations
URLs

Recent publications

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033, 
Improving The Latency And Quality Of Cascaded Encoders.

Interspeech2022 Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang 0033, Yonghui Wu, Rob Clark, 
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks.

ICML2022 Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu
Self-supervised learning with random-projection quantizer for speech recognition.

ICASSP2021 Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang 0033, Ye Jia, Ron J. Weiss, Yonghui Wu
Parallel Tacotron: Non-Autoregressive and Controllable TTS.

ICASSP2021 Bo Li 0028, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han 0002, Qiao Liang 0001, Yu Zhang 0033, Trevor Strohman, Yonghui Wu
A Better and Faster end-to-end Model for Streaming ASR.

ICASSP2021 Jiahui Yu, Chung-Cheng Chiu, Bo Li 0028, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han 0002, Anmol Gulati, Yonghui Wu, Ruoming Pang, 
FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization.

Interspeech2021 Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang 0033, Ye Jia, R. J. Skerry-Ryan, Yonghui Wu
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling.

Interspeech2021 Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang 0033, Yonghui Wu
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS.

ICLR2021 Jiahui Yu, Wei Han 0002, Anmol Gulati, Chung-Cheng Chiu, Bo Li 0028, Tara N. Sainath, Yonghui Wu, Ruoming Pang, 
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling.

ICASSP2020 Bo Li 0028, Shuo-Yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu
Towards Fast and Accurate Streaming End-To-End ASR.

ICASSP2020 Daniel S. Park, Yu Zhang 0033, Chung-Cheng Chiu, Youzheng Chen, Bo Li 0028, William Chan, Quoc V. Le, Yonghui Wu
Specaugment on Large Scale Datasets.

ICASSP2020 Tara N. Sainath, Yanzhang He, Bo Li 0028, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-Yiin Chang, Wei Li 0133, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alexander Gruenstein, Ke Hu, Anjuli Kannan, Qiao Liang 0001, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirkó Visontai, Yonghui Wu, Yu Zhang 0033, Ding Zhao, 
A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency.

ICASSP2020 Guangzhi Sun, Yu Zhang 0033, Ron J. Weiss, Yuan Cao 0007, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu
Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior.

ICASSP2020 Guangzhi Sun, Yu Zhang 0033, Ron J. Weiss, Yuan Cao 0007, Heiga Zen, Yonghui Wu
Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis.

ICASSP2020 Gary Wang, Andrew Rosenberg, Zhehuai Chen, Yu Zhang 0033, Bhuvana Ramabhadran, Yonghui Wu, Pedro J. Moreno 0001, 
Improving Speech Recognition Using Consistent Predictions on Synthesized Speech.

Interspeech2020 Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang 0033, Jiahui Yu, Wei Han 0002, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang, 
Conformer: Convolution-augmented Transformer for Speech Recognition.

Interspeech2020 Wei Han 0002, Zhengdong Zhang, Yu Zhang 0033, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context.

Interspeech2020 Daniel S. Park, Yu Zhang 0033, Ye Jia, Wei Han 0002, Chung-Cheng Chiu, Bo Li 0028, Yonghui Wu, Quoc V. Le, 
Improved Noisy Student Training for Automatic Speech Recognition.

ICASSP2019 Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang 0001, Deepti Bhatia, Yuan Shangguan, Bo Li 0028, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, Alexander Gruenstein, 
Streaming End-to-end Speech Recognition for Mobile Devices.

ICASSP2019 Wei-Ning Hsu, Yu Zhang 0033, Ron J. Weiss, Yu-An Chung, Yuxuan Wang 0002, Yonghui Wu, James R. Glass, 
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization.

#147  | Takaaki Hori | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 14TASLP: 2
By year2024: 12023: 12022: 42021: 72020: 52019: 112018: 1
ISCA sessionsspoken dialogue systems: 1self-supervision and semi-supervision for neural asr training: 1acoustic event detection and acoustic scene classification: 1novel neural network architectures for asr: 1streaming for asr/rnn transducers: 1asr neural network architectures: 1diarization: 1sequence-to-sequence speech recognition: 1emotion and personality in conversation: 1spoken term detection, confidence measure, and end-to-end speech recognition: 1end-to-end speech recognition: 1search methods for speech recognition: 1speech technologies for code-switching in multilingual communities: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 14natural language processing: 5end to end speech recognition: 4automatic speech recognition: 3end to end: 3graph theory: 3self training: 3recurrent neural nets: 3speech coding: 3decoding: 2ctc: 2pattern classification: 2wfst: 2pseudo labeling: 2transformer: 2microphone arrays: 2connectionist temporal classification: 2joint ctc/attention: 2triggered attention: 2hidden markov models: 1data models: 1task analysis: 1neural transducer: 1buildings: 1transducers: 1attention masking: 1transformer transducer: 1transformers: 1acoustic rescoring: 1conformer: 1pipelines: 1gtc: 1multi speaker overlapped speech: 1end to end asr: 1semi supervised learning (artificial intelligence): 1iterative methods: 1semi supervised learning: 1gtc t: 1rnn t: 1transducer: 1asr: 1domain adaptation: 1dropout: 1iterative pseudo labeling: 1self supervised asr: 1dilated self attention: 1computational complexity: 1language translation: 1graph based temporal classification: 1semi supervised asr: 1encoding: 1encoder decoder: 1multi encoder multi resolution (mem res): 1multi encoder multi array (mem array): 1hierarchical attention network (han): 1audio coding: 1streaming: 1neural turing machine: 1turing machines: 1unsupervised speaker adaptation: 1speaker recognition: 1speaker memory: 1attention models: 1discriminative training: 1optimisation: 1softmax margin: 1beam search training: 1sequence learning: 1cold fusion: 1automatic speech recognition (asr): 1language model: 1shallow fusion: 1storage management: 1deep fusion: 1sequence to sequence: 1expert systems: 1unpaired data: 1cycle consistency: 1end to end automatic speech recognition: 1frame synchronous decoding: 1attention mechanism: 1signal classification: 1error statistics: 1stream attention: 1speech codecs: 1multiple microphone array: 1
Most publications (all venues) at2019: 182017: 172012: 142018: 122013: 10

Affiliations
URLs

Recent publications

TASLP2024 Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe 0001, 
End-to-End Speech Recognition: A Survey.

ICASSP2023 Pawel Swietojanski, Stefan Braun, Dogan Can, Thiago Fraga da Silva, Arnab Ghoshal, Takaaki Hori, Roger Hsiao, Henry Mason, Erik McDermott, Honza Silovsky, Ruchir Travadi, Xiaodan Zhuang, 
Variable Attention Masking for Configurable Transformer Transducer Speech Recognition.

ICASSP2022 Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe 0001, Jonathan Le Roux, 
Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR.

ICASSP2022 Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy.

ICASSP2022 Niko Moritz, Takaaki Hori, Shinji Watanabe 0001, Jonathan Le Roux, 
Sequence Transduction with Graph-Based Supervision.

Interspeech2022 Chiori Hori, Takaaki Hori, Jonathan Le Roux, 
Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers.

ICASSP2021 Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux, 
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training.

ICASSP2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux, 
Capturing Multi-Resolution Context by Dilated Self-Attention.

ICASSP2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux, 
Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification.

Interspeech2021 Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition.

Interspeech2021 Chiori Hori, Takaaki Hori, Jonathan Le Roux, 
Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers.

Interspeech2021 Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux, 
Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers.

Interspeech2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux, 
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition.

TASLP2020 Ruizhi Li, Xiaofei Wang 0007, Sri Harish Mallidi, Shinji Watanabe 0001, Takaaki Hori, Hynek Hermansky, 
Multi-Stream End-to-End Speech Recognition.

ICASSP2020 Niko Moritz, Takaaki Hori, Jonathan Le Roux, 
Streaming Automatic Speech Recognition with the Transformer Model.

ICASSP2020 Leda Sari, Niko Moritz, Takaaki Hori, Jonathan Le Roux, 
Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR.

Interspeech2020 Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux, 
Transformer-Based Long-Context End-to-End Speech Recognition.

Interspeech2020 Niko Moritz, Gordon Wichern, Takaaki Hori, Jonathan Le Roux, 
All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection.

ICASSP2019 Murali Karthick Baskar, Lukás Burget, Shinji Watanabe 0001, Martin Karafiát, Takaaki Hori, Jan Honza Cernocký, 
Promising Accurate Prefix Boosting for Sequence-to-sequence ASR.

ICASSP2019 Jaejin Cho, Shinji Watanabe 0001, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesús Villalba 0001, Najim Dehak, 
Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition.

#148  | Simon King 0001 | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 7TASLP: 2
By year2024: 12023: 42022: 32021: 42020: 62019: 72018: 5
ISCA sessionsspeech synthesis: 11text analysis, multilingual issues and evaluation in speech synthesis: 2show and tell: 1intelligibility-enhancing speech modification: 1speech synthesis paradigms and methods: 1speech intelligibility: 1representation learning of emotion and paralinguistics: 1prosody modeling and generation: 1speech perception in adverse conditions: 1voice conversion and speech synthesis: 1
IEEE keywordsspeech synthesis: 8vocoders: 3task analysis: 2neural vocoder: 2filtering theory: 2aerospace electronics: 1style modelling: 1speech: 1prosody: 1prosody transfer: 1prosody modeling and generation: 1text to speech: 1computational modeling: 1data models: 1computer architecture: 1predictive models: 1ensemble methods: 1prosody prediction: 1machine learning: 1representation learning: 1differentiable dsp: 1synthesizers: 1speech coding: 1signal processing algorithms: 1machine learning algorithms: 1training data: 1speech analysis: 1voice conversion evaluation: 1voice conversion challenges: 1speaker characterization: 1pipelines: 1vocoding: 1voice conversion: 1neural network: 1variational auto encoder: 1recurrent neural nets: 1fundamental frequency: 1multilingual: 1speaker embedding: 1subjective evaluation: 1linguistics: 1dnn: 1cross language: 1natural language processing: 1tts: 1asvspoof: 1replay attacks: 1automatic speaker verification: 1security of data: 1spoofing attack: 1anti spoofing: 1speaker recognition: 1speech reconstruction: 1convolutional neural nets: 1convolutional neural network: 1
Most publications (all venues) at2013: 272016: 202010: 182014: 172015: 16

Affiliations
University of Edinburgh, Centre for Speech Technology Research, Scotland, UK

Recent publications

ICASSP2024 Atli Sigurgeirsson, Simon King 0001
Controllable Speaking Styles Using A Large Language Model.

ICASSP2023 Atli Þór Sigurgeirsson, Simon King 0001
Do Prosody Transfer Models Transfer Prosodyƒ.

ICASSP2023 Tian Huey Teh, Vivian Hu, Devang S. Ram Mohan, Zack Hodari, Christopher G. R. Wallis, Tomás Gómez Ibarrondo, Alexandra Torresquintero, James Leoni, Mark J. F. Gales, Simon King 0001
Ensemble Prosody Prediction For Expressive Speech Synthesis.

ICASSP2023 Jacob J. Webber, Cassia Valentini-Botinhao, Evelyn Williams, Gustav Eje Henter, Simon King 0001
Autovocoder: Fast Waveform Generation from a Learned Speech Representation Using Differentiable Digital Signal Processing.

Interspeech2023 Niamh Corkey, Johannah O'Mahony, Simon King 0001
Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0.

Interspeech2022 Jason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, Simon King 0001
Speech Audio Corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech.

Interspeech2022 Sébastien Le Maguer, Simon King 0001, Naomi Harte, 
Back to the Future: Extending the Blizzard Challenge 2013.

Interspeech2022 Johannah O'Mahony, Catherine Lai, Simon King 0001
Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis.

TASLP2021 Berrak Sisman, Junichi Yamagishi, Simon King 0001, Haizhou Li 0001, 
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning.

Interspeech2021 Devang S. Ram Mohan, Qinmin Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King 0001
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis.

Interspeech2021 Alexandra Torresquintero, Tian Huey Teh, Christopher G. R. Wallis, Marlene Staib, Devang S. Ram Mohan, Vivian Hu, Lorenzo Foglianti, Jiameng Gao, Simon King 0001
ADEPT: A Dataset for Evaluating Prosody Transfer.

Interspeech2021 Cassia Valentini-Botinhao, Simon King 0001
Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech.

TASLP2020 Xin Wang 0037, Shinji Takaki, Junichi Yamagishi, Simon King 0001, Keiichi Tokuda, 
A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis.

ICASSP2020 Ivan Himawan, Sandesh Aryal, Iris Ouyang, Sam Kang, Pierre Lanchantin, Simon King 0001
Speaker Adaptation of a Multilingual Acoustic Model for Cross-Language Synthesis.

Interspeech2020 Carol Chermaz, Simon King 0001
A Sound Engineering Approach to Near End Listening Enhancement.

Interspeech2020 Jason Fong, Jason Taylor, Simon King 0001
Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis.

Interspeech2020 Pilar Oplustil Gallegos, Jennifer Williams 0001, Joanna Rownicka, Simon King 0001
An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets.

Interspeech2020 Jacob J. Webber, Olivier Perrotin, Simon King 0001
Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification.

ICASSP2019 Cheng-I Lai, Alberto Abad, Korin Richmond, Junichi Yamagishi, Najim Dehak, Simon King 0001
Attentive Filtering Networks for Audio Replay Attack Detection.

ICASSP2019 Oliver Watts, Cassia Valentini-Botinhao, Simon King 0001
Speech Waveform Reconstruction Using Convolutional Neural Networks with Noise and Periodic Inputs.

#149  | Peter Birkholz | DBLP Google Scholar  
By venueInterspeech: 17SpeechComm: 5ICASSP: 5TASLP: 3
By year2024: 22023: 32022: 122021: 42020: 62019: 3
ISCA sessionsspeech synthesis: 4speech production: 3speech processing & measurement: 2show and tell: 2miscellaneous topics in speech, voice and hearing disorders: 1phonetics: 1human speech & signal processing: 1pathological speech assessment: 1tonal aspects of acoustic phonetics and prosody: 1language learning: 1
IEEE keywordsspeech synthesis: 4synthesizers: 2articulatory synthesis: 2natural language processing: 2silent speech: 2trajectory: 1speech inversion: 1convolutional recurrent neural networks: 1data models: 1copy synthesis: 1long short term memory: 1vocaltractlab (vtl): 1speech recognition: 1optimization: 1automatic phoneme recognition: 1vocal learning simulation: 1visualization: 1articulatory speech synthesis: 1pattern clustering: 1german phonology: 1r allophones in german: 1acoustic resonance: 1transfer function measurement: 1silicon: 1vocal tract walls: 1acoustic resonances: 1voice activity detection: 1data handling: 1prosodic annotation: 1carina: 1speech data: 1approximation theory: 1pitch modeling: 1intrinsic f0 variation: 1co intrinsic f0 variation: 1target approximation model: 1ssi: 1eos: 1acoustic variables measurement: 1electro optical stomatography: 1speaker recognition: 1intonation modeling: 1support vector machines: 1regression analysis: 1mean square error methods: 1articulation to speech synthesis: 1
Most publications (all venues) at2022: 172021: 102020: 102023: 72018: 6

Affiliations
URLs

Recent publications

SpeechComm2024 Simon Stone, Peter Birkholz
Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs.

TASLP2024 Yingming Gao, Peter Birkholz, Ya Li, 
Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks.

SpeechComm2023 Daniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul Konstantin Krug, Peter Birkholz, Lorna F. Halliday, Santitham Prom-on, Yi Xu 0007, 
Simulating vocal learning of spoken language: Beyond imitation.

TASLP2023 Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu 0007, 
Artificial Vocal Learning Guided by Phoneme Recognition and Visual Information.

Interspeech2023 Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel R. van Niekerk, Anqi Xu, Yi Xu 0007, 
Self-Supervised Solution to the Control Problem of Articulatory Synthesis.

TASLP2022 Simon Stone, Yingming Gao, Peter Birkholz
Articulatory Synthesis of Vocalized /r/ Allophones in German.

ICASSP2022 Peter Birkholz, P. Häsner, Steffen Kürbis, 
Acoustic Comparison of Physical Vocal Tract Models with Hard and Soft Walls.

ICASSP2022 Hannes Kath, Simon Stone, Stefan Rapp, Peter Birkholz
Carina - A Corpus of Aligned German Read Speech Including Annotations.

Interspeech2022 Pouriya Amini Digehsara, João Vítor Possamai de Menezes, Christoph Wagner, Michael Bärhold, Petr Schaffer, Dirk Plettemeier, Peter Birkholz
A user-friendly headset for radar-based silent speech recognition.

Interspeech2022 Arne-Lukas Fietkau, Simon Stone, Peter Birkholz
Relationship between the acoustic time intervals and tongue movements of German diphthongs.

Interspeech2022 Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu 0007, 
Articulatory Synthesis for Data Augmentation in Phoneme Recognition.

Interspeech2022 Ingo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz
Glottal inverse filtering based on articulatory synthesis and deep learning.

Interspeech2022 Leon Liebig, Christoph Wagner, Alexander Mainka, Peter Birkholz
An investigation of regression-based prediction of the femininity or masculinity in speech of transgender people.

Interspeech2022 João Vítor Menezes, Pouriya Amini Digehsara, Christoph Wagner, Marco Mütze, Michael Bärhold, Petr Schaffer, Dirk Plettemeier, Peter Birkholz
Evaluation of different antenna types and positions in a stepped frequency continuous-wave radar-based silent speech interface.

Interspeech2022 Debasish Ray Mohapatra, Mario Fleischer, Victor Zappi, Peter Birkholz, Sidney S. Fels, 
Three-dimensional finite-difference time-domain acoustic analysis of simplified vocal tract shapes.

Interspeech2022 Daniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul Konstantin Krug, Peter Birkholz, Yi Xu 0007, 
Exploration strategies for articulatory synthesis of complex syllable onsets.

Interspeech2022 Yi Xu 0007, Anqi Xu, Daniel R. van Niekerk, Branislav Gerazov, Peter Birkholz, Paul Konstantin Krug, Santitham Prom-on, Lorna F. Halliday, 
Evoc-Learn - High quality simulation of early vocal learning.

SpeechComm2021 Peter Birkholz, Susanne Drechsel, 
Effects of the piriform fossae, transvelar acoustic coupling, and laryngeal wall vibration on the naturalness of articulatory speech synthesis.

Interspeech2021 Rémi Blandin, Marc Arnela, Simon Félix, Jean-Baptiste Doc, Peter Birkholz
Comparison of the Finite Element Method, the Multimodal Method and the Transmission-Line Model for the Computation of Vocal Tract Transfer Functions.

Interspeech2021 Alexander Wilbrandt, Simon Stone, Peter Birkholz
Articulatory Data Recorder: A Framework for Real-Time Articulatory Data Recording.

#150  | Jesper Jensen 0001 | DBLP Google Scholar  
By venueTASLP: 11ICASSP: 11Interspeech: 6SpeechComm: 2
By year2024: 52023: 42022: 22021: 52020: 82019: 6
ISCA sessionsspeech synthesis: 2speech enhancement and intelligibility: 1noise reduction and intelligibility: 1speech intelligibility: 1speech recognition and beyond: 1
IEEE keywordsspeech enhancement: 15speech intelligibility: 6speech recognition: 4time frequency analysis: 3noise measurement: 3indexes: 3noise reduction: 3audio visual systems: 3audio signal processing: 3speech presence probability: 2supervised learning: 2robustness: 2acoustic distortion: 2convolution: 2noise robustness: 2performance evaluation: 2keyword spotting: 2hearing: 2array signal processing: 2multi task learning: 2hearing aids: 2deep neural networks: 2maximum likelihood estimation: 2filtering theory: 2speech: 2audio visual speech enhancement: 2training data: 1speech intelligibility prediction: 1frequency estimation: 1target speaker: 1self supervised learning: 1voice activity detection: 1predictive coding: 1bandwidth: 1artificial neural networks: 1speaker adaptation: 1adaptation models: 1bone conducted speech: 1correlation: 1bandwidth extension: 1working environment noise: 1generalization: 1computational modeling: 1diffusion models: 1schedules: 1databases: 1image synthesis: 1simulation: 1hearing assistive devices: 1interaural cues: 1transformers: 1complex convolutional neural networks: 1signal processing algorithms: 1binaural speech enhancement: 1adaptive: 1approximated speech intelligibility index: 1near end listening enhancement: 1optimization: 1speech quality: 1minimum processing: 1energy consumption: 1switches: 1filterbank learning: 1small footprint: 1filter banks: 1end to end: 1maximum likelihood: 1beamforming: 1signal denoising: 1turn taking: 1speech behavior: 1approximation theory: 1asii: 1speech intelligibility enhancement: 1multi microphone: 1beamformer: 1regression analysis: 1spectro temporal modulation: 1speech quality model: 1modulation: 1keyword embedding: 1text analysis: 1computational complexity: 1multi condition training: 1loss function: 1deep metric learning: 1sensor fusion: 1source separation: 1sound source separation: 1audio visual processing: 1speech separation: 1speech synthesis: 1speech inpainting: 1deep learning (artificial intelligence): 1audio visual: 1face landmarks: 1mean square error methods: 1time domain: 1gradient methods: 1fully convolutional neural networks: 1objective intelligibility: 1robust keyword spotting: 1auditory system: 1hearing assistive device: 1constant q transform: 1assistive devices: 1external speaker: 1microphones: 1generalized cross correlation: 1residual neural networks: 1multichannel speech enhancement: 1expectation maximisation algorithm: 1probability: 1kalman filter: 1recursive expectation maximization: 1own voice retrieval: 1spectral analysis: 1multi microphone speech enhancement: 1power spectral density estimation: 1monaural: 1intelligibility: 1prediction: 1intrusive: 1pattern classification: 1human auditory system: 1mutual information: 1maximum likelihood classifier: 1decoding: 1gaussian mixture model: 1vocabulary: 1gaussian processes: 1least mean squares methods: 1minimum mean square error estimator: 1correlation theory: 1lombard effect: 1visualization: 1objective functions: 1training targets: 1
Most publications (all venues) at2019: 142023: 122017: 122016: 122024: 11

Affiliations
Aalborg University, Department of Electronic Systems, Denmark
Oticon A/S, Smørum, Denmark
Delft University of Technology, The Netherlands

Recent publications

TASLP2024 Mathias Bach Pedersen, Søren Holdt Jensen, Zheng-Hua Tan, Jesper Jensen 0001
Data-Driven Non-Intrusive Speech Intelligibility Prediction Using Speech Presence Probability.

ICASSP2024 Holger Severin Bovbjerg, Jesper Jensen 0001, Jan Østergaard, Zheng-Hua Tan, 
Self-Supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions.

ICASSP2024 Amin Edraki, Wai-Yip Chan, Jesper Jensen 0001, Daniel Fogerty, 
Speaker Adaptation For Enhancement Of Bone-Conducted Speech.

ICASSP2024 Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen 0001, Tommy Sonne Alstrøm, Tobias May, 
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler.

ICASSP2024 Vikas Tokala, Eric Grinstein, Mike Brookes, Simon Doclo, Jesper Jensen 0001, Patrick A. Naylor, 
Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks.

SpeechComm2023 Iván López-Espejo, Amin Edraki, Wai-Yip Chan, Zheng-Hua Tan, Jesper Jensen 0001
On the deficiency of intelligibility metrics as proxies for subjective intelligibility.

TASLP2023 Andreas Jonas Fuglsig, Jesper Jensen 0001, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard, 
Minimum Processing Near-End Listening Enhancement.

ICASSP2023 Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen 0001, John H. L. Hansen, 
Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting.

Interspeech2023 Juan Felipe Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen 0001
Speech inpainting: Context-based speech synthesis guided by video.

TASLP2022 Poul Hoang, Jan Mark de Haan, Zheng-Hua Tan, Jesper Jensen 0001
Multichannel Speech Enhancement With Own Voice-Based Interfering Speech Suppression for Hearing Assistive Devices.

ICASSP2022 Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen 0001, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan, 
Joint Far- and Near-End Speech Intelligibility Enhancement Based on the Approximated Speech Intelligibility Index.

TASLP2021 Amin Edraki, Wai-Yip Chan, Jesper Jensen 0001, Daniel Fogerty, 
Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis.

TASLP2021 Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen 0001
A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting.

TASLP2021 Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu 0004, Meng Yu 0003, Dong Yu 0001, Jesper Jensen 0001
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation.

ICASSP2021 Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen 0001
Audio-Visual Speech Inpainting with Deep Learning.

Interspeech2021 Amin Edraki, Wai-Yip Chan, Jesper Jensen 0001, Daniel Fogerty, 
A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction.

SpeechComm2020 Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen 0001
Deep-learning-based audio-visual speech enhancement in presence of Lombard effect.

TASLP2020 Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen 0001
On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement.

TASLP2020 Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen 0001
Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices.

TASLP2020 Juan M. Martín-Doñas, Jesper Jensen 0001, Zheng-Hua Tan, Angel M. Gomez, Antonio M. Peinado, 
Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation.

#151  | Steve Renals | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 9TASLP: 3SpeechComm: 2
By year2023: 22022: 22021: 82020: 92019: 9
ISCA sessionsfeature extraction and distant asr: 3robust speaker recognition: 1topics in asr: 1embedding and network architecture for speaker recognition: 1diverse modes of speech acquisition and processing: 1neural network training methods for asr: 1evaluation of speech technology systems and methods for resource construction and annotation: 1asr model training and strategies: 1asr neural network training: 1medical applications and visual asr: 1model training for asr: 1feature extraction for asr: 1spoken language processing for children’s speech: 1asr neural network architectures: 1
IEEE keywordsspeech recognition: 9end to end: 3acoustic modelling: 2natural language processing: 2error analysis: 1confusion matrix: 1phonetic error analysis: 1hybrid: 1broad phonetic classes: 1timit: 1phone recognition: 1phonetics: 1nose: 1task analysis: 1automatic speech recognition: 1raw signal representation: 1multi stream acoustic modelling: 1fourier transforms: 1fourier transform: 1shape: 1streaming media: 1information filters: 1regression analysis: 1waveform based models: 1data augmentation: 1vicinal risk minimization: 1out of distribution generalization: 1sensor fusion: 1phase based source filter separation: 1multi head cnns: 1asr: 1raw phase spectrum: 1general classifier: 1recurrent neural nets: 1language model: 1top down training: 1layer wise training: 1domain adaptation: 1multilingual speech recognition: 1diarization: 1deep neural network: 1domain adversarial training: 1adversarial learning: 1speaker verification: 1speaker recognition: 1convolutional neural nets: 1signal resolution: 1low pass filters: 1signal representation: 1computer vision: 1transfer learning: 1robust speech recognition: 1bottleneck features: 1statistical normalisation: 1deep neural networks: 1probability density function: 1speaker independent: 1ultrasound tongue imaging: 1ultrasound: 1child speech: 1speech therapy: 1attention: 1decoding: 1
Most publications (all venues) at2013: 172014: 162020: 152019: 152017: 15


Recent publications

TASLP2023 Erfan Loweimi, Andrea Carmantini, Peter Bell 0001, Steve Renals, Zoran Cvetkovic, 
Phonetic Error Analysis Beyond Phone Error Rate.

TASLP2023 Erfan Loweimi, Zhengjun Yue, Peter Bell 0001, Steve Renals, Zoran Cvetkovic, 
Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform.

TASLP2022 Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu 0001, 
Towards Robust Waveform-Based Acoustic Models.

Interspeech2022 Chau Luu, Steve Renals, Peter Bell 0001, 
Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations.

SpeechComm2021 Manuel Sam Ribeiro, Joanne Cleland, Aciel Eshky, Korin Richmond, Steve Renals
Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors.

SpeechComm2021 Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin Richmond, Steve Renals
Automatic audiovisual synchronisation for ultrasound tongue imaging.

ICASSP2021 Erfan Loweimi, Zoran Cvetkovic, Peter Bell 0001, Steve Renals
Speech Acoustic Modelling from Raw Phase Spectrum.

ICASSP2021 Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter Bell 0001, Steve Renals
Train Your Classifier First: Cascade Neural Networks Training from Upper Layers to Lower Layers.

Interspeech2021 Erfan Loweimi, Zoran Cvetkovic, Peter Bell 0001, Steve Renals
Speech Acoustic Modelling Using Raw Source and Filter Components.

Interspeech2021 Chau Luu, Peter Bell 0001, Steve Renals
Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization.

Interspeech2021 Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals
Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video.

Interspeech2021 Shucong Zhang, Erfan Loweimi, Peter Bell 0001, Steve Renals
Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models.

ICASSP2020 Alberto Abad, Peter Bell 0001, Andrea Carmantini, Steve Renals
Cross Lingual Transfer Learning for Zero-Resource Domain Adaptation.

ICASSP2020 Chau Luu, Peter Bell 0001, Steve Renals
Channel Adversarial Training for Speaker Verification and Diarization.

ICASSP2020 Joanna Rownicka, Peter Bell 0001, Steve Renals
Multi-Scale Octave Convolutions for Robust Speech Recognition.

ICASSP2020 Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Steve Renals
Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition.

Interspeech2020 Ahmed Ali 0002, Steve Renals
Word Error Rate Estimation Without ASR Output: e-WER2.

Interspeech2020 Neethu M. Joy, Dino Oglic, Zoran Cvetkovic, Peter Bell 0001, Steve Renals
Deep Scattering Power Spectrum Features for Robust Speech Recognition.

Interspeech2020 Erfan Loweimi, Peter Bell 0001, Steve Renals
On the Robustness and Training Dynamics of Raw Waveform Models.

Interspeech2020 Erfan Loweimi, Peter Bell 0001, Steve Renals
Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling.

#152  | Herman Kamper | DBLP Google Scholar  
By venueInterspeech: 18ICASSP: 6TASLP: 5NAACL: 1
By year2024: 22023: 52022: 32021: 62020: 42019: 82018: 2
ISCA sessionslow-resource speech recognition: 3speech recognition: 2multimodal systems: 2speech synthesis and voice conversion: 1multi-modal systems: 1low-resource asr development: 1zero, low-resource and multi-modal speech recognition: 1the zero resource speech challenge 2020: 1topics in asr: 1the zero resource speech challenge 2019: 1feature extraction for asr: 1corpus annotation and evaluation: 1selected topics in neural speech processing: 1topics in speech recognition: 1
IEEE keywordsspeech recognition: 6multimodal modelling: 4natural language processing: 4acoustic word embeddings: 3zero resource speech processing: 3query by example: 3adaptation models: 2speech synthesis: 2task analysis: 2word acquisition: 2acoustic unit discovery: 2multilingual models: 2recurrent neural nets: 2vocabulary: 2signal classification: 2low resource speech processing: 2unsupervised learning: 2visual grounding: 2semantic retrieval: 2speech disentanglement: 1generative adversarial networks: 1generators: 1convolution: 1unconditional speech synthesis: 1few shot learning: 1reviews: 1visually grounded speech models: 1visualization: 1low resource language: 1costs: 1phone segmentation: 1unsupervised word segmentation: 1zero resource speech: 1hidden markov models: 1data models: 1dynamic programming: 1benchmark testing: 1voice conversion: 1self supervised learning: 1indexterms: 1linguistics: 1speaker recognition: 1image representation: 1gaussian processes: 1transfer learning: 1computational linguistics: 1supervised learning: 1speech translation: 1text analysis: 1speech classification: 1unwritten languages: 1indexing: 1signal reconstruction: 1signal representation: 1information retrieval: 1speech retrieval: 1keyword spotting: 1cross modal matching: 1convolutional neural nets: 1one shot learning: 1image resolution: 1nearest neighbour methods: 1decoding: 1query processing: 1speech search: 1
Most publications (all venues) at2021: 142020: 132019: 102023: 92022: 9

Affiliations
URLs

Recent publications

TASLP2024 Matthew Baas, Herman Kamper
Disentanglement in a GAN for Unconditional Speech Synthesis.

TASLP2024 Leanne Nortje, Dan Oneata, Herman Kamper
Visually Grounded Few-Shot Word Learning in Low-Resource Settings.

TASLP2023 Herman Kamper
Word Segmentation on Discovered Phone Units With Dynamic Programming and Self-Supervised Scoring.

Interspeech2023 Matthew Baas, Benjamin van Niekerk, Herman Kamper
Voice Conversion With Just Nearest Neighbors.

Interspeech2023 Christiaan Jacobs, Nathanaël Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett, Herman Kamper
Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili.

Interspeech2023 Ruan van der Merwe, Herman Kamper
Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning.

Interspeech2023 Leanne Nortje, Benjamin van Niekerk, Herman Kamper
Visually grounded few-shot word acquisition with fewer shots.

ICASSP2022 Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Matthew Baas, Hugo Seuté, Herman Kamper
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion.

Interspeech2022 Matthew Baas, Herman Kamper
Voice Conversion Can Improve ASR in Very Low-Resource Settings.

Interspeech2022 Werner van der Merwe, Herman Kamper, Johan Adam du Preez, 
A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery.

TASLP2021 Herman Kamper, Yevgen Matusevych, Sharon Goldwater, 
Improved Acoustic Word Embeddings for Zero-Resource Languages Using Multilingual Transfer.

Interspeech2021 Christiaan Jacobs, Herman Kamper
Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language.

Interspeech2021 Herman Kamper, Benjamin van Niekerk, 
Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks.

Interspeech2021 Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper
Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Interspeech2021 Leanne Nortje, Herman Kamper
Direct Multimodal Few-Shot Learning of Speech and Images.

Interspeech2021 Kayode Olaleye, Herman Kamper
Attention-Based Keyword Localisation in Speech Using Visual Grounding.

ICASSP2020 Sameer Bansal, Herman Kamper, Adam Lopez, Sharon Goldwater, 
Cross-Lingual Topic Prediction For Speech Using Translations.

ICASSP2020 Herman Kamper, Yevgen Matusevych, Sharon Goldwater, 
Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages.

Interspeech2020 Benjamin van Niekerk, Leanne Nortje, Herman Kamper
Vector-Quantized Neural Networks for Acoustic Unit Discovery in the ZeroSpeech 2020 Challenge.

Interspeech2020 Leanne Nortje, Herman Kamper
Unsupervised vs. Transfer Learning for Multimodal One-Shot Matching of Speech and Images.

#153  | Tomoki Hayashi | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 13TASLP: 3
By year2023: 12022: 32021: 82020: 92019: 62018: 3
ISCA sessionsspeech synthesis: 3neural techniques for voice conversion and waveform generation: 2acoustic event detection and acoustic scene classification: 1speech enhancement, bandwidth extension and hearing aids: 1voice conversion and adaptation: 1the zero resource speech challenge 2020: 1neural waveform generation: 1sequence models for asr: 1recurrent neural models for asr: 1voice conversion and speech synthesis: 1
IEEE keywordsvoice conversion: 7speech synthesis: 6sequence to sequence: 4speech recognition: 4transformer: 4self supervised speech representation: 3autoregressive processes: 3recurrent neural nets: 3convolutional neural nets: 3vocoders: 3end to end: 3speech coding: 3non autoregressive: 2natural language processing: 2streaming: 2open source: 2pitch dependent dilated convolution: 2neural vocoder: 2audio signal processing: 2conformer: 2speaker recognition: 2open source software: 2supervised learning: 2unpaired data: 2electrolaryngeal speech: 1speech enhancement: 1natural languages: 1sequence to sequence voice conversion: 1embedded systems: 1robustness: 1computers: 1low latency speech enhancement: 1self supervised learning: 1computer based training: 1pretraining: 1parallel wavegan: 1quasi periodic wavenet: 1wavenet: 1quasi periodic structure: 1pitch controllability: 1vocoder: 1end to end speech processing: 1convolution: 1sequence to sequence modeling: 1vq wav2vec: 1any to one voice conversion: 1signal representation: 1vector quantized variational autoencoder: 1nonparallel: 1gaussian processes: 1automatic speech recognition: 1text to speech: 1reproducibility of results: 1biological system modeling: 1speaker adaptation: 1data models: 1adaptation models: 1end to end speech synthesis: 1joint training of asr tts: 1pipelines: 1sound event detection: 1weakly supervised learning: 1self attention: 1laplacian distribution: 1prediction theory: 1wavenet vocoder: 1multiple samples output: 1shallow model: 1linear prediction: 1signal detection: 1voice activity detection: 1ctc greedy search: 1expert systems: 1cycle consistency: 1wavenet fine tuning: 1oversmoothed parameters: 1cyclic recurrent neural network: 1
Most publications (all venues) at2021: 162020: 152019: 132018: 102022: 9

Affiliations
URLs

Recent publications

ICASSP2023 Kazuhiro Kobayashi, Tomoki Hayashi, Tomoki Toda, 
Low-Latency Electrolaryngeal Speech Enhancement Based on Fastspeech2-Based Voice Conversion and Self-Supervised Speech Representation.

ICASSP2022 Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, 
An Investigation of Streaming Non-Autoregressive sequence-to-sequence Voice Conversion.

ICASSP2022 Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe 0001, Tomoki Toda, 
S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations.

Interspeech2022 Jiatong Shi, Shuai Guo, Tao Qian, Tomoki Hayashi, Yuning Wu, Fangzheng Xu, Xuankai Chang, Huazhe Li, Peter Wu, Shinji Watanabe 0001, Qin Jin, 
Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis.

TASLP2021 Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda, 
Pretraining Techniques for Sequence-to-Sequence Voice Conversion.

TASLP2021 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda, 
Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network.

TASLP2021 Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda, 
Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network.

ICASSP2021 Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi 0003, Shinji Watanabe 0001, Kun Wei, Wangyou Zhang, Yuekai Zhang, 
Recent Developments on Espnet Toolkit Boosted By Conformer.

ICASSP2021 Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda, 
Non-Autoregressive Sequence-To-Sequence Voice Conversion.

ICASSP2021 Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi
Any-to-One Sequence-to-Sequence Voice Conversion Using Self-Supervised Discrete Speech Representations.

ICASSP2021 Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda, 
Crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder.

Interspeech2021 Tatsuya Komatsu, Shinji Watanabe 0001, Koichi Miyazaki, Tomoki Hayashi
Acoustic Event Detection with Classifier Chains.

ICASSP2020 Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe 0001, Tomoki Toda, Kazuya Takeda, Yu Zhang 0033, Xu Tan 0003, 
Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit.

ICASSP2020 Katsuki Inoue, Sunao Hara, Masanobu Abe, Tomoki Hayashi, Ryuichi Yamamoto, Shinji Watanabe 0001, 
Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models.

ICASSP2020 Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe 0001, Tomoki Toda, Kazuya Takeda, 
Weakly-Supervised Sound Event Detection with Self-Attention.

ICASSP2020 Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, 
Efficient Shallow Wavenet Vocoder Using Multiple Samples Output Based on Laplacian Distribution and Linear Prediction.

ICASSP2020 Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, Shinji Watanabe 0001, 
End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection.

Interspeech2020 Shu Hikosaka, Shogo Seki, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, Hideki Banno, Tomoki Toda, 
Intelligibility Enhancement Based on Speech Waveform Modification Using Hearing Impairment.

Interspeech2020 Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda, 
Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining.

Interspeech2020 Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, Tomoki Toda, 
Cyclic Spectral Modeling for Unsupervised Unit Discovery into Voice Conversion with Excitation and Waveform Modeling.

#154  | Ruoming Pang | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 13ICLR: 2NeurIPS: 1
By year2022: 52021: 92020: 92019: 52018: 2
ISCA sessionsasr neural network architectures: 3streaming for asr/rnn transducers: 2asr neural network architectures and training: 2language modeling and lexical modeling for asr: 1multi-, cross-lingual and other topics in asr: 1speech synthesis: 1streaming asr: 1lm adaptation, lexical units and punctuation: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 11recurrent neural nets: 7speech coding: 4text analysis: 2decoding: 2two pass asr: 2end to end asr: 2conformer: 2rnnt: 2long form asr: 2natural language processing: 2rnn t: 2latency: 2optimisation: 2transducers: 1degradation: 1multilingual: 1massive: 1lifelong learning: 1task analysis: 1speaker recognition: 1streaming asr: 1model distillation: 1non streaming asr: 1cascaded encoders: 1error analysis: 1computational modeling: 1computer architecture: 1second pass asr: 1dynamic sparse models: 1model pruning: 1asr: 1regression analysis: 1probability: 1endpointer: 1vocabulary: 1supervised learning: 1unsupervised learning: 1sequence to sequence: 1filtering theory: 1semi supervised training: 1mobile handsets: 1
Most publications (all venues) at2021: 172020: 132019: 122022: 82024: 6

Affiliations
URLs

Recent publications

ICASSP2022 Ke Hu, Tara N. Sainath, Arun Narayanan, Ruoming Pang, Trevor Strohman, 
Transducer-Based Streaming Deliberation for Cascaded Encoders.

ICASSP2022 Bo Li 0028, Ruoming Pang, Yu Zhang 0033, Tara N. Sainath, Trevor Strohman, Parisa Haghani, Yun Zhu, Brian Farris, Neeraj Gaur, Manasa Prasad, 
Massively Multilingual ASR: A Lifelong Learning Solution.

ICASSP2022 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, Bo Li 0028, David Rybach, Emmanuel Guzman, Ian McGraw, James Qin, Krzysztof Choromanski, Qiao Liang 0001, Robert David, Ruoming Pang, Shuo-Yiin Chang, Trevor Strohman, W. Ronny Huang, Wei Han 0002, Yonghui Wu, Yu Zhang 0033, 
Improving The Latency And Quality Of Cascaded Encoders.

Interspeech2022 W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor D. Strohman, Shankar Kumar, 
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition.

Interspeech2022 Bo Li 0028, Tara N. Sainath, Ruoming Pang, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang 0001, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani, 
A Language Agnostic Multilingual Streaming On-Device ASR System.

ICASSP2021 Thibault Doutre, Wei Han 0002, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang 0033, Liangliang Cao, 
Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data.

ICASSP2021 Bo Li 0028, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han 0002, Qiao Liang 0001, Yu Zhang 0033, Trevor Strohman, Yonghui Wu, 
A Better and Faster end-to-end Model for Streaming ASR.

ICASSP2021 Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman, 
Cascaded Encoders for Unifying Streaming and Non-Streaming ASR.

ICASSP2021 Zhaofeng Wu, Ding Zhao, Qiao Liang 0001, Jiahui Yu, Anmol Gulati, Ruoming Pang
Dynamic Sparsity Neural Networks for Automatic Speech Recognition.

ICASSP2021 Jiahui Yu, Chung-Cheng Chiu, Bo Li 0028, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han 0002, Anmol Gulati, Yonghui Wu, Ruoming Pang
FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization.

Interspeech2021 Thibault Doutre, Wei Han 0002, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao, 
Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models.

Interspeech2021 Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li 0028, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li 0133, Qiao Liang 0001, Pat Rondon, 
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling.

Interspeech2021 Andros Tjandra, Ruoming Pang, Yu Zhang 0033, Shigeki Karita, 
Unsupervised Learning of Disentangled Speech Content and Style Representation.

ICLR2021 Jiahui Yu, Wei Han 0002, Anmol Gulati, Chung-Cheng Chiu, Bo Li 0028, Tara N. Sainath, Yonghui Wu, Ruoming Pang
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling.

ICASSP2020 Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar, 
Deliberation Model Based Two-Pass End-To-End Speech Recognition.

ICASSP2020 Bo Li 0028, Shuo-Yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu, 
Towards Fast and Accurate Streaming End-To-End ASR.

ICASSP2020 Tara N. Sainath, Yanzhang He, Bo Li 0028, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-Yiin Chang, Wei Li 0133, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alexander Gruenstein, Ke Hu, Anjuli Kannan, Qiao Liang 0001, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirkó Visontai, Yonghui Wu, Yu Zhang 0033, Ding Zhao, 
A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency.

ICASSP2020 Tara N. Sainath, Ruoming Pang, Ron J. Weiss, Yanzhang He, Chung-Cheng Chiu, Trevor Strohman, 
An Attention-Based Joint Acoustic and Text on-Device End-To-End Model.

Interspeech2020 Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang 0033, Jiahui Yu, Wei Han 0002, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
Conformer: Convolution-augmented Transformer for Speech Recognition.

Interspeech2020 Wei Han 0002, Zhengdong Zhang, Yu Zhang 0033, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu, 
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context.

#155  | Atsunori Ogawa | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 13
By year2024: 12023: 82022: 32021: 22020: 42019: 92018: 2
ISCA sessionsspeech coding and enhancement: 1multi-talker methods in speech processing: 1paralinguistics: 1spoken language understanding, summarization, and information retrieval: 1speech recognition: 1novel models and training methods for asr: 1speech enhancement and intelligibility: 1noise reduction and intelligibility: 1lm adaptation, lexical units and punctuation: 1speech enhancement: 1asr for noisy and far-field speech: 1asr neural network architectures: 1speech and audio source separation and scene analysis: 1neural networks for language modeling: 1adjusting to speaker, accent, and domain: 1end-to-end speech recognition: 1
IEEE keywordsspeech recognition: 5natural language processing: 5automatic speech recognition: 4speech summarization: 2encoding: 2speech translation: 2text analysis: 2speaker recognition: 2long form asr: 1transformers: 1complexity theory: 1data models: 1adaptation models: 1self supervised learning: 1video on demand: 1computational modeling: 1computational efficiency: 1memory management: 1end to end modeling: 1memory efficient encoders: 1dual speech/text encoder: 1long spoken document: 1task analysis: 1training data: 1end to end speech summarization: 1measurement: 1synthetic data augmentation: 1how2 dataset: 1multi modal data augmentation: 1iterative methods: 1forward language model: 1end to end speech recognition: 1iterative decoding: 1partial sentence aware backward language model: 1iterative shallow fusion: 1symbols: 1shallow fusion: 1sensor fusion: 1language translation: 1attention fusion: 1rover: 1large ensemble: 1error analysis: 1complementary neural language models: 1iterative lattice generation: 1lattice rescoring: 1context carry over: 1lattices: 1end to end (e2e) speech recognition: 1estimation theory: 1recurrent neural nets: 1bidirectional long short term memory (blstm): 1imbalanced datasets: 1confidence estimation: 1auxiliary features: 1cluster voting: 1i vectors: 1deep neural network: 1speaker clustering: 1age and gender estimation: 1speaker embedding: 1adversarial learning: 1deep neural networks: 1phoneme invariant feature: 1text independent speaker recognition: 1signal classification: 1domain adaptation: 1topic model: 1recurrent neural network language model: 1sequence summary network: 1semi supervised learning: 1decoding: 1encoder decoder: 1speech synthesis: 1autoencoder: 1speaker attention: 1speech enhancement: 1blind source separation: 1neural network: 1speech separation/extraction: 1integer programming: 1compressive speech summarization: 1maximum coverage of content words: 1oracle (upper bound) performance: 1integer linear programming (ilp): 1linear programming: 1
Most publications (all venues) at2023: 162017: 152019: 102018: 92013: 9

Affiliations
URLs

Recent publications

ICASSP2024 William Chen, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001, 
Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing.

ICASSP2023 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Roshan S. Sharma, Kohei Matsuura, Shinji Watanabe 0001, 
Speech Summarization of Long Spoken Document: Improving Memory Efficiency of Speech/Text Encoders.

ICASSP2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura, 
Leveraging Large Text Corpora For End-To-End Speech Summarization.

ICASSP2023 Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix, 
Iterative Shallow Fusion of Backward Language Model for End-To-End Speech Recognition.

Interspeech2023 Shoko Araki, Ayako Yamamoto, Tsubasa Ochiai, Kenichi Arai, Atsunori Ogawa, Tomohiro Nakatani, Toshio Irino, 
Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine.

Interspeech2023 Marc Delcroix, Naohiro Tawara, Mireia Díez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukás Burget, Shoko Araki, 
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization.

Interspeech2023 Yuki Kitagishi, Naohiro Tawara, Atsunori Ogawa, Ryo Masumura, Taichi Asami, 
What are differences? Comparing DNN and Human by Their Performance and Characteristics in Speaker Age Estimation.

Interspeech2023 Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix, 
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

ICASSP2022 Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe 0001, 
Integrating Multiple ASR Systems into NLP Backend with Attention Fusion.

ICASSP2022 Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, Shoko Araki, 
Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models.

Interspeech2022 Koharu Horii, Meiko Fukuda, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka, 
End-to-End Spontaneous Speech Recognition Using Disfluency Labeling.

ICASSP2021 Atsunori Ogawa, Naohiro Tawara, Takatomo Kano, Marc Delcroix, 
BLSTM-Based Confidence Estimation for End-to-End Speech Recognition.

Interspeech2021 Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, 
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility.

ICASSP2020 Naohiro Tawara, Hosana Kamiyama, Satoshi Kobashikawa, Atsunori Ogawa
Improving Speaker-Attribute Estimation by Voting Based on Speaker Cluster Information.

ICASSP2020 Naohiro Tawara, Atsunori Ogawa, Tomoharu Iwata, Marc Delcroix, Tetsuji Ogawa, 
Frame-Level Phoneme-Invariant Speaker Embedding for Text-Independent Speaker Recognition on Extremely Short Utterances.

Interspeech2020 Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, Toshio Irino, 
Predicting Intelligibility of Enhanced Speech Using Posteriors Derived from DNN-Based ASR System.

Interspeech2020 Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, 
Language Model Data Augmentation Based on Text Domain Transfer.

ICASSP2019 Michael Hentschel, Marc Delcroix, Atsunori Ogawa, Tomoharu Iwata, Tomohiro Nakatani, 
A Unified Framework for Feature-based Domain Adaptation of Neural Network Language Models.

ICASSP2019 Shigeki Karita, Shinji Watanabe 0001, Tomoharu Iwata, Marc Delcroix, Atsunori Ogawa, Tomohiro Nakatani, 
Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders.

#156  | Petr Motlícek | DBLP Google Scholar  
By venueInterspeech: 20ICASSP: 9
By year2024: 42023: 42022: 12021: 92020: 42019: 32018: 4
ISCA sessionsautomatic speech recognition in air traffic management: 3speech emotion recognition: 1novel transformer models for asr: 1search methods and decoding algorithms for asr: 1show and tell: 1embedding and network architecture for speaker recognition: 1openasr20 and low resource asr development: 1voice activity detection: 1assessment of pathological speech and language: 1multilingual and code-switched asr: 1learning techniques for speaker recognition: 1applications of asr: 1model adaptation for asr: 1cross-lingual and multilingual asr: 1speaker verification using neural network methods: 1deep learning for source separation and pitch tracking: 1speaker verification: 1multimodal systems: 1
IEEE keywordsspeech recognition: 5task analysis: 3automatic speech recognition: 2air traffic control: 2transformers: 2human computer interaction: 2finite state transducers: 2natural language processing: 2training data: 1domain adaptation: 1boosting: 1contextual biasing: 1atmospheric modeling: 1decoding: 1signal processing algorithms: 1rare word recognition: 1speaker change detection: 1degradation: 1multitask learning: 1f1 score: 1measurement: 1speaker turn detection: 1nist: 1out of domain: 1fine tuning: 1delay effects: 1wav2vec2: 1linguistics: 1language identification: 1low resource: 1word confusion networks: 1intent recognition: 1performance evaluation: 1encoding: 1self supervised learning: 1knowledge distillation: 1intent classification: 1cross modal alignment: 1word consensus networks: 1cross modal attention: 1spoken language understanding: 1benchmark testing: 1annotations: 1manuals: 1pipelines: 1risk analysis: 1video surveillance: 1air traffic: 1air surveillance data: 1callsign detection: 1data handling: 1aerospace computing: 1aircraft communication: 1software reliability: 1speech dataset: 1oov word recognition: 1multi genre speech recognition: 1semi supervised learning: 1incremental training: 1sensor fusion: 1bayesian fusion: 1speaker recognition: 1probability: 1inter task fusion: 1
Most publications (all venues) at2022: 172021: 172020: 142019: 122012: 11

Affiliations
URLs

Recent publications

ICASSP2024 Mrinmoy Bhattacharjee, Iuliia Nigmatulina, Amrutha Prasad, Pradeep Rangappa, Srikanth R. Madikeri, Petr Motlícek, Hartmut Helmke, Matthias Kleinert, 
Contextual Biasing Methods for Improving Rare Word Detection in Automatic Speech Recognition.

ICASSP2024 Shashi Kumar, Srikanth R. Madikeri, Iuliia Nigmatulina, Esaú Villatoro-Tello, Petr Motlícek, Karthik Pandia, S. Pavankumar Dubagunta, Aravind Ganapathiraju, 
Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers.

ICASSP2024 Amrutha Prasad, Andrés Carofilis, Geoffroy Vanderreydt, Driss Khalil, Srikanth R. Madikeri, Petr Motlícek, Christof Schüpbach, 
Fine-Tuning Self-Supervised Models for Language Identification Using Orthonormal Constraint.

ICASSP2024 Esaú Villatoro-Tello, Srikanth R. Madikeri, Bidisha Sharma, Driss Khalil, Shashi Kumar, Iuliia Nigmatulina, Petr Motlícek, Aravind Ganapathiraju, 
Probability-Aware Word-Confusion-Network-To-Text Alignment Approach for Intent Classification.

ICASSP2023 Esaú Villatoro-Tello, Srikanth R. Madikeri, Juan Zuluaga-Gomez, Bidisha Sharma, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Petr Motlícek, Alexei V. Ivanov, Aravind Ganapathiraju, 
Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks.

Interspeech2023 Sergio Burdisso, Esaú Villatoro-Tello, Srikanth R. Madikeri, Petr Motlícek
Node-weighted Graph Convolutional Network for Depression Detection in Transcribed Clinical Interviews.

Interspeech2023 Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet, Petr Motlícek
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition.

Interspeech2023 Iuliia Nigmatulina, Srikanth R. Madikeri, Esaú Villatoro-Tello, Petr Motlícek, Juan Zuluaga-Gomez, Karthik Pandia, Aravind Ganapathiraju, 
Implementing Contextual Biasing in GPU Decoder for Online ASR.

ICASSP2022 Iuliia Nigmatulina, Juan Zuluaga-Gomez, Amrutha Prasad, Seyyed Saeed Sarfjoo, Petr Motlícek
A Two-Step Approach to Leverage Contextual Data: Speech Recognition in Air-Traffic Communications.

ICASSP2021 Rudolf A. Braun, Srikanth R. Madikeri, Petr Motlícek
A Comparison of Methods for OOV-Word Recognition on a New Public Dataset.

Interspeech2021 Maël Fabien, Shantipriya Parida, Petr Motlícek, Dawei Zhu, Aravind Krishnan, Hoang H. Nguyen, 
ROXANNE Research Platform: Automate Criminal Investigations.

Interspeech2021 Weipeng He, Petr Motlícek, Jean-Marc Odobez, 
Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction.

Interspeech2021 Martin Kocour, Karel Veselý, Alexander Blatt, Juan Zuluaga-Gomez, Igor Szöke, Jan Cernocký, Dietrich Klakow, Petr Motlícek
Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition.

Interspeech2021 Srikanth R. Madikeri, Petr Motlícek, Hervé Bourlard, 
Multitask Adaptation with Lattice-Free MMI for Multi-Genre Speech Recognition of Low Resource Languages.

Interspeech2021 Oliver Ohneiser, Seyyed Saeed Sarfjoo, Hartmut Helmke, Shruthi Shetty, Petr Motlícek, Matthias Kleinert, Heiko Ehr, Sarunas Murauskas, 
Robust Command Recognition for Lithuanian Air Traffic Control Tower Utterances.

Interspeech2021 Seyyed Saeed Sarfjoo, Srikanth R. Madikeri, Petr Motlícek
Speech Activity Detection Based on Multilingual Speech Recognition System.

Interspeech2021 Esaú Villatoro-Tello, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlícek, Mathew Magimai-Doss, 
Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition.

Interspeech2021 Juan Zuluaga-Gomez, Iuliia Nigmatulina, Amrutha Prasad, Petr Motlícek, Karel Veselý, Martin Kocour, Igor Szöke, 
Contextual Semi-Supervised Learning: An Approach to Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems.

ICASSP2020 Banriskhem K. Khonglah, Srikanth R. Madikeri, Subhadeep Dey, Hervé Bourlard, Petr Motlícek, Jayadev Billa, 
Incremental Semi-Supervised Learning for Multi-Genre Speech Recognition.

Interspeech2020 Srikanth R. Madikeri, Banriskhem K. Khonglah, Sibo Tong, Petr Motlícek, Hervé Bourlard, Daniel Povey, 
Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems.

#157  | Hervé Bourlard | DBLP Google Scholar  
By venueInterspeech: 13ICASSP: 11TASLP: 3SpeechComm: 2
By year2022: 22021: 62020: 82019: 82018: 5
ISCA sessionsatypical speech analysis and detection: 1novel models and training methods for asr: 1openasr20 and low resource asr development: 1neural network training methods and architectures for asr: 1speech in health: 1multilingual and code-switched asr: 1speech and language analytics for medical applications: 1model training for asr: 1extracting information from audio: 1plenary talk: 1dereverberation: 1spoken term detection: 1adjusting to speaker, accent, and domain: 1
IEEE keywordsspeech recognition: 5medical disorders: 4medical signal processing: 4speech intelligibility: 3diseases: 3hidden markov models: 3parkinson’s disease: 2amyotrophic lateral sclerosis: 2dysarthria: 2cerebral palsy: 2neurophysiology: 2speech: 2svm: 2weibull distribution: 2dtw: 2speech synthesis: 2patient diagnosis: 2expectation maximisation algorithm: 1poisson distribution: 1non negative matrix factorization: 1speech dereverberation: 1matrix decomposition: 1variational au toencoders: 1reverberation: 1monte carlo methods: 1matrix algebra: 1convolutional neural nets: 1pairwise distance: 1convolutional neural network: 1perceptual classification: 1tools: 1apraxia of speech: 1databases: 1hierarchical classification: 1support vector machine classification: 1support vector machine: 1cross lingual adaptation: 1automatic speech recognition: 1natural language processing: 1lfmmi: 1self supervised pretraining: 1supervised learning: 1regression analysis: 1spectral subspace: 1spectral modulation: 1hearing impairment: 1speaker recognition: 1svd: 1support vector machines: 1patient treatment: 1entropy: 1medical signal detection: 1non parametric sparsity: 1parametric sparsity: 1parkinson's disease: 1medical diagnostic computing: 1neurons: 1spoken term detection: 1bottleneck features: 1subsequence detection: 1deep neural network: 1query by example: 1end to end: 1task analysis: 1p estoi: 1tts: 1multi genre speech recognition: 1semi supervised learning: 1incremental training: 1speech enhancement: 1estoi: 1stoi: 1pathological speech intelligibility: 1statistical distributions: 1cepstral analysis: 1super gaussianity: 1signal classification: 1stability analysis: 1digital iir filters: 1mathematical model: 1prosody modelling: 1fujisaki model: 1explosions: 1muscles: 1multilingual asr: 1ctc: 1language adaptive training: 1end to end lf mmi: 1dropout uncertainty: 1word confidence: 1wer estimation: 1error localization: 1decoding: 1estimation: 1predictive models: 1uncertainty: 1
Most publications (all venues) at2014: 202004: 172002: 162012: 152011: 15

Affiliations
URLs

Recent publications

Interspeech2022 Cécile Fougeron, Nicolas Audibert, Ina Kodrasi, Parvaneh Janbakhshi, Michaela Pernon, Nathalie Lévêque, Stephanie Borel, Marina Laganaro, Hervé Bourlard, Frédéric Assal, 
Comparison of 5 methods for the evaluation of intelligibility in mild to moderate French dysarthric speech.

Interspeech2022 Selen Hande Kabil, Hervé Bourlard
From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition.

ICASSP2021 Deepak Baby, Hervé Bourlard
Speech Dereverberation Using Variational Autoencoders.

ICASSP2021 Parvaneh Janbakhshi, Ina Kodrasi, Hervé Bourlard
Automatic Dysarthric Speech Detection Exploiting Pairwise Distance-Based Convolutional Neural Networks.

ICASSP2021 Ina Kodrasi, Michaela Pernon, Marina Laganaro, Hervé Bourlard
Automatic And Perceptual Discrimination Between Dysarthria, Apraxia of Speech, and Neurotypical Speech.

ICASSP2021 Apoorv Vyas, Srikanth R. Madikeri, Hervé Bourlard
Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models.

Interspeech2021 Srikanth R. Madikeri, Petr Motlícek, Hervé Bourlard
Multitask Adaptation with Lattice-Free MMI for Multi-Genre Speech Recognition of Low Resource Languages.

Interspeech2021 Apoorv Vyas, Srikanth R. Madikeri, Hervé Bourlard
Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model.

SpeechComm2020 Pranay Dighe, Afsaneh Asaei, Hervé Bourlard
On quantifying the quality of acoustic models in hybrid DNN-HMM ASR.

TASLP2020 Parvaneh Janbakhshi, Ina Kodrasi, Hervé Bourlard
Automatic Pathological Speech Intelligibility Assessment Exploiting Subspace-Based Analyses.

TASLP2020 Ina Kodrasi, Hervé Bourlard
Spectro-Temporal Sparsity Characterization for Dysarthric Speech Detection.

TASLP2020 Dhananjay Ram, Lesly Miculicich, Hervé Bourlard
Neural Network Based End-to-End Query by Example Spoken Term Detection.

ICASSP2020 Parvaneh Janbakhshi, Ina Kodrasi, Hervé Bourlard
Synthetic Speech References for Automatic Pathological Speech Intelligibility Assessment.

ICASSP2020 Banriskhem K. Khonglah, Srikanth R. Madikeri, Subhadeep Dey, Hervé Bourlard, Petr Motlícek, Jayadev Billa, 
Incremental Semi-Supervised Learning for Multi-Genre Speech Recognition.

Interspeech2020 Ina Kodrasi, Michaela Pernon, Marina Laganaro, Hervé Bourlard
Automatic Discrimination of Apraxia of Speech and Dysarthria Using a Minimalistic Set of Handcrafted Features.

Interspeech2020 Srikanth R. Madikeri, Banriskhem K. Khonglah, Sibo Tong, Petr Motlícek, Hervé Bourlard, Daniel Povey, 
Lattice-Free Maximum Mutual Information Training of Multilingual Speech Recognition Systems.

SpeechComm2019 Pranay Dighe, Afsaneh Asaei, Hervé Bourlard
Low-rank and sparse subspace modeling of speech for DNN based acoustic modeling.

ICASSP2019 Parvaneh Janbakhshi, Ina Kodrasi, Hervé Bourlard
Pathological Speech Intelligibility Assessment Based on the Short-time Objective Intelligibility Measure.

ICASSP2019 Ina Kodrasi, Hervé Bourlard
Super-gaussianity of Speech Spectral Coefficients as a Potential Biomarker for Dysarthric Speech Detection.

ICASSP2019 François Marelli, Bastian Schnell, Hervé Bourlard, Thierry Dutoit, Philip N. Garner, 
An End-to-end Network to Synthesize Intonation Using a Generalized Command Response Model.

#158  | Zheng-Hua Tan | DBLP Google Scholar  
By venueTASLP: 11ICASSP: 10Interspeech: 5SpeechComm: 2ICLR: 1
By year2024: 42023: 52022: 42021: 32020: 72019: 52018: 1
ISCA sessionsspeech synthesis: 2speech segmentation: 1speech recognition and beyond: 1language identification: 1
IEEE keywordsspeech enhancement: 11speech intelligibility: 4noise measurement: 3indexes: 3speech recognition: 3audio visual systems: 3audio signal processing: 3time frequency analysis: 2speech presence probability: 2noise reduction: 2supervised learning: 2robustness: 2visualization: 2noise robustness: 2performance evaluation: 2keyword spotting: 2array signal processing: 2speaker recognition: 2multi task learning: 2hearing aids: 2deep neural networks: 2maximum likelihood estimation: 2filtering theory: 2audio visual speech enhancement: 2training data: 1speech intelligibility prediction: 1frequency estimation: 1target speaker: 1self supervised learning: 1voice activity detection: 1predictive coding: 1working environment noise: 1generalization: 1computational modeling: 1diffusion models: 1schedules: 1databases: 1image synthesis: 1adaptive: 1approximated speech intelligibility index: 1near end listening enhancement: 1optimization: 1acoustic distortion: 1speech quality: 1minimum processing: 1cross modal task: 1contrastive learning: 1decoding: 1audio captioning: 1caption consistency regularization: 1semantics: 1task analysis: 1energy consumption: 1switches: 1filterbank learning: 1small footprint: 1filter banks: 1end to end: 1maximum likelihood: 1beamforming: 1signal denoising: 1turn taking: 1speech behavior: 1hearing: 1approximation theory: 1asii: 1speech intelligibility enhancement: 1multi microphone: 1beamformer: 1microphone arrays: 1multi speaker asr: 1meeting transcription: 1natural language processing: 1alimeeting: 1m2met: 1speaker diarization: 1keyword embedding: 1text analysis: 1computational complexity: 1multi condition training: 1loss function: 1deep metric learning: 1sensor fusion: 1source separation: 1sound source separation: 1audio visual processing: 1speech separation: 1speech synthesis: 1speech inpainting: 1deep learning (artificial intelligence): 1audio visual: 1face landmarks: 1mean square error methods: 1time domain: 1gradient methods: 1fully convolutional neural networks: 1objective intelligibility: 1robust keyword spotting: 1auditory system: 1hearing assistive device: 1constant q transform: 1assistive devices: 1external speaker: 1microphones: 1generalized cross correlation: 1residual neural networks: 1multichannel speech enhancement: 1expectation maximisation algorithm: 1probability: 1kalman filter: 1recursive expectation maximization: 1own voice retrieval: 1spectral analysis: 1multi microphone speech enhancement: 1power spectral density estimation: 1cepstral feature: 1convolutional neural nets: 1security of data: 1convo lutional neural network: 1adversarial attack: 1web sites: 1signal classification: 1least mean squares methods: 1minimum mean square error estimator: 1correlation theory: 1pattern clustering: 1brain: 1dnns: 1bottleneck feature: 1cepstral analysis: 1time contrastive learning: 1speaker verification: 1image segmentation: 1gmm ubm: 1gaussian processes: 1lombard effect: 1convolution: 1speech: 1objective functions: 1training targets: 1
Most publications (all venues) at2016: 252018: 222021: 212022: 182017: 18

Affiliations
URLs

Recent publications

TASLP2024 Mathias Bach Pedersen, Søren Holdt Jensen, Zheng-Hua Tan, Jesper Jensen 0001, 
Data-Driven Non-Intrusive Speech Intelligibility Prediction Using Speech Presence Probability.

ICASSP2024 Holger Severin Bovbjerg, Jesper Jensen 0001, Jan Østergaard, Zheng-Hua Tan
Self-Supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions.

ICASSP2024 Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen 0001, Tommy Sonne Alstrøm, Tobias May, 
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler.

ICLR2024 Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners.

SpeechComm2023 Iván López-Espejo, Amin Edraki, Wai-Yip Chan, Zheng-Hua Tan, Jesper Jensen 0001, 
On the deficiency of intelligibility metrics as proxies for subjective intelligibility.

TASLP2023 Andreas Jonas Fuglsig, Jesper Jensen 0001, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard, 
Minimum Processing Near-End Listening Enhancement.

TASLP2023 Yiming Zhang, Hong Yu 0006, Ruoyi Du, Zheng-Hua Tan, Wenwu Wang 0001, Zhanyu Ma, Yuan Dong, 
ACTUAL: Audio Captioning With Caption Feature Space Regularization.

ICASSP2023 Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen 0001, John H. L. Hansen, 
Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting.

Interspeech2023 Juan Felipe Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen 0001, 
Speech inpainting: Context-based speech synthesis guided by video.

TASLP2022 Poul Hoang, Jan Mark de Haan, Zheng-Hua Tan, Jesper Jensen 0001, 
Multichannel Speech Enhancement With Own Voice-Based Interfering Speech Suppression for Hearing Assistive Devices.

ICASSP2022 Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen 0001, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan
Joint Far- and Near-End Speech Intelligibility Enhancement Based on the Approximated Speech Intelligibility Index.

ICASSP2022 Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie 0001, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma 0001, Xin Xu, Hui Bu, 
Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge.

Interspeech2022 Claus M. Larsen, Peter Koch 0001, Zheng-Hua Tan
Adversarial Multi-Task Deep Learning for Noise-Robust Voice Activity Detection with Low Algorithmic Delay.

TASLP2021 Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen 0001, 
A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting.

TASLP2021 Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu 0004, Meng Yu 0003, Dong Yu 0001, Jesper Jensen 0001, 
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation.

ICASSP2021 Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen 0001, 
Audio-Visual Speech Inpainting with Deep Learning.

SpeechComm2020 Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen 0001, 
Deep-learning-based audio-visual speech enhancement in presence of Lombard effect.

TASLP2020 Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen 0001, 
On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement.

TASLP2020 Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen 0001, 
Improved External Speaker-Robust Keyword Spotting for Hearing Assistive Devices.

TASLP2020 Juan M. Martín-Doñas, Jesper Jensen 0001, Zheng-Hua Tan, Angel M. Gomez, Antonio M. Peinado, 
Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation.

#159  | Jen-Tzung Chien | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 12TASLP: 3
By year2024: 52023: 62022: 32021: 42020: 52019: 6
ISCA sessionsspeaker recognition: 2spoken language understanding, summarization, and information retrieval: 1show and tell: 1speech synthesis: 1spoken language processing: 1novel neural network architectures for asr: 1spoken dialogue systems: 1neural networks for language modeling: 1spoken dialogue system: 1dialogue speech understanding: 1semantic analysis and classification: 1
IEEE keywordsnatural language processing: 6variational autoencoder: 5speaker verification: 5domain adaptation: 4contrastive learning: 3self supervised learning: 3task analysis: 3recurrent neural nets: 3disentangled representation learning: 2speaker embedding: 2linguistics: 2speech recognition: 2adaptation models: 2data augmentation: 2pipelines: 2policy optimization: 2reinforcement learning: 2natural language understanding: 2adversarial training: 2data mining: 2recurrent neural network: 2mutual information: 2gaussian distribution: 2speaker recognition: 2stochastic processes: 2electronic mail: 1labeling: 1faces: 1error analysis: 1switches: 1code switching: 1bilingual speech recognition: 1attention guidance: 1speech coding: 1parameter efficiency: 1representation learning: 1hard negative pairs: 1reverberation: 1weighted contrastive loss: 1multi domain dialogue system: 1dialogue system robustness: 1robustness: 1multi step prompting strategy: 1vae: 1simclr: 1costs: 1guidance learning: 1optimization: 1transformers: 1dialogue system: 1hierarchical reinforcement learning: 1prompt based learning: 1meta learning: 1computational modeling: 1supervised learning: 1data models: 1noise measurement: 1perturbation methods: 1sentence embedding: 1pattern classification: 1document handling: 1document representation: 1optimisation: 1sequential learning: 1language translation: 1adversarial learning: 1transformer: 1minimax techniques: 1mask language model: 1talking face generation: 1training data: 1domain mapping: 1generative model: 1image synthesis: 1flow based model: 1interactive systems: 1dialogue generation: 1normalizing flow: 1autoregressive processes: 1domain adversarial training: 1speaker verification (sv): 1standards: 1markov state: 1latent variable model: 1speech enhancement: 1deep sequential learning: 1source separation: 1speech intelligibility: 1stochastic transition: 1backpropagation: 1markov processes: 1sequence generation: 1hierarchical model: 1image representation: 1i vectors: 1maximum mean discrepancy: 1x vectors: 1
Most publications (all venues) at2020: 212023: 162021: 162019: 162018: 16


Recent publications

TASLP2024 Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien
Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement.

ICASSP2024 Bobbi Aditya, Mahdin Rohmatillah, Liang-Hsuan Tai, Jen-Tzung Chien
Attention-Guided Adaptation for Code-Switching Speech Recognition.

ICASSP2024 Chong-Xin Gan, Man-Wai Mak, Weiwei Lin 0002, Jen-Tzung Chien
Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification.

ICASSP2024 Mahdin Rohmatillah, Jen-Tzung Chien
Revise the NLU: A Prompting Strategy for Robust Dialogue System.

ICASSP2024 Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien
Contrastive Speaker Embedding With Sequential Disentanglement.

TASLP2023 Mahdin Rohmatillah, Jen-Tzung Chien
Hierarchical Reinforcement Learning With Guidance for Multi-Domain Dialogue Policy.

ICASSP2023 Ming-Yen Chen, Mahdin Rohmatillah, Ching-Hsien Lee, Jen-Tzung Chien
Meta Learning for Domain Agnostic Soft Prompt.

ICASSP2023 Jen-Tzung Chien, Yuan-An Chen, 
Self-Supervised Adversarial Training for Contrastive Sentence Embedding.

Interspeech2023 Jen-Tzung Chien, Shang-En Li, 
Contrastive Disentangled Learning for Memory-Augmented Transformer.

Interspeech2023 Mahdin Rohmatillah, Bobbi Aditya, Li-Jen Yang, Bryan Gautama Ngo, Willianto Sulaiman, Jen-Tzung Chien
Promoting Mental Self-Disclosure in a Spoken Dialogue System.

Interspeech2023 Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien
Parameter-Efficient Learning for Text-to-Speech Accent Adaptation.

ICASSP2022 Chang-Ting Chu, Mahdin Rohmatillah, Ching-Hsien Lee, Jen-Tzung Chien
Augmentation Strategy Optimization for Language Understanding.

ICASSP2022 Hou Lio, Shang-En Li, Jen-Tzung Chien
Adversarial Mask Transformer for Sequential Learning.

Interspeech2022 Jen-Tzung Chien, Yu-Han Huang, 
Bayesian Transformer Using Disentangled Mask Attention.

ICASSP2021 Sheng-Jhe Huang, Jen-Tzung Chien
Attribute Decomposition for Flow-Based Domain Mapping.

ICASSP2021 Tien-Ching Luo, Jen-Tzung Chien
Variational Dialogue Generation with Normalizing Flows.

Interspeech2021 Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien
Online Compressive Transformer for End-to-End Speech Recognition.

Interspeech2021 Mahdin Rohmatillah, Jen-Tzung Chien
Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy.

TASLP2020 Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien
Variational Domain Adversarial Learning With Mutual Information Maximization for Speaker Verification.

ICASSP2020 Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien
Information Maximized Variational Domain Adversarial Learning for Speaker Verification.

#160  | Man-Wai Mak | DBLP Google Scholar  
By venueICASSP: 13Interspeech: 8TASLP: 7ICML: 1
By year2024: 52023: 62022: 62021: 32020: 72019: 2
ISCA sessionsspeaker recognition: 3speech and language in health: 1speaker and language recognition: 1atypical speech analysis and detection: 1feature, embedding and neural architecture for speaker recognition: 1speaker embedding: 1
IEEE keywordsspeaker verification: 11speaker recognition: 9domain adaptation: 7speaker embedding: 5linguistics: 4data augmentation: 4mutual information: 4variational autoencoder: 3contrastive learning: 3self supervised learning: 3transfer learning: 3adaptation models: 3statistics pooling: 3maximum mean discrepancy: 3disentangled representation learning: 2representation learning: 2data mining: 2transformers: 2data models: 2task analysis: 2speech recognition: 2asr: 2spectral analysis: 2speaker verification (sv): 2gaussian distribution: 2electronic mail: 1labeling: 1faces: 1hard negative pairs: 1reverberation: 1weighted contrastive loss: 1prompt tuning: 1parameter efficient tuning: 1transformer adapter: 1pre trained transformer: 1vae: 1simclr: 1depression detection: 1speaker disentanglement: 1depression: 1detectors: 1interference: 1telephone sets: 1robust speaker recognition: 1domain shift: 1convolution: 1kernel: 1weight space ensemble: 1computer architecture: 1maml: 1meta learning: 1computational modeling: 1deep speaker embedding: 1hidden markov models: 1text dependent speaker verification: 1model adaptation: 1speech enhancement: 1feature selection: 1transformer: 1rabbits: 1disfluency pattern: 1dementia detection: 1additives: 1multiobjective optimization: 1additive angular margin: 1optimization methods: 1attention mechanism: 1gumbel softmax: 1attention models: 1deep neural networks: 1gaussian processes: 1short time fourier transform: 1self attention: 1audio signal processing: 1search problems: 1bayes methods: 1hyper parameter optimization: 1robust speaker verification: 1population based learning: 1filtering theory: 1parameter estimation: 1adress: 1cognition: 1patient diagnosis: 1alzheimer's disease detection: 1signal classification: 1diseases: 1features: 1geriatrics: 1medical diagnostic computing: 1natural language processing: 1spectral pooling: 1signal representation: 1domain adversarial training: 1standards: 1adversarial training: 1i vectors: 1x vectors: 1
Most publications (all venues) at2023: 142022: 112020: 112018: 112024: 10

Affiliations
Hong Kong Polytechnic University, Hong Kong

Recent publications

TASLP2024 Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien, 
Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement.

ICASSP2024 Chong-Xin Gan, Man-Wai Mak, Weiwei Lin 0002, Jen-Tzung Chien, 
Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification.

ICASSP2024 Zhe Li, Man-Wai Mak, Helen Mei-Ling Meng, 
Dual Parameter-Efficient Fine-Tuning for Speaker Representation Via Speaker Prompt Tuning and Adapters.

ICASSP2024 Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien, 
Contrastive Speaker Embedding With Sequential Disentanglement.

ICASSP2024 Lishi Zuo, Man-Wai Mak, Youzhi Tu, 
Promoting Independence of Depression and Speaker Features for Speaker Disentanglement in Speech-Based Depression Detection.

TASLP2023 Weiwei Lin 0002, Man-Wai Mak
Robust Speaker Verification Using Deep Weight Space Ensemble.

TASLP2023 Weiwei Lin 0002, Man-Wai Mak
Model-Agnostic Meta-Learning for Fast Text-Dependent Speaker Embedding Adaptation.

ICASSP2023 Xiaoquan Ke, Man-Wai Mak, Helen M. Meng, 
Feature Selection and Text Embedding for Detecting Dementia from Spontaneous Cantonese.

ICASSP2023 Zhe Li, Man-Wai Mak, Helen Mei-Ling Meng, 
Discriminative Speaker Representation Via Contrastive Learning with Class-Aware Attention in Angular Space.

Interspeech2023 Helen Meng, Brian Mak, Man-Wai Mak, Helene H. Fung, Xianmin Gong, Timothy C. Y. Kwok, Xunying Liu, Vincent C. T. Mok, Patrick C. M. Wong, Jean Woo, Xixin Wu, Ka Ho Wong, Sean Shensheng Xu, Naijun Zheng, Ranzo Huang, Jiawen Kang 0002, Xiaoquan Ke, Junan Li, Jinchao Li, Yi Wang, 
Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders.

ICML2023 Weiwei Lin 0002, Chenhang He, Man-Wai Mak, Youzhi Tu, 
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations.

TASLP2022 Weiwei Lin 0002, Man-Wai Mak
Mixture Representation Learning for Deep Speaker Embedding.

TASLP2022 Youzhi Tu, Man-Wai Mak
Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding.

ICASSP2022 Weiwei Lin 0002, Man-Wai Mak
Robust Speaker Verification Using Population-Based Data Augmentation.

ICASSP2022 Lu Yi 0001, Man-Wai Mak
Disentangled Speaker Embedding for Robust Speaker Verification.

Interspeech2022 Zhenke Gao, Man-Wai Mak, Weiwei Lin 0002, 
UNet-DenseNet for Robust Far-Field Speaker Verification.

Interspeech2022 Xiaoquan Ke, Man-Wai Mak, Helen M. Meng, 
Automatic Selection of Discriminative Features for Dementia Detection in Cantonese-Speaking People.

ICASSP2021 Jinchao Li, Jianwei Yu, Zi Ye 0001, Simon Wong, Man-Wai Mak, Brian Mak, Xunying Liu, Helen Meng, 
A Comparative Study of Acoustic and Linguistic Features Classification for Alzheimer's Disease Detection.

ICASSP2021 Youzhi Tu, Man-Wai Mak
Short-Time Spectral Aggregation for Speaker Embedding.

Interspeech2021 Youzhi Tu, Man-Wai Mak
Mutual Information Enhanced Training for Speaker Embedding.

#161  | Zhongqiu Wang | DBLP Google Scholar  
By venueTASLP: 12ICASSP: 10Interspeech: 6NeurIPS: 1
By year2024: 32023: 72022: 52021: 22020: 32019: 52018: 4
ISCA sessionsspatial and phase cues for source separation and speech recognition: 2speech enhancement and intelligibility: 1speaker and language recognition: 1deep enhancement: 1deep learning for source separation and pitch tracking: 1
IEEE keywordsspeech enhancement: 11reverberation: 7speaker recognition: 7complex spectral mapping: 6array signal processing: 6source separation: 5microphone array processing: 5microphones: 4speech separation: 4speech recognition: 4noise measurement: 3time domain analysis: 3time frequency analysis: 3phase estimation: 3signal processing algorithms: 3beamforming: 3covariance matrices: 3benchmark testing: 2sound effects: 2audio source separation: 2motion pictures: 2soundtrack: 2music: 2speech: 2computational modeling: 2frame online speech enhancement: 2training data: 2speaker separation: 2microphone arrays: 2estimation: 2robust speaker localization: 2direction of arrival estimation: 2deep learning (artificial intelligence): 2speech dereverberation: 2time frequency masking: 2deep neural networks: 2iterative methods: 2mathematical models: 1nonlinear filters: 1filtering algorithms: 1unsupervised neural speech dereverberation (usd): 1boosting: 1transformer: 1transformers: 1modulation: 1data mining: 1target speaker extraction: 1real world scenarios: 1multimodality: 1misp challenge: 1visualization: 1recording: 1sound event detection: 1event detection: 1remixing: 1measurement: 1tagging: 1audio tagging: 1separation processes: 1full and sub band integration: 1acoustic beamforming: 1computer architecture: 1task analysis: 1discrete fourier transforms: 1low latency communication: 1prediction algorithms: 1indexes: 1pipelines: 1data models: 1predictive models: 1memory management: 1low complexity speech enhancement: 1hearing aids design: 1road transportation: 1memory architecture: 1quantization (signal): 1multi channel complex spectral mapping: 1spectrospatial filtering: 1spectrogram: 1geometry: 1adaptation models: 1generative model: 1diffusion probabilistic model: 1instruments: 1continuous speech separation: 1speaker diarization: 1regression analysis: 1blind deconvolution: 1supervised learning: 1rir estimation: 1filtering theory: 1continuous speaker separation: 1robust speaker recognition: 1gammatone frequency cepstral coefficient (gfcc): 1masking based beamforming: 1x vector: 1gaussian processes: 1transient response: 1chimera++ networks: 1blind source separation: 1deep clustering: 1permutation invariant training: 1spatial features: 1acoustic noise: 1gcc phat: 1steered response power: 1audio signal processing: 1ideal ratio mask: 1denoising: 1signal reconstruction: 1dereverberation: 1speech intelligibility: 1phase: 1phase reconstruction: 1chimera + + networks: 1fourier transforms: 1
Most publications (all venues) at2023: 172022: 112018: 112024: 102021: 8

Affiliations
URLs

Recent publications

TASLP2024 Zhong-Qiu Wang
USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering.

ICASSP2024 Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhongqiu Wang, Shinji Watanabe 0001, 
Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor.

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

TASLP2023 Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Zhong-Qiu Wang, Jonathan Le Roux, 
Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks.

TASLP2023 Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe 0001, 
TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation.

TASLP2023 Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe 0001, Jonathan Le Roux, 
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency.

ICASSP2023 Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe 0001, Manuel Pariente, Nobutaka Ono, Stefano Squartini, 
Multi-Channel Speaker Extraction with Adversarial Training: The Wavlab Submission to The Clarity ICASSP 2023 Grand Challenge.

ICASSP2023 Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe 0001, 
TF-GRIDNET: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation.

ICASSP2023 Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe 0001, 
FNeural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated full- and sub-band Modeling.

NeurIPS2023 Zhong-Qiu Wang, Shinji Watanabe 0001, 
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures.

TASLP2022 Ke Tan 0001, Zhong-Qiu Wang, DeLiang Wang, 
Neural Spectrospatial Filtering.

ICASSP2022 Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe 0001, Alexander Richard, Cheng Yu, Yu Tsao 0001, 
Conditional Diffusion Probabilistic Model for Speech Enhancement.

ICASSP2022 Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux, 
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks.

ICASSP2022 Zhong-Qiu Wang, DeLiang Wang, 
Localization based Sequential Grouping for Continuous Speech Separation.

Interspeech2022 Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao 0001, Yanmin Qian, Shinji Watanabe 0001, 
ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding.

TASLP2021 Zhong-Qiu Wang, Gordon Wichern, Jonathan Le Roux, 
Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation.

ICASSP2021 Zhong-Qiu Wang, DeLiang Wang, 
Count And Separate: Incorporating Speaker Counting For Continuous Speaker Separation.

TASLP2020 Hassan Taherian, Zhong-Qiu Wang, Jorge Chang, DeLiang Wang, 
Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement.

TASLP2020 Zhong-Qiu Wang, DeLiang Wang, 
Deep Learning Based Target Cancellation for Speech Dereverberation.

TASLP2020 Zhong-Qiu Wang, Peidong Wang, DeLiang Wang, 
Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR.

#162  | Shiyin Kang | DBLP Google Scholar  
By venueICASSP: 15Interspeech: 12TASLP: 2
By year2024: 32023: 52022: 62021: 42020: 52019: 52018: 1
ISCA sessionsspeech synthesis: 5voice conversion and adaptation: 2non-autoregressive sequential modeling for speech processing: 1speech synthesis paradigms and methods: 1speech-to-text and speech assessment: 1neural techniques for voice conversion and waveform generation: 1expressive speech synthesis: 1
IEEE keywordsspeech synthesis: 8speech coding: 5speech recognition: 5speaker recognition: 4expressive speech synthesis: 3natural language processing: 3recurrent neural nets: 3text to speech: 2adaptation models: 2language model: 2decoding: 2multiple signal classification: 2coherence: 2hierarchical: 2speaking style modelling: 2time frequency analysis: 2speech separation: 2voice conversion: 2vocoders: 2cloning: 1speaker adaptation: 1zero shot: 1timbre: 1multi scale acoustic prompts: 1stereophonic music: 1degradation: 1codecs: 1music generation: 1encoding: 1neural codec: 1image coding: 1language models: 1long multi track: 1instruments: 1multi view midivae: 1symbolic music generation: 1two dimensional displays: 1context modeling: 1style modeling: 1semantics: 1hidden markov models: 1bit error rate: 1multi scale: 1predictive models: 1audiobook speech synthesis: 1prediction methods: 1transformers: 1context aware: 1multi sentence: 1hierarchical transformer: 1speech: 1network architecture: 1corrector network: 1source separation: 1time domain: 1time frequency domain: 1particle separators: 1robustness: 1learning systems: 1stability analysis: 1error analysis: 1contextual biasing: 1conformer: 1biased words: 1sensitivity: 1phase information: 1full band extractor: 1noise reduction: 1speech enhancement: 1multi scale time sensitive channel attention: 1memory management: 1convolution: 1text analysis: 1xlnet: 1knowledge distillation: 1connectionist temporal classification: 1cross entropy: 1entropy: 1disentangling: 1hybrid bottleneck features: 1voice activity detection: 1capsule: 1exemplary emotion descriptor: 1speech emotion recognition: 1emotion recognition: 1residual error: 1multi speaker and multi style tts: 1hifi gan: 1durian: 1low resource condition: 1phonetic pos teriorgrams: 1speech intelligibility: 1code switching: 1accent conversion: 1accented speech recognition: 1multi modal: 1audio visual systems: 1overlapped speech: 1audio visual speech recognition: 1wavenet: 1self attention: 1blstm: 1optimisation: 1phonetic posteriorgrams(ppgs): 1variational inference: 1convolutional neural nets: 1quasifully recurrent neural network (qrnn): 1parallel processing: 1parallel wavenet: 1text to speech (tts) synthesis: 1convolutional neural network (cnn): 1
Most publications (all venues) at2023: 92019: 92024: 82022: 62021: 6

Affiliations
URLs

Recent publications

ICASSP2024 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Dan Luo, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han 0001, Helen Meng, 
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts.

ICASSP2024 Xingda Li, Fan Zhuo, Dan Luo, Jun Chen 0024, Shiyin Kang, Zhiyong Wu 0001, Tao Jiang, Yang Li, Han Fang, Yahui Zhou, 
Generating Stereophonic Music with Single-Stage Language Models.

ICASSP2024 Zhiwei Lin, Jun Chen 0024, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu 0001, Helen Meng, 
Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation.

TASLP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Xixin Wu, Shiyin Kang, Helen Meng, 
MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis.

ICASSP2023 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis.

ICASSP2023 Weinan Tong, Jiaxu Zhu, Jun Chen 0024, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
TFCnet: Time-Frequency Domain Corrector for Speech Separation.

ICASSP2023 Yaoxun Xu, Baiji Liu, Qiaochu Huang, Xingchen Song, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
CB-Conformer: Contextual Biasing Conformer for Biased Word Recognition.

Interspeech2023 Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou 0002, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis.

ICASSP2022 Jun Chen 0024, Zilin Wang, Deyi Tuo, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement.

ICASSP2022 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis.

ICASSP2022 Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu 0001, Shiyin Kang, Deyi Tuo, Helen Meng, 
Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion.

Interspeech2022 Jie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu 0001, Helen Meng, 
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information.

Interspeech2022 Shun Lei, Yixuan Zhou 0002, Liyang Chen, Jiankun Hu, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis.

Interspeech2022 Shaohuan Zhou, Shun Lei, Weiya You, Deyi Tuo, Yuren You, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information.

TASLP2021 Xixin Wu, Yuewen Cao, Hui Lu, Songxiang Liu, Shiyin Kang, Zhiyong Wu 0001, Xunying Liu, Helen Meng, 
Exemplar-Based Emotive Speech Synthesis.

ICASSP2021 Jie Wang, Yuren You, Feng Liu, Deyi Tuo, Shiyin Kang, Zhiyong Wu 0001, Helen Meng, 
The Huya Multi-Speaker and Multi-Style Speech Synthesis System for M2voc Challenge 2020.

Interspeech2021 Hui Lu, Zhiyong Wu 0001, Xixin Wu, Xu Li 0015, Shiyin Kang, Xunying Liu, Helen Meng, 
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis.

Interspeech2021 Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu 0001, Shiyin Kang, Helen Meng, 
Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion.

ICASSP2020 Yuewen Cao, Songxiang Liu, Xixin Wu, Shiyin Kang, Peng Liu, Zhiyong Wu 0001, Xunying Liu, Dan Su 0002, Dong Yu 0001, Helen Meng, 
Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora.

ICASSP2020 Songxiang Liu, Disong Wang, Yuewen Cao, Lifa Sun, Xixin Wu, Shiyin Kang, Zhiyong Wu 0001, Xunying Liu, Dan Su 0002, Dong Yu 0001, Helen Meng, 
End-To-End Accent Conversion Without Using Native Utterances.

#163  | Frank K. Soong | DBLP Google Scholar  
By venueICASSP: 13Interspeech: 11TASLP: 3SpeechComm: 1
By year2023: 22022: 72021: 52020: 52019: 72018: 2
ISCA sessionsspeech synthesis: 8singing voice computing and processing in music: 1voice conversion and speech synthesis: 1applications in education and learning: 1
IEEE keywordsspeech synthesis: 7natural language processing: 7neural tts: 4vocoders: 4speech recognition: 4regression analysis: 3speech coding: 2autoregressive processes: 2text analysis: 2linguistics: 2pronunciation assessment: 2computer assisted language learning: 2prosody: 2ordinal regression: 2computer aided instruction: 2unit selection: 2recurrent neural nets: 2lpcnet: 2vq vae: 1vector quantization: 1multi stage multi codebook (msmc): 1speech representation: 1predictive models: 1variational inference: 1style and speaker attributes: 1disjoint datasets: 1style transfer: 1speaker recognition: 1text to speech (tts): 1computational linguistics: 1long form: 1cross sentence: 1universal ordinal regression: 1mispronunciation detection: 1computational modeling: 1memory management: 1self attention: 1transformers: 1semiconductor device modeling: 1efficient transformer: 1phoneme recognition: 1mispronunciation detection and diagnosis: 1acoustic phonetic linguistic embeddings: 1computer aided pronunciation training: 1speech bert embedding: 1training data: 1large scale pre training: 1data models: 1acoustic distortion: 1bit error rate: 1decoding: 1databases: 1speech quality assessment: 1medical image processing: 1correlation methods: 1speech intelligibility: 1mos prediction: 1mean bias network: 1sensitivity analysis: 1video signal processing: 1goodness of pronunciation: 1trajectory tiling: 1spectral analysis: 1sequence to sequence: 1hybrid text to speech: 1text to speech: 1lp mdn: 1neural vocoder: 1filtering theory: 1bert: 1dynamic acoustic difference: 1probability: 1absolute f0 difference: 1kl divergence: 1domain adversarial training: 1asr: 1esl: 1keyword spotting: 1call: 1anchored reference sample: 1pattern classification: 1mean opinion score (mos): 1speech fluency assessment: 1computer assisted language learning (call): 1permutation invariant training: 1speech separation: 1pitch tracking: 1deep clustering: 1source separation: 1
Most publications (all venues) at2007: 252006: 252008: 242012: 152010: 15

Affiliations
Microsoft Research Asia, Beijing, China
Chinese University of Hong Kong (CUHK), Department of Systems Engineering and Engineering Management, Hong Kong
Bell Labs Research, Murray Hill, NJ, USA
University of Stanford, Department of Electrical Engineering, CA, USA (PhD)

Recent publications

TASLP2023 Haohan Guo, Fenglong Xie, Xixin Wu, Frank K. Soong, Helen Meng, 
MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTS.

Interspeech2023 Yujia Xiao, Shaofei Zhang, Xi Wang 0016, Xu Tan 0003, Lei He 0005, Sheng Zhao, Frank K. Soong, Tan Lee 0001, 
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading.

TASLP2022 Xiaochun An, Frank K. Soong, Lei Xie 0001, 
Disentangling Style and Speaker Attributes for TTS Style Transfer.

TASLP2022 Liumeng Xue, Frank K. Soong, Shaofei Zhang, Lei Xie 0001, 
ParaTTS: Learning Linguistic and Prosodic Cross-Sentence Information in Paragraph-Based TTS.

ICASSP2022 Shaoguang Mao, Frank K. Soong, Yan Xia 0005, Jonathan Tien, 
A Universal Ordinal Regression for Assessing Phoneme-Level Pronunciation.

ICASSP2022 Yujia Xiao, Xi Wang 0016, Lei He 0005, Frank K. Soong
Improving Fastspeech TTS with Efficient Self-Attention and Compact Feed-Forward Network.

ICASSP2022 Wenxuan Ye, Shaoguang Mao, Frank K. Soong, Wenshan Wu, Yan Xia 0005, Jonathan Tien, Zhiyong Wu 0001, 
An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings.

Interspeech2022 Mutian He 0001, Jingzhou Yang, Lei He 0005, Frank K. Soong
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge.

Interspeech2022 Haohan Guo, Feng-Long Xie, Frank K. Soong, Xixin Wu, Helen Meng, 
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS.

ICASSP2021 Liping Chen, Yan Deng, Xi Wang 0016, Frank K. Soong, Lei He 0005, 
Speech Bert Embedding for Improving Prosody in Neural TTS.

ICASSP2021 Yichong Leng, Xu Tan 0003, Sheng Zhao, Frank K. Soong, Xiang-Yang Li 0001, Tao Qin 0001, 
MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network.

ICASSP2021 Bin Su, Shaoguang Mao, Frank K. Soong, Yan Xia 0005, Jonathan Tien, Zhiyong Wu 0001, 
Improving Pronunciation Assessment Via Ordinal Regression with Anchored Reference Samples.

ICASSP2021 Feng-Long Xie, Xinhui Li, Wen-Chao Su, Li Lu, Frank K. Soong
A New High Quality Trajectory Tiling Based Hybrid TTS In Real Time.

Interspeech2021 Xiaochun An, Frank K. Soong, Lei Xie 0001, 
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS.

ICASSP2020 Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank K. Soong, Hong-Goo Kang, 
Improving LPCNET-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network.

ICASSP2020 Yujia Xiao, Lei He 0005, Huaiping Ming, Frank K. Soong
Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS.

ICASSP2020 Feng-Long Xie, Xinhui Li, Bo Liu, Yibin Zheng, Li Meng, Li Lu, Frank K. Soong
An Improved Frame-Unit-Selection Based Voice Conversion System Without Parallel Training Data.

Interspeech2020 Yang Cui, Xi Wang 0016, Lei He 0005, Frank K. Soong
An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis.

Interspeech2020 Yuanbo Hou, Frank K. Soong, Jian Luan 0001, Shengchen Li, 
Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music.

SpeechComm2019 Feng-Long Xie, Frank K. Soong, Haifeng Li 0001, 
Voice conversion with SI-DNN and KL divergence based mapping without parallel training data.

#164  | Lin Li 0032 | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 10TASLP: 2
By year2024: 22023: 52022: 42021: 92020: 52019: 3
ISCA sessionsspeaker recognition: 3speaker and language identification: 2oriental language recognition: 2language recognition: 2speaker embedding and diarization: 1speaker and language recognition: 1non-autoregressive sequential modeling for speech processing: 1feature, embedding and neural architecture for speaker recognition: 1large-scale evaluation of short-duration speaker verification: 1asr neural network architectures: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1
IEEE keywordsspeaker recognition: 7speaker verification: 4speech recognition: 3noisy labels: 2transformers: 2speaker clustering: 2adaptation models: 2x vector: 2deep learning (artificial intelligence): 2computational modeling: 1exponential moving average: 1noise measurement: 1label ensembling: 1confidence ranking: 1noise: 1predictive models: 1task analysis: 1performance evaluation: 1representation learning: 1robustness: 1visualization: 1stable learning: 1label correction: 1degradation: 1clustering algorithms: 1signal processing algorithms: 1pre trained model: 1transfer learning: 1runtime: 1production: 1conformer: 1computer architecture: 1runtime environment: 1stability analysis: 1loss weight adaption: 1model agnostic meta learning: 1homoscedastic uncertainty: 1manuals: 1convolutional neural network: 1low resource automatic speech recognition: 1uncertainty: 1bayes methods: 1probabilistic linear discriminant analysis: 1training data: 1noise reduction: 1graph convolutional network: 1semi supervised learning: 1semisupervised learning: 1convolution: 1clustering methods: 1multi speaker: 1text analysis: 1multi lingual: 1non autoregressive: 1natural language processing: 1speech synthesis: 1lightweight: 1autoregressive processes: 1error analysis: 1multi accent: 1global embedding: 1end to end: 1data augmentation: 1domain adaptation: 1open source toolkit: 1deep neural networks: 1linear discriminant analysis: 1sensor fusion: 1f tdnn: 1prediction theory: 1as norm: 1domain mismatch: 1sre19: 1speaker embedding: 1speech coding: 1optimisation: 1adversarial training: 1multi task: 1
Most publications (all venues) at2021: 152022: 102020: 72019: 72016: 6

Affiliations
Xiamen University, Department of Electronic Engineering, China
University of Science and Technology of China, Hefei, China (PhD 2008)

Recent publications

TASLP2024 Zhihua Fang, Liang He 0003, Lin Li 0032, Ying Hu 0005, 
Improving Speaker Verification With Noise-Aware Label Ensembling and Sample Selection: Learning and Correcting Noisy Speaker Labels.

ICASSP2024 Jian Zhang, Jing Ma, Xiaochen Guo, Lin Li 0032, Liang He 0003, 
A Speaker Recognition Method Based on Stable Learning.

ICASSP2023 Zhicong Chen, Jie Wang, Wenxuan Hu, Lin Li 0032, Qingyang Hong, 
Unsupervised Speaker Verification Using Pre-Trained Model and Label Correction.

ICASSP2023 Dexin Liao, Tao Jiang 0033, Feng Wang, Lin Li 0032, Qingyang Hong, 
Towards A Unified Conformer Structure: from ASR to ASV Task.

ICASSP2023 Qiulin Wang, Wenxuan Hu, Lin Li 0032, Qingyang Hong, 
Meta Learning with Adaptive Loss Weight for Low-Resource Speech Recognition.

Interspeech2023 Zhihua Fang, Liang He 0003, Hanhan Ma, Xiaochen Guo, Lin Li 0032
Robust Training for Speaker Verification against Noisy Labels.

Interspeech2023 Feng Wang, Lingyan Huang, Tao Li, Qingyang Hong, Lin Li 0032
Conformer-based Language Embedding with Self-Knowledge Distillation for Spoken Language Identification.

TASLP2022 Lin Li 0032, Fuchuan Tong, Qingyang Hong, 
When Speaker Recognition Meets Noisy Labels: Optimizations for Front-Ends and Back-Ends.

ICASSP2022 Fuchuan Tong, Siqi Zheng, Min Zhang, Yafeng Chen, Hongbin Suo, Qingyang Hong, Lin Li 0032
Graph Convolutional Network Based Semi-Supervised Learning on Multi-Speaker Meeting Data.

Interspeech2022 Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li 0032, Qingyang Hong, 
Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party Meeting.

Interspeech2022 Binling Wang, Feng Wang, Wenxuan Hu, Qiulin Wang, Jing Li, Dong Wang 0013, Lin Li 0032, Qingyang Hong, 
Oriental Language Recognition (OLR) 2021: Summary and Analysis.

ICASSP2021 Song Li, Beibei Ouyang, Lin Li 0032, Qingyang Hong, 
Light-TTS: Lightweight Multi-Speaker Multi-Lingual Text-to-Speech.

ICASSP2021 Song Li, Beibei Ouyang, Dexin Liao, Shipeng Xia, Lin Li 0032, Qingyang Hong, 
End-To-End Multi-Accent Speech Recognition with Unsupervised Accent Modelling.

ICASSP2021 Fuchuan Tong, Miao Zhao, Jianfeng Zhou, Hao Lu, Zheng Li, Lin Li 0032, Qingyang Hong, 
ASV-SUBTOOLS: Open Source Toolkit for Automatic Speaker Verification.

Interspeech2021 Zheng Li, Yan Liu, Lin Li 0032, Qingyang Hong, 
Additive Phoneme-Aware Margin Softmax Loss for Language Recognition.

Interspeech2021 Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li 0032, Qingyang Hong, 
Real-Time End-to-End Monaural Multi-Speaker Speech Recognition.

Interspeech2021 Jing Li, Binling Wang, Yiming Zhi, Zheng Li, Lin Li 0032, Qingyang Hong, Dong Wang 0013, 
Oriental Language Recognition (OLR) 2020: Summary and Analysis.

Interspeech2021 Dexin Liao, Jing Li, Yiming Zhi, Song Li, Qingyang Hong, Lin Li 0032
An Integrated Framework for Two-Pass Personalized Voice Trigger.

Interspeech2021 Yan Liu, Zheng Li, Lin Li 0032, Qingyang Hong, 
Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification.

Interspeech2021 Fuchuan Tong, Yan Liu, Song Li, Jie Wang, Lin Li 0032, Qingyang Hong, 
Automatic Error Correction for Speaker Embedding Learning with Noisy Labels.

#165  | Pengcheng Guo | DBLP Google Scholar  
By venueInterspeech: 13ICASSP: 11TASLP: 4
By year2024: 42023: 102022: 52021: 22020: 22019: 42018: 1
ISCA sessionsspeech recognition: 2speaker and language recognition: 2anti-spoofing for speaker verification: 1multi-talker methods in speech processing: 1models for streaming asr: 1robust asr, and far-field/multi-talker asr: 1non-autoregressive sequential modeling for speech processing: 1asr neural network architectures and training: 1the attacker’s perpective on automatic speaker verification: 1model adaptation for asr: 1speech technologies for code-switching in multilingual communities: 1
IEEE keywordsspeech recognition: 8linguistics: 3visualization: 3multitasking: 2task analysis: 2data privacy: 2speaker anonymization: 2information filtering: 2privacy protection: 2privacy: 2end to end: 2error analysis: 2fuses: 2representation learning: 2multimodal: 2audio visual speech recognition: 2timbre: 2voice conversion: 2speaker recognition: 2source separation: 2data models: 2microphone arrays: 2automatic speech recognition: 2alimeeting: 2meeting transcription: 2natural language processing: 2two granularity modeling units: 1asr ar multi task learning: 1lasas: 1decoding: 1degradation: 1transforms: 1matrix decomposition: 1voiceprivacy challenge: 1singular value decomposition (svd): 1redundancy: 1discrete units: 1speech translation: 1self supervised learning: 1correlation: 1spoken language understanding: 1systematics: 1buildings: 1robustness: 1cross attention: 1adversarial attack: 1perturbation methods: 1speaker identification: 1predictive models: 1speech synthesis: 1timbre reserved: 1data processing: 1reverberation: 1multi task learning: 1upper bound: 1background sound: 1acoustic distortion: 1social networking (online): 1internet: 1voice privacy challenge: 1robust keyword spotting: 1real time systems: 1multi modality fusion: 1noise reduction: 1audio visual keywords spotting: 1lips: 1headphones: 1meeting scenario: 1speak diarization: 1arrays: 1multi speaker asr: 1m2met: 1speaker diarization: 1optical filters: 1corpus: 1matched filters: 1noise measurement: 1optical character recognition software: 1multi domain: 1end to end speech processing: 1conformer: 1transformer: 1error statistics: 1statistical distributions: 1attention: 1cross entropy: 1gradient methods: 1listen attend and spell: 1interference suppression: 1virtual adversarial training: 1sequence to sequence: 1adversarial training: 1domain adversarial training: 1asr: 1computer aided instruction: 1esl: 1keyword spotting: 1call: 1
Most publications (all venues) at2024: 222023: 162022: 112021: 102020: 6

Affiliations
URLs

Recent publications

TASLP2024 Qijie Shao, Pengcheng Guo, Jinghao Yan, Pengfei Hu 0004, Lei Xie 0001, 
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition.

TASLP2024 Jixun Yao, Qing Wang 0039, Pengcheng Guo, Ziqian Ning, Lei Xie 0001, 
Distinctive and Natural Speaker Anonymization via Singular Value Transformation-Assisted Matrix.

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 He Wang, Pengcheng Guo, Pan Zhou, Lei Xie 0001, 
MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition.

TASLP2023 Qing Wang 0039, Jixun Yao, Li Zhang 0106, Pengcheng Guo, Lei Xie 0001, 
Timbre-Reserved Adversarial Attack in Speaker Identification.

ICASSP2023 Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen, 
The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge.

ICASSP2023 Jixun Yao, Yi Lei, Qing Wang 0039, Pengcheng Guo, Ziqian Ning, Lei Xie 0001, Hai Li, Junhui Liu, Danming Xie, 
Preserving Background Sound in Noise-Robust Voice Conversion Via Multi-Task Learning.

ICASSP2023 Jixun Yao, Qing Wang 0039, Yi Lei, Pengcheng Guo, Lei Xie 0001, Namin Wang, Jie Liu, 
Distinguishable Speaker Anonymization Based on Formant and Fundamental Frequency Scaling.

ICASSP2023 Ao Zhang, He Wang, Pengcheng Guo, Yihui Fu, Lei Xie 0001, Yingying Gao, Shilei Zhang, Junlan Feng, 
VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting.

Interspeech2023 Qing Wang 0039, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie 0001, 
Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification.

Interspeech2023 Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, Lei Xie 0001, 
Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network.

Interspeech2023 Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen 0003, Lei Xie 0001, 
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR.

Interspeech2023 Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao Li, Changru Chen, Chao Li, Lei Xie 0001, 
Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition.

Interspeech2023 Hongfei Xue, Qijie Shao, Peikun Chen, Pengcheng Guo, Lei Xie 0001, Jie Liu, 
TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition.

ICASSP2022 Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie 0001, Siqi Zheng, Zhihao Du, Weilong Huang, Pengcheng Guo, Zhijie Yan, Bin Ma 0001, Xin Xu, Hui Bu, 
M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.

ICASSP2022 Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie 0001, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma 0001, Xin Xu, Hui Bu, 
Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge.

ICASSP2022 Binbin Zhang, Hang Lv 0001, Pengcheng Guo, Qijie Shao, Chao Yang 0031, Lei Xie 0001, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu 0061, Zhendong Peng, 
WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition.

Interspeech2022 Qijie Shao, Jinghao Yan, Jian Kang 0006, Pengcheng Guo, Xian Shi, Pengfei Hu 0004, Lei Xie 0001, 
Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition.

Interspeech2022 Kun Wei, Pengcheng Guo, Ning Jiang, 
Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism.

ICASSP2021 Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi 0003, Shinji Watanabe 0001, Kun Wei, Wangyou Zhang, Yuekai Zhang, 
Recent Developments on Espnet Toolkit Boosted By Conformer.

#166  | L. Paola García-Perera | DBLP Google Scholar  
By venueInterspeech: 14ICASSP: 11TASLP: 2NAACL: 1
By year2024: 12023: 102022: 42021: 52020: 52019: 3
ISCA sessionsmerlion ccs challenge: 2speaker diarization: 2language identification and diarization: 1speech recognition: 1speaker and language recognition: 1adaptation, transfer learning, and distillation for asr: 1language and accent recognition: 1linguistic components in end-to-end asr: 1phonetic event detection and segmentation: 1speaker recognition evaluation: 1spoken language processing for children’s speech: 1speaker recognition and diarization: 1
IEEE keywordsspeaker diarization: 5speaker recognition: 5self supervised learning: 4decoding: 4speech enhancement: 4eend: 3pipelines: 3estimation: 2unsupervised asr: 2speech recognition: 2end to end: 2switches: 2channel bank filters: 2speaker verification: 2feature enhancement: 2network architecture: 1online diarization: 1oral communication: 1optimization: 1transformers: 1recording: 1reproducibility of results: 1espnet: 1s3prl: 1learning systems: 1task analysis: 1aggregates: 1adaptation models: 1multi talker asr: 1target speaker asr: 1codes: 1error correction: 1information retrieval: 1measurement: 1keyword search: 1confidence: 1timing: 1forced alignment: 1language diarization: 1automatic speech recognition: 1multitasking: 1token: 1code switching: 1language posterior: 1speech coding: 1bridges: 1connectors: 1benchmark testing: 1question answering (information retrieval): 1spoken language understanding: 1semantics: 1eda: 1iterative methods: 1encoding: 1recurrent neural nets: 1inference mechanisms: 1fourier transforms: 1speech separation: 1pattern clustering: 1pattern classification: 1overlapped speech detection: 1matrix algebra: 1voice activity detection: 1resegmentation: 1standards: 1proposals: 1neural network: 1region proposal network: 1faster r cnn: 1predictive models: 1signal denoising: 1perceptual loss: 1deep feature loss: 1far field adaptation: 1dereverberation: 1data handling: 1cyclegan: 1linear discriminant analysis: 1
Most publications (all venues) at2023: 212022: 152021: 142020: 112024: 6

Affiliations
Johns Hopkins University, Center for Language and Speech Processing, Baltimore, MD, USA
Nuance Communications, Inc. (former)
Agnitio S.L., Madrid, Spain (former)
University of Zaragoza, Spain (PhD 2014)
Monterrey Institute of Technology and Higher Education (ITESM), Computer Science Department, Monterrey, Mexico

Recent publications

NAACL2024 Patrick Foley, Matthew Wiesner, Bismarck Odoom, Leibny Paola García-Perera, Kenton Murray, Philipp Koehn, 
Where are you from? Geolocating Speech and Applications to Language Identification.

TASLP2023 Shota Horiguchi, Shinji Watanabe 0001, Paola García 0001, Yuki Takashima, Yohei Kawaguchi, 
Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors.

ICASSP2023 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola García, Hung-Yi Lee, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Euro: Espnet Unsupervised ASR Open-Source Toolkit.

ICASSP2023 Zili Huang, Desh Raj, Paola García 0001, Sanjeev Khudanpur, 
Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings.

ICASSP2023 Ruizhe Huang, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, Jan Trmal, Sanjeev Khudanpur, 
Building Keyword Search System from End-To-End Asr Systems.

ICASSP2023 Hexin Liu, Haihua Xu, Leibny Paola García, Andy W. H. Khong, Yi He, Sanjeev Khudanpur, 
Reducing Language Confusion for Code-Switching Speech Recognition with Token-Level Language Diarization.

ICASSP2023 Jiatong Shi, Chan-Jan Hsu, Ho-Lam Chung, Dongji Gao, Paola García 0001, Shinji Watanabe 0001, Ann Lee 0001, Hung-Yi Lee, 
Bridging Speech and Textual Pre-Trained Models With Unsupervised ASR.

Interspeech2023 Jesús Villalba 0001, Jonas Borgstrom, Maliha Jahan, Saurabh Kataria, Leibny Paola García, Pedro A. Torres-Carrasquillo, Najim Dehak, 
Advances in Language Recognition in Low Resource African Languages: The JHU-MIT Submission for NIST LRE22.

Interspeech2023 Yi Han Victoria Chua, Hexin Liu, Leibny Paola García, Fei Ting Woon, Jinyi Wong, Xiangyu Zhang, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels, Suzy J. Styles, 
MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization.

Interspeech2023 Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola García, Daniel Povey, Sanjeev Khudanpur, 
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts.

Interspeech2023 Suzy J. Styles, Yi Han Victoria Chua, Fei Ting Woon, Hexin Liu, Leibny Paola García, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels, 
Investigating model performance in language identification: beyond simple error statistics.

TASLP2022 Shota Horiguchi, Yusuke Fujita, Shinji Watanabe 0001, Yawen Xue, Paola García 0001
Encoder-Decoder Based Attractors for End-to-End Neural Diarization.

ICASSP2022 Zili Huang, Shinji Watanabe 0001, Shu-Wen Yang, Paola García 0001, Sanjeev Khudanpur, 
Investigating Self-Supervised Learning for Speech Enhancement and Separation.

Interspeech2022 Hexin Liu, Leibny Paola García-Perera, Andy W. H. Khong, Suzy J. Styles, Sanjeev Khudanpur, 
PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification.

Interspeech2022 Yuki Takashima, Shota Horiguchi, Shinji Watanabe 0001, Leibny Paola García-Perera, Yohei Kawaguchi, 
Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models.

ICASSP2021 Shota Horiguchi, Paola García 0001, Yusuke Fujita, Shinji Watanabe 0001, Kenji Nagamatsu, 
End-To-End Speaker Diarization as Post-Processing.

Interspeech2021 Hexin Liu, Leibny Paola García-Perera, Xinyi Zhang, Justin Dauwels, Andy W. H. Khong, Sanjeev Khudanpur, Suzy J. Styles, 
End-to-End Language Diarization for Bilingual Code-Switching Speech.

Interspeech2021 Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe 0001, Leibny Paola García-Perera, Kenji Nagamatsu, 
Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization.

Interspeech2021 Matthew Wiesner, Mousmita Sarma, Ashish Arora, Desh Raj, Dongji Gao, Ruizhe Huang, Supreet Preet, Moris Johnson, Zikra Iqbal, Nagendra Goel, Jan Trmal, Leibny Paola García-Perera, Sanjeev Khudanpur, 
Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition.

Interspeech2021 Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe 0001, Leibny Paola García-Perera, Kenji Nagamatsu, 
Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers.

#167  | Bo Xu 0002 | DBLP Google Scholar  
By venueICASSP: 11Interspeech: 11AAAI: 3TASLP: 1ACL: 1IJCAI: 1
By year2024: 42023: 42022: 32021: 72020: 32019: 22018: 5
ISCA sessionssequence models for asr: 2question answering from speech: 1speech recognition: 1language and accent recognition: 1source separation, dereverberation and echo cancellation: 1single-channel speech enhancement: 1targeted source separation: 1speech synthesis: 1speaker verification using neural network methods: 1dereverberation: 1
IEEE keywordsspeech recognition: 7natural language processing: 4cocktail party problem: 3visualization: 2continuous integrate and fire: 2multi agent systems: 2low resource: 2interactive systems: 2hearing: 2multi cue guided separation: 1target speaker separation: 1semi supervised learning: 1semisupervised learning: 1spectrogram: 1multimodal machine learning: 1multimodal speech recognition: 1linguistics: 1lips: 1bisimulation: 1decision making: 1computational efficiency: 1reinforcement learning: 1codes: 1oral communication: 1pre training: 1complexity theory: 1terminology: 1medical dialogues: 1filling: 1spoken language understanding: 1semantics: 1slot filling: 1information retrieval: 1deep matching: 1semantic networks: 1deep learning (artificial intelligence): 1external knowledge: 1human computer interaction: 1retrieval based dialogue systems: 1response selection: 1text analysis: 1document handling: 1visual dialog: 1contrastive learning: 1cross modal understanding: 1question answering (information retrieval): 1data visualisation: 1collaborative decoding: 1contextual biasing: 1knowledge selection: 1contextual speech recognition: 1voiceprint: 1speaker extraction: 1emotion recognition: 1onset cue: 1face recognition: 1voice activity detection: 1onset/offset cues: 1speaker recognition: 1source separation: 1dual channel speech separation: 1speaker and direction inferred separation: 1anechoic chambers (acoustic): 1signal representation: 1data augmentation: 1mixup: 1soft and monotonic alignment: 1acoustic boundary positioning: 1online speech recognition: 1decoding: 1end to end model: 1self attention network: 1recurrent neural nets: 1latency control: 1encoder decoder: 1feedforward neural nets: 1end to end: 1
Most publications (all venues) at2018: 352023: 322014: 302002: 282024: 26

Affiliations
University of Science and Technology of China, Department of Automation, Hefei, China
Chinese Academy of Sciences, Center for Excellence in Brain Science and Intelligence Technology, Beijing, China
Chinese Academy of Sciences, Institute of Automation, National Laboratory of Pattern Recognition, Beijing, China

Recent publications

TASLP2024 Jiaming Xu 0001, Jian Cui, Yunzhe Hao, Bo Xu 0002
Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments.

ICASSP2024 Ziyi Ni, Minglun Han, Feilong Chen, Linghui Meng 0001, Jing Shi 0003, Pin Lv, Bo Xu 0002
ViLaS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition.

ICASSP2024 Jingqing Ruan, Runpeng Xie, Xuantang Xiong, Shuang Xu, Bo Xu 0002
MaDE: Multi-Scale Decision Enhancement for Multi-Agent Reinforcement Learning.

ACL2024 Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo Xu 0002, Guoqi Li, 
SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network.

ICASSP2023 Zefa Hu, Xiuyi Chen, Haoran Wu, Minglun Han, Ziyi Ni, Jing Shi 0003, Shuang Xu, Bo Xu 0002
Matching-Based Term Semantics Pre-Training for Spoken Patient Query Understanding.

Interspeech2023 Feilong Chen, Minglun Han, Jing Shi 0003, Shuang Xu, Bo Xu 0002
Enhancing Visual Question Answering via Deconstructing Questions and Explicating Answers.

Interspeech2023 Minglun Han, Feilong Chen, Jing Shi 0003, Shuang Xu, Bo Xu 0002
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation.

AAAI2023 Qingyu Wang, Tielin Zhang, Minglun Han, Yi Wang, Duzhen Zhang, Bo Xu 0002
Complex Dynamic Neurons Improved Spiking Transformer Network for Efficient Automatic Speech Recognition.

ICASSP2022 Xiuyi Chen, Feilong Chen, Shuang Xu, Bo Xu 0002
A Multi Domain Knowledge Enhanced Matching Network for Response Selection in Retrieval-Based Dialogue Systems.

ICASSP2022 Feilong Chen, Xiuyi Chen, Shuang Xu, Bo Xu 0002
Improving Cross-Modal Understanding in Visual Dialog Via Contrastive Learning.

ICASSP2022 Minglun Han, Linhao Dong, Zhenlin Liang, Meng Cai, Shiyu Zhou, Zejun Ma, Bo Xu 0002
Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection.

ICASSP2021 Yunzhe Hao, Jiaming Xu 0001, Peng Zhang, Bo Xu 0002
Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments.

ICASSP2021 Chenxing Li, Jiaming Xu 0001, Nima Mesgarani, Bo Xu 0002
Speaker and Direction Inferred Dual-Channel Speech Separation.

ICASSP2021 Linghui Meng 0001, Jin Xu 0010, Xu Tan 0003, Jindong Wang 0001, Tao Qin 0001, Bo Xu 0002
MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition.

Interspeech2021 Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu 0002
Exploring wav2vec 2.0 on Speaker Verification and Language Identification.

Interspeech2021 Xiyun Li, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Jiaming Xu 0001, Bo Xu 0002, Dong Yu 0001, 
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation.

AAAI2021 Qianqian Dong, Mingxuan Wang, Hao Zhou 0012, Shuang Xu, Bo Xu 0002, Lei Li 0005, 
Consecutive Decoding for Speech-to-text Translation.

AAAI2021 Qianqian Dong, Rong Ye, Mingxuan Wang, Hao Zhou 0012, Shuang Xu, Bo Xu 0002, Lei Li 0005, 
Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation.

ICASSP2020 Linhao Dong, Bo Xu 0002
CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition.

Interspeech2020 Jing Shi 0003, Jiaming Xu 0001, Yusuke Fujita, Shinji Watanabe 0001, Bo Xu 0002
Speaker-Conditional Chain Model for Speech Separation and Extraction.

#168  | Ross Cutler | DBLP Google Scholar  
By venueICASSP: 15Interspeech: 13
By year2024: 22023: 42022: 72021: 72020: 52019: 3
ISCA sessionsspeech coding and enhancement: 2speech transmission & coding: 2audio deep plc (packet loss concealment) challenge: 1acoustic scene analysis: 1non-intrusive objective speech quality assessment (nisqa) challenge for online conferencing applications: 1interspeech 2021 acoustic echo cancellation challenge: 1speech localization, enhancement, and quality assessment: 1interspeech 2021 deep noise suppression challenge: 1deep noise suppression challenge: 1speech enhancement: 1speech and audio classification: 1
IEEE keywordsspeech enhancement: 7perceptual speech quality: 5acoustic echo cancellation: 4speech quality assessment: 3crowdsourcing: 3subjective test: 3echo cancellers: 3deep noise suppressor: 3single talk: 2double talk: 2echo suppression: 2p.835: 2signal denoising: 2objective metric: 2speech: 2metric: 2machine learning: 2estimation: 1direction of arrival estimation: 1a/v fusion: 1speaker detection: 1variable speed drives: 1real time systems: 1signal processing algorithms: 1perceptual dimensions: 1oral communication: 1signal quality: 1surveys: 1quality assessment: 1model size reduction: 1speech interruption detection: 1computational modeling: 1performance evaluation: 1transformers: 1semi supervised learning: 1pandemics: 1analytical models: 1quantization (signal): 1production: 1privacy: 1noise reduction: 1test set: 1ontologies: 1speech recognition: 1iterative methods: 1deep noise suppression: 1personalized noise suppression: 1quality management: 1subjective quality assessment: 1acoustic measurements: 1tools: 1p.808: 1dns: 1interference suppression: 1telephony: 1life estimation: 1writing: 1cinematography: 1multimodal fusion: 1video conferencing: 1active speaker detection: 1virtual cinematography: 1cameras: 1computer vision: 1sound source localization: 1streaming media: 1real time speech enhancement: 1mean square error methods: 1loss function: 1recurrent neural networks: 1speech distortion: 1recurrent neural nets: 1mean opinion score: 1audio quality assessment: 1deep neural network: 1
Most publications (all venues) at2021: 112023: 102022: 102020: 102024: 9

Affiliations
URLs

Recent publications

ICASSP2024 Ilya Gurvich, Ido Leichter, Dharmendar Reddy Palle, Yossi Asher, Alon Vinnikov, Igor Abramovski, Vishak Gopal, Ross Cutler, Eyal Krupka, 
A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism.

ICASSP2024 Babak Naderi, Ross Cutler, Nicolae-Catalin Ristea, 
Multi-Dimensional Speech Quality Assessment in Crowdsourcing.

ICASSP2023 Quchen Fu, Szu-Wei Fu, Yaran Fan, Yu Wu 0012, Zhuo Chen 0006, Jayant Gupchup, Ross Cutler
Real-Time Speech Interruption Analysis: from Cloud to Client Deployment.

ICASSP2023 Xavier Gitiaux, Aditya Khant, Ebrahim Beyrami, Chandan K. A. Reddy, Jayant Gupchup, Ross Cutler
AURA: Privacy-Preserving Augmentation to Improve Test Set Diversity in Speech Enhancement.

Interspeech2023 Lorenz Diener, Marju Purin, Sten Sootla, Ando Saabas, Robert Aichner, Ross Cutler
PLCMOS - A Data-driven Non-intrusive Metric for The Evaluation of Packet Loss Concealment Algorithms.

Interspeech2023 Nicolae-Catalin Ristea, Evgenii Indenbom, Ando Saabas, Tanel Pärnamaa, Jegor Guzvin, Ross Cutler
DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation.

ICASSP2022 Ross Cutler, Ando Saabas, Tanel Pärnamaa, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sørensen, Robert Aichner, 
ICASSP 2022 Acoustic Echo Cancellation Challenge.

ICASSP2022 Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner, 
Icassp 2022 Deep Noise Suppression Challenge.

ICASSP2022 Marju Purin, Sten Sootla, Mateja Sponza, Ando Saabas, Ross Cutler
AECMOS: A Speech Quality Assessment Metric for Echo Impairment.

ICASSP2022 Chandan K. A. Reddy, Vishak Gopal, Ross Cutler
Dnsmos P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors.

Interspeech2022 Lorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner, Ross Cutler
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge.

Interspeech2022 Chandan K. A. Reddy, Vishak Gopal, Harishchandra Dubey, Ross Cutler, Sergiy Matusevych, Robert Aichner, 
MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection.

Interspeech2022 Gaoxiong Yi, Wei Xiao, Yiming Xiao, Babak Naderi, Sebastian Möller 0001, Wafaa Wardah, Gabriel Mittag, Ross Cutler, Zhuohuang Zhang, Donald S. Williamson, Fei Chen 0011, Fuzheng Yang, Shidong Shang, 
ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications.

ICASSP2021 Ross Cutler, Babak Nadari, Markus Loide, Sten Sootla, Ando Saabas, 
Crowdsourcing Approach for Subjective Evaluation of Echo Impairment.

ICASSP2021 Chandan K. A. Reddy, Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan 0003, 
ICASSP 2021 Deep Noise Suppression Challenge.

ICASSP2021 Chandan K. A. Reddy, Vishak Gopal, Ross Cutler
Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors.

ICASSP2021 Kusha Sridhar, Ross Cutler, Ando Saabas, Tanel Pärnamaa, Markus Loide, Hannes Gamper, Sebastian Braun, Robert Aichner, Sriram Srinivasan 0003, 
ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results.

Interspeech2021 Ross Cutler, Ando Saabas, Tanel Pärnamaa, Markus Loide, Sten Sootla, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sørensen, Robert Aichner, Sriram Srinivasan 0003, 
INTERSPEECH 2021 Acoustic Echo Cancellation Challenge.

Interspeech2021 Babak Naderi, Ross Cutler
Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing.

Interspeech2021 Chandan K. A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Asokan Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan 0003, 
INTERSPEECH 2021 Deep Noise Suppression Challenge.

#169  | Brian Yan | DBLP Google Scholar  
By venueICASSP: 14Interspeech: 12EMNLP-Findings: 2
By year2024: 42023: 162022: 72021: 1
ISCA sessionsspeech recognition: 4spoken dialog systems and conversational analysis: 2novel transformer models for asr: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1spoken language understanding: 1low-resource asr development: 1speech enhancement and intelligibility: 1cross/multi-lingual and code-switched asr: 1
IEEE keywordsspeech recognition: 5spoken language understanding: 5data models: 4task analysis: 4predictive models: 3semantics: 3speech translation: 2self supervised learning: 2end to end: 2speech coding: 2computational modeling: 2end to end systems: 2benchmark testing: 2stop challenge: 2pipelines: 2ctc: 2code switched asr: 2natural language processing: 2redundancy: 1discrete units: 1correlation: 1systematics: 1conversational speech: 1context modeling: 1speech enhancement: 1contextual information: 1end to end models: 1memory management: 1machine translation: 1robustness: 1decoding: 1zero shot learning: 1code switching: 1data augmentation: 1asr: 1splicing: 1transducers: 1transfer learning: 1multitasking: 1st: 1multi tasking: 1mt: 1vocabulary: 1oral communication: 1spoken dialog system: 1emotion recognition: 1joint modelling: 1history: 1speaker attributes: 1adaptation models: 1overthinking: 1multilingual asr: 1face recognition: 1low resource asr: 1cleaning: 1usability: 1target tracking: 1e2e: 1on device: 1tensors: 1e branchformer: 1sequential distillation: 1tensor decomposition: 1closed box: 1real time systems: 1speech to text translation: 1out of order: 1zero shot asr: 1text analysis: 1public domain software: 1speech based user interfaces: 1language translation: 1open source: 1rnn t: 1bilingual asr: 1computational linguistics: 1
Most publications (all venues) at2023: 242022: 82024: 72021: 4

Affiliations
URLs

Recent publications

ICASSP2024 Xuankai Chang, Brian Yan, Kwanghee Choi, Jee-Weon Jung, Yichen Lu, Soumi Maiti, Roshan S. Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe 0001, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang, 
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study.

ICASSP2024 Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization.

ICASSP2024 Amir Hussein, Dorsa Zeinali, Ondrej Klejch, Matthew Wiesner, Brian Yan, Shammur Absar Chowdhury, Ahmed Ali 0002, Shinji Watanabe 0001, Sanjeev Khudanpur, 
Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora.

ICASSP2024 Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe 0001, 
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing.

ICASSP2023 Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe 0001, 
Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History.

ICASSP2023 Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe 0001, 
A Study on the Integration of Pipeline and E2E SLU Systems for Spoken Semantic Parsing Toward Stop Quality Challenge.

ICASSP2023 Dan Berrebbi, Brian Yan, Shinji Watanabe 0001, 
Avoid Overthinking in Self-Supervised Models for Speech Recognition.

ICASSP2023 William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe 0001, 
Improving Massively Multilingual ASR with Auxiliary CTC Objectives.

ICASSP2023 Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe 0001, 
The Pipeline System of ASR and NLU with MLM-based data Augmentation Toward Stop Low-Resource Challenge.

ICASSP2023 Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe 0001, 
E-Branchformer-Based E2E SLU Toward Stop on-Device Challenge.

ICASSP2023 Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe 0001, 
Align, Write, Re-Order: Explainable End-to-End Speech Translation via Operation Sequence Generation.

ICASSP2023 Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe 0001, 
Towards Zero-Shot Code-Switched Speech Recognition.

Interspeech2023 Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe 0001, 
Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding.

Interspeech2023 Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe 0001, 
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning.

Interspeech2023 Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe 0001, 
Tensor decomposition for minimization of E2E SLU model toward on-device processing.

Interspeech2023 Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora, William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe 0001, 
A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks.

Interspeech2023 Puyuan Peng, Brian Yan, Shinji Watanabe 0001, David Harwath, 
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization.

Interspeech2023 Peter Polák, Brian Yan, Shinji Watanabe 0001, Alex Waibel, Ondrej Bojar, 
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff.

Interspeech2023 Yui Sudo, Muhammad Shakeel 0001, Brian Yan, Jiatong Shi, Shinji Watanabe 0001, 
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders.

Interspeech2023 Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong Yu 0001, Shinji Watanabe 0001, 
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction.

#170  | Tomi Kinnunen | DBLP Google Scholar  
By venueInterspeech: 17TASLP: 5ICASSP: 4SpeechComm: 1
By year2024: 12023: 72022: 42021: 22020: 52019: 62018: 2
ISCA sessionsspeech coding and enhancement: 3the attacker’s perpective on automatic speaker verification: 2anti-spoofing for speaker verification: 1speaker and language identification: 1spoofing-aware automatic speaker verification (sasv): 1speech coding and privacy: 1voice anti-spoofing and countermeasure: 1speaker recognition: 1speaker embedding: 1speaker recognition evaluation: 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker recognition and diarization: 1deep learning for source separation and pitch tracking: 1speaker verification: 1
IEEE keywordsspeaker recognition: 6asvspoof: 2task analysis: 2spoofing: 2presentation attack detection: 2speaker verification: 2electronic mail: 1authentication: 1multitasking: 1optimization: 1anti spoofing: 1spoof aware speaker verification (sasv): 1codecs: 1deepfakes: 1distributed databases: 1protocols: 1countermeasures: 1communication networks: 1gabor filters: 1initialisation: 1learnable frontend: 1leaf: 1learnable filterbanks: 1voice activity detection: 1sensitivity: 1spoof countermeasures: 1security: 1reinforcement learning: 1multi regime compression: 1deep learning (artificial intelligence): 1nonlinear compression: 1data compression: 1automatic speaker verification (asv): 1security of data: 1detect ion cost function: 1spoofing counter measures: 1regression model: 1regression analysis: 1waveform to sinusoid regression: 1recurrent neural networks: 1frequency estimation: 1recurrent neural nets: 1pitch: 1f0: 1fundamental frequency: 1signal classification: 1mimicry: 1social networking (online): 1speaker ranking: 1public demo: 1large scale speaker identification: 1voxceleb: 1web service: 1
Most publications (all venues) at2013: 182020: 172021: 142018: 142016: 14

Affiliations
URLs

Recent publications

TASLP2024 Xuechen Liu, Md. Sahidullah, Kong Aik Lee, Tomi Kinnunen
Generalizing Speaker Verification for Spoof Awareness in the Embedding Space.

TASLP2023 Xuechen Liu, Xin Wang 0037, Md. Sahidullah, Jose Patino 0001, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas W. D. Evans, Andreas Nautsch, Kong Aik Lee, 
ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild.

ICASSP2023 Mark Anderson 0006, Tomi Kinnunen, Naomi Harte, 
Learnable Frontends That Do Not Learn: Quantifying Sensitivity To Filterbank Initialisation.

Interspeech2023 Xuechen Liu, Md. Sahidullah, Kong Aik Lee, Tomi Kinnunen
Speaker-Aware Anti-spoofing.

Interspeech2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang 0037, Xuechen Liu, Md. Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas W. D. Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung, 
Towards Single Integrated Spoofing-aware Speaker Verification Embeddings.

Interspeech2023 Hye-jin Shim, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen
How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning.

Interspeech2023 Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen
Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing.

Interspeech2023 Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen
Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech.

SpeechComm2022 Lauri Tavi, Tomi Kinnunen, Rosa González Hautamäki, 
Improving speaker de-identification with functional data analysis of f0 trajectories.

TASLP2022 Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi, 
Optimizing Tandem Speaker Verification and Anti-Spoofing Systems.

ICASSP2022 Xuechen Liu, Md. Sahidullah, Tomi Kinnunen
Learnable Nonlinear Compression for Robust Speaker Verification.

Interspeech2022 Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas W. D. Evans, Tomi Kinnunen
SASV 2022: The First Spoofing-Aware Speaker Verification Challenge.

Interspeech2021 Bhusan Chettri, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen
Data Quality as Predictor of Voice Anti-Spoofing Generalization.

Interspeech2021 Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas W. D. Evans, Xin Wang 0037, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee, 
Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing.

TASLP2020 Tomi Kinnunen, Héctor Delgado, Nicholas W. D. Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang 0037, Md. Sahidullah, Junichi Yamagishi, Douglas A. Reynolds, 
Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals.

Interspeech2020 Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, Haizhou Li 0001, 
The Attacker's Perspective on Automatic Speaker Verification: An Overview.

Interspeech2020 Rosa González Hautamäki, Tomi Kinnunen
Why Did the x-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch Upon Target Score on VoxCeleb Data.

Interspeech2020 Xuechen Liu, Md. Sahidullah, Tomi Kinnunen
A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings.

Interspeech2020 Alexey Sholokhov, Tomi Kinnunen, Ville Vestman, Kong Aik Lee, 
Extrapolating False Alarm Rates in Automatic Speaker Verification.

TASLP2019 Akihiro Kato, Tomi H. Kinnunen
Statistical Regression Models for Noise Robust F0 Estimation Using Recurrent Deep Neural Networks.

#171  | Reinhold Häb-Umbach | DBLP Google Scholar  
By venueInterspeech: 15ICASSP: 9TASLP: 3
By year2024: 22023: 52022: 42021: 22020: 62019: 62018: 2
ISCA sessionssource separation: 2far-field speech recognition: 2multi-talker methods in speech processing: 1speaker recognition: 1speaker embedding and diarization: 1voice conversion and adaptation: 1the fearless steps challenge phase-02: 1monaural source separation: 1diarization: 1speech enhancement: 1privacy in speech and audio interfaces: 1distant asr: 1zero-resource speech recognition: 1
IEEE keywordssource separation: 8speech enhancement: 4speech recognition: 4reverberation: 4blind source separation: 3array signal processing: 3artificial neural networks: 2meeting recognition: 2speaker embeddings: 2error analysis: 2diarization: 2permutation invariant training: 2particle separators: 2convolution: 2dynamic programming: 2recording: 2signal to distortion ratio: 2dereverberation: 2optimisation: 2frequency domain analysis: 2audio signal processing: 2time domain analysis: 2backpropagation: 2speaker recognition: 2time frequency analysis: 1meeting separation: 1microphones: 1indexes: 1speaker diarization: 1computer architecture: 1task analysis: 1meeting data: 1mixture model: 1data models: 1mixture models: 1interpolation: 1clustering: 1continuous speech separation: 1oral communication: 1graph pit: 1teacher student training: 1signal resolution: 1eend: 1speaker verification: 1dvectors: 1computational efficiency: 1software: 1tensors: 1word error rate: 1levenshtein distance: 1training data: 1switches: 1computational modeling: 1loss function: 1acoustic beamforming: 1complex backpropagation: 1transfer functions: 1multi channel source separation: 1beamforming: 1automatic speech recognition: 1maximum likelihood estimation: 1filtering theory: 1microphone array: 1robust automatic speech recognition: 1multichannel source separation: 1mean square error methods: 1joint training: 1convolutional neural nets: 1computational complexity: 1end to end speech recognition: 1hidden markov models: 1multi speaker speech recognition: 1time domain: 1speech separation: 1iterative methods: 1joint optimization: 1least squares approximations: 1robust asr: 1meeting diarization: 1source counting: 1online processing: 1neural network: 1
Most publications (all venues) at2021: 172019: 162013: 152018: 132023: 12

Affiliations
University of Paderborn, Department of Electrical Engineering and Information Technology, Germany

Recent publications

TASLP2024 Christoph Böddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux, 
TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings.

ICASSP2024 Tobias Cord-Landwehr, Christoph Böddeker, Catalin Zorila, Rama Doddipatla, Reinhold Haeb-Umbach
Geodesic Interpolation of Frame-Wise Speaker Embeddings for the Diarization of Meeting Scenarios.

TASLP2023 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach
Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria.

ICASSP2023 Tobias Cord-Landwehr, Christoph Böddeker, Catalin Zorila, Rama Doddipatla, Reinhold Haeb-Umbach
Frame-Wise and Overlap-Robust Speaker Embeddings for Meeting Diarization.

ICASSP2023 Thilo von Neumann, Christoph Böddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems.

Interspeech2023 Simon Berger, Peter Vieting, Christoph Böddeker, Ralf Schlüter, Reinhold Haeb-Umbach
Mixture Encoder for Joint Speech Separation and Recognition.

Interspeech2023 Tobias Cord-Landwehr, Christoph Böddeker, Catalin Zorila, Rama Doddipatla, Reinhold Haeb-Umbach
A Teacher-Student Approach for Extracting Informative Speaker Embeddings From Speech Mixtures.

ICASSP2022 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach
SA-SDR: A Novel Loss Function for Separation of Meeting Style Data.

Interspeech2022 Christoph Böddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach
An Initialization Scheme for Meeting Separation with Spatial Mixture Models.

Interspeech2022 Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Böddeker, Reinhold Haeb-Umbach
Utterance-by-utterance overlap-aware neural diarization with Graph-PIT.

Interspeech2022 Michael Kuhlmann, Fritz Seebauer, Janek Ebbers, Petra Wagner, Reinhold Haeb-Umbach
Investigation into Target Speaking Rate Adaptation for Voice Conversion.

ICASSP2021 Christoph Böddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation.

Interspeech2021 Thilo von Neumann, Keisuke Kinoshita, Christoph Böddeker, Marc Delcroix, Reinhold Haeb-Umbach
Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers.

TASLP2020 Tomohiro Nakatani, Christoph Böddeker, Keisuke Kinoshita, Rintaro Ikeshita, Marc Delcroix, Reinhold Haeb-Umbach
Jointly Optimal Denoising, Dereverberation, and Source Separation.

ICASSP2020 Jens Heitkaemper, Darius Jakobeit, Christoph Böddeker, Lukas Drude, Reinhold Haeb-Umbach
Demystifying TasNet: A Dissecting Approach.

ICASSP2020 Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Böddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
End-to-End Training of Time Domain Audio Separation and Recognition.

Interspeech2020 Jens Heitkaemper, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments.

Interspeech2020 Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation.

Interspeech2020 Thilo von Neumann, Christoph Böddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR.

ICASSP2019 Jahn Heymann, Lukas Drude, Reinhold Haeb-Umbach, Keisuke Kinoshita, Tomohiro Nakatani, 
Joint Optimization of Neural Network-based WPE Dereverberation and Acoustic Model for Robust Online ASR.

#172  | Hong-Goo Kang | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 8TASLP: 2
By year2024: 22023: 82022: 42021: 12020: 72019: 42018: 1
ISCA sessionsspeech synthesis: 4speech coding and enhancement: 2multimodal speech processing: 2speech signal analysis: 1analysis of speech and audio signals: 1speaker recognition and diarization: 1spoken term detection and voice search: 1speaker recognition: 1speech enhancement, bandwidth extension and hearing aids: 1speaker embedding: 1speech coding and evaluation: 1statistical parametric speech synthesis: 1
IEEE keywordstask analysis: 4speech synthesis: 4transformers: 3speech enhancement: 2speech recognition: 2context modeling: 1data mining: 1disentangled representation: 1transformer: 1supervised clustering: 1arabic dialect identification: 1global context: 1bridges: 1self supervised model: 1data models: 1adaptation models: 1fine tuning: 1style modeling: 1generators: 1synthesizers: 1convolution: 1convolutional neural networks: 1articulation to speech: 1multi speaker: 1speech: 1brain modeling: 1auditory eeg decoding: 1measurement: 1electroencephalography: 1decoding: 1global conditioner: 1codes: 1phase reconstruction: 1phase continuity loss: 1denoising: 1text to speech: 1vocoders: 1lp mdn: 1recurrent neural nets: 1lpcnet: 1neural vocoder: 1filtering theory: 1autoregressive processes: 1emotion intensity control: 1emotional tts: 1interpolation: 1emotion recognition: 1end to end gst tacotron2: 1time frequency analysis: 1single channel speech enhancement: 1complex valued time frequency mask: 1spectrogram consistency: 1microphones: 1exact time domain reconstruction: 1audio visual synchronisation: 1cross modal supervision: 1synchronization: 1self supervised learning: 1lips: 1cross modal embedding: 1visualization: 1streaming media: 1automatic speech recognition: 1learning: 1combined query strategy: 1
Most publications (all venues) at2023: 192020: 142022: 112018: 112019: 10

Affiliations
URLs

Recent publications

TASLP2024 Zainab Alhakeem, Se-In Jang, Hong-Goo Kang
Disentangled Representations in Local-Global Contexts for Arabic Dialect Identification.

ICASSP2024 Hejung Yang, Hong-Goo Kang
On Fine-Tuning Pre-Trained Speech Models With EMA-Target Self-Supervised Loss.

ICASSP2023 Miseul Kim, Zhenyu Piao, Jihyun Lee, Hong-Goo Kang
Style Modeling for Multi-Speaker Articulation-to-Speech.

ICASSP2023 Zhenyu Piao, Miseul Kim, Hyungchan Yoon, Hong-Goo Kang
HappyQuokka System for ICASSP 2023 Auditory EEG Challenge.

Interspeech2023 Woo-Jin Chung, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang
MF-PAM: Accurate Pitch Estimation through Periodicity Analysis and Multi-level Feature Fusion.

Interspeech2023 Doyeon Kim, Soo-Whan Chung, Hyewon Han, Youna Ji, Hong-Goo Kang
HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders.

Interspeech2023 Jihyun Kim, Hong-Goo Kang
Contrastive Learning based Deep Latent Masking for Music Source Separation.

Interspeech2023 Hejung Yang, Hong-Goo Kang
Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement.

Interspeech2023 Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang
Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech.

Interspeech2023 Hyungchan Yoon, Seyun Um, Changhwan Kim, Hong-Goo Kang
Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech.

ICASSP2022 Doyeon Kim, Hyewon Han, Hyeon-Kyeong Shin, Soo-Whan Chung, Hong-Goo Kang
Phase Continuity: Learning Derivatives of Phase Spectrum for Speech Enhancement.

Interspeech2022 Miseul Kim, Zhenyu Piao, Seyun Um, Ran Lee, Jaemin Joh, Seungshin Lee, Hong-Goo Kang
Light-Weight Speaker Verification with Global Context Information.

Interspeech2022 Changhwan Kim, Seyun Um, Hyungchan Yoon, Hong-Goo Kang
FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS.

Interspeech2022 Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang
Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting.

Interspeech2021 Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks.

ICASSP2020 Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank K. Soong, Hong-Goo Kang
Improving LPCNET-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network.

ICASSP2020 Se-Yun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, Chunghyun Ahn, Hong-Goo Kang
Emotional Speech Synthesis with Rich and Granularized Control.

Interspeech2020 Soo-Whan Chung, Soyeon Choe, Joon Son Chung, Hong-Goo Kang
FaceFilter: Audio-Visual Speech Separation Using Still Images.

Interspeech2020 Soo-Whan Chung, Hong-Goo Kang, Joon Son Chung, 
Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision.

Interspeech2020 Hyewon Han, Soo-Whan Chung, Hong-Goo Kang
MIRNet: Learning Multiple Identities Representations in Overlapped Speech.

#173  | Yan Song 0001 | DBLP Google Scholar  
By venueInterspeech: 15ICASSP: 10TASLP: 2
By year2024: 12023: 32022: 42021: 42020: 52019: 52018: 5
ISCA sessionsanalysis of speech and audio signals: 2speaker and language recognition: 2language and accent recognition: 1acoustic event detection and acoustic scene classification: 1learning techniques for speaker recognition: 1asr neural network architectures and training: 1acoustic event detection: 1speaker recognition and diarization: 1speaker verification using neural network methods: 1representation learning for emotion: 1acoustic scenes and rare events: 1novel neural network architectures for acoustic modelling: 1speaker verification: 1
IEEE keywordsspeech recognition: 5speaker verification: 4speaker recognition: 4convolutional neural nets: 3representation learning: 2deep learning (artificial intelligence): 2supervised learning: 2speech separation: 2sound event detection: 2audio tagging: 2recurrent neural nets: 2audio signal processing: 2runtime: 1meta learning: 1episodic training: 1iron: 1domain alignment: 1performance gain: 1degradation: 1stargan: 1domain adaptation: 1performance evaluation: 1data models: 1adaptation models: 1data augmentation: 1recording: 1knowledge based systems: 1self supervised learning: 1anomalous sound detection: 1label smoothing: 1unsupervised domain adaptation: 1knowledge distillation: 1end to end: 1speech emotion recognition: 1emotion recognition: 1signal reconstruction: 1style transformation: 1convolutional neural network: 1disentanglement: 1sequence alignment: 1probability: 1multi granularity: 1post inference: 1inference mechanisms: 1end to end asr: 1encoder decoder: 1dense residual networks: 1model ensemble: 1embedding learning: 1source separation: 1time domain: 1sparse encoder: 1speaker identification: 1target tracking: 1signal representation: 1time domain analysis: 1semi supervised learning: 1weakly labeled: 1computational auditory scene analysis: 1label permutation problem: 1autoregressive processes: 1pattern clustering: 1document handling: 1hidden markov models: 1topic detection: 1agglomerative hierarchical clustering: 1natural language processing: 1consensus analysis: 1weakly labelled data: 1attention: 1signal classification: 1
Most publications (all venues) at2007: 102006: 102023: 92019: 92018: 9

Affiliations
University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing, Hefei, China

Recent publications

ICASSP2024 Jian-Tao Zhang, Yan Song 0001, Jin Li, Wu Guo, Hao-Yu Song, Ian McLoughlin 0001, 
Meta Representation Learning Method for Robust Speaker Verification in Unseen Domains.

ICASSP2023 Hang-Rui Hu, Yan Song 0001, Jian-Tao Zhang, Li-Rong Dai 0001, Ian McLoughlin 0001, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, 
Stargan-vc Based Cross-Domain Data Augmentation for Speaker Verification.

Interspeech2023 Kang Li, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Jin Li, Li-Rong Dai 0001, 
Fine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection.

Interspeech2023 Xiao-Min Zeng, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Li-Rong Dai 0001, 
Robust Prototype Learning for Anomalous Sound Detection.

ICASSP2022 Han Chen, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Self-Supervised Representation Learning for Unsupervised Anomalous Sound Detection Under Domain Shift.

ICASSP2022 Hang-Rui Hu, Yan Song 0001, Ying Liu, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Domain Robust Deep Embedding Learning for Speaker Recognition.

ICASSP2022 Yuxuan Xi, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Frontend Attributes Disentanglement for Speech Emotion Recognition.

Interspeech2022 Hang-Rui Hu, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
Class-Aware Distribution Alignment based Unsupervised Domain Adaptation for Speaker Verification.

TASLP2021 Jian Tang, Jie Zhang 0042, Yan Song 0001, Ian McLoughlin 0001, Li-Rong Dai 0001, 
Multi-Granularity Sequence Alignment Mapping for Encoder-Decoder Based End-to-End ASR.

ICASSP2021 Ying Liu, Yan Song 0001, Ian McLoughlin 0001, Lin Liu 0017, Li-Rong Dai 0001, 
An Effective Deep Embedding Learning Method Based on Dense-Residual Networks for Speaker Verification.

Interspeech2021 Hui Wang, Lin Liu 0017, Yan Song 0001, Lei Fang, Ian McLoughlin 0001, Li-Rong Dai 0001, 
A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification.

Interspeech2021 Xu Zheng, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection.

ICASSP2020 Hui Wang, Yan Song 0001, Zengxi Li, Ian McLoughlin 0001, Li-Rong Dai 0001, 
An Online Speaker-aware Speech Separation Approach Based on Time-domain Representation.

ICASSP2020 Jie Yan, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, 
Task-Aware Mean Teacher Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection.

Interspeech2020 Ying Liu, Yan Song 0001, Yiheng Jiang, Ian McLoughlin 0001, Lin Liu 0017, Li-Rong Dai 0001, 
An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions.

Interspeech2020 Zi-qiang Zhang, Yan Song 0001, Jian-Shu Zhang, Ian McLoughlin 0001, Li-Rong Dai 0001, 
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution.

Interspeech2020 Xu Zheng, Yan Song 0001, Jie Yan, Li-Rong Dai 0001, Ian McLoughlin 0001, Lin Liu 0017, 
An Effective Perturbation Based Semi-Supervised Learning Method for Sound Event Detection.

TASLP2019 Zengxi Li, Yan Song 0001, Li-Rong Dai 0001, Ian McLoughlin 0001, 
Listening and Grouping: An Online Autoregressive Approach for Monaural Speech Separation.

ICASSP2019 Jian Sun, Wu Guo, Zhi Chen, Yan Song 0001
Topic Detection in Conversational Telephone Speech Using CNN with Multi-stream Inputs.

ICASSP2019 Jie Yan, Yan Song 0001, Wu Guo, Li-Rong Dai 0001, Ian McLoughlin 0001, Liang Chen, 
A Region Based Attention Method for Weakly Supervised Sound Event Detection and Classification.

#174  | Athanasios Mouchtaris | DBLP Google Scholar  
By venueInterspeech: 14ICASSP: 13
By year2024: 12023: 52022: 92021: 82020: 4
ISCA sessionsresource-constrained asr: 3spoken language understanding: 3streaming asr: 2novel transformer models for asr: 1topics in asr: 1language and lexical modeling for asr: 1privacy-preserving machine learning for audio & speech processing: 1summarization, semantic analysis and classification: 1computational resource constrained speech recognition: 1
IEEE keywordsspeech recognition: 11personalization: 5transducers: 4automatic speech recognition: 3contextual biasing: 3end to end: 3multilingual: 3spoken language understanding: 3natural language processing: 3conformer: 2neural transducer: 2runtime: 2costs: 2contact name recognition: 2attention: 2end to end asr: 2switches: 2interactive systems: 2audio signal processing: 2decoding: 2signal classification: 2speech synthesis: 2error analysis: 1max margin: 1end to end speech recognition models: 1sequence discriminative criterion: 1minimum word error rate training: 1computational modeling: 1adaptation models: 1logic gates: 1tail: 1neural transducers: 1encoding: 1rnn t: 1semantics: 1tokenization: 1degradation: 1performance evaluation: 1human computer interaction: 1inference optimization: 1wake word spotting: 1neural biasing: 1computer architecture: 1cross modal learning: 1signal to interpretation: 1end to end neural model: 1cache storage: 1streaming: 1latency: 1latency reduction: 1on device: 1e2e: 1gating: 1dialogue act: 1pitch: 1prosody: 1transformer network: 1attention layer: 1multichannel asr: 1channel coding: 1array signal processing: 1joint modeling: 1recurrent neural network transducer: 1code switching: 1recurrent neural nets: 1optimisation: 1language identification: 1pronunciation generation: 1end to end models: 1grapheme to phoneme (g2p): 1byte representation: 1
Most publications (all venues) at2021: 122022: 112017: 92015: 82014: 8

Affiliations
URLs

Recent publications

ICASSP2024 Rupak Vignesh Swaminathan, Grant P. Strimel, Ariya Rastrow, Sri Harish Mallidi, Kai Zhen, Hieu Duy Nguyen, Nathan Susanj, Athanasios Mouchtaris
Max-Margin Transducer Loss: Improving Sequence-Discriminative Training Using a Large-Margin Learning Strategy.

ICASSP2023 Anastasios Alexandridis, Kanthashree Mysore Sathyendra, Grant P. Strimel, Feng-Ju Chang, Ariya Rastrow, Nathan Susanj, Athanasios Mouchtaris
Gated Contextual Adapters For Selective Contextual Biasing In Neural Transducers.

ICASSP2023 Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, Jing Liu, Grant P. Strimel, Ross McGowan, Athanasios Mouchtaris
Robust Acoustic And Semantic Contextual Biasing In Neural Transducers For Speech Recognition.

ICASSP2023 Markus Müller, Anastasios Alexandridis, Zach Trozenski, Joel Whiteman, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, 
Multilingual End-To-End Spoken Language Understanding For Ultra-Low Footprint Applications.

ICASSP2023 Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree Mysore Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann, 
Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition.

Interspeech2023 Martin Radfar, Paulina Lyskawa, Brandon Trujillo, Yi Xie, Kai Zhen, Jahn Heymann, Denis Filimonov, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris
Conmer: Streaming Conformer Without Self-attention for Interactive Voice Assistants.

ICASSP2022 Bhuvan Agrawal, Markus Müller, Samridhi Choudhary, Martin Radfar, Athanasios Mouchtaris, Ross McGowan, Nathan Susanj, Siegfried Kunzmann, 
Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding.

ICASSP2022 Anastasios Alexandridis, Grant P. Strimel, Ariya Rastrow, Pavel Kveton, Jon Webb, Maurizio Omologo, Siegfried Kunzmann, Athanasios Mouchtaris
Caching Networks: Capitalizing on Common Speech for ASR.

ICASSP2022 Anastasios Alexandridis, Kanthashree Mysore Sathyendra, Grant P. Strimel, Pavel Kveton, Jon Webb, Athanasios Mouchtaris
TINYS2I: A Small-Footprint Utterance Classification Model with Contextual Support for On-Device SLU.

ICASSP2022 Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing Liu, Jinru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann, 
Contextual Adapters for Personalized Speech Recognition in Neural Transducers.

ICASSP2022 Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Müller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo, 
A Neural Prosody Encoder for End-to-End Dialogue Act Classification.

Interspeech2022 Kaiqi Zhao 0002, Hieu Nguyen, Animesh Jain, Nathan Susanj, Athanasios Mouchtaris, Lokesh Gupta, Ming Zhao 0002, 
Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer.

Interspeech2022 Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition.

Interspeech2022 Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian John King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel, 
Compute Cost Amortized Transformer for Streaming ASR.

Interspeech2022 Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Nathan Susanj, Athanasios Mouchtaris, Tariq Afzal, Ariya Rastrow, 
Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition.

ICASSP2021 Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian John King, Siegfried Kunzmann, 
End-to-End Multi-Channel Transformer for Speech Recognition.

ICASSP2021 Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus Müller, Sergio Murillo, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann, 
Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching.

Interspeech2021 Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, 
Multi-Channel Transformer Transducer for Speech Recognition.

Interspeech2021 Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo, 
Phonetically Induced Subwords for End-to-End Speech Recognition.

Interspeech2021 Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow, 
FANS: Fusing ASR and NLU for On-Device SLU.

#175  | David F. Harwath | DBLP Google Scholar  
By venueInterspeech: 12ICASSP: 10ACL: 3ICLR: 2
By year2024: 32023: 102022: 52021: 32020: 32019: 3
ISCA sessionsspeech recognition: 2multimodal systems: 2dialog management: 1cross-lingual and multilingual asr: 1acoustic signal representation and analysis: 1zero, low-resource and multi-modal speech recognition: 1low-resource speech recognition: 1speech translation and multilingual/multimodal learning: 1zero-resource asr: 1speech recognition and beyond: 1
IEEE keywordsself supervised learning: 3visually grounded speech: 3image retrieval: 3speech recognition: 3benchmark testing: 2visualization: 2task analysis: 2multimodal speech processing: 2data models: 2vision and language: 2natural language processing: 2evaluation: 1audio visual learning: 1representation learning: 1soft sensors: 1semantics: 1multilingual speech processing: 1geometry: 1speech enhancement: 1three dimensional displays: 1rendering (computer graphics): 1streaming media: 1encoding: 1domain adaptation: 1continual learning: 1on device: 1computational modeling: 1adaptation models: 1signal processing algorithms: 1asr: 1error analysis: 1automatic speech recognition: 1unsupervised data selection: 1correlation: 1retrieval: 1multilingual: 1entropy: 1cross modal: 1knowledge distillation: 1analytical models: 1cross lingual: 1predictive models: 1self supervised speech processing: 1deep learning (artificial intelligence): 1computer vision: 1image representation: 1self supervised representation learning: 1adversarial training: 1information retrieval: 1and cross lingual retrieval: 1semantic embedding space: 1vision and spoken language: 1self attention: 1convolutional neural nets: 1unsupervised speech processing: 1
Most publications (all venues) at2023: 172022: 122024: 92021: 62019: 5

Affiliations
Massachusetts Institute of Technology, Cambridge, USA (PhD 2018)

Recent publications

ICASSP2024 Yuan Tseng, Layne Berry, Yiting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Poyao Huang 0001, Chun-Mao Lai, Shang-Wen Li 0001, David Harwath, Yu Tsao 0001, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee, 
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models.

ACL2024 Puyuan Peng, Po-Yao Huang 0001, Shang-Wen Li 0001, Abdelrahman Mohamed, David Harwath
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild.

ACL2024 Jordan Voas, David Harwath, Raymond Mooney, 
Multimodal Contextualized Semantic Parsing from Speech.

ICASSP2023 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval.

ICASSP2023 Changan Chen, Wei Sun, David Harwath, Kristen Grauman, 
Learning Audio-Visual Dereverberation.

ICASSP2023 Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed, 
Continual Learning for On-Device Speech Recognition Using Disentangled Conformers.

ICASSP2023 Reem Gody, David Harwath
Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models.

ICASSP2023 Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas 0001, Rogério Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James R. Glass, 
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval.

Interspeech2023 Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh K. Jha, Diego Romeres, Jonathan Le Roux, 
Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos.

Interspeech2023 Puyuan Peng, Shang-Wen Li 0001, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model.

Interspeech2023 Puyuan Peng, Brian Yan, Shinji Watanabe 0001, David Harwath
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization.

Interspeech2023 Andrew Rouditchenko, Sameer Khurana, Samuel Thomas 0001, Rogério Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James R. Glass, 
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.

ICLR2023 Yuan Gong 0001, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James R. Glass, 
Contrastive Audio-Visual Masked Autoencoder.

ICASSP2022 Puyuan Peng, David Harwath
Fast-Slow Transformer for Visually Grounding Speech.

ICASSP2022 David Xu 0006, David Harwath
Adversarial Input Ablation for Audio-Visual Learning.

Interspeech2022 Alan Baade, Puyuan Peng, David Harwath
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer.

Interspeech2022 Tyler Miller, David Harwath
Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech.

Interspeech2022 Puyuan Peng, David Harwath
Word Discovery in Visually Grounded, Self-Supervised Speech Models.

Interspeech2021 Andrew Rouditchenko, Angie W. Boggust, David Harwath, Samuel Thomas 0001, Hilde Kuehne, Brian Chen 0001, Rameswar Panda, Rogério Feris, Brian Kingsbury, Michael Picheny, James R. Glass, 
Cascaded Multilingual Audio-Visual Learning from Videos.

Interspeech2021 Andrew Rouditchenko, Angie W. Boggust, David Harwath, Brian Chen 0001, Dhiraj Joshi, Samuel Thomas 0001, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogério Schmidt Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba 0001, James R. Glass, 
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos.

#176  | Emmanuel Dupoux | DBLP Google Scholar  
By venueInterspeech: 17ICASSP: 3EMNLP: 2ACL: 2NeurIPS: 1EMNLP-Findings: 1TASLP: 1
By year2023: 92022: 42021: 32020: 72019: 12018: 3
ISCA sessionsinvariant and robust pre-trained acoustic models: 2connecting speech-science and speech-technology for children's speech: 2speech synthesis: 2zero, low-resource and multi-modal speech recognition: 2low-resource speech recognition: 1speech and audio quality assessment: 1the zero resource speech challenge 2020: 1diarization: 1acoustic phonetics and prosody: 1the zero resource speech challenge 2019: 1zero-resource speech recognition: 1topics in speech recognition: 1sequence models for asr: 1
IEEE keywordsspeech recognition: 3text analysis: 2unsupervised learning: 2natural language processing: 2smoothing methods: 1measurement: 1representation learning: 1self supervision: 1self supervised learning: 1unit discovery: 1automatic speech recognition: 1human computer interaction: 1image retrieval: 1speech synthesis: 1image representation: 1distant supervision: 1unsupervised and semi supervised learning: 1audio signal processing: 1zero and low resource asr.: 1dataset: 1unsupervised pretraining: 1speech coding: 1computational linguistics: 1low resources: 1cross lingual: 1
Most publications (all venues) at2020: 182023: 172022: 162021: 142016: 14

Affiliations
URLs

Recent publications

ICASSP2023 Ali Elkahky, Wei-Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux, Abdelrahman Mohamed, 
Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

Interspeech2023 Mark Hallap, Emmanuel Dupoux, Ewan Dunbar, 
Evaluating context-invariance in unsupervised speech representations.

Interspeech2023 Marvin Lavechin, Yaya Sy, Hadrien Titeux, María Andrea Cruz Blandón, Okko Räsänen, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristià, 
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models.

Interspeech2023 Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarandi, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux
Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis.

Interspeech2023 Maureen de Seyssel, Marvin Lavechin, Hadrien Titeux, Arthur Thomas, Gwendal Virlet, Andrea Santos Revilla, Guillaume Wisniewski, Bogdan Ludusan, Emmanuel Dupoux
ProsAudit, a prosodic benchmark for self-supervised speech models.

Interspeech2023 Yaya Sy, William N. Havard, Marvin Lavechin, Emmanuel Dupoux, Alejandrina Cristià, 
Measuring Language Development From Child-centered Recordings.

NeurIPS2023 Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz 0001, Yossi Adi, 
Textually Pretrained Speech Language Models.

EMNLP2023 Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot, Emmanuel Dupoux
Generative Spoken Language Model based on continuous word-sized audio tokens.

EMNLP-Findings2023 Robin Algayres, Pablo Diego-Simon, Benoît Sagot, Emmanuel Dupoux
XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words.

Interspeech2022 Robin Algayres, Adel Nabli, Benoît Sagot, Emmanuel Dupoux
Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning.

Interspeech2022 Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski, 
Probing phoneme, language and speaker information in unsupervised speech representations.

ACL2022 Eugene Kharitonov, Ann Lee 0001, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu, 
Text-Free Prosody-Aware Generative Spoken Language Modeling.

EMNLP2022 Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi, 
Textless Speech Emotion Conversion using Discrete & Decomposed Representations.

Interspeech2021 Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux
The Zero Resource Speech Challenge 2021: Spoken Language Modelling.

Interspeech2021 Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

ACL2021 Changhan Wang, Morgane Rivière, Ann Lee 0001, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Miguel Pino, Emmanuel Dupoux
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation.

TASLP2020 Odette Scharenborg, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux, Laurent Besacier, Alan W. Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stüker, Pierre Godard, Markus Müller 0001, 
Speech Technology for Unwritten Languages.

ICASSP2020 Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux
Libri-Light: A Benchmark for ASR with Limited or No Supervision.

ICASSP2020 Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux
Unsupervised Pretraining Transfers Well Across Languages.

Interspeech2020 Robin Algayres, Mohamed Salah Zaïem, Benoît Sagot, Emmanuel Dupoux
Evaluating the Reliability of Acoustic Speech Embeddings.

#177  | Christian Fügen | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 10NAACL: 1
By year2024: 32023: 22022: 22021: 122020: 62019: 2
ISCA sessionsnovel neural network architectures for asr: 2speech synthesis: 2speech recognition: 1multi-talker methods in speech processing: 1summarization, entity extraction, evaluation and others: 1self-supervised, semi-supervised, adaptation and data augmentation for asr: 1speech signal analysis and representation: 1language and lexical modeling for asr: 1streaming for asr/rnn transducers: 1speech coding and privacy: 1resource-constrained asr: 1asr neural network architectures: 1new trends in self-supervised speech processing: 1lexicon and language model for speech recognition: 1
IEEE keywordsspeech recognition: 10natural language processing: 4task analysis: 2recurrent neural nets: 2rnn t: 2decoding: 2acoustic modeling: 2hybrid speech recognition: 2llama: 1large language model: 1question answering (information retrieval): 1contextual biasing: 1adaptation models: 1large language models: 1leveraging unpaired text: 1streaming end to end speech recognition: 1language model fusion: 1convolutional neural nets: 1speech enhancement: 1internet telephony: 1packet loss concealment: 1long short term memory: 1speech coding: 1speech intelligibility: 1voice over ip: 1convolutional codes: 1transducers: 1recurrent transducer: 1automatic speech recognition: 1smart devices: 1microprocessors: 1computer architecture: 1on device inference: 1asr: 1signal representation: 1pseudo labeling: 1contrastive learning: 1supervised learning: 1text analysis: 1distant supervision: 1unsupervised and semi supervised learning: 1audio signal processing: 1unsupervised learning: 1zero and low resource asr.: 1dataset: 1graphemic pronunciation learning: 1chenones: 1transformer: 1recurrent neural networks: 1class based language model: 1token passing: 1end to end speech recognition: 1weighted finite state transducer: 1
Most publications (all venues) at2021: 152020: 82007: 62006: 62024: 5

Affiliations
URLs

Recent publications

ICASSP2024 Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer, 
Prompting Large Language Models with Speech Recognition Abilities.

ICASSP2024 Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen
End-to-End Speech Recognition Contextualization with Large Language Models.

NAACL2024 Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer, 
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs.

Interspeech2023 Pingchuan Ma 0001, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic, 
Streaming Audio-Visual Speech Recognition with Alignment Regularization.

Interspeech2023 Ju Lin, Niko Moritz, Ruiming Xie, Kaustubh Kalgaonkar, Christian Fuegen, Frank Seide, 
Directional Speech Recognition for Speaker Disambiguation and Cross-talk Suppression.

Interspeech2022 Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer, 
Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric.

Interspeech2022 Weiyi Zheng, Alex Xiao, Gil Keren, Duc Le, Frank Zhang 0001, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed, 
Scaling ASR Improves Zero and Few Shot Learning.

ICASSP2021 Suyoun Kim, Yuan Shangguan, Jay Mahadeokar, Antoine Bruguier, Christian Fuegen, Michael L. Seltzer, Duc Le, 
Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer.

ICASSP2021 Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen
A Time-Domain Convolutional Recurrent Network for Packet Loss Concealment.

ICASSP2021 Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra, 
Memory-Efficient Speech Recognition on Smart Devices.

ICASSP2021 Alex Xiao, Christian Fuegen, Abdelrahman Mohamed, 
Contrastive Semi-Supervised Learning for ASR.

Interspeech2021 Anurag Kumar 0003, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen
Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning.

Interspeech2021 Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer, 
Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding.

Interspeech2021 Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer, 
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion.

Interspeech2021 Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen
A Two-Stage Approach to Speech Bandwidth Extension.

Interspeech2021 Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer, 
Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios.

Interspeech2021 Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer, 
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition.

Interspeech2021 Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer, 
Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency.

Interspeech2021 Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Köhler, Qing He, 
Transformer-Based Acoustic Modeling for Streaming Speech Synthesis.

ICASSP2020 Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux, 
Libri-Light: A Benchmark for ASR with Limited or No Supervision.

#178  | Grant P. Strimel | DBLP Google Scholar  
By venueICASSP: 13Interspeech: 12ACL-Findings: 1AAAI: 1
By year2024: 22023: 82022: 92021: 52020: 22018: 1
ISCA sessionsresource-constrained asr: 3speech recognition: 1novel transformer models for asr: 1spoken term detection and voice search: 1streaming asr: 1neural network training methods for asr: 1spoken language processing: 1summarization, semantic analysis and classification: 1computational resource constrained speech recognition: 1spoken dialogue systems and conversational analysis: 1
IEEE keywordsspeech recognition: 10personalization: 6transducers: 5contextual biasing: 4end to end: 4error analysis: 3automatic speech recognition: 3neural transducer: 3adaptation models: 3rnn t: 3attention: 3spoken language understanding: 3conformer: 2runtime: 2costs: 2contact name recognition: 2semantics: 2switches: 2inference optimization: 2decoding: 2signal classification: 2interactive systems: 2recurrent neural nets: 2natural language processing: 2audio signal processing: 2max margin: 1end to end speech recognition models: 1sequence discriminative criterion: 1minimum word error rate training: 1computational modeling: 1logic gates: 1personalized speech recognition: 1context modeling: 1early late fusion: 1dialog act: 1contextual adapter: 1tail: 1neural transducers: 1encoding: 1end to end asr: 1tokenization: 1degradation: 1multilingual: 1performance evaluation: 1human computer interaction: 1measurement uncertainty: 1fuses: 1pronunciation: 1wake word spotting: 1neural biasing: 1computer architecture: 1cache storage: 1streaming: 1latency: 1latency reduction: 1on device: 1semantic beam search: 1multi task learning: 1rnn transducer: 1speech coding: 1autoregressive processes: 1e2e: 1gating: 1dialogue act: 1pitch: 1speech synthesis: 1prosody: 1optimisation: 1recurrent neural network transducer (rnn t): 1on device speech recognition: 1
Most publications (all venues) at2023: 112022: 112021: 62024: 22020: 2

Affiliations
URLs

Recent publications

ICASSP2024 Rupak Vignesh Swaminathan, Grant P. Strimel, Ariya Rastrow, Sri Harish Mallidi, Kai Zhen, Hieu Duy Nguyen, Nathan Susanj, Athanasios Mouchtaris, 
Max-Margin Transducer Loss: Improving Sequence-Discriminative Training Using a Large-Margin Learning Strategy.

ACL-Findings2024 Aditya Gourav, Jari Kolehmainen, Prashanth Gurunath Shivakumar, Yile Gu, Grant P. Strimel, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko, 
Multi-Modal Retrieval For Large Language Model Based Speech Recognition.

ICASSP2023 Anastasios Alexandridis, Kanthashree Mysore Sathyendra, Grant P. Strimel, Feng-Ju Chang, Ariya Rastrow, Nathan Susanj, Athanasios Mouchtaris, 
Gated Contextual Adapters For Selective Contextual Biasing In Neural Transducers.

ICASSP2023 Feng-Ju Chang, Thejaswi Muniyappa, Kanthashree Mysore Sathyendra, Kai Wei, Grant P. Strimel, Ross McGowan, 
Dialog Act Guided Contextual Adapter for Personalized Speech Recognition.

ICASSP2023 Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, Jing Liu, Grant P. Strimel, Ross McGowan, Athanasios Mouchtaris, 
Robust Acoustic And Semantic Contextual Biasing In Neural Transducers For Speech Recognition.

ICASSP2023 Markus Müller, Anastasios Alexandridis, Zach Trozenski, Joel Whiteman, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, 
Multilingual End-To-End Spoken Language Understanding For Ultra-Low Footprint Applications.

ICASSP2023 Rahul Pandey, Roger Ren, Qi Luo, Jing Liu, Ariya Rastrow, Ankur Gandhe, Denis Filimonov, Grant P. Strimel, Andreas Stolcke, Ivan Bulyko, 
Procter: Pronunciation-Aware Contextual Adapter For Personalized Speech Recognition In Neural Transducers.

ICASSP2023 Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree Mysore Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann, 
Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition.

Interspeech2023 Yiting Lu, Philip Harding, Kanthashree Mysore Sathyendra, Sibo Tong, Xuandi Fu, Jing Liu, Feng-Ju Chang, Simon Wiesler, Grant P. Strimel
Model-Internal Slot-triggered Biasing for Domain Expansion in Neural Transducer ASR Models.

Interspeech2023 Martin Radfar, Paulina Lyskawa, Brandon Trujillo, Yi Xie, Kai Zhen, Jahn Heymann, Denis Filimonov, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, 
Conmer: Streaming Conformer Without Self-attention for Interactive Voice Assistants.

ICASSP2022 Anastasios Alexandridis, Grant P. Strimel, Ariya Rastrow, Pavel Kveton, Jon Webb, Maurizio Omologo, Siegfried Kunzmann, Athanasios Mouchtaris, 
Caching Networks: Capitalizing on Common Speech for ASR.

ICASSP2022 Anastasios Alexandridis, Kanthashree Mysore Sathyendra, Grant P. Strimel, Pavel Kveton, Jon Webb, Athanasios Mouchtaris, 
TINYS2I: A Small-Footprint Utterance Classification Model with Contextual Support for On-Device SLU.

ICASSP2022 Xuandi Fu, Feng-Ju Chang, Martin Radfar, Kai Wei, Jing Liu, Grant P. Strimel, Kanthashree Mysore Sathyendra, 
Multi-Task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding.

ICASSP2022 Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing Liu, Jinru Su, Grant P. Strimel, Athanasios Mouchtaris, Siegfried Kunzmann, 
Contextual Adapters for Personalized Speech Recognition in Neural Transducers.

ICASSP2022 Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Müller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo, 
A Neural Prosody Encoder for End-to-End Dialogue Act Classification.

Interspeech2022 Christin Jose, Joe Wang, Grant P. Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis, 
Latency Control for Keyword Spotting.

Interspeech2022 Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, 
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition.

Interspeech2022 Yi Xie, Jonathan Macoskey, Martin Radfar, Feng-Ju Chang, Brian John King, Ariya Rastrow, Athanasios Mouchtaris, Grant P. Strimel
Compute Cost Amortized Transformer for Streaming ASR.

AAAI2022 Thanh Tran, Kai Wei, Weitong Ruan, Ross McGowan, Nathan Susanj, Grant P. Strimel
Adaptive Global-Local Context Fusion for Multi-Turn Spoken Language Understanding.

ICASSP2021 Jon Macoskey, Grant P. Strimel, Ariya Rastrow, 
Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization.

#179  | Tom Ko | DBLP Google Scholar  
By venueInterspeech: 15ICASSP: 4ACL: 3TASLP: 1ICLR: 1IJCAI: 1ACL-Findings: 1
By year2024: 32023: 72022: 72021: 32020: 32019: 12018: 2
ISCA sessionsspoken term detection & voice search: 2analysis of speech and audio signals: 1invariant and robust pre-trained acoustic models: 1spoken language translation, information retrieval, summarization, resources, and evaluation: 1novel models and training methods for asr: 1speech synthesis: 1spoken language processing: 1asr: 1language and lexical modeling for asr: 1spoken term detection: 1speech classification: 1speaker recognition: 1application of asr in medical practice: 1speaker verification using neural network methods: 1
IEEE keywordstask analysis: 3text analysis: 2speaker verification: 2natural language processing: 2electronic mail: 1chatbots: 1chatgpt: 1engines: 1noise measurement: 1multimodal learning: 1pipelines: 1audio language dataset: 1data models: 1machine translation: 1speech translation: 1benchmark testing: 1data augmentation: 1mix at three levels: 1transformer: 1speaker identification: 1speaker recognition: 1speech recognition: 1domain adaptation: 1machine speech chain: 1speech synthesis: 1meta learning: 1extraterrestrial measurements: 1prototypical networks: 1target recognition: 1
Most publications (all venues) at2023: 122022: 82021: 82024: 72020: 4

Affiliations
URLs

Recent publications

TASLP2024 Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang 0001, 
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

ICLR2024 Qianqian Dong, Zhiying Huang, Qi Tian 0001, Chen Xu 0008, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li 0001, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu 0015, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang 0002, 
PolyVoice: Language Models for Speech to Speech Translation.

ACL2024 Zhichao Huang, Chutong Meng, Tom Ko
RepCodec: A Speech Representation Codec for Speech Tokenization.

ICASSP2023 Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou, 
M3ST: Mix at Three Levels for Speech Translation.

Interspeech2023 Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang 0006, H. Lilian Tang, Mark D. Plumbley, Volkan Kiliç, Wenwu Wang 0001, 
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention.

Interspeech2023 Chutong Meng, Junyi Ao, Tom Ko, Mingxuan Wang, Haizhou Li 0001, 
CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning.

Interspeech2023 Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao, 
GigaST: A 10, 000-hour Pseudo Speech Translation Corpus.

IJCAI2023 Chen Xu 0008, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, Jingbo Zhu, 
Recent Advances in Direct Speech-to-text Translation.

ACL2023 Chen Xu 0008, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma, Jingbo Zhu, 
CTC-based Non-autoregressive Speech Translation.

ACL-Findings2023 Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou, 
DUB: Discrete Unit Back-translation for Speech Translation.

ICASSP2022 Rui Wang 0073, Junyi Ao, Long Zhou, Shujie Liu 0001, Zhihua Wei 0001, Tom Ko, Qing Li 0001, Yu Zhang 0006, 
Multi-View Self-Attention Based Transformer for Speaker Recognition.

ICASSP2022 Fengpeng Yue, Yan Deng, Lei He 0005, Tom Ko, Yu Zhang 0006, 
Exploring Machine Speech Chain For Domain Adaptation.

Interspeech2022 Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu 0001, Haizhou Li 0001, Tom Ko, Lirong Dai 0001, Jinyu Li 0001, Yao Qian, Furu Wei, 
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data.

Interspeech2022 Qibing Bai, Tom Ko, Yu Zhang 0006, 
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis.

Interspeech2022 Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang 0006, 
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation.

Interspeech2022 Rui Wang 0073, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei 0001, Yu Zhang 0006, Tom Ko, Haizhou Li 0001, 
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT.

ACL2022 Junyi Ao, Rui Wang 0073, Long Zhou, Chengyi Wang 0002, Shuo Ren, Yu Wu 0012, Shujie Liu 0001, Tom Ko, Qing Li 0001, Yu Zhang 0006, Zhihua Wei 0001, Yao Qian, Jinyu Li 0001, Furu Wei, 
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing.

Interspeech2021 Yangbin Chen, Tom Ko, Jianping Wang 0001, 
A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples.

Interspeech2021 Qiushi Huang, Tom Ko, H. Lilian Tang, Xubo Liu, Bo Wu 0018, 
Token-Level Supervised Contrastive Learning for Punctuation Restoration.

Interspeech2021 Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie, 
Auto-KWS 2021 Challenge: Task, Datasets, and Baselines.

#180  | Nicholas Cummins | DBLP Google Scholar  
By venueInterspeech: 21ICASSP: 4TASLP: 1
By year2024: 12023: 32022: 22021: 32020: 72019: 62018: 4
ISCA sessionsspeech and language in health: 5speech in health: 3speech emotion recognition: 2attention mechanism for speaker state recognition: 2diverse modes of speech acquisition and processing: 1alzheimer’s dementia recognition through spontaneous speech: 1social signals detection and speaker traits analysis: 1training strategy for speech emotion recognition: 1speech signal characterization: 1speech and language analytics for mental health: 1text analysis, multilingual issues and evaluation in speech synthesis: 1speech pathology, depression, and medical applications: 1speaker state and trait: 1
IEEE keywordsdepression: 2emotion recognition: 2sociology: 1speech analysis: 1mental health: 1adaptation models: 1longitudinal assessment: 1contrastive training: 1language analysis: 1disentangled representation learning: 1audio generation: 1guided representation learning: 1audio signal processing: 1and generative adversarial neural network: 1signal representation: 1temporal convolutional networks: 1human computer interaction: 1electroencephalography: 1medical signal processing: 1hierarchical attention mechanism: 1recurrent neural nets: 1eeg signals: 1signal classification: 1speech recognition: 1monotonic attention: 1mean square error methods: 1attention transfer: 1hierarchical attention: 1psychology: 1behavioural sciences computing: 1state of mind: 1mood congruency: 1sentiment analysis: 1context modeling: 1hierarchical models: 1recurrent neural networks: 1gated recurrent units: 1attention mechanisms: 1mood: 1logic gates: 1task analysis: 1
Most publications (all venues) at2017: 262018: 212019: 172020: 152022: 11


Recent publications

ICASSP2024 Paula Andrea Pérez-Toro, Judith Dineley, Agnieszka Kaczkowska, Pauline Conde, Yuezhou Zhang, Faith Matcham, Sara Siddi, Josep Maria Haro, Stuart Bruce, Til Wykes, Raquel Bailón, Srinivasan Vairavan, Richard J. B. Dobson, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave, Vaibhav A. Narayan, Nicholas Cummins
Longitudinal Modeling of Depression Shifts Using Speech and Language.

Interspeech2023 Edward L. Campbell, Judith Dineley, Pauline Conde, Faith Matcham, Katie M. White, Carolin Oetzmann, Sara Simblett, Stuart Bruce, Amos A. Folarin, Til Wykes, Srinivasan Vairavan, Richard J. B. Dobson, Laura Docío Fernández, Carmen García-Mateo, Vaibhav A. Narayan, Matthew Hotopf, Nicholas Cummins
Classifying depression symptom severity: Assessment of speech representations in personalized and generalized machine learning models.

Interspeech2023 Judith Dineley, Ewan Carr, Faith Matcham, Johnny Downs, Richard J. B. Dobson, Thomas F. Quatieri, Nicholas Cummins
Towards robust paralinguistic assessment for real-world mobile health (mHealth) monitoring: an initial study of reverberation effects on speech.

Interspeech2023 Salvatore Fara, Orlaith Hickey, Alexandra Livia Georgescu, Stefano Goria, Emilia Molimpakis, Nicholas Cummins
Bayesian Networks for the robust and unbiased prediction of depression and its symptoms utilizing speech and multimodal data.

Interspeech2022 Salvatore Fara, Stefano Goria, Emilia Molimpakis, Nicholas Cummins
Speech and the n-Back task as a lens into depression. How combining both may allow us to isolate different core symptoms of depression.

Interspeech2022 Bahman Mirheidari, André Bittar, Nicholas Cummins, Johnny Downs, Helen L. Fisher, Heidi Christensen, 
Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities.

TASLP2021 Kazi Nazmul Haque, Rajib Rana, Jiajun Liu, John H. L. Hansen, Nicholas Cummins, Carlos Busso, Björn W. Schuller, 
Guided Generative Adversarial Neural Network for Representation Learning and Audio Generation Using Fewer Labelled Audio Data.

ICASSP2021 Chao Li, Boyang Chen, Ziping Zhao 0001, Nicholas Cummins, Björn W. Schuller, 
Hierarchical Attention-Based Temporal Convolutional Networks for Eeg-Based Emotion Recognition.

Interspeech2021 Judith Dineley, Grace Lavelle, Daniel Leightley, Faith Matcham, Sara Siddi, Maria Teresa Peñarrubia-María, Katie M. White, Alina Ivan, Carolin Oetzmann, Sara Simblett, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J. B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins, RADAR-CNS Consortium, 
Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder.

ICASSP2020 Ziping Zhao 0001, Zhongtian Bao, Zixing Zhang 0001, Nicholas Cummins, Haishuai Wang, Björn W. Schuller, 
Hierarchical Attention Transfer Networks for Depression Assessment from Speech.

Interspeech2020 Merlin Albes, Zhao Ren, Björn W. Schuller, Nicholas Cummins
Squeeze for Sneeze: Compact Neural Networks for Cold and Flu Recognition.

Interspeech2020 Alice Baird, Nicholas Cummins, Sebastian Schnieder, Jarek Krajewski, Björn W. Schuller, 
An Evaluation of the Effect of Anxiety on Speech - Computational Prediction of Anxiety from Sustained Vowels.

Interspeech2020 Nicholas Cummins, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen, Daniel Blackburn, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä, 
A Comparison of Acoustic and Linguistics Methodologies for Alzheimer's Dementia Recognition.

Interspeech2020 Adria Mallol-Ragolta, Nicholas Cummins, Björn W. Schuller, 
An Investigation of Cross-Cultural Semi-Supervised Learning for Continuous Affect Recognition.

Interspeech2020 Zhao Ren, Jing Han 0010, Nicholas Cummins, Björn W. Schuller, 
Enhancing Transferability of Black-Box Adversarial Attacks via Lifelong Learning for Speech Emotion Recognition Models.

Interspeech2020 Ziping Zhao 0001, Qifei Li, Nicholas Cummins, Bin Liu 0041, Haishuai Wang, Jianhua Tao 0001, Björn W. Schuller, 
Hybrid Network Feature Extraction for Depression Assessment from Speech.

ICASSP2019 Lukas Stappen, Nicholas Cummins, Eva-Maria Meßner, Harald Baumeister, Judith Dineley, Björn W. Schuller, 
Context Modelling Using Hierarchical Attention Networks for Sentiment and Self-assessed Emotion Detection in Spoken Narratives.

Interspeech2019 Alice Baird, Shahin Amiriparian, Nicholas Cummins, Sarah Sturmbauer, Johanna Janson, Eva-Maria Meßner, Harald Baumeister, Nicolas Rohleder, Björn W. Schuller, 
Using Speech to Predict Sequentially Measured Cortisol Levels During a Trier Social Stress Test.

Interspeech2019 Adria Mallol-Ragolta, Ziping Zhao 0001, Lukas Stappen, Nicholas Cummins, Björn W. Schuller, 
A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews.

Interspeech2019 Maximilian Schmitt, Nicholas Cummins, Björn W. Schuller, 
Continuous Emotion Recognition in Speech - Do We Need Recurrence?

#181  | Mads Græsbøll Christensen | DBLP Google Scholar  
By venueICASSP: 11TASLP: 7Interspeech: 5SpeechComm: 3
By year2024: 22023: 52022: 22021: 42020: 52019: 8
ISCA sessionsspeech enhancement: 2analysis of speech and audio signals: 1speech signal analysis and representation: 1single-channel speech enhancement: 1
IEEE keywordsspeech enhancement: 6estimation: 4frequency estimation: 4loudspeakers: 3bayes methods: 3matrix decomposition: 3audio signal processing: 3bayesian permutation training: 2variational autoencoder: 2deep representation learning: 2headphones: 2harmonic analysis: 2adaptive filters: 2causality: 2noise psd estimation: 2eigenvalues and eigenfunctions: 2computational complexity: 2conjugate gradient methods: 2hidden markov models: 2pre whitening: 2maximum likelihood estimation: 2fundamental frequency: 2variable span trade off filter: 2artificial neural networks: 1representation learning: 1decoding: 1signal representation: 1adversarial training: 1bandwidth: 1harmonics: 1anc: 1adaptation models: 1attenuation: 1predictive models: 1speech prediction: 1probability: 1wiener filters: 1speech presence probabilities: 1dictionaries: 1stochastic processes: 1personal sound zone: 1expectation maximization: 1semidefinite relaxation: 1complex gaussian mixture model: 1biconvex optimization: 1optimization: 1perturbation methods: 1robustness: 1signal processing algorithms: 1uncertainty: 1training data: 1noise reduction: 1frequency bin wise: 1frequency domain analysis: 1a posteriori probability: 1speech presence probability: 1gated recurrent units: 1voice activity detection: 1fixed filter anc: 1anc headphones: 1long term linear prediction: 1active noise control: 1feedforward: 1filtering theory: 1speech attenuation: 1deep learning (artificial intelligence): 1conjugate gradient method: 1personal sound zones: 1physically meaningful constraints: 1reverberation: 1subspace based approach: 1generalized eigenvalue decomposition: 1poisson distribution: 1signal denoising: 1hidden markov model (hmm): 1non negative matrix factorization (nmf): 1least mean squares methods: 1mixture models: 1minimum mean square error (mmse): 1poisson mixture model (pmm): 1auto regressive model: 1levinson durbin recursion: 1generalized analysis by synthesis: 1dnn: 1parameter estimation: 1autoregressive processes: 1maximum likelihood: 1spectral shape: 1coloured noise: 1colored noise: 1least squares: 1lcmv filter: 1conjugate gradient: 1sound zone control: 1reduced rank: 1distortion: 1acoustic variables control: 1multiple signal classification: 1delays: 1music information retrieval: 1multi pitch estimation: 1sterephonic signal analysis: 1model selection: 1vector quantization: 1instruments: 1multi channel pitch estimation: 1autoregressive model: 1hearing aids: 1kalman filters: 1kalman filter: 1pitch estimation: 1speech intelligibility: 1binaural enhancement: 1markov process: 1tracking: 1correlation methods: 1fundamental frequency or pitch tracking: 1harmonic model: 1harmonic order: 1voiced unvoiced detection: 1markov processes: 1gross error rate: 1estimation theory: 1spectral flatness measure: 1auditory system: 1human auditory system: 1measurement: 1personal sound: 1microphones: 1acoustic distortion: 1masking effect: 1sound zones: 1pattern classification: 1diseases: 1patient monitoring: 1segmentation: 1parkinson’s disease: 1quality control: 1bayesian nonparametric: 1infinite hmm: 1signal classification: 1
Most publications (all venues) at2016: 212019: 202014: 202013: 182010: 18


Recent publications

TASLP2024 Yang Xiang, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training.

ICASSP2024 Yurii Iotov, Sidsel Marie Nørholm, Peter John McCutcheon, Mads Græsbøll Christensen
Improving Speech Attenuation in Headphones using Harmonic Model Decomposition and Multiple-Frequency ANC.

SpeechComm2023 Alfredo Esquivel Jaramillo, Jesper Kjær Nielsen, Mads Græsbøll Christensen
An adaptive autoregressive pre-whitener for speech and acoustic signals based on parametric NMF.

TASLP2023 Jesper Kjær Nielsen, Mads Græsbøll Christensen, Jesper Bünsow Boldt, 
An Analysis of Traditional Noise Power Spectral Density Estimators Based on the Gaussian Stochastic Volatility Model.

TASLP2023 Junqing Zhang, Liming Shi, Mads Græsbøll Christensen, Wen Zhang 0002, Lijun Zhang 0004, Jingdong Chen, 
CGMM-Based Sound Zone Generation Using Robust Pressure Matching With ATF Perturbation Constraints.

ICASSP2023 Shuai Tao, Himavanth Reddy, Jesper Rindom Jensen, Mads Græsbøll Christensen
Frequency Bin-Wise Single Channel Speech Presence Probability Estimation Using Multiple DNNS.

Interspeech2023 Debang Liu, Tianqi Zhang, Mads Græsbøll Christensen, Ying Wei, Zeliang An, 
Audio-Visual Fusion using Multiscale Temporal Convolutional Attention for Time-Domain Speech Separation.

ICASSP2022 Yurii Iotov, Sidsel Marie Nørholm, Valiantsin Belyi, Mads Dyrholm, Mads Græsbøll Christensen
Computationally Efficient Fixed-Filter ANC for Speech Based on Long-Term Prediction for Headphone Applications.

ICASSP2022 Yang Xiang, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen
A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder.

SpeechComm2021 Amir Hossein Poorjam, Mathew Shaji Kavalekalam, Liming Shi, Yordan P. Raykov, Jesper Rindom Jensen, Max A. Little, Mads Græsbøll Christensen
Automatic quality control and enhancement for voice-based remote Parkinson's disease detection.

TASLP2021 Liming Shi, Taewoong Lee, Lijun Zhang 0004, Jesper Kjær Nielsen, Mads Græsbøll Christensen
Generation of Personal Sound Zones With Physical Meaningful Constraints and Conjugate Gradient Method.

ICASSP2021 Yang Xiang, Liming Shi, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen
A Novel NMF-HMM Speech Enhancement Algorithm Based on Poisson Mixture Model.

Interspeech2021 Alfredo Esquivel Jaramillo, Jesper Kjær Nielsen, Mads Græsbøll Christensen
Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation.

SpeechComm2020 Jesper Rindom Jensen, Sam Karimian-Azari, Mads Græsbøll Christensen, Jacob Benesty, 
Harmonic beamformers for speech enhancement and dereverberation in the time domain.

ICASSP2020 Zihao Cui, Changchun Bao, Jesper Kjær Nielsen, Mads Græsbøll Christensen
Autoregressive Parameter Estimation with Dnn-Based Pre-Processing.

ICASSP2020 Alfredo Esquivel Jaramillo, Andreas Jakobsson, Jesper Kjær Nielsen, Mads Græsbøll Christensen
Robust Fundamental Frequency Estimation in Coloured Noise.

ICASSP2020 Liming Shi, Taewoong Lee, Lijun Zhang 0004, Jesper Kjær Nielsen, Mads Græsbøll Christensen
A Fast Reduced-Rank Sound Zone Control Algorithm Using The Conjugate Gradient Method.

Interspeech2020 Yang Xiang, Liming Shi, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen
An NMF-HMM Speech Enhancement Method Based on Kullback-Leibler Divergence.

TASLP2019 Martin Weiss Hansen, Jesper Rindom Jensen, Mads Græsbøll Christensen
Estimation of Fundamental Frequencies in Stereophonic Music Mixtures.

TASLP2019 Mathew Shaji Kavalekalam, Jesper Kjær Nielsen, Jesper Bünsow Boldt, Mads Græsbøll Christensen
Model-Based Speech Enhancement for Intelligibility Improvement in Binaural Hearing Aids.

#182  | Juan Rafael Orozco-Arroyave | DBLP Google Scholar  
By venueInterspeech: 19ICASSP: 5SpeechComm: 2
By year2024: 12023: 72022: 32021: 52020: 32019: 52018: 2
ISCA sessionsspeech and language in health: 4speech and language analytics for medical applications: 2speech, voice, and hearing disorders: 1connecting speech-science and speech-technology for children's speech: 1technology for disordered speech: 1show and tell: 1the interspeech 2021 computational paralinguistics challenge (compare): 1the adresso challenge: 1disordered speech: 1the interspeech 2020 computational paralinguistics challenge (compare): 1speech perception in adverse listening conditions: 1applications in language learning and healthcare: 1social signals detection and speaker traits analysis: 1speech and language analytics for mental health: 1automatic detection and recognition of voice and speech disorders: 1
IEEE keywordsspeech analysis: 3diseases: 3medical signal processing: 3depression: 2alzheimer’s disease: 2parkinson’s disease: 2gait analysis: 2sociology: 1mental health: 1adaptation models: 1longitudinal assessment: 1contrastive training: 1language analysis: 1transfer learning: 1artificial neural networks: 1computational modeling: 1emotion recognition: 1forestry: 1analytical models: 1linguistic analysis: 1medical disorders: 1medical diagnostic computing: 1natural language processing: 1psen1–e280a: 1acoustic analysis: 1smartphones: 1patient treatment: 1deep learning (artificial intelligence): 1smart phones: 1handwriting analysis: 1mixture models: 1neurophysiology: 1gmm ubm: 1speaker recognition: 1ivectors: 1gaussian processes: 1
Most publications (all venues) at2021: 192022: 152018: 152023: 142019: 14

Affiliations
URLs

Recent publications

ICASSP2024 Paula Andrea Pérez-Toro, Judith Dineley, Agnieszka Kaczkowska, Pauline Conde, Yuezhou Zhang, Faith Matcham, Sara Siddi, Josep Maria Haro, Stuart Bruce, Til Wykes, Raquel Bailón, Srinivasan Vairavan, Richard J. B. Dobson, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave, Vaibhav A. Narayan, Nicholas Cummins, 
Longitudinal Modeling of Depression Shifts Using Speech and Language.

SpeechComm2023 Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Philipp Klumpp, Juan Camilo Vásquez-Correa, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave
Depression assessment in people with Parkinson's disease: The combination of acoustic features and natural language processing.

ICASSP2023 Paula Andrea Pérez-Toro, Dalia Rodríguez-Salas, Tomás Arias-Vergara, Sebastian P. Bayerl, Philipp Klumpp, Korbinian Riedhammer, Maria Schuster, Elmar Nöth, Andreas K. Maier, Juan Rafael Orozco-Arroyave
Transferring Quantified Emotion Knowledge for the Detection of Depression in Alzheimer's Disease Using Forestnets.

Interspeech2023 Soroosh Tayebi Arasteh, Cristian David Ríos-Urrego, Elmar Nöth, Andreas Maier 0001, Seung Hee Yang, Jan Rusz, Juan Rafael Orozco-Arroyave
Federated Learning for Secure Development of AI Models for Parkinson's Disease Detection Using Speech from Different Languages.

Interspeech2023 Tomás Arias-Vergara, Elizabeth Londoño-Mora, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas Maier 0001, 
Measuring Phonological Precision in Children with Cleft Lip and Palate.

Interspeech2023 Daniel Escobar-Grisales, Tomás Arias-Vergara, Cristian David Ríos-Urrego, Elmar Nöth, Adolfo M. García, Juan Rafael Orozco-Arroyave
An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson's Disease Patients.

Interspeech2023 Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Franziska Braun, Florian Hönig, Carlos Andrés Tobón-Quintero, David Aguillón, Francisco Lopera, Liliana Hincapié-Henao, Maria Schuster, Korbinian Riedhammer, Andreas Maier 0001, Elmar Nöth, Juan Rafael Orozco-Arroyave
Automatic Assessment of Alzheimer's across Three Languages Using Speech and Language Features.

Interspeech2023 Cristian David Ríos-Urrego, Jan Rusz, Elmar Nöth, Juan Rafael Orozco-Arroyave
Automatic Classification of Hypokinetic and Hyperkinetic Dysarthria based on GMM-Supervectors.

Interspeech2022 Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier, Seung Hee Yang, 
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition.

Interspeech2022 Paula Andrea Pérez-Toro, Philipp Klumpp, Abner Hernandez, Tomas Arias, Patricia Lillo, Andrea Slachevsky, Adolfo Martín García, Maria Schuster, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave
Alzheimer's Detection from English to Spanish Using Acoustic and Linguistic Embeddings.

Interspeech2022 P. Schäfer, Paula Andrea Pérez-Toro, Philipp Klumpp, Juan Rafael Orozco-Arroyave, Elmar Nöth, Andreas K. Maier, A. Abad, Maria Schuster, Tomás Arias-Vergara, 
CoachLea: an Android Application to Evaluate the Speech Production and Perception of Children with Hearing Loss.

ICASSP2021 Paula Andrea Pérez-Toro, Juan Camilo Vásquez-Correa, Tomás Arias-Vergara, Philipp Klumpp, M. Sierra-Castrillón, M. E. Roldán-López, David Aguillón, Liliana Hincapié-Henao, Carlos Andrés Tobón-Quintero, Tobias Bocklet, Maria Schuster, Juan Rafael Orozco-Arroyave, Elmar Nöth, 
Acoustic and Linguistic Analyses to Assess Early-Onset and Genetic Alzheimer's Disease.

ICASSP2021 Juan Camilo Vásquez-Correa, Tomás Arias-Vergara, Philipp Klumpp, Paula Andrea Pérez-Toro, Juan Rafael Orozco-Arroyave, Elmar Nöth, 
End-2-End Modeling of Speech and Gait from Patients with Parkinson's Disease: Comparison Between High Quality Vs. Smartphone Data.

Interspeech2021 Philipp Klumpp, Tobias Bocklet, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Sebastian P. Bayerl, Juan Rafael Orozco-Arroyave, Elmar Nöth, 
The Phonetic Footprint of Covid-19?

Interspeech2021 Paula Andrea Pérez-Toro, Sebastian P. Bayerl, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Philipp Klumpp, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Korbinian Riedhammer, 
Influence of the Interviewer on the Automatic Assessment of Alzheimer's Disease in the Context of the ADReSSo Challenge.

Interspeech2021 Juan Camilo Vásquez-Correa, Julian Fritsch, Juan Rafael Orozco-Arroyave, Elmar Nöth, Mathew Magimai-Doss, 
On Modeling Glottal Source Information for Phonation Assessment in Parkinson's Disease.

SpeechComm2020 Juan Camilo Vásquez-Correa, Tomás Arias-Vergara, Maria Schuster, Juan Rafael Orozco-Arroyave, Elmar Nöth, 
Parallel Representation Learning for the Classification of Pathological Speech: Studies on Parkinson's Disease and Cleft Lip and Palate.

ICASSP2020 Juan Camilo Vásquez-Correa, Tobias Bocklet, Juan Rafael Orozco-Arroyave, Elmar Nöth, 
Comparison of User Models Based on GMM-UBM and I-Vectors for Speech, Handwriting, and Gait Assessment of Parkinson's Disease Patients.

Interspeech2020 Philipp Klumpp, Tomás Arias-Vergara, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro, Florian Hönig, Elmar Nöth, Juan Rafael Orozco-Arroyave
Surgical Mask Detection with Deep Recurrent Phonetic Models.

Interspeech2019 Tomás Arias-Vergara, Juan Rafael Orozco-Arroyave, Milos Cernak, Sandra Gollwitzer, Maria Schuster, Elmar Nöth, 
Phone-Attribute Posteriors to Evaluate the Speech of Cochlear Implant Users.

#183  | Emmanuel Vincent 0001 | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 6TASLP: 3SpeechComm: 1
By year2024: 12023: 22022: 32021: 12020: 122019: 52018: 2
ISCA sessionsvoice privacy challenge: 3speech synthesis: 1neural-based speech and acoustic analysis: 1trustworthy speech processing: 1multi-channel speech enhancement and hearing aids: 1diarization: 1speech processing and analysis: 1monaural source separation: 1asr model training and strategies: 1acoustic model adaptation for asr: 1speech enhancement: 1privacy in speech and audio interfaces: 1robust speech recognition: 1spatial and phase cues for source separation and speech recognition: 1
IEEE keywordsspeaker recognition: 5privacy: 3data privacy: 3voice conversion: 3speech recognition: 3pattern classification: 2signal classification: 2speech separation: 2speaker verification: 2weak labels: 2pseudonymisation: 1voice privacy: 1anonymisation: 1attack model: 1protocols: 1speech synthesis: 1recording: 1task analysis: 1linkability: 1speaker anonymization: 1natural language processing: 1acoustic scene classification: 1knowledge based systems: 1adversarial domain adaptation: 1feature normalization: 1moment matching: 1image classification: 1unsupervised learning: 1bandwidth: 1filterbank design: 1bridges: 1noise measurement: 1iterative methods: 1localization: 1source separation: 1deflation: 1linkage attack: 1audio embedding.: 1prototypical network: 1audio tagging: 1audio signal processing: 1triplet loss: 1sound event detection: 1confidence intervals: 1pattern recognition: 1jackknife estimates: 1acoustic signal detection: 1signal representation: 1speech enhancement: 1uncertainty propagation: 1robustness: 1reverberation: 1i vector: 1data distortion: 1
Most publications (all venues) at2020: 172010: 172017: 162015: 162022: 15

Affiliations
Inria Nancy - Grand Est, Villers-lès-Nancy, France

Recent publications

TASLP2024 Michele Panariello, Natalia A. Tomashenko, Xin Wang 0037, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas W. D. Evans, Emmanuel Vincent 0001, Junichi Yamagishi, 
The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation.

Interspeech2023 Sewade Ogun, Vincent Colotte, Emmanuel Vincent 0001
Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS.

Interspeech2023 Prerak Srivastava, Antoine Deleforge, Archontis Politis, Emmanuel Vincent 0001
How to (Virtually) Train Your Speaker Localizer.

TASLP2022 Brij Mohan Lal Srivastava, Mohamed Maouche, Md. Sahidullah, Emmanuel Vincent 0001, Aurélien Bellet, Marc Tommasi, Natalia A. Tomashenko, Xin Wang 0037, Junichi Yamagishi, 
Privacy and Utility of X-Vector Based Speaker Anonymization.

ICASSP2022 Michel Olvera, Emmanuel Vincent 0001, Gilles Gasso, 
On The Impact of Normalization Strategies in Unsupervised Adversarial Domain Adaptation for Acoustic Scene Classification.

Interspeech2022 Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent 0001
Enhancing Speech Privacy with Slicing.

Interspeech2021 Sunit Sivasankaran, Emmanuel Vincent 0001, Dominique Fohr, 
Explaining Deep Learning Models for Speech Enhancement.

ICASSP2020 Manuel Pariente, Samuele Cornell, Antoine Deleforge, Emmanuel Vincent 0001
Filterbank Design for End-to-end Speech Separation.

ICASSP2020 Sunit Sivasankaran, Emmanuel Vincent 0001, Dominique Fohr, 
SLOGD: Speaker Location Guided Deflation Approach to Speech Separation.

ICASSP2020 Brij Mohan Lal Srivastava, Nathalie Vauquier, Md. Sahidullah, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent 0001
Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers.

ICASSP2020 Nicolas Turpault, Romain Serizel, Emmanuel Vincent 0001
Limitations of Weak Labels for Embedding and Tagging.

Interspeech2020 Samuele Cornell, Maurizio Omologo, Stefano Squartini, Emmanuel Vincent 0001
Detecting and Counting Overlapping Speakers in Distant Speech Scenarios.

Interspeech2020 Mathieu Hu, Laurent Pierron, Emmanuel Vincent 0001, Denis Jouvet, 
Kaldi-Web: An Installation-Free, On-Device Speech Recognition System.

Interspeech2020 Mohamed Maouche, Brij Mohan Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent 0001
A Comparative Study of Speech Anonymization Metrics.

Interspeech2020 Manuel Pariente, Samuele Cornell, Joris Cosentino, Sunit Sivasankaran, Efthymios Tzinis, Jens Heitkaemper, Michel Olvera, Fabian-Robert Stöter, Mathieu Hu, Juan M. Martín-Doñas, David Ditter, Ariel Frank, Antoine Deleforge, Emmanuel Vincent 0001
Asteroid: The PyTorch-Based Audio Source Separation Toolkit for Researchers.

Interspeech2020 Imran A. Sheikh, Emmanuel Vincent 0001, Irina Illina, 
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data.

Interspeech2020 Brij Mohan Lal Srivastava, Natalia A. Tomashenko, Xin Wang 0037, Emmanuel Vincent 0001, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, Marc Tommasi, 
Design Choices for X-Vector Based Speaker Anonymization.

Interspeech2020 Natalia A. Tomashenko, Brij Mohan Lal Srivastava, Xin Wang 0037, Emmanuel Vincent 0001, Andreas Nautsch, Junichi Yamagishi, Nicholas W. D. Evans, Jose Patino 0001, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco, 
Introducing the VoicePrivacy Initiative.

Interspeech2020 M. A. Tugtekin Turan, Emmanuel Vincent 0001, Denis Jouvet, 
Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation.

SpeechComm2019 Nancy Bertin, Ewen Camberlein, Romain Lebarbenchon, Emmanuel Vincent 0001, Sunit Sivasankaran, Irina Illina, Frédéric Bimbot, 
VoiceHome-2, an extended corpus for multichannel speech processing in real homes.

#184  | Shoko Araki | DBLP Google Scholar  
By venueICASSP: 11Interspeech: 11TASLP: 3SpeechComm: 1
By year2024: 42023: 32022: 22021: 42020: 92019: 32018: 1
ISCA sessionsspeech enhancement and intelligibility: 2speech coding and enhancement: 1multi-talker methods in speech processing: 1source separation: 1speech localization, enhancement, and quality assessment: 1noise reduction and intelligibility: 1multi-channel speech enhancement: 1targeted source separation: 1speech enhancement: 1speech intelligibility and quality: 1
IEEE keywordsspeech recognition: 5speech enhancement: 4blind source separation: 4reverberation: 4source separation: 4speaker recognition: 4target speech extraction: 3single channel speech enhancement: 2processing distortion: 2online processing: 2dynamic stream weights: 2source counting: 2neural network: 2degradation: 1nonlinear distortion: 1noise measurement: 1noise robust speech recognition: 1interference: 1delays: 1noise reduction: 1spatial regularization: 1optimization: 1dereverberation: 1transfer functions: 1real time systems: 1microphone array: 1joint training: 1data models: 1acoustic distortion: 1interpolation: 1feature aggregation: 1pre trained models: 1transformers: 1adaptation models: 1benchmark testing: 1self supervised learning: 1telephone sets: 1artificial neural networks: 1data mining: 1few shot adaptation: 1sound event: 1soundbeam: 1target sound extraction: 1recording: 1large ensemble: 1error analysis: 1automatic speech recognition: 1complementary neural language models: 1iterative lattice generation: 1lattice rescoring: 1context carry over: 1lattices: 1sensor fusion: 1audiovisual speaker localization: 1audio visual systems: 1audio signal processing: 1image fusion: 1data fusion: 1array signal processing: 1video signal processing: 1microphone arrays: 1time domain network: 1multi task loss: 1spatial features: 1frequency domain blind source separation: 1cayley transform: 1frequency domain analysis: 1transforms: 1sparse: 1fastica: 1geometry: 1unitary constraint: 1filtering theory: 1block coordinate descent method: 1independent vector analysis: 1generalized eigenvalue problem: 1gaussian noise: 1overdetermined: 1diarization: 1separation: 1smart devices: 1robustness: 1task analysis: 1audiovisual speaker tracking: 1kalman filters: 1tracking: 1recurrent neural nets: 1backprop kalman filter: 1backpropagation: 1adaptation: 1auxiliary feature: 1meeting diarization: 1
Most publications (all venues) at2021: 132020: 132004: 132024: 112012: 11

Affiliations
URLs

Recent publications

TASLP2024 Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance.

TASLP2024 Tetsuya Ueda, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki, Shoji Makino, 
Blind and Spatially-Regularized Online Joint Optimization of Source Separation, Dereverberation, and Noise Reduction.

ICASSP2024 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

ICASSP2024 Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocký, 
Target Speech Extraction with Pre-Trained Self-Supervised Learning Models.

TASLP2023 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki
SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning.

Interspeech2023 Shoko Araki, Ayako Yamamoto, Tsubasa Ochiai, Kenichi Arai, Atsunori Ogawa, Tomohiro Nakatani, Toshio Irino, 
Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine.

Interspeech2023 Marc Delcroix, Naohiro Tawara, Mireia Díez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukás Burget, Shoko Araki
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization.

ICASSP2022 Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, Shoko Araki
Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models.

Interspeech2022 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR.

ICASSP2021 Julio Wissing, Benedikt T. Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura, 
Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain.

Interspeech2021 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki
Few-Shot Learning of New Sound Classes for Target Sound Extraction.

Interspeech2021 Christopher Schymura, Benedikt T. Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa, 
PILOT: Introducing Transformers for Probabilistic Sound Event Localization.

Interspeech2021 Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, 
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility.

SpeechComm2020 Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani, 
GEDI: Gammachirp envelope distortion index for predicting intelligibility of enhanced speech.

ICASSP2020 Marc Delcroix, Tsubasa Ochiai, Katerina Zmolíková, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki
Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam.

ICASSP2020 Satoru Emura, Hiroshi Sawada, Shoko Araki, Noboru Harada, 
A Frequency-Domain BSS Method Based on ℓ1 Norm, Unitary Constraint, and Cayley Transform.

ICASSP2020 Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki
Overdetermined Independent Vector Analysis.

ICASSP2020 Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, 
Tackling Real Noisy Reverberant Meetings with All-Neural Source Separation, Counting, and Diarization System.

ICASSP2020 Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa, 
A Dynamic Stream Weight Backprop Kalman Filter for Audiovisual Speaker Tracking.

Interspeech2020 Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, Toshio Irino, 
Predicting Intelligibility of Enhanced Speech Using Posteriors Derived from DNN-Based ASR System.

#185  | Shansong Liu | DBLP Google Scholar  
By venueInterspeech: 13ICASSP: 9TASLP: 4
By year2024: 32023: 12022: 32021: 92020: 32019: 52018: 2
ISCA sessionsspeech recognition of atypical speech: 3speech synthesis: 2topics in asr: 2medical applications and visual asr: 2speech and speaker recognition: 1asr neural network architectures: 1novel neural network architectures for acoustic modelling: 1application of asr in medical practice: 1
IEEE keywordsspeech recognition: 9bayes methods: 4natural language processing: 3rhythm: 2task analysis: 2neural architecture search: 2deep learning (artificial intelligence): 2time delay neural network: 2domain adaptation: 2adaptation models: 2inference mechanisms: 2bayesian learning: 2speaker recognition: 2speech separation: 2mu llama: 1text to music generation: 1natural languages: 1measurement: 1music question answering: 1tagging: 1question answering (information retrieval): 1musicqa dataset: 1humming melody transcription: 1humtrans dataset: 1open source: 1instruments: 1largest known humming dataset: 1software development management: 1recording: 1unified tag set: 1cross modal matching: 1temporal information: 1optical flow: 1video music retrieval: 1cross attention: 1industries: 1search problems: 1uncertainty handling: 1minimisation: 1neural net architecture: 1error analysis: 1articulatory inversion: 1dysarthric speech: 1hybrid power systems: 1benchmark testing: 1variational inference: 1delays: 1generalisation (artificial intelligence): 1lf mmi: 1gaussian process: 1handicapped aids: 1speaker adaptation: 1data augmentation: 1multimodal speech recognition: 1disordered speech recognition: 1image recognition: 1microphone arrays: 1audio visual: 1visual occlusion: 1overlapped speech recognition: 1multi channel: 1jointly fine tuning: 1filtering theory: 1video signal processing: 1training data: 1switches: 1transformer: 1model uncertainty: 1estimation: 1neural language models: 1uncertainty: 1automatic speech recognition: 1neurocognitive disorder detection: 1elderly speech: 1dementia: 1multi modal: 1audio visual systems: 1overlapped speech: 1audio visual speech recognition: 1gaussian process neural network: 1activation function selection: 1bayesian neural network: 1gaussian processes: 1
Most publications (all venues) at2022: 62021: 62019: 52018: 42024: 3

Affiliations
URLs

Recent publications

ICASSP2024 Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan, 
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning.

ICASSP2024 Shansong Liu, Xu Li 0015, Dian Li, Ying Shan, 
Humtrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond.

ICASSP2024 Tianjun Mao, Shansong Liu, Yunxuan Zhang, Dian Li, Ying Shan, 
Unified Pretraining Target Based Video-Music Retrieval with Music Rhythm and Video Optical Flow Information.

Interspeech2023 Zhihan Yang, Shansong Liu, Xu Li 0015, Haozhe Wu, Zhiyong Wu 0001, Ying Shan, Jia Jia 0001, 
Prosody Modeling with 3D Visual Information for Expressive Video Dubbing.

TASLP2022 Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

ICASSP2022 Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng, 
Exploiting Cross Domain Acoustic-to-Articulatory Inverted Features for Disordered Speech Recognition.

Interspeech2022 Xu Li 0015, Shansong Liu, Ying Shan, 
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion.

TASLP2021 Shoukang Hu, Xurong Xie, Shansong Liu, Jianwei Yu, Zi Ye 0001, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition.

TASLP2021 Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen Meng, 
Recent Progress in the CUHK Dysarthric Speech Recognition System.

TASLP2021 Jianwei Yu, Shi-Xiong Zhang, Bo Wu, Shansong Liu, Shoukang Hu, Mengzhe Geng, Xunying Liu, Helen Meng, Dong Yu 0001, 
Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech.

ICASSP2021 Shoukang Hu, Xurong Xie, Shansong Liu, Mingyu Cui, Mengzhe Geng, Xunying Liu, Helen Meng, 
Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks.

ICASSP2021 Boyang Xue, Jianwei Yu, Junhao Xu, Shansong Liu, Shoukang Hu, Zi Ye 0001, Mengzhe Geng, Xunying Liu, Helen Meng, 
Bayesian Transformer Language Models for Speech Recognition.

ICASSP2021 Zi Ye 0001, Shoukang Hu, Jinchao Li, Xurong Xie, Mengzhe Geng, Jianwei Yu, Junhao Xu, Boyang Xue, Shansong Liu, Xunying Liu, Helen Meng, 
Development of the Cuhk Elderly Speech Recognition System for Neurocognitive Disorder Detection Using the Dementiabank Corpus.

Interspeech2021 Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye 0001, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng, 
Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition.

Interspeech2021 Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye 0001, Zengrui Jin, Xunying Liu, Helen Meng, 
Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition.

Interspeech2021 Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng, 
Adversarial Data Augmentation for Disordered Speech Recognition.

ICASSP2020 Jianwei Yu, Shi-Xiong Zhang, Jian Wu 0027, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu 0001, 
Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset.

Interspeech2020 Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng, 
Investigation of Data Augmentation Techniques for Disordered Speech Recognition.

Interspeech2020 Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng, 
Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition.

ICASSP2019 Shoukang Hu, Max W. Y. Lam, Xurong Xie, Shansong Liu, Jianwei Yu, Xixin Wu, Xunying Liu, Helen Meng, 
Bayesian and Gaussian Process Neural Networks for Large Vocabulary Continuous Speech Recognition.

#186  | Shixiong Zhang | DBLP Google Scholar  
By venueInterspeech: 13ICASSP: 9TASLP: 3AAAI: 1
By year2024: 12023: 22022: 32021: 92020: 72019: 4
ISCA sessionssource separation, dereverberation and echo cancellation: 2speech enhancement and bandwidth expansion: 1dereverberation and echo cancellation: 1speaker recognition: 1source separation: 1speech localization, enhancement, and quality assessment: 1topics in asr: 1multi-channel speech enhancement: 1multimodal speech processing: 1speech and audio source separation and scene analysis: 1speech enhancement: 1asr for noisy and far-field speech: 1
IEEE keywordsspeech recognition: 10speech separation: 4speech enhancement: 3source separation: 3task analysis: 2application program interfaces: 2reverberation: 2audio visual systems: 2microphone arrays: 2filtering theory: 2end to end speech recognition: 2measurement: 1optimization: 1acoustic environment: 1speech simulation: 1transient response: 1graphics processing units: 1rnn t: 1code switched asr: 1natural language processing: 1bilingual asr: 1computational linguistics: 1sensor fusion: 1sound source separation: 1audio signal processing: 1audio visual processing: 1speech synthesis: 1image recognition: 1audio visual: 1visual occlusion: 1overlapped speech recognition: 1multi channel: 1jointly fine tuning: 1video signal processing: 1mvdr: 1array signal processing: 1adl mvdr: 1direction of arrival estimation: 1source localization: 1speaker recognition: 1audio video synchronization: 1human computer interaction: 1labeling: 1synchronization: 1speaker diarization: 1self supervised learning: 1multi modal learning: 1learning systems: 1end to end: 1multi channel speech separation: 1inter channel convolution differences: 1spatial filters: 1spatial features: 1target speech extraction: 1minimisation: 1neural beamformer: 1signal reconstruction: 1multi modal: 1overlapped speech: 1audio visual speech recognition: 1cloud computing: 1quantization: 1polynomials: 1privacy preserving: 1dnn: 1decoding: 1cryptography: 1speech coding: 1encryption: 1
Most publications (all venues) at2021: 242022: 122024: 112023: 112020: 11

Affiliations
URLs

Recent publications

AAAI2024 Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu 0001, Shi-Xiong Zhang, Guangzhi Li, Yi Luo 0004, Rongzhi Gu, 
SECap: Speech Emotion Captioning with Large Language Model.

ICASSP2023 Ruize Xu, Ruoxuan Feng, Shi-Xiong Zhang, Di Hu 0001, 
MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning.

Interspeech2023 Yong Xu 0004, Vinay Kothapally, Meng Yu 0003, Shixiong Zhang, Dong Yu 0001, 
Zoneformer: On-device Neural Beamformer For In-car Multi-zone Speech Separation, Enhancement and Echo Cancellation.

ICASSP2022 Anton Ratnarajah, Shi-Xiong Zhang, Meng Yu 0003, Zhenyu Tang 0001, Dinesh Manocha, Dong Yu 0001, 
Fast-Rir: Fast Neural Diffuse Room Impulse Response Generator.

ICASSP2022 Brian Yan, Chunlei Zhang, Meng Yu 0003, Shi-Xiong Zhang, Siddharth Dalmia, Dan Berrebbi, Chao Weng, Shinji Watanabe 0001, Dong Yu 0001, 
Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization.

Interspeech2022 Vinay Kothapally, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Dong Yu 0001, 
Joint Neural AEC and Beamforming with Double-Talk Detection.

TASLP2021 Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu 0004, Meng Yu 0003, Dong Yu 0001, Jesper Jensen 0001, 
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation.

TASLP2021 Jianwei Yu, Shi-Xiong Zhang, Bo Wu, Shansong Liu, Shoukang Hu, Mengzhe Geng, Xunying Liu, Helen Meng, Dong Yu 0001, 
Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech.

TASLP2021 Zhuohuang Zhang, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Lianwu Chen, Donald S. Williamson, Dong Yu 0001, 
Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation.

ICASSP2021 Aswin Shanmugam Subramanian, Chao Weng, Shinji Watanabe 0001, Meng Yu 0003, Yong Xu 0004, Shi-Xiong Zhang, Dong Yu 0001, 
Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization.

Interspeech2021 Saurabh Kataria, Shi-Xiong Zhang, Dong Yu 0001, 
Multi-Channel Speaker Verification for Single and Multi-Talker Speech.

Interspeech2021 Xiyun Li, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Jiaming Xu 0001, Bo Xu 0002, Dong Yu 0001, 
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation.

Interspeech2021 Helin Wang, Bo Wu, Lianwu Chen, Meng Yu 0003, Jianwei Yu, Yong Xu 0004, Shi-Xiong Zhang, Chao Weng, Dan Su 0002, Dong Yu 0001, 
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation.

Interspeech2021 Yong Xu 0004, Zhuohuang Zhang, Meng Yu 0003, Shi-Xiong Zhang, Dong Yu 0001, 
Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation.

Interspeech2021 Meng Yu 0003, Chunlei Zhang, Yong Xu 0004, Shi-Xiong Zhang, Dong Yu 0001, 
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment.

ICASSP2020 Yifan Ding, Yong Xu 0004, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang, 
Self-Supervised Learning for Audio-Visual Speaker Diarization.

ICASSP2020 Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu 0004, Meng Yu 0003, Dan Su 0002, Yuexian Zou, Dong Yu 0001, 
Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning.

ICASSP2020 Aswin Shanmugam Subramanian, Chao Weng, Meng Yu 0003, Shi-Xiong Zhang, Yong Xu 0004, Shinji Watanabe 0001, Dong Yu 0001, 
Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives.

ICASSP2020 Jianwei Yu, Shi-Xiong Zhang, Jian Wu 0027, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu 0001, 
Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset.

Interspeech2020 Shansong Liu, Xurong Xie, Jianwei Yu, Shoukang Hu, Mengzhe Geng, Rongfeng Su, Shi-Xiong Zhang, Xunying Liu, Helen Meng, 
Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition.

#187  | Hiroshi Sato | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 9TASLP: 1
By year2024: 32023: 72022: 82021: 42020: 22019: 2
ISCA sessionsspeech recognition: 2novel models and training methods for asr: 2spoken dialog systems and conversational analysis: 1end-to-end asr: 1speech coding and enhancement: 1dereverberation, noise reduction, and speaker extraction: 1speech enhancement and intelligibility: 1multi-, cross-lingual and other topics in asr: 1single-channel speech enhancement: 1streaming for asr/rnn transducers: 1source separation, dereverberation and echo cancellation: 1asr neural network architectures and training: 1speech and audio classification: 1model training for asr: 1
IEEE keywordsspeech recognition: 7speech enhancement: 4neural network: 4end to end: 3degradation: 2single channel speech enhancement: 2noise robust speech recognition: 2processing distortion: 2transformers: 2recurrent neural nets: 2natural language processing: 2recurrent neural network transducer: 2probability: 2nonlinear distortion: 1noise measurement: 1interference: 1speaker embeddings: 1noise robustness: 1adaptation models: 1zero shot tts: 1self supervised learning: 1speech synthesis: 1self supervised learning model: 1joint training: 1data models: 1acoustic distortion: 1interpolation: 1neural transducer: 1recurrent neural networks: 1robustness: 1linguistics: 1decoding: 1scheduled sampling: 1buildings: 1multilingual: 1automatic speech recognition: 1representation learning: 1cross lingual: 1self supervised speech representation learning: 1task analysis: 1attention based decoder: 1input switching: 1deep learning (artificial intelligence): 1speech separation: 1speakerbeam: 1speech extraction: 1listener adaptation: 1perceived emotion: 1speech emotion recognition: 1emotion recognition: 1entropy: 1whole network pre training: 1synchronisation: 1autoregressive processes: 1connectionist temporal classification: 1attention weight: 1knowledge distillation: 1speech codecs: 1
Most publications (all venues) at2023: 182022: 162012: 132019: 122021: 8

Affiliations
URLs

Recent publications

TASLP2024 Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance.

ICASSP2024 Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima, 
Noise-Robust Zero-Shot Text-to-Speech Synthesis Conditioned on Self-Supervised Speech-Representation Model with Adapters.

ICASSP2024 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

ICASSP2023 Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, 
Improving Scheduled Sampling for Neural Transducer-Based ASR.

ICASSP2023 Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Hiroshi Sato, Taiga Yamane, Takanori Ashihara, Kohei Matsuura, Takafumi Moriya, 
Leveraging Language Embeddings for Cross-Lingual Self-Supervised Speech Representation Learning.

Interspeech2023 Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, 
Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer.

Interspeech2023 Mana Ihori, Hiroshi Sato, Tomohiro Tanaka, Ryo Masumura, Saki Mizuno, Nobukatsu Hojo, 
Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model.

Interspeech2023 Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Target and Non-Target Speakers ASR.

Interspeech2023 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami, 
Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data.

Interspeech2023 Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo, 
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss.

ICASSP2022 Takafumi Moriya, Takanori Ashihara, Atsushi Ando, Hiroshi Sato, Tomohiro Tanaka, Kohei Matsuura, Ryo Masumura, Marc Delcroix, Takahiro Shinozaki, 
Hybrid RNN-T/Attention-Based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration.

ICASSP2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya, 
Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition.

Interspeech2022 Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani, 
Listen only to me! How well can target speech extraction handle false alarms?

Interspeech2022 Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri, 
How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR.

Interspeech2022 Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando, 
End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training.

Interspeech2022 Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki, 
Streaming Target-Speaker ASR with Neural Transducer.

Interspeech2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura, 
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations.

Interspeech2022 Tomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, 
Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks.

ICASSP2021 Atsushi Ando, Ryo Masumura, Hiroshi Sato, Takafumi Moriya, Takanori Ashihara, Yusuke Ijima, Tomoki Toda, 
Speech Emotion Recognition Based on Listener Adaptive Models.

ICASSP2021 Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, Yusuke Shinohara, 
Simpleflat: A Simple Whole-Network Pre-Training Approach for RNN Transducer-Based End-to-End Speech Recognition.

#188  | Pedro J. Moreno 0001 | DBLP Google Scholar  
By venueInterspeech: 15ICASSP: 10NAACL: 1
By year2024: 32023: 42022: 62021: 62020: 52019: 12018: 1
ISCA sessionsnovel models and training methods for asr: 2self-supervised, semi-supervised, adaptation and data augmentation for asr: 2feature modeling for asr: 1acoustic model adaptation for asr: 1streaming for asr/rnn transducers: 1speech recognition of atypical speech: 1self-supervision and semi-supervision for neural asr training: 1neural network training methods for asr: 1asr neural network architectures and training: 1training strategies for asr: 1multilingual and code-switched asr: 1medical applications and visual asr: 1recurrent neural models for asr: 1
IEEE keywordsspeech recognition: 8error analysis: 4n best rescoring: 3natural language processing: 3video on demand: 2automatic speech recognition: 2computational modeling: 2decoding: 2speech synthesis: 2rnn t: 2multilingual: 2dialect classifier: 1equity: 1semisupervised learning: 1us english: 1african american english: 1robustness: 1computational efficiency: 1runtime efficiency: 1end to end asr: 1computational latency: 1large models: 1task analysis: 1standards: 1submodels: 1convolution: 1self attention: 1additives: 1entropy: 1lattices: 1fine tuning: 1large scale language models: 1text analysis: 1consistency regularization: 1self supervised: 1sequence to sequence model: 1speech normalization: 1speaker recognition: 1speech impairments: 1voice conversion: 1end to end speech recognition: 1gradient methods: 1language id: 1mixture of experts: 1classification algorithms: 1encoder decoder: 1signal processing algorithms: 1data augmentation: 1
Most publications (all venues) at2022: 102023: 82021: 72018: 62024: 5

Affiliations
Google. Inc., Mountain View, CA, USA

Recent publications

ICASSP2024 Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara N. Sainath, Françoise Beaufays, Pedro Moreno Mengibar
Improving Speech Recognition for African American English with Audio Classification.

ICASSP2024 Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J. Moreno 0001
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models.

NAACL2024 Weiran Wang, Rohit Prabhavalkar, Haozhe Shan, Zhong Meng, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li 0028, James Qin, Xingyu Cai, Adam Stooke, Chengjian Zheng, Yanzhang He, Tara N. Sainath, Pedro Moreno Mengibar
Massive End-to-end Speech Recognition Models with Time Reduction.

ICASSP2023 Kartik Audhkhasi, Brian Farris, Bhuvana Ramabhadran, Pedro J. Moreno 0001
Modular Conformer Training for Flexible End-to-End ASR.

ICASSP2023 Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel S. Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Michael Riley 0001, 
Large-Scale Language Model Rescoring on Long-Form Data.

Interspeech2023 Zhouyuan Huo, Khe Chai Sim, Dongseong Hwang, Tsendsuren Munkhdalai, Tara N. Sainath, Pedro Moreno Mengibar
Re-investigating the Efficient Transfer Learning of Speech Foundation Model using Feature Fusion Methods.

Interspeech2023 Qiujia Li, Bo Li 0028, Dongseong Hwang, Tara N. Sainath, Pedro Moreno Mengibar
Modular Domain Adaptation for Conformer-Based Streaming ASR.

ICASSP2022 Zhehuai Chen, Yu Zhang 0033, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Gary Wang, 
Tts4pretrain 2.0: Advancing the use of Text and Speech in ASR Pretraining with Consistency and Contrastive Losses.

ICASSP2022 Neeraj Gaur, Tongzhou Chen, Ehsan Variani, Parisa Haghani, Bhuvana Ramabhadran, Pedro J. Moreno 0001
Multilingual Second-Pass Rescoring for Automatic Speech Recognition Systems.

Interspeech2022 Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno 0001
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition.

Interspeech2022 Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro J. Moreno 0001
A Scalable Model Specialization Framework for Training and Inference using Submodels and its Application to Speech Model Personalization.

Interspeech2022 Zhehuai Chen, Yu Zhang 0033, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno 0001, Ankur Bapna, Heiga Zen, 
MAESTRO: Matched Speech Text Representations through Modality Matching.

Interspeech2022 Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno 0001
Non-Parallel Voice Conversion for ASR Augmentation.

ICASSP2021 Rohan Doshi, Youzheng Chen, Liyang Jiang, Xia Zhang, Fadi Biadsy, Bhuvana Ramabhadran, Fang Chu, Andrew Rosenberg, Pedro J. Moreno 0001
Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech.

ICASSP2021 Neeraj Gaur, Brian Farris, Parisa Haghani, Isabel Leal, Pedro J. Moreno 0001, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu, 
Mixture of Informed Experts for Multilingual Speech Recognition.

Interspeech2021 Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno 0001
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition.

Interspeech2021 Zhehuai Chen, Bhuvana Ramabhadran, Fadi Biadsy, Xia Zhang, Youzheng Chen, Liyang Jiang, Fang Chu, Rohan Doshi, Pedro J. Moreno 0001
Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech.

Interspeech2021 Zhehuai Chen, Andrew Rosenberg, Yu Zhang 0033, Heiga Zen, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno 0001
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation.

Interspeech2021 Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno 0001, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu, 
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence.

ICASSP2020 Ehsan Variani, Tongzhou Chen, James Apfel, Bhuvana Ramabhadran, Seungji Lee, Pedro J. Moreno 0001
Neural Oracle Search on N-BEST Hypotheses.

#189  | Gabriel Synnaeve | DBLP Google Scholar  
By venueInterspeech: 12ICASSP: 8ICLR: 2NeurIPS: 2EMNLP: 1ICML: 1
By year2024: 12023: 52022: 22021: 52020: 92019: 32018: 1
ISCA sessionsself-supervision and semi-supervision for neural asr training: 2topics in asr: 2neural networks for language modeling: 2speech synthesis: 1single-channel speech enhancement: 1multilingual and code-switched asr: 1computational resource constrained speech recognition: 1asr model training and strategies: 1sequence models for asr: 1
IEEE keywordsspeech recognition: 7natural language processing: 6pseudo labeling: 2supervised learning: 1semi supervised learning: 1massively multilingual models: 1entropy: 1joint training: 1pattern classification: 1contrastive learning: 1self supervision: 1semi supervised: 1self supervised learning: 1pre training: 1self training: 1text analysis: 1distant supervision: 1unsupervised and semi supervised learning: 1audio signal processing: 1unsupervised learning: 1zero and low resource asr.: 1dataset: 1ctc: 1transformer: 1hybrid asr: 1video signal processing: 1error statistics: 1automatic speech recognition: 1multi task learning: 1speaker recognition: 1adversarial learning: 1c++ language: 1public domain software: 1end to end: 1open source software: 1
Most publications (all venues) at2023: 162021: 142020: 142019: 122022: 11

Affiliations
URLs

Recent publications

ICLR2024 Alon Ziv, Itai Gat, Gaël Le Lan, Tal Remez, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Yossi Adi, 
Masked Audio Generation using a Single Non-Autoregressive Transformer.

Interspeech2023 Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarandi, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux, 
Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis.

NeurIPS2023 Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz 0001, Yossi Adi, 
Textually Pretrained Speech Language Models.

NeurIPS2023 Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez, 
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion.

ICLR2023 Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi, 
AudioGen: Textually Guided Audio Generation.

EMNLP2023 Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot, Emmanuel Dupoux, 
Generative Spoken Language Model based on continuous word-sized audio tokens.

ICASSP2022 Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert, 
Pseudo-Labeling for Massively Multilingual Speech Recognition.

ICASSP2022 Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert, 
Word Order does not Matter for Speech Recognition.

ICASSP2021 Chaitanya Talnikar, Tatiana Likhomanenko, Ronan Collobert, Gabriel Synnaeve
Joint Masked CPC And CTC Training For ASR.

ICASSP2021 Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli, 
Self-Training and Pre-Training are Complementary for Speech Recognition.

Interspeech2021 Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee 0001, Ronan Collobert, Gabriel Synnaeve, Michael Auli, 
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training.

Interspeech2021 Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert, 
slimIPL: Language-Model-Free Iterative Pseudo-Labeling.

Interspeech2021 Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve
Rethinking Evaluation in ASR: Are Our Models Robust Enough?

ICASSP2020 Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux, 
Libri-Light: A Benchmark for ASR with Limited or No Supervision.

ICASSP2020 Andros Tjandra, Chunxi Liu, Frank Zhang 0001, Xiaohui Zhang 0007, Yongqiang Wang 0005, Gabriel Synnaeve, Satoshi Nakamura 0001, Geoffrey Zweig, 
DEJA-VU: Double Feature Presentation and Iterated Loss in Deep Transformer Networks.

Interspeech2020 Alexandre Défossez, Gabriel Synnaeve, Yossi Adi, 
Real Time Speech Enhancement in the Waveform Domain.

Interspeech2020 Da-Rong Liu, Chunxi Liu, Frank Zhang 0001, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig, 
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model.

Interspeech2020 Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Y. Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert, 
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters.

Interspeech2020 Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Y. Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert, 
Scaling Up Online Speech Recognition Using ConvNets.

Interspeech2020 Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert, 
MLS: A Large-Scale Multilingual Dataset for Speech Research.

#190  | Juan Pino 0001 | DBLP Google Scholar  
By venueACL: 10Interspeech: 10ICASSP: 3ICML: 1ACL-Findings: 1NAACL: 1
By year2024: 12023: 92022: 62021: 72020: 3
ISCA sessionsspoken language processing: 2spoken machine translation: 2speech translation and multilingual/multimodal learning: 2resources for spoken language processing: 1speech recognition: 1cross/multi-lingual asr: 1multilingual and code-switched asr: 1
IEEE keywordsspeech recognition: 2speech to speech translation: 1multitasking: 1text to speech augmentation: 1data models: 1discrete units: 1analytical models: 1streaming speech translation: 1simultaneous speech translation: 1language translation: 1end to end speech translation: 1training data: 1error analysis: 1speech enhancement: 1noise reduction: 1multi task learning: 1machine translation: 1speech translation: 1
Most publications (all venues) at2023: 172021: 142020: 142022: 82019: 7

Affiliations
Meta AI
University of Cambridge, Department of Engineering, UK (former)
Carnegie Mellon University, Language Technologies Institute, Pittsburgh, PA, USA (foremr)

Recent publications

ACL2024 HyoJung Han, Mohamed Anwar, Juan Pino 0001, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang, 
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception.

ICASSP2023 Jiatong Shi, Yun Tang 0002, Ann Lee 0001, Hirofumi Inaguma, Changhan Wang, Juan Pino 0001, Shinji Watanabe 0001, 
Enhancing Speech-To-Speech Translation with Multiple TTS Targets.

Interspeech2023 Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino 0001, Changhan Wang, 
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.

Interspeech2023 Jiatong Shi, Yun Tang 0002, Hirofumi Inaguma, Hongyu Gong, Juan Pino 0001, Shinji Watanabe 0001, 
Exploration on HuBERT with Multiple Resolution.

ICML2023 Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan Pino 0001, Benjamin Lecouteux, Didier Schwab, 
Pre-training for Speech Translation: CTC Meets Optimal Transport.

ACL2023 Yun Tang 0002, Anna Y. Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai Ma, Paden Tomasello, Juan Pino 0001
Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks.

ACL2023 Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, Jingfei Du, Ann Lee 0001, Vedanuj Goswami, Changhan Wang, Juan Pino 0001, Benoît Sagot, Holger Schwenk, 
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations.

ACL2023 Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang 0002, Ann Lee 0001, Shinji Watanabe 0001, Juan Pino 0001
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units.

ACL2023 Changhan Wang, Hirofumi Inaguma, Peng-Jen Chen, Ilia Kulikov, Yun Tang 0002, Wei-Ning Hsu, Michael Auli, Juan Pino 0001
Simple and Effective Unsupervised Speech Translation.

ACL-Findings2023 Peng-Jen Chen, Kevin Tran, Yilin Yang, Jingfei Du, Justine Kao, Yu-An Chung, Paden Tomasello, Paul-Ambroise Duquenne, Holger Schwenk, Hongyu Gong, Hirofumi Inaguma, Sravya Popuri, Changhan Wang, Juan Pino 0001, Wei-Ning Hsu, Ann Lee 0001, 
Speech-to-Speech Translation for a Real-world Unwritten Language.

Interspeech2022 Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino 0001, Alexei Baevski, Alexis Conneau, Michael Auli, 
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.

Interspeech2022 Danni Liu, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang 0002, Juan Miguel Pino
From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation.

Interspeech2022 Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino 0001, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee 0001, 
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation.

ACL2022 Ann Lee 0001, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang 0002, Juan Pino 0001, Wei-Ning Hsu, 
Direct Speech-to-Speech Translation With Discrete Units.

ACL2022 Yun Tang 0002, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Miguel Pino
Unified Speech-Text Pre-training for Speech Translation and Recognition.

NAACL2022 Ann Lee 0001, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Miguel Pino, Jiatao Gu, Wei-Ning Hsu, 
Textless Speech-to-Speech Translation on Real Data.

ICASSP2021 Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, Philipp Koehn, Juan Miguel Pino
Streaming Simultaneous Speech Translation with Augmented Memory Transformer.

ICASSP2021 Yun Tang 0002, Juan Miguel Pino, Changhan Wang, Xutai Ma, Dmitriy Genzel, 
A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks.

Interspeech2021 Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino 0001
CoVoST 2 and Massively Multilingual Speech Translation.

Interspeech2021 Changhan Wang, Anne Wu, Juan Pino 0001, Alexei Baevski, Michael Auli, Alexis Conneau, 
Large-Scale Self- and Semi-Supervised Learning for Speech Translation.

#191  | Jingdong Chen | DBLP Google Scholar  
By venueICASSP: 13TASLP: 5Interspeech: 4SpeechComm: 3
By year2024: 42023: 72022: 62021: 12020: 42019: 22018: 1
ISCA sessionsspoken language processing: 1spoken dialogue systems and multimodality: 1tools, corpora and resources: 1speaker verification: 1
IEEE keywordssignal processing algorithms: 5filtering algorithms: 4speech enhancement: 4misp challenge: 4speech recognition: 3visualization: 3optimization: 3loudspeakers: 2transfer functions: 2prediction algorithms: 2linear prediction: 2multimodality: 2recording: 2audio visual: 2speaker diarization: 2maximum likelihood detection: 2nonlinear filters: 2reverberation: 2microphone arrays: 2metric learning: 2entropy: 2speaker recognition: 2speaker verification: 2semi blind source separation: 1computational modeling: 1odd power series expansion: 1convolution: 1convolutive transfer function model: 1bilinear: 1nonlinear acoustic echo cancellation: 1ip networks: 1alternating optimization: 1simulation: 1trade off prewhitening: 1computational efficiency: 1bilinear forms: 1acoustic source localization: 1location awareness: 1predictive models: 1data mining: 1target speaker extraction: 1real world scenarios: 1benchmark testing: 1personal sound zone: 1expectation maximization: 1semidefinite relaxation: 1complex gaussian mixture model: 1biconvex optimization: 1perturbation methods: 1robustness: 1uncertainty: 1synchronization: 1cost function: 1kronecker product decomposition: 1frequencydomain adaptive filter: 1recursive least squares (rls) algorithm: 1frequency domain analysis: 1adaptation models: 1adaptive filters: 1acoustic system identification: 1adaptive systems: 1switching filter: 1oral communication: 1switches: 1filtering: 1kronecker product: 1dereverberation: 1weighted prediction error: 1convolutive transfer function: 1source separation: 1analytical models: 1independent vector analysis: 1spatially informed source extraction: 1headphones: 1speaker extraction: 1rendering (computer graphics): 1modified rhyme test: 1antiphasic presentation: 1psychoacoustics: 1multiple input/binaural output: 1tv: 1quality assessment: 1pattern classification: 1calibration: 1curriculum learning: 1bipartite ranking: 1end to end: 1adaptive weighted prediction error: 1speech dereverberation: 1kronecker product filtering: 1reflection: 1multichannel linear prediction: 1public domain software: 1automatic speech recognition: 1wake word spotting: 1audio visual systems: 1microphone array: 1multiframe wiener filter: 1deep learning (artificial intelligence): 1signal denoising: 1wiener filters: 1dnn: 1single channel noise reduction: 1correlation methods: 1interframe correlation: 1multiframe mvdr filter: 1squared mahalanobis distance: 1pauc: 1measurement: 1detection algorithms: 1nist: 1pauc optimization: 1verification loss: 1pipelines: 1speaker centers: 1diffusion strategy: 1multi agent systems: 1joint sparsity: 1ℓ∞,1 norm regularization: 1gradient methods: 1least mean squares methods: 1optimisation: 1network theory (graphs): 1proximal operator: 1distributed optimization: 1auc: 1voice activity detection: 1deep neural networks: 1
Most publications (all venues) at2023: 312024: 292021: 282022: 252020: 25

Affiliations
URLs

Recent publications

SpeechComm2024 Chao Pan 0001, Jingdong Chen, Jacob Benesty, 
On intrusive speech quality measures and a global SNR based metric.

TASLP2024 Xianrui Wang, Yichen Yang 0010, Andreas Brendel, Tetsuya Ueda, Shoji Makino, Jacob Benesty, Walter Kellermann, Jingdong Chen
On Semi-Blind Source Separation-Based Approaches to Nonlinear Echo Cancellation Based on Bilinear Alternating Optimization.

ICASSP2024 Zhiheng Wang, Hongsen He, Jingdong Chen, Jacob Benesty, Yi Yu 0002, 
A Steered Response Power Approach with Bilinear Prediction-Based Trade-Off Prewhitening for Speaker Localization.

ICASSP2024 Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang 0029, Hongbo Lan, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao, 
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction.

TASLP2023 Junqing Zhang, Liming Shi, Mads Græsbøll Christensen, Wen Zhang 0002, Lijun Zhang 0004, Jingdong Chen
CGMM-Based Sound Zone Generation Using Robust Pressure Matching With ATF Perturbation Constraints.

ICASSP2023 Hang Chen, Shilong Wu, Yusheng Dai, Zhe Wang, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
Summary on the Multimodal Information Based Speech Processing (MISP) 2022 Challenge.

ICASSP2023 Hongsen He, Jingdong Chen, Jacob Benesty, Yi Yu 0002, 
A Frequency-Domain Recursive Least-Squares Adaptive Filtering Algorithm Based On A Kronecker Product Decomposition.

ICASSP2023 Gongping Huang, Jacob Benesty, Israel Cohen, Emil Winebrand, Jingdong Chen, Walter Kellermann, 
Switching Kronecker Product Linear Filtering for Multispeaker Adaptive Speech Dereverberation.

ICASSP2023 Xianrui Wang, Andreas Brendel, Gongping Huang, Yichen Yang 0010, Walter Kellermann, Jingdong Chen
Spatially Informed Independent vector analysis for Source Extraction based on the convolutive Transfer Function Model.

ICASSP2023 Xianrui Wang, Ningning Pan, Jacob Benesty, Jingdong Chen
On Multiple-Input/Binaural-Output Antiphasic Speaker Signal Extraction.

ICASSP2023 Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition.

TASLP2022 Zhongxin Bai, Jianyu Wang, Xiao-Lei Zhang 0001, Jingdong Chen
End-to-End Speaker Verification via Curriculum Bipartite Ranking Weighted Binary Cross-Entropy.

TASLP2022 Gongping Huang, Jacob Benesty, Israel Cohen, Jingdong Chen
Kronecker Product Multichannel Linear Filtering for Adaptive Weighted Prediction Error-Based Speech Dereverberation.

ICASSP2022 Hang Chen, Hengshun Zhou, Jun Du, Chin-Hui Lee 0001, Jingdong Chen, Shinji Watanabe 0001, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, Bao-Cai Yin, Jia Pan, Jianqing Gao, Cong Liu 0006, 
The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results.

ICASSP2022 Ningning Pan, Jingdong Chen, Jacob Benesty, 
DNN Based Multiframe Single-Channel Noise Reduction Filters.

Interspeech2022 Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee 0001, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan, 
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis.

Interspeech2022 Hengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee 0001, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jianqing Gao, 
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis.

Interspeech2021 Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen 0006, Yanxin Hu, Lei Xie 0001, Jian Wu 0027, Hui Bu, Xin Xu, Jun Du, Jingdong Chen
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario.

SpeechComm2020 Zhongxin Bai, Xiao-Lei Zhang 0001, Jingdong Chen
Cosine metric learning based speaker verification.

TASLP2020 Zhongxin Bai, Xiao-Lei Zhang 0001, Jingdong Chen
Speaker Verification by Partial AUC Optimization With Mahalanobis Distance Metric Learning.

#192  | Rohan Kumar Das | DBLP Google Scholar  
By venueInterspeech: 15ICASSP: 7TASLP: 3
By year2023: 22022: 22021: 32020: 82019: 10
ISCA sessionsspeaker recognition: 2speech and speaker recognition: 2speech signal characterization: 2neural-based speech and acoustic analysis: 1the first dicova challenge: 1the attacker’s perpective on automatic speaker verification: 1the interspeech 2020 far field speaker verification challenge: 1speaker recognition challenges and applications: 1anti-spoofing and liveness detection: 1the interspeech 2019 computational paralinguistics challenge (compare): 1the 2019 automatic speaker verification spoofing and countermeasures challenge: 1speaker recognition evaluation: 1
IEEE keywordsspeaker recognition: 8cepstral analysis: 3convolutional neural nets: 2speech recognition: 2signal detection: 2synthetic speech detection: 2security of data: 2anti spoofing: 2speech synthesis: 2natural language processing: 2progressive clustering: 1multi modal: 1diverse positive pairs: 1supervised learning: 1face recognition: 1self supervised learning: 1task analysis: 1multi scale frequency channel attention: 1short utterance: 1text independent speaker verification: 1pattern classification: 1pseudo label selection: 1self supervised speaker recognition: 1loss gated learning: 1unsupervised learning: 1modified magnitude phase spectrum: 1constant q modified octave coefficients: 1mixture models: 1signal classification: 1transforms: 1unknown kind spoofing detection: 1gaussian processes: 1signal companding: 1data augmentation: 1chains corpus: 1vocal tract constriction: 1whispered speech: 1speaker characterization: 1synthetic attacks: 1replay attacks: 1generalized countermeasures: 1asvspoof 2019: 1text analysis: 1text to speech: 1code switching: 1crosslingual word embedding: 1word processing: 1end to end: 1replay speech detection: 1multi level transform (mlt): 1voice activity detection: 1constant q multi level coefficients (cmc): 1phonetic posteriorgram (ppg): 1cross lingual: 1average modeling approach (ama): 1voice conversion: 1
Most publications (all venues) at2020: 212019: 202024: 92018: 92022: 8

Affiliations
URLs

Recent publications

TASLP2023 Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li 0001, 
Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs.

Interspeech2023 Tanmay Khandelwal, Rohan Kumar Das
A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds.

ICASSP2022 Tianchi Liu 0004, Rohan Kumar Das, Kong Aik Lee, Haizhou Li 0001, 
MFA: TDNN with Multi-Scale Frequency-Channel Attention for Text-Independent Speaker Verification with Short Utterances.

ICASSP2022 Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li 0001, 
Self-Supervised Speaker Recognition with Loss-Gated Learning.

TASLP2021 Jichen Yang, Hongji Wang, Rohan Kumar Das, Yanmin Qian, 
Modified Magnitude-Phase Spectrum Information for Spoofing Detection.

ICASSP2021 Rohan Kumar Das, Jichen Yang, Haizhou Li 0001, 
Data Augmentation with Signal Companding for Detection of Logical Access Attacks.

Interspeech2021 Rohan Kumar Das, Maulik C. Madhavi, Haizhou Li 0001, 
Diagnosis of COVID-19 Using Auditory Acoustic Cues.

ICASSP2020 Rohan Kumar Das, Haizhou Li 0001, 
On the Importance of Vocal Tract Constriction for Speaker Characterization: The Whispered Speech Study.

ICASSP2020 Rohan Kumar Das, Jichen Yang, Haizhou Li 0001, 
Assessing the Scope of Generalized Countermeasures for Anti-Spoofing.

ICASSP2020 Xuehao Zhou, Xiaohai Tian, Grandee Lee, Rohan Kumar Das, Haizhou Li 0001, 
End-to-End Code-Switching TTS with Cross-Lingual Language Model.

Interspeech2020 Tianchi Liu 0004, Rohan Kumar Das, Maulik C. Madhavi, Shengmei Shen, Haizhou Li 0001, 
Speaker-Utterance Dual Attention for Speaker and Utterance Verification.

Interspeech2020 Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, Haizhou Li 0001, 
The Attacker's Perspective on Automatic Speaker Verification: An Overview.

Interspeech2020 Xiaoyi Qin, Ming Li 0026, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, Haizhou Li 0001, 
The INTERSPEECH 2020 Far-Field Speaker Verification Challenge.

Interspeech2020 Ruijie Tao, Rohan Kumar Das, Haizhou Li 0001, 
Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network.

Interspeech2020 Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li 0001, 
Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks.

TASLP2019 Jichen Yang, Rohan Kumar Das, Nina Zhou, 
Extraction of Octave Spectra Information for Spoofing Attack Detection.

ICASSP2019 Yi Zhou 0020, Xiaohai Tian, Haihua Xu, Rohan Kumar Das, Haizhou Li 0001, 
Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling.

Interspeech2019 Rohan Kumar Das, Haizhou Li 0001, 
Instantaneous Phase and Long-Term Acoustic Cues for Orca Activity Detection.

Interspeech2019 Rohan Kumar Das, Jichen Yang, Haizhou Li 0001, 
Long Range Acoustic Features for Spoofed Speech Detection.

Interspeech2019 Sarfaraz Jelil, Abhishek Shrivastava, Rohan Kumar Das, S. R. Mahadeva Prasanna, Rohit Sinha 0003, 
SpeechMarker: A Voice Based Multi-Level Attendance Application.

#193  | Heidi Christensen | DBLP Google Scholar  
By venueInterspeech: 16ICASSP: 7TASLP: 1SpeechComm: 1
By year2023: 12022: 62021: 62020: 82019: 32018: 1
ISCA sessionsspeech and language in health: 2assessment of pathological speech and language: 2speech and voice disorders: 2speech in health: 2summarization, entity extraction, evaluation and others: 1technology for disordered speech: 1survey talk: 1the adresso challenge: 1alzheimer’s dementia recognition through spontaneous speech: 1voice and hearing disorders: 1medical applications and visual asr: 1integrating speech science and technology for clinical applications: 1
IEEE keywordsspeech recognition: 5handicapped aids: 3multi stream acoustic modelling: 2cognition: 2biometrics (access control): 1medical treatment: 1transgender voice: 1focusing: 1binary gender classification: 1human evaluation: 1dysarthric automatic speech recognition: 1filtering theory: 1speech coding: 1source filter separation and fusion: 1buildings: 1feature fusion: 1multi modal dysarthric speech recognition: 1convolution: 1databases: 1performance gain: 1support vector machines: 1regression analysis: 1x vector: 1multi task learning: 1age estimation: 1cognitive decline estimation: 1behavioural sciences computing: 1sincnet: 1transfer learning: 1dysarthric speech recognition: 1probability: 1posterior probability: 1entropy: 1gaussian distribution: 1speaker recognition: 1data selection: 1spectral analysis: 1language modelling: 1natural language processing: 1continuous dysarthric speech recognition: 1vocabulary: 1out of domain data: 1clinical applications of speech technology: 1pattern classification: 1automatic speech recognition: 1diseases: 1medical disorders: 1patient diagnosis: 1virtual reality: 1geriatrics: 1medical diagnostic computing: 1neurophysiology: 1speaker diarisation: 1software agents: 1brain: 1dysarthria: 1speech tempo: 1hidden markov models: 1personalised speech recognition: 1data augmentation: 1mixture models: 1time domain analysis: 1phonetics: 1gaussian processes: 1
Most publications (all venues) at2020: 92021: 82017: 72015: 72022: 6

Affiliations
URLs

Recent publications

ICASSP2023 Sebastian Ellis, Stefan Goetze, Heidi Christensen
Moving Towards Non-Binary Gender Identification Via Analysis of System Errors in Binary Gender Classification.

TASLP2022 Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic, 
Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition.

ICASSP2022 Zhengjun Yue, Erfan Loweimi, Zoran Cvetkovic, Heidi Christensen, Jon Barker, 
Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition.

Interspeech2022 Samuel Hollands, Daniel Blackburn, Heidi Christensen
Evaluating the Performance of State-of-the-Art ASR Systems on Non-Native English using Corpora with Extensive Language Background Variation.

Interspeech2022 Bahman Mirheidari, Daniel Blackburn, Heidi Christensen
Automatic cognitive assessment: Combining sparse datasets with disparate cognitive scores.

Interspeech2022 Bahman Mirheidari, André Bittar, Nicholas Cummins, Johnny Downs, Helen L. Fisher, Heidi Christensen
Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities.

Interspeech2022 Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic, 
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs.

SpeechComm2021 Lubna Alhinti, Heidi Christensen, Stuart P. Cunningham, 
Acoustic differences in emotional speech of people with dysarthria.

ICASSP2021 Yilin Pan, Venkata Srikanth Nallanthighal, Daniel Blackburn, Heidi Christensen, Aki Härmä, 
Multi-Task Estimation of Age and Cognitive Decline from Speech.

Interspeech2021 Heidi Christensen
Towards Automatic Speech Recognition for People with Atypical Speech.

Interspeech2021 Bahman Mirheidari, Yilin Pan, Daniel Blackburn, Ronan O'Malley, Heidi Christensen
Identifying Cognitive Impairment Using Sentence Representation Vectors.

Interspeech2021 Yilin Pan, Bahman Mirheidari, Jennifer M. Harris, Jennifer C. Thompson, Matthew Jones, Julie S. Snowden, Daniel Blackburn, Heidi Christensen
Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer's Dementia Detection Through Spontaneous Speech.

Interspeech2021 Zhengjun Yue, Jon Barker, Heidi Christensen, Cristina McKean, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright, 
Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children.

ICASSP2020 Feifei Xiong, Jon Barker, Zhengjun Yue, Heidi Christensen
Source Domain Data Selection for Improved Transfer Learning Targeting Dysarthric Speech Recognition.

ICASSP2020 Zhengjun Yue, Feifei Xiong, Heidi Christensen, Jon Barker, 
Exploring Appropriate Acoustic and Language Modelling Choices for Continuous Dysarthric Speech Recognition.

Interspeech2020 Lubna Alhinti, Stuart P. Cunningham, Heidi Christensen
Recognising Emotions in Dysarthric Speech Using Typical Speech Data.

Interspeech2020 Nicholas Cummins, Yilin Pan, Zhao Ren, Julian Fritsch, Venkata Srikanth Nallanthighal, Heidi Christensen, Daniel Blackburn, Björn W. Schuller, Mathew Magimai-Doss, Helmer Strik, Aki Härmä, 
A Comparison of Acoustic and Linguistics Methodologies for Alzheimer's Dementia Recognition.

Interspeech2020 Bahman Mirheidari, Daniel Blackburn, Ronan O'Malley, Annalena Venneri, Traci Walker, Markus Reuber, Heidi Christensen
Improving Cognitive Impairment Classification by Generative Neural Network-Based Feature Augmentation.

Interspeech2020 Yilin Pan, Bahman Mirheidari, Markus Reuber, Annalena Venneri, Daniel Blackburn, Heidi Christensen
Improving Detection of Alzheimer's Disease Using Automatic Speech Recognition to Identify High-Quality Segments for More Robust Feature Extraction.

Interspeech2020 Yilin Pan, Bahman Mirheidari, Zehai Tu, Ronan O'Malley, Traci Walker, Annalena Venneri, Markus Reuber, Daniel Blackburn, Heidi Christensen
Acoustic Feature Extraction with Interpretable Deep Neural Network for Neurodegenerative Related Disorder Classification.

#194  | Meng Ge | DBLP Google Scholar  
By venueInterspeech: 15ICASSP: 8SpeechComm: 1TASLP: 1
By year2024: 42023: 42022: 102021: 32020: 32019: 1
ISCA sessionsanalysis of speech and audio signals: 4spatial audio: 3source separation: 2asr: 2speech representation: 1targeted source separation: 1single-channel speech enhancement: 1speech enhancement: 1
IEEE keywordsspeech recognition: 3speaker recognition: 3target speaker extraction: 2sparsely overlapped speech: 2voice activity detection: 2speaker extraction: 2speaker embedding: 2reverberation: 2data mining: 1synchronization: 1active speaker detection: 1audio visual: 1interference: 1speech: 1low snr: 1testing: 1optimization: 1artificial noise: 1signal to noise ratio: 1background noise: 1speaker verification: 1gradient: 1noise robust: 1noise robustness: 1biological system modeling: 1computational modeling: 1recurrent neural networks: 1spiking neural network (snn): 1voice activity detection (vad): 1auditory attention: 1robustness: 1power demand: 1 $general speech mixture$ : 1multi modal: 1scenario aware differentiated loss: 1speech intelligibility: 1filtering theory: 1direction of arrival estimation: 1beamforming: 1doa estimation: 1speaker localizer: 1array signal processing: 1feature distillation: 1task driven loss: 1model compression: 1transformer: 1signal fusion: 1multi stage: 1time domain: 1auditory encoder: 1hearing: 1convolutional neural network: 1ear: 1time frequency analysis: 1multi target learning: 1speech dereverberation: 1two stage: 1spectrograms fusion: 1
Most publications (all venues) at2022: 172024: 112023: 92021: 72020: 6

Affiliations
URLs

Recent publications

SpeechComm2024 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001, 
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network.

ICASSP2024 Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang 0016, Haizhou Li 0001, 
Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-Talker Speech.

ICASSP2024 Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li 0001, 
Gradient Weighting for Speaker Verification in Extremely Low Signal-to-Noise Ratio.

ICASSP2024 Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li 0001, 
SVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks.

Interspeech2023 Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, Chengyun Deng, Fei Wang, 
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation.

Interspeech2023 Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang 0001, Shiliang Zhang, 
Rethinking the Visual Cues in Audio-Visual Speaker Extraction.

Interspeech2023 Qinghua Liu, Meng Ge, Zhizheng Wu 0001, Haizhou Li 0001, 
PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network.

Interspeech2023 Honglong Wang, Chengyun Deng, Yanjie Fu, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, Fei Wang, 
SDNet: Stream-attention and Dual-feature Learning Network for Ad-hoc Array Speech Separation.

TASLP2022 Zexu Pan, Meng Ge, Haizhou Li 0001, 
USEV: Universal Speaker Extraction With Visual Cue.

ICASSP2022 Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang 0001, Haizhou Li 0001, 
L-SpEx: Localized Target Speaker Extraction.

ICASSP2022 Yongjie Lv, Longbiao Wang, Meng Ge, Sheng Li 0010, Chenchen Ding, Lixin Pan, Yuguang Wang 0003, Jianwu Dang 0001, Kiyoshi Honda, 
Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation.

Interspeech2022 Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang 0001, 
Iterative Sound Source Localization for Unknown Number of Sources.

Interspeech2022 Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang 0001, 
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network.

Interspeech2022 Nan Li, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li 0010, Jianwu Dang 0001, 
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network.

Interspeech2022 Zexu Pan, Meng Ge, Haizhou Li 0001, 
A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction.

Interspeech2022 Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang 0001, 
Language-specific Characteristic Assistance for Code-switching Speech Recognition.

Interspeech2022 Qiang Xu, Tongtong Song, Longbiao Wang, Hao Shi, Yuqin Lin, Yongjie Lv, Meng Ge, Qiang Yu 0005, Jianwu Dang 0001, 
Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model.

Interspeech2022 Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang 0001, 
MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources.

ICASSP2021 Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang 0001, Haizhou Li 0001, 
Multi-Stage Speaker Extraction with Utterance and Frame-Level Reference Signals.

ICASSP2021 Nan Li, Longbiao Wang, Masashi Unoki, Sheng Li 0010, Rui Wang 0102, Meng Ge, Jianwu Dang 0001, 
Robust Voice Activity Detection Using a Masked Auditory Encoder Based Convolutional Neural Network.

#195  | Gaofeng Cheng | DBLP Google Scholar  
By venueInterspeech: 11TASLP: 8ICASSP: 5SpeechComm: 1
By year2024: 32023: 12022: 122021: 32020: 22019: 12018: 3
ISCA sessionsnovel models and training methods for asr: 2speaker embedding and diarization: 1asr: 1spoken language processing: 1multi-, cross-lingual and other topics in asr: 1low-resource asr development: 1asr neural network training: 1novel neural network architectures for acoustic modelling: 1neural network training strategies for asr: 1acoustic modelling: 1
IEEE keywordsspeech recognition: 11end to end speech recognition: 6automatic speech recognition: 5hidden markov models: 3text analysis: 3natural language processing: 3decoding: 3pre training: 2filtering: 2data models: 2estimation: 2pseudo labeling: 2computational modeling: 2end to end: 2signal classification: 2ctc/attention speech recognition: 2error analysis: 2transformer: 2online speech recognition: 2computer architecture: 2oral communication: 1clustering algorithms: 1metric embedding learning: 1online clustering: 1speaker diarization: 1machine learning algorithms: 1task analysis: 1domain adaptation: 1adaptation models: 1self supervised learning: 1semi supervised learning: 1noise measurement: 1tuning: 1mixture models: 1hybrid dnn hmm speech recognition: 1gaussian processes: 1entropy: 1pattern classification: 1long tailed problem: 1probability: 1random processes: 1supervised learning: 1self supervised pre training: 1knowledge transfer: 1connectionist temporal classification: 1pre trained language model: 1non autoregressive: 1autoregressive processes: 1keyword confidence scoring: 1keyword search: 1transformers: 1phoneme alignment: 1history: 1language model: 1graphics processing units: 1history utterance: 1performance gain: 1grammars: 1speech coding: 1unpaired data: 1heuristic algorithms: 1hybrid ctc/attention speech recognition: 1computational efficiency: 1
Most publications (all venues) at2022: 172021: 72024: 52018: 42019: 3

Affiliations
URLs

Recent publications

SpeechComm2024 Sanli Tian, Zehan Li, Zhaobiao Lyv, Gaofeng Cheng, Qing Xiao, Ta Li, Qingwei Zhao, 
Factorized and progressive knowledge distillation for CTC-based ASR models.

TASLP2024 Yifan Chen, Gaofeng Cheng, Runyan Yang, Pengyuan Zhang, Yonghong Yan 0002, 
Interrelate Training and Clustering for Online Speaker Diarization.

TASLP2024 Han Zhu 0004, Gaofeng Cheng, Jindong Wang 0001, Wenxin Hou, Pengyuan Zhang, Yonghong Yan 0002, 
Boosting Cross-Domain Speech Recognition With Self-Supervision.

TASLP2023 Han Zhu 0004, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan 0002, 
Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition.

TASLP2022 Gaofeng Cheng, Haoran Miao, Runyan Yang, Keqi Deng, Yonghong Yan 0002, 
ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture.

TASLP2022 Keqi Deng, Gaofeng Cheng, Runyan Yang, Yonghong Yan 0002, 
Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification.

TASLP2022 Changfeng Gao, Gaofeng Cheng, Ta Li, Pengyuan Zhang, Yonghong Yan 0002, 
Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model.

ICASSP2022 Keqi Deng, Songjun Cao, Yike Zhang, Long Ma, Gaofeng Cheng, Ji Xu, Pengyuan Zhang, 
Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models.

ICASSP2022 Keqi Deng, Zehui Yang, Shinji Watanabe 0001, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang, 
Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models.

Interspeech2022 Yifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002, 
Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization.

Interspeech2022 Zehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan 0002, 
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies.

Interspeech2022 Sanli Tian, Keqi Deng, Zehan Li, Lingxuan Ye, Gaofeng Cheng, Ta Li, Yonghong Yan 0002, 
Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning.

Interspeech2022 Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie 0001, Yonghong Yan 0002, 
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset.

Interspeech2022 Lingxuan Ye, Gaofeng Cheng, Runyan Yang, Zehui Yang, Sanli Tian, Pengyuan Zhang, Yonghong Yan 0002, 
Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods.

Interspeech2022 Han Zhu 0004, Li Wang, Gaofeng Cheng, Jindong Wang 0001, Pengyuan Zhang, Yonghong Yan 0002, 
Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR.

Interspeech2022 Han Zhu 0004, Jindong Wang 0001, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002, 
Decoupled Federated Learning for ASR with Non-IID Data.

TASLP2021 Runyan Yang, Gaofeng Cheng, Haoran Miao, Ta Li, Pengyuan Zhang, Yonghong Yan 0002, 
Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments.

ICASSP2021 Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan 0002, 
History Utterance Embedding Transformer LM for Speech Recognition.

ICASSP2021 Changfeng Gao, Gaofeng Cheng, Runyan Yang, Han Zhu 0004, Pengyuan Zhang, Yonghong Yan 0002, 
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data.

TASLP2020 Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002, 
Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture.

#196  | Visar Berisha | DBLP Google Scholar  
By venueInterspeech: 14ICASSP: 7TASLP: 3NeurIPS: 1
By year2023: 62022: 22021: 32020: 42019: 72018: 3
ISCA sessionsspeech coding and enhancement: 1speech and language in health: 1pathological speech analysis: 1assessment of pathological speech and language: 1speech enhancement and intelligibility: 1asr neural network architectures: 1voice and hearing disorders: 1emotion and personality in conversation: 1voice quality characterization for clinical voice assessment: 1speech and language analytics for mental health: 1speech and audio classification: 1speaker verification using neural network methods: 1applications in education and learning: 1spoken dialogue systems and conversational analysis: 1
IEEE keywordsdysarthria: 4hypernasality: 4robustness: 3medical disorders: 3speech: 3production: 2acoustic measurements: 2cleft lip and palate: 2clinical speech analytics: 2speech recognition: 2signal classification: 2velopharyngeal dysfunction: 2medical signal processing: 2and second language learning: 1convolution neural networks: 1articulation precision: 1consonant vowel transitions: 1pronunciation scores: 1databases: 1embedding learning: 1convolution: 1contrastive loss: 1spectrogram: 1dysphonic voice: 1recording: 1rivers: 1deepfake technology: 1benford’s law: 1frequency domain analysis: 1speech spectra: 1finance: 1synthetic speech: 1detecting deepfakes: 1decorrelation: 1transformers: 1bit error rate: 1language modeling: 1decorrelated features: 1predictive models: 1estimation error: 1attention: 1cleft palate: 1recurrent neural networks: 1cavity resonators: 1neurological diseases: 1speech features: 1pattern classification: 1deep neural network: 1dysarthric speech: 1bioacoustics: 1amyotrophic lateral sclerosis (als): 1diseases: 1patient treatment: 1patient diagnosis: 1tremor: 1neurophysiology: 1velopha ryngeal dysfunction: 1automatic speech recognition: 1error analysis: 1mathematical model: 1sentence embeddings: 1asr error simulator: 1semantic embedding: 1natural language processing: 1semantics: 1task analysis: 1
Most publications (all venues) at2019: 132020: 102018: 92016: 92023: 8

Affiliations
URLs

Recent publications

TASLP2023 Vikram C. Mathad, Julie M. Liss, Kathy Chapman, Nancy Scherer, Visar Berisha
Consonant-Vowel Transition Models Based on Deep Learning for Objective Evaluation of Articulation.

TASLP2023 Jianwei Zhang, Julie Liss, Suren Jayasuriya, Visar Berisha
Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection.

ICASSP2023 Leo Hsu, Visar Berisha
Does Human Speech Follow Benford's Law?

ICASSP2023 Lingfeng Xu, Kimberly D. Mueller, Julie Liss, Visar Berisha
Decorrelating Language Model Embeddings for Speech-Based Prediction of Cognitive Impairment.

Interspeech2023 Yan Xiong, Visar Berisha, Chaitali Chakrabarti, 
Aligning Speech Enhancement for Improving Downstream Classification Performance.

NeurIPS2023 Jianwei Zhang, Suren Jayasuriya, Visar Berisha
Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer.

Interspeech2022 Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, Julie Liss, 
Are reported accuracies in the clinical speech machine learning literature overoptimistic?

Interspeech2022 Kelvin Tran, Lingfeng Xu, Gabriela Stegmann, Julie Liss, Visar Berisha, Rene Utianski, 
Investigating the Impact of Speech Compression on the Acoustics of Dysarthric Speech.

ICASSP2021 Vikram C. Mathad, Nancy Scherer, Kathy Chapman, Julie Liss, Visar Berisha
An Attention Model for Hypernasality Prediction in Children with Cleft Palate.

Interspeech2021 Vikram C. Mathad, Tristan J. Mahr, Nancy Scherer, Kathy Chapman, Katherine C. Hustad, Julie Liss, Visar Berisha
The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation.

Interspeech2021 Jianwei Zhang, Suren Jayasuriya, Visar Berisha
Restoring Degraded Speech via a Modified Diffusion Model.

TASLP2020 Michael Saxon, Ayush Tripathi, Yishan Jiao, Julie M. Liss, Visar Berisha
Robust Estimation of Hypernasality in Dysarthria With Acoustic Model Likelihood Features.

ICASSP2020 Vikram C. Mathad, Kathy Chapman, Julie Liss, Nancy Scherer, Visar Berisha
Deep Learning Based Prediction of Hypernasality for Clinical Applications.

Interspeech2020 Deepak Kadetotad, Jian Meng, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo, 
Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity.

Interspeech2020 Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan, 
UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech.

ICASSP2019 Jacob Peplinski, Visar Berisha, Julie Liss, Shira Hahn, Jeremy Shefner, Seward B. Rutkove, Kristin Qi, Kerisa Shelton, 
Objective Assessment of Vocal Tremor.

ICASSP2019 Michael Saxon, Julie Liss, Visar Berisha
Objective Measures of Plosive Nasalization in Hypernasal Speech.

ICASSP2019 Rohit Voleti, Julie M. Liss, Visar Berisha
Investigating the Effects of Word Substitution Errors on Sentence Embeddings.

Interspeech2019 Nichola Lubold, Stephanie A. Borrie, Tyson S. Barrett, Megan M. Willi, Visar Berisha
Do Conversational Partners Entrain on Articulatory Precision?

Interspeech2019 Meredith Moore, Michael Saxon, Hemanth Venkateswara, Visar Berisha, Sethuraman Panchanathan, 
Say What? A Dataset for Exploring the Error Patterns That Two ASR Engines Make.

#197  | Xueliang Zhang 0001 | DBLP Google Scholar  
By venueInterspeech: 11ICASSP: 10TASLP: 4
By year2024: 22023: 12022: 62021: 42020: 62019: 42018: 2
ISCA sessionsdereverberation and echo cancellation: 2speech coding and privacy: 1conferencingspeech 2021 challenge: 1interspeech 2021 deep noise suppression challenge: 1noise robust and distant speech recognition: 1speech and audio quality assessment: 1phonetic event detection and segmentation: 1speech enhancement: 1novel approaches to enhancement: 1deep learning for source separation and pitch tracking: 1
IEEE keywordsspeech enhancement: 7recurrent neural nets: 4time frequency analysis: 3acoustic noise: 3complex spectral mapping: 3convolutional neural nets: 3direction of arrival estimation: 2target speaker extraction: 2recurrent neural networks: 2convolution: 2arn: 2data mining: 2anchor information: 2neural network: 2bone conduction: 2attention based fusion: 2microphones: 2speech intelligibility: 2real time systems: 1gsc: 1refining: 1signal processing algorithms: 1inplace crn: 1task analysis: 1noise reduction: 1fuses: 1speakerfilter: 1production: 1cepstral analysis: 1fourier transforms: 1predictive models: 1sensor fusion: 1signal denoising: 1air conduction: 1computational complexity: 1speech dereverberation: 1microphone array processing: 1mathematical operators: 1convolutional recurrent neural network: 1supervised single channel speech enhancement: 1loss metric mismatch: 1function smoothing: 1feature combination: 1frame level snr estimation: 1long short term memory: 1densely connected convolutional recurrent network: 1on device processing: 1microphone arrays: 1real time speech enhancement: 1mobile communication: 1dual microphone mobile phones: 1generative vocoder: 1vocoders: 1speech coding: 1joint framework: 1denoising autoencoder: 1monaural speech enhancement: 1deepfilter: 1speaker extraction: 1particle separators: 1signal to noise ratio: 1systematics: 1robust speaker localization: 1gcc phat: 1steered response power: 1time frequency masking: 1deep neural networks: 1audio signal processing: 1signal approximation: 1mask estimation: 1real spectrum: 1robust speaker verification: 1deep speaker: 1convolutional recurrent network: 1speech separation: 1
Most publications (all venues) at2022: 112020: 102019: 102018: 72024: 6

Affiliations
Inner Mongolia University, College of Computer Science, Inner Mongolia Key Laboratory of Mongolian Information Processing Technology, China
Chinese Academy and Sciences, National Laboratory of Pattern Recognition, NLPR, Institute of Automation, China
URLs

Recent publications

ICASSP2024 Shulin He, Jinjiang Liu, Hao Li 0046, Yang Yang 0121, Fei Chen 0011, Xueliang Zhang 0001
3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications.

ICASSP2024 Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang 0121, Xueliang Zhang 0001
Hierarchical Speaker Representation for Target Speaker Extraction.

ICASSP2023 Shulin He, Wei Rao, Jinjiang Liu, Jun Chen 0024, Yukai Ju, Xueliang Zhang 0001, Yannan Wang, Shidong Shang, 
Speech Enhancement with Intelligent Neural Homomorphic Synthesis.

TASLP2022 Heming Wang, Xueliang Zhang 0001, DeLiang Wang, 
Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement.

ICASSP2022 Jinjiang Liu, Xueliang Zhang 0001
DRC-NET: Densely Connected Recurrent Convolutional Neural Network for Speech Dereverberation.

ICASSP2022 Heming Wang, Xueliang Zhang 0001, DeLiang Wang, 
Attention-Based Fusion for Bone-Conducted and Air-Conducted Speech Enhancement in the Complex Domain.

ICASSP2022 Yang Yang 0121, Hui Zhang 0031, Xueliang Zhang 0001, Huaiwen Zhang, 
Alleviating the Loss-Metric Mismatch in Supervised Single-Channel Speech Enhancement.

Interspeech2022 Jiahui Pan, Shuai Nie, Hui Zhang 0031, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang 0001, Jianhua Tao 0001, 
Speaker recognition-assisted robust audio deepfake detection.

Interspeech2022 Chenggang Zhang, Jinjiang Liu, Xueliang Zhang 0001
LCSM: A Lightweight Complex Spectral Mapping Framework for Stereophonic Acoustic Echo Cancellation.

TASLP2021 Hao Li 0046, DeLiang Wang, Xueliang Zhang 0001, Guanglai Gao, 
Recurrent Neural Networks and Acoustic Features for Frame-Level Signal-to-Noise Ratio Estimation.

ICASSP2021 Ke Tan 0001, Xueliang Zhang 0001, DeLiang Wang, 
Real-Time Speech Enhancement for Mobile Communication Based on Dual-Channel Complex Spectral Mapping.

Interspeech2021 Jinjiang Liu, Xueliang Zhang 0001
Inplace Gated Convolutional Recurrent Neural Network for Dual-Channel Speech Enhancement.

Interspeech2021 Kanghao Zhang, Shulin He, Hao Li 0046, Xueliang Zhang 0001
DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement.

TASLP2020 Zhihao Du, Xueliang Zhang 0001, Jiqing Han 0001, 
A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement.

ICASSP2020 Shulin He, Hao Li 0046, Xueliang Zhang 0001
Speakerfilter: Deep Learning-Based Target Speaker Extraction Using Anchor Speech.

Interspeech2020 Zhihao Du, Jiqing Han 0001, Xueliang Zhang 0001
Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition.

Interspeech2020 Hao Li 0046, DeLiang Wang, Xueliang Zhang 0001, Guanglai Gao, 
Frame-Level Signal-to-Noise Ratio Estimation Using Deep Learning.

Interspeech2020 Tianjiao Xu, Hui Zhang 0031, Xueliang Zhang 0001
Polishing the Classical Likelihood Ratio Test by Supervised Learning for Voice Activity Detection.

Interspeech2020 Chenggang Zhang, Xueliang Zhang 0001
A Robust and Cascaded Acoustic Echo Cancellation Based on Deep Learning.

TASLP2019 Zhong-Qiu Wang, Xueliang Zhang 0001, DeLiang Wang, 
Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking.

#198  | Wen-Chin Huang | DBLP Google Scholar  
By venueICASSP: 12Interspeech: 8TASLP: 4ACL: 1
By year2024: 32023: 22022: 92021: 82020: 12019: 2
ISCA sessionsspeech synthesis: 3voice conversion and adaptation: 2the voicemos challenge: 1technology for disordered speech: 1neural techniques for voice conversion and waveform generation: 1
IEEE keywordsspeech synthesis: 6voice conversion: 6speech recognition: 5task analysis: 3speaker recognition: 3pathology: 2model pretraining: 2linguistics: 2computational modeling: 2benchmark testing: 2protocols: 2self supervised learning: 2electrolaryngeal speech: 2training data: 2speech intelligibility: 2mos prediction: 2sequence to sequence modeling: 2self supervised speech representation: 2voice conversion (vc): 2natural language processing: 2sequence to sequence: 2transformer: 2transfer learning: 1domain adaptation: 1automatic speech recognition (asr): 1low resourced asr: 1electrolaryngeal (el) speech: 1biological system modeling: 1task generalization: 1evaluation: 1benchmark: 1representation learning: 1analytical models: 1foundation model: 1speech: 1error analysis: 1speech enhancement: 1data mining: 1intelligibility enhancement: 1atypical speech: 1prosody transfer: 1tv: 1expressive speech to speech translation: 1controllable text to speech: 1focusing: 1writing: 1automatic speech recognition: 1minimally resourced asr: 1limiting: 1timbre: 1speech naturalness assessment: 1mean opinion score: 1speech quality assessment: 1hearing: 1decision making: 1dysarthric speech: 1pathological speech: 1autoencoder: 1computer based training: 1open source: 1noisy to noisy vc: 1noisy speech modeling: 1signal denoising: 1pretraining: 1recurrent neural nets: 1transformer network: 1attention: 1sequence to sequence learning: 1data models: 1many to many vc: 1decoding: 1computer architecture: 1convolution: 1non autoregressive: 1conformer: 1vq wav2vec: 1any to one voice conversion: 1signal representation: 1bert: 1language model: 1text analysis: 1vector quantized variational autoencoder: 1vocoders: 1open source software: 1nonparallel: 1neural vocoder: 1gaussian processes: 1
Most publications (all venues) at2021: 152022: 102020: 102023: 82024: 6

Affiliations
URLs

Recent publications

TASLP2024 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda, 
Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition.

TASLP2024 Shu-Wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li 0001, Abdelrahman Mohamed, Shinji Watanabe 0001, Hung-yi Lee, 
A Large-Scale Evaluation of Speech Foundation Models.

ICASSP2024 Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda, 
Electrolaryngeal Speech Intelligibility Enhancement through Robust Linguistic Encoders.

ICASSP2023 Wen-Chin Huang, Benjamin Peloquin, Justine Kao, Changhan Wang, Hongyu Gong, Elizabeth Salesky, Yossi Adi, Ann Lee 0001, Peng-Jen Chen, 
A Holistic Cascade System, Benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation.

ICASSP2023 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda, 
Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition.

ICASSP2022 Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi, 
Generalization Ability of MOS Prediction Networks.

ICASSP2022 Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda, 
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech.

ICASSP2022 Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda, 
Towards Identity Preserving Normal to Dysarthric Voice Conversion.

ICASSP2022 Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe 0001, Tomoki Toda, 
S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations.

ICASSP2022 Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, Tomoki Toda, 
Direct Noisy Speech Modeling for Noisy-To-Noisy Voice Conversion.

Interspeech2022 Wen-Chin Huang, Erica Cooper, Yu Tsao 0001, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi, 
The VoiceMOS Challenge 2022.

Interspeech2022 Wen-Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon, 
End-to-End Binaural Speech Synthesis.

Interspeech2022 Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda, 
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition.

ACL2022 Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-Wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li 0001, Shinji Watanabe 0001, Abdelrahman Mohamed, Hung-yi Lee, 
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities.

TASLP2021 Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda, 
Pretraining Techniques for Sequence-to-Sequence Voice Conversion.

TASLP2021 Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda, 
Many-to-Many Voice Transformer Network.

ICASSP2021 Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda, 
Non-Autoregressive Sequence-To-Sequence Voice Conversion.

ICASSP2021 Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, 
Any-to-One Sequence-to-Sequence Voice Conversion Using Self-Supervised Discrete Speech Representations.

ICASSP2021 Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, Tomoki Toda, 
Speech Recognition by Simply Fine-Tuning Bert.

ICASSP2021 Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda, 
Crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder.

#199  | Jinsong Zhang 0001 | DBLP Google Scholar  
By venueInterspeech: 22TASLP: 2ICASSP: 1
By year2023: 32022: 42021: 42020: 72019: 22018: 5
ISCA sessionsspeech synthesis: 2applications in transcription, education and learning: 2pronunciation: 2speech perception, production, and acquisition: 1miscellaneous topics in speech, voice and hearing disorders: 1show and tell iii(vr): 1non-native speech: 1speech perception: 1speech signal representation: 1bi- and multilinguality: 1tonal aspects of acoustic phonetics and prosody: 1speech annotation and speech assessment: 1first and second language acquisition: 1bilingualism, l2, and non-nativeness: 1speech and speaker perception: 1source and supra-segmentals: 1second language acquisition and code-switching: 1deep learning for source separation and pitch tracking: 1speech prosody: 1
IEEE keywordsdpgmm: 2unsupervised phoneme discovery: 2zerospeech: 2recurrent neural nets: 2unsupervised learning: 2gaussian processes: 2redundancy: 1computational modeling: 1computational efficiency: 1source coding: 1mutual information: 1light weight: 1linguistics: 1disentanglement: 1voice conversion: 1speech recognition: 1hearing: 1low resource asr: 1natural language processing: 1infant speech perception: 1engrams: 1perception of phonemes: 1rnn: 1functional load: 1
Most publications (all venues) at2016: 162022: 142021: 142018: 142020: 12

Affiliations
Beijing Language and Culture University, Beijing, China
TU Dresden, Institute of Acoustics and Speech Communication, Dresden, Germany (former)
URLs

Recent publications

ICASSP2023 Liangjie Huang, Tian Yuan, Yunming Liang, Zeyu Chen, Can Wen, Yanlu Xie, Jinsong Zhang 0001, Dengfeng Ke, 
LIMI-VC: A Light Weight Voice Conversion Model with Mutual Information Disentanglement.

Interspeech2023 Lixia Hao, Qi Gong, Jinsong Zhang 0001
The effect of stress on Mandarin tonal perception in continuous speech for Spanish-speaking learners.

Interspeech2023 Ruishan Li, Yingming Gao, Yanlu Xie, Dengfeng Ke, Jinsong Zhang 0001
Dual Audio Encoders Based Mandarin Prosodic Boundary Prediction by Using Multi-Granularity Prosodic Representations.

TASLP2022 Bin Wu, Sakriani Sakti, Jinsong Zhang 0001, Satoshi Nakamura 0001, 
Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR.

Interspeech2022 Jingwen Cheng, Yuchen Yan, Yingming Gao, Xiaoli Feng, Yannan Wang, Jinsong Zhang 0001
A study of production error analysis for Mandarin-speaking Children with Hearing Impairment.

Interspeech2022 Yujia Jin, Yanlu Xie, Jinsong Zhang 0001
A VR Interactive 3D Mandarin Pronunciation Teaching Model.

Interspeech2022 Longfei Yang, Jinsong Zhang 0001, Takahiro Shinozaki, 
Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Verification.

TASLP2021 Bin Wu, Sakriani Sakti, Jinsong Zhang 0001, Satoshi Nakamura 0001, 
Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load.

Interspeech2021 Linkai Peng, Kaiqi Fu, Binghuai Lin, Dengfeng Ke, Jinsong Zhang 0001
A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis.

Interspeech2021 Yuqing Zhang 0003, Zhu Li, Binghuai Lin, Jinsong Zhang 0001
A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives.

Interspeech2021 Yuqing Zhang 0003, Zhu Li, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang 0001
Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication.

Interspeech2020 Wang Dai, Jinsong Zhang 0001, Yingming Gao, Wei Wei, Dengfeng Ke, Binghuai Lin, Yanlu Xie, 
Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism.

Interspeech2020 Dan Du, Xianjin Zhu, Zhu Li, Jinsong Zhang 0001
Perception and Production of Mandarin Initial Stops by Native Urdu Speakers.

Interspeech2020 Yingming Gao, Xinyu Zhang, Yi Xu, Jinsong Zhang 0001, Peter Birkholz, 
An Investigation of the Target Approximation Model for Tone Modeling and Recognition in Continuous Mandarin Speech.

Interspeech2020 Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang 0001
Automatic Scoring at Multi-Granularity for L2 Pronunciation.

Interspeech2020 Binghuai Lin, Liyuan Wang, Xiaoli Feng, Jinsong Zhang 0001
Joint Detection of Sentence Stress and Phrase Boundary for Prosody.

Interspeech2020 Yanlu Xie, Xiaoli Feng, Boxue Li, Jinsong Zhang 0001, Yujia Jin, 
A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback.

Interspeech2020 Longfei Yang, Kaiqi Fu, Jinsong Zhang 0001, Takahiro Shinozaki, 
Pronunciation Erroneous Tendency Detection with Language Adversarial Represent Learning.

Interspeech2019 Dan Du, Jinsong Zhang 0001
The Production of Chinese Affricates /ts/ and /tsh/ by Native Urdu Speakers.

Interspeech2019 Shuju Shi, Chilin Shih, Jinsong Zhang 0001
Capturing L1 Influence on L2 Pronunciation by Simulating Perceptual Space Using Acoustic Features.

#200  | Shaojun Wang | DBLP Google Scholar  
By venueInterspeech: 13ICASSP: 8TASLP: 2ICML: 2
By year2024: 42023: 22022: 52021: 82020: 52019: 1
ISCA sessionsspeech synthesis: 5speech, voice, and hearing disorders: 1speech perception, production, and acquisition: 1novel models and training methods for asr: 1multi-, cross-lingual and other topics in asr: 1spoken language modeling and understanding: 1single-channel speech enhancement: 1voice conversion and adaptation: 1speech and audio quality assessment: 1
IEEE keywordstext to speech: 3speech synthesis: 3speech recognition: 3computational modeling: 2end to end: 2task analysis: 2couplings: 1differentiable aligner: 1vae: 1hierarchical vae: 1computer architecture: 1voice conversion: 1static var compensators: 1fuses: 1mutual information: 1emotion decoupling: 1adaptation models: 1adaptive style fusion: 1linguistics: 1correlation: 1adaptive systems: 1singing voice conversion: 1automatic speech recognition: 1benchmark testing: 1monotonic alignment: 1asr: 1time frequency masks: 1beamforming: 1speech enhancement: 1speech region: 1speech enhancement and recognition: 1unet++: 1gaussian processes: 1grammar: 1error analysis: 1pointer generator network: 1generators: 1parameter genera tor: 1semiotics: 1text normalization: 1unsupervised: 1data acquisition: 1information bottleneck: 1unsupervised learning: 1instance discriminator: 1feature maps: 1network pruning: 1matrix algebra: 1pqr: 1wireless channels: 1linear dependency analysis: 1triples: 1interactive systems: 1natural language interfaces: 1speech based user interfaces: 1graph theory: 1dialog: 1generation: 1knowledge base: 1generative flow: 1non autoregressive: 1autoregressive processes: 1label smoothing: 1neural network: 1entropy: 1recurrent neural nets: 1language model: 1natural language processing: 1
Most publications (all venues) at2019: 162021: 152022: 112020: 92013: 8

Affiliations
URLs

Recent publications

TASLP2024 Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion.

ICASSP2024 Zeyu Yang, Minchuan Chen, Yanping Li, Wei Hu, Shaojun Wang, Jing Xiao 0006, Zijian Li, 
ESVC: Combining Adaptive Style Fusion and Multi-Level Feature Disentanglement for Expressive Singing Voice Conversion.

ICASSP2024 Ziyang Zhuang, Kun Zou, Chenfeng Miao, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao 0006, 
Improving Attention-Based End-to-End Speech Recognition by Monotonic Alignment Attention Matrix Reconstruction.

ICML2024 Chenfeng Miao, Qingying Zhu, Minchuan Chen, Wei Hu, Zijian Li, Shaojun Wang, Jing Xiao 0006, 
DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation.

Interspeech2023 Minchuan Chen, Chenfeng Miao, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Exploring multi-task learning and data augmentation in dementia detection with self-supervised pretrained models.

Interspeech2023 Fengyun Tan, Chaofeng Feng, Tao Wei, Shuai Gong, Jinqiang Leng, Wei Chu, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Improving End-to-End Modeling For Mandarin-English Code-Switching Using Lightweight Switch-Routing Mixture-of-Experts.

TASLP2022 Suliang Bu, Yunxin Zhao, Tuo Zhao, Shaojun Wang, Mei Han, 
Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition.

Interspeech2022 Chenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
A compact transformer-based GAN vocoder.

Interspeech2022 Chenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech Recognition.

Interspeech2022 Zongfeng Quan, Nick J. C. Wang, Wei Chu, Tao Wei, Shaojun Wang, Jing Xiao 0006, 
FFM: A Frame Filtering Mechanism To Accelerate Inference Speed For Conformer In Speech Recognition.

Interspeech2022 Ye Wang, Baishun Ling, Yanmeng Wang, Junhao Xue, Shaojun Wang, Jing Xiao 0006, 
Adversarial Knowledge Distillation For Robust Spoken Language Understanding.

ICASSP2021 Weiwei Jiang, Junjie Li, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Improving Neural Text Normalization with Partial Parameter Generator and Pointer-Generator Network.

ICASSP2021 Shuang Liang, Chenfeng Miao, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Unsupervised Learning for Multi-Style Speech Synthesis with Limited Data.

ICASSP2021 Hao Pan, Zhongdi Chao, Jiang Qian, Bojin Zhuang, Shaojun Wang, Jing Xiao 0006, 
Network Pruning Using Linear Dependency Analysis on Feature Maps.

ICASSP2021 Yanmeng Wang, Ye Wang, Xingyu Lou, Wenge Rong, Zhenghong Hao, Shaojun Wang
Improving Dialogue Response Generation Via Knowledge Graph Filter.

Interspeech2021 Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han, 
Learning Speech Structure to Improve Time-Frequency Masks.

Interspeech2021 Junjie Li, Zhiyu Zhang, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-Pooling Strategy and Window-Based Attention.

Interspeech2021 Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder.

ICML2021 Chenfeng Miao, Shuang Liang, Zhengchen Liu, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture.

ICASSP2020 Chenfeng Miao, Shuang Liang, Minchuan Chen, Jun Ma 0018, Shaojun Wang, Jing Xiao 0006, 
Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow.