Ziyang Ma (马子阳)
|
Ph.D. student,
Shanghai Jiao Tong University,
800 Dongchuan RD. Minhang District, Shanghai, China.
zym.22@sjtu.edu.cn
Nanyang Technological University,
50 Nanyang Ave, Singapore 639798.
ziyang012@e.ntu.edu.sg
|
Biography
Hi👋 nice to meet you!
Currently I am within the Joint Ph.D. Programme of Shanghai Jiao Tong University (SJTU) and Nanyang Technological University (NTU), co-supervised by Prof. Xie Chen from SJTU and Prof. Chng Eng Siong from NTU. I am also a member in Cross Media (X-) Language Intelligence Lab (X-LANCE), working closely with Prof. Kai Yu. As the first Ph.D. supervised by Prof. Chen, I will try my best in the next five exciting years! 💪
I was a research assistant at InteLligent media research center (iLearn), working closely with Prof. Xuemeng Song and Liqiang Nie during my undergraduate years.
My research usually follows the KISS philosophy. My recent work focuses on speech, language, audio and music processing with Self-Supervised Learning (SSL) and Large Language Model (LLM). If you are also interested, please feel free to contact me.
Education
Ph.D., Computer Science and Engineering, Shanghai Jiao Tong University, 2022.09-Now
Ph.D., Computer Science and Engineering, Nanyang Technological University, 2022.09-Now
B.E., Computer Science and Technology, Shandong University, 2018.09-2022.06
Interests
NEWS
[2025.5] Check out our MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs).[arXiv][Demo][GitHub][Benchmark]
[2025.5] 1 papers was accpeted by ISCA INTERSPEECH2025.
[2025.5] 5 papers were accpeted by ACL2025.
[2025.4] 1 paper was accpeted by IEEE TASLP.
[2025.3] 2 papers were accpeted by ICME2025.
[2025.3] 🔥 Check out our Spark-TTS (along with BiCodec and VoxBox dataset), a LLM-based controllable TTS with both voice cooing and generation abilities.
[2025.1] Check out our Audio-CoT, the first work to explore chain-of-thought reasoning in large audio language model (LALM).
[2025.1] Full reproduction (including all data preparation, model training, inference and checkpoints) for SLAM-Omni has been supported!
[2025.1] MUPT was accpeted by ICLR2025.
[2025.1] LSLM, SLAM-ASR and ELLA-V have been selected for Oral presentation at AAAI2025.
[2024.12] 3 papers were accpeted by ICASSP2025.
[2024.12] 4 papers were accpeted by AAAI2025.
[2024.10] Check out our SLAM-AAC, a new member of SLAM-LLM family with SOTA audio captioning performance.
[2024.10] 1 paper was accpeted by IEEE TASLP.
[2024.10] Check out our F5-TTS, a bilingual DiT-based TTS model with flow-matching!
[2024.8] 1 paper was accpeted by IEEE TMM.
[2024.8] 2 papers were accpeted by IEEE SLT2024.
[2024.7] Chinese Tiny LLM was accepted by the 1st Conference on Language Modeling (COLM).
[2024.7] MER24 Baseline Paper was accpeted by MRAC24 Workshop@ACM Multimedia.
[2024.7] Check out FunAudioLLM family, including a speech understanding model SenseVoice and a speech generation model CosyVoice.
[2024.6] We organize Speech Processing in LLM Era @ISCSLP 2024 Special Session which has been open for submission.
[2024.6] 4 papers were accpeted by ISCA INTERSPEECH2024.
[2024.5] SLAM-LLM, a toolkit focusing on speech, language, audio, music processing with LLM, has been released!
[2024.5] emotion2vec and ChatMusician were accepted by ACL 2024 Findings.
[2024.5] BAT was accepted by ICML 2024.
[2024.4] MER24 Challenge@IJCAI and MRAC24 Workshop@ACM Multimedia are coming! [Baseline Paper][Baseline Code][Challenge Homepage]
[2024.4] EAT was accepted by IJCAI 2024.
[2024.3] We won the 1st place in Categorical Emotion Recognition at Odyssey 2024 Emotion Recognition Challenge.[Technical Report]
[2024.1] Check out our Repo for EAT, a new audio representation model with both effectiveness and efficiency.
[2023.12] Check out our Repo for emotion2vec, the first universal speech emotion representation model.
[2023.12] 4 papers were accpeted by IEEE ICASSP2024.
[2023.9] Check out our Repo for Fast-HuBERT. We accelerate HuBERT pre-training in 5.2X speedup without performance drop.
[2023.9] 2 papers were accpeted by IEEE ASRU2023.
[2023.8] MT4SSL was nominated in ISCA Interspeech Best Student Paper Shortlist.
[2023.5] 4 papers were accpeted by ISCA INTERSPEECH2023.
[2023.2] 2 papers were accpeted by IEEE ICASSP2023.
[2022.11] Check out our Repo for MT4SSL, a multi-task learning framework for self-supervised learning.
[2022.09] We won 3rd place in Avatar Track of AIWIN, held by WAIC2022.[Report][Invited Talk]
Research
Selected Publications
Thanks to all the collaborators for their great work!
Check out Google Scholar for more information.
Speech, Language, Audio, Music Processing with SSL
Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, Xie Chen
.
MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization.
in arXiv, 2025.
Ziyang Ma*, Mingjie Chen*, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain
.
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark.
Oral in INTERSPEECH, 2024.
Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen.
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.
in International Joint Conference on Artificial Intelligence (IJCAI), 2024.
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen.
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.
in the annual meeting of the Association for Computational Linguistic (ACL), Findings, 2024.
Guanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie Chen.
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning.
in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang, Chao Zhang, Xie Chen.
Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.
in INTERSPEECH, 2023.
Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen.
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets.
Oral & Best Student Paper Shortlist in INTERSPEECH, 2023.
Speech, Language, Audio, Music Processing with LLM
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen.
Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration.
Oral in the Annual AAAI Conference on Artificial Intelligence (AAAI), 2025.
Yexing Du*, Ziyang Ma*, Yifan Yang, Keqi Deng, Xie Chen, Bo Yang, Yang Xiang, Ming Liu, Bing Qin.
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought.
in arXiv, 2024.
Wenxi Chen*, Ziyang Ma*, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen.
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs.
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.
Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Qiuqiang Kong, Xie Chen.
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning.
Oral in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.
Guanrou Yang, Ziyang Ma, Zhifu Gao, Shiliang Zhang, Xie Chen.
CTC-Assisted LLM-Based Contextual ASR.
in IEEE Spoken Language Technology Workshop (SLT), 2024.
Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen.
MaLa-ASR: Multimedia-Assisted LLM-Based ASR.
Oral in INTERSPEECH, 2024.
35 authors including Ziyang Ma.
ChatMusician: Understanding and Generating Music Intrinsically with LLM.
in the annual meeting of the Association for Computational Linguistic (ACL), Findings, 2024.
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen.
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity.
in arXiv, 2024.
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath.
BAT: Learning to Reason about Spatial Sounds with Large Language Models.
in International Conference on Machine Learning (ICML), 2024.
15 authors including Ziyang Ma.
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT.
in arXiv, 2023.
Reasoning, Alignment and Post-training for Speech and Audio Processing
Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen.
Towards Reliable Large Audio Language Model.
in the annual meeting of the Association for Computational Linguistic (ACL), Findings, 2025.
32 authors including Ziyang Ma.
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models.
in arXiv, 2025.
Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu and Others.
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.
in arXiv, 2025.
Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen.
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model.
in arXiv, 2025.
Generation, Interaction, and Dialog for Speech and Audio Processing
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li and Others.
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens.
in arXiv, 2025.
57 authors including Ziyang Ma.
YuE: Scaling Open Foundation Models for Long-Form Music Generation.
in arXiv, 2025.
Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen.
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.
in the annual meeting of the Association for Computational Linguistic (ACL), Findings, 2025.
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching.
in the annual meeting of the Association for Computational Linguistic (ACL), 2025.
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen.
Language Model Can Listen While Speaking.
Oral in the Annual AAAI Conference on Artificial Intelligence (AAAI), 2025.
12 authors including Ziyang Ma.
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.
in arXiv, 2024.
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu.
Voiceflow: Efficient text-to-speech with rectified flow matching.
Oral in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
Synthetic Data for Speech and Audio Processing
Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen.
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap.
Oral in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.
Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, Xie Chen.
Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition.
Oral in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
Experiences
Research Intern, SEED Speech Team, ByteDance, 2024.05-2025.05
Research Intern, Tongyi Speech Lab, Alibaba DAMO Academy, 2023.06-2024.02
Research Intern, NLC Group, Microsoft Research Asia(MSRA), 2022.02-2022.08
Investigate joint pre-training of speech and text to help improve the accuracy of ASR and other downstream tasks.
Led by Furu Wei, supervised by Shujie Liu, and working closely with Yu Wu and Long Zhou.
Research Intern, Video Group, MEGVII Research, 2021.04-2021.06
Research Assistant, InteLligent media research center (iLearn), Shandong University, 2020.09-2021.09
Academic Service
Organizing Committee / Chair
Multimodal Emotion Recognition Challenge (MER25) @ACM Multimedia MRAC25 Workshop
Speech Processing in LLM Era @ISCSLP 2024 Special Session
Multimodal Emotion Recognition Challenge (MER24) @ACM Multimedia MRAC24 Workshop
Conference Reviewer / TPC Member
Conference on Neural Information Processing Systems (NeurIPS) 2025
ISCA Interspeech 2025
International Conference on Learning Representations (ICLR) 2025
IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP) 2023, 2024, 2025
IEEE Spoken Language Technology Workshop (IEEE SLT) 2024
ACL Rolling Review (ACL ARR) 2024, 2025
AAAI Conference on Artificial Intelligence 2022
ACM International Conference on Multimedia (ACM MM) 2022
Journal Reviewer
IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP)
IEEE Signal Processing Letters (IEEE SPL)
IEEE Transactions on Multimedia (IEEE TMM)
IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCAVT)
Open-Source Projects
Projects
SLAM-LLM[GitHub]
SLAM-LLM is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on Speech, Language, Audio, Music processing.
FunAudioLLM[GitHub][Techinical Report][HuggingFace][Demo]
emotion2vec series[GitHub][emotion2vec(ACL2024)][HuggingFace][ModelScope]
MAP-Neo series[GitHub][Techinical Report][HuggingFace]
Accomplishments
Awards
SPS Travel Grant, IEEE, 2024.02
Best Presentation Award in Student Forum, the 18th National Conference on Man-Machine Speech Communication (NCMMSC), 2023.12
Interspeech Best Student Paper Shortlist, ISCA, 2023.08
Excellent Graduate, Department of Education, Shandong Province, China, 2022.06
"Intelligent Pedestal" Scholarship, Huawei, 2021.12
SIGMM Student Travel Grant, ACM, 2021.11
National Scholarship, Ministry of Education, China, 2021.10
Competitions
3rd in DCASE 2024 Challenge Task6: Automated Audio Captioning, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2024.06.
1st in Odyssey 2024 Emotion Recognition Challenge Task1: Categorical Emotion Recognition, Odyssey 2024 The Speaker and Language Recognition Workshop, 2024.03.
3rd in DCASE 2023 Challenge Task4b: Sound Event Detection with Soft Labels, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2023.06.
3rd in Avatar Track of AIWIN, the 5th World Artificial Intelligence Conference(WAIC2022), Shanghai, China, 2022.09.[Report][Invited Talk]
Finalist(Top 284 in 26112 teams) in Mathematical Contest in Modeling (MCM), Consortium for Mathematics and Its Application, America, 2021.02
First Prize(Top 293 in 45689 teams) in Contemporary Undergraduate Mathematical Contest in Modeling (CUMCM), China Society for Industrial and Applied Mathematics, China, 2020.09
Activities
Invited Talk: Towards Interactive Speech Language Model, Nvidia, 2024.10
Invited Talk: Towards Interactive Speech Language Model, The Hong Kong University of Science and Technology(HKUST), 2024.8
Invited Talk: Speech & Audio Understanding Based on SSL and LLM, Nvidia, 2024.6
Invited Talk: INTERSPEECH 2023 Pre-presentation, SpeechHome, 2023.07
Invited Talk: Towards More Realistic, Powerful, and Accurate Speech-based Self-Supervised Learning , The Renmin University of China(RUC), 2023.5
PhD Debate Towards AIGC, AI TIME, 2023.1
[Invited Talk]: How to conduct audio-driven talking head? An introduction and solution sharing, Datawhale, 2022.11
Member of Datawhale, 2022.09-Now
Teaching Assistant, Computer Science and Technology, Shandong University, 2021.03-2021.06
Member of Elite Class, Computer Science and Technology, Shandong University, 2020.09-2022.06
|