Ziyang Ma (马子阳)
|
Ph.D. student,
Shanghai Jiao Tong University.
800 Dongchuan RD. Minhang District,
Shanghai, China.
E-mail: zym.22@sjtu.edu.cn
|
Biography
Hi👋 nice to meet you!
Currently I am a Ph.D. student of Shanghai Jiao Tong University (SJTU) and SJTU Artificial Intelligence Institute, and a member in Cross Media (X-) Language Intelligence Lab (X-LANCE) of the Department of Computer Science and Engineering, co-supervised by Prof. Xie Chen, Yanmin Qian and working closely with Prof. Kai Yu. As the first Ph.D. supervised by Prof. Chen, I will try my best in the next five exciting years! 💪
I was a research assistant at InteLligent media research center (iLearn), working closely with Prof. Xuemeng Song and Liqiang Nie during my undergraduate years.
My research usually follows the KISS philosophy. My recent work focuses on speech, language, audio and music processing with Self-Supervised Learning (SSL) and Large Language Model (LLM). If you are also interested, please feel free to contact me.
Education
Ph.D., Computer Science and Engineering, Shanghai Jiao Tong University, 2022.09-Now
B.E., Computer Science and Technology, Shandong University, 2018.09-2022.06
Interests
Self-Supervised Learning
Speech and Audio Processing
Natural Language Processing
Multimedia and Multimodal
NEWS
[2024.12] 🎉 4 papers including 2 first-author papers were accpeted by AAAI2025.
[2024.10] Check out our SLAM-AAC, a new member of SLAM-LLM family with SOTA audio captioning performance.
[2024.10] 🔥 Check out our F5-TTS, a bilingual DiT-based TTS model with flow-matching!
[2024.8] 2 papers were accpeted by IEEE SLT2024.
[2024.7] Chinese Tiny LLM was accepted by the 1st Conference on Language Modeling (COLM).
[2024.7] MER24 Baseline Paper was accpeted by MRAC24 Workshop@ACM Multimedia.
[2024.7] Check out FunAudioLLM family, including a speech understanding model SenseVoice and a speech generation model CosyVoice.
[2024.6] We organize Speech Processing in LLM Era @ISCSLP 2024 Special Session which has been open for submission.
[2024.6] 4 papers were accpeted by ISCA INTERSPEECH2024.
[2024.5] SLAM-LLM, a toolkit focusing on speech, language, audio, music processing with LLM, has been released!
[2024.5] emotion2vec and ChatMusician were accepted by ACL 2024 Findings.
[2024.5] BAT was accepted by ICML 2024.
[2024.4] MER24 Challenge@IJCAI and MRAC24 Workshop@ACM Multimedia are coming! [Baseline Paper][Baseline Code][Challenge Homepage]
[2024.4] EAT was accepted by IJCAI 2024.
[2024.3] We won the 1st place in Categorical Emotion Recognition at Odyssey 2024 Emotion Recognition Challenge.[Technical Report]
[2024.1] Check out our Repo for EAT, a new audio representation model with both effectiveness and efficiency.
[2023.12] Check out our Repo for emotion2vec, the first universal speech emotion representation model.
[2023.12] 4 papers were accpeted by IEEE ICASSP2024.
[2023.9] Check out our Repo for Fast-HuBERT. We accelerate HuBERT pre-training in 5.2X speedup without performance drop.
[2023.9] 2 papers were accpeted by IEEE ASRU2023.
[2023.8] MT4SSL was nominated in ISCA Interspeech Best Student Paper Shortlist. Congrats!
Research
Selected Publications
Thanks to all the collaborators for their great work!
Check out Google Scholar for more information.
Speech, Language, Audio, Music Processing with SSL
Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen.
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.
in International Joint Conference on Artificial Intelligence (IJCAI), 2024.
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen.
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation.
in the annual meeting of the Association for Computational Linguistic (ACL), Findings, 2024.
Guanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie Chen.
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning.
in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang, Chao Zhang, Xie Chen.
Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation.
in INTERSPEECH, 2023.
Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen.
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets.
Oral & Best Student Paper Shortlist in INTERSPEECH, 2023.
Speech, Language, Audio, Music Processing with LLM
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen.
Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration.
in the Annual AAAI Conference on Artificial Intelligence (AAAI), 2025.
Wenxi Chen*, Ziyang Ma*, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen.
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs.
in arXiv, 2024.
Yexing Du*, Ziyang Ma*, Yifan Yang, Keqi Deng, Xie Chen, Bo Yang, Yang Xiang, Ming Liu, Bing Qin.
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought.
in arXiv, 2024.
Guanrou Yang, Ziyang Ma, Zhifu Gao, Shiliang Zhang, Xie Chen.
CTC-Assisted LLM-Based Contextual ASR.
in IEEE Spoken Language Technology Workshop (SLT), 2024.
Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen.
MaLa-ASR: Multimedia-Assisted LLM-Based ASR.
Oral in INTERSPEECH, 2024.
ChatMusician: Understanding and Generating Music Intrinsically with LLM.
in the annual meeting of the Association for Computational Linguistic (ACL), Findings, 2024.
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen.
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity.
in arXiv, 2024.
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath.
BAT: Learning to Reason about Spatial Sounds with Large Language Models.
in International Conference on Machine Learning (ICML), 2024.
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT.
in arXiv, 2023.
Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, Xie Chen.
Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition.
Oral in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
Speech Generation and Dialog System
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching.
in arXiv, 2024.
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen.
Language Model Can Listen While Speaking.
in the Annual AAAI Conference on Artificial Intelligence (AAAI), 2025.
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.
in arXiv, 2024.
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu.
Voiceflow: Efficient text-to-speech with rectified flow matching.
Oral in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
Experiences
Research Intern, Speech Lab, Alibaba DAMO Academy, 2023.06-2024.02
Research Intern, NLC Group, Microsoft Research Asia(MSRA), 2022.02-2022.08
Investigate joint pre-training of speech and text to help improve the accuracy of ASR and other downstream tasks.
Led by Furu Wei, supervised by Shujie Liu, and working closely with Yu Wu and Long Zhou.
Research Intern, Video Group, MEGVII Research, 2021.04-2021.06
Research Assistant, InteLligent media research center (iLearn), Shandong University, 2020.09-2021.09
Academic Service
Organizing Committee
Conference Reviewer / TPC Member
International Conference on Learning Representations (ICLR) 2025
IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP) 2023, 2024, 2025
IEEE Spoken Language Technology Workshop (IEEE SLT) 2024
ACL Rolling Review (ACL ARR) 2024
AAAI Conference on Artificial Intelligence 2022
ACM International Conference on Multimedia (ACM MM) 2022
Journal Reviewer
IEEE Signal Processing Letters (IEEE SPL)
IEEE Transactions on Multimedia (IEEE TMM)
IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCAVT)
Open-Source Projects
Projects
SLAM-LLM[GitHub]
SLAM-LLM is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on Speech, Language, Audio, Music processing.
FunAudioLLM[GitHub][Techinical Report][HuggingFace][Demo]
emotion2vec series[GitHub][emotion2vec(ACL2024)][HuggingFace][ModelScope]
MAP-Neo series[GitHub][Techinical Report][HuggingFace]
Dataset & Benchmark
EmoBox[GitHub][Benchmark][EmoBox(INTERSPEECH2024 Oral)]
GigaSpeech 2[Dataset][GitHub][arXiv]
Accomplishments
Awards
SPS Travel Grant, IEEE, 2024.02
Best Presentation Award in Student Forum, the 18th National Conference on Man-Machine Speech Communication (NCMMSC), 2023.12
Interspeech Best Student Paper Shortlist, ISCA, 2023.08
Excellent Graduate, Department of Education, Shandong Province, China, 2022.06
"Intelligent Pedestal" Scholarship, Huawei, 2021.12
SIGMM Student Travel Grant, ACM, 2021.11
National Scholarship, Ministry of Education, China, 2021.10
Competitions
3rd in DCASE 2024 Challenge Task6: Automated Audio Captioning, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2024.06.
1st in Odyssey 2024 Emotion Recognition Challenge Task1: Categorical Emotion Recognition, Odyssey 2024 The Speaker and Language Recognition Workshop, 2024.03.
3rd in DCASE 2023 Challenge Task4b: Sound Event Detection with Soft Labels, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2023.06.
3rd in Avatar Track of AIWIN, the 5th World Artificial Intelligence Conference(WAIC2022), Shanghai, China, 2022.09.[Report][Invited Talk]
Finalist(Top 284 in 26112 teams) in Mathematical Contest in Modeling (MCM), Consortium for Mathematics and Its Application, America, 2021.02
First Prize(Top 293 in 45689 teams) in Contemporary Undergraduate Mathematical Contest in Modeling (CUMCM), China Society for Industrial and Applied Mathematics, China, 2020.09
Activities
Invited Talk: Towards Interactive Speech Language Model, Nvidia, 2024.10
Invited Talk: Towards Interactive Speech Language Model, The Hong Kong University of Science and Technology(HKUST), 2024.8
Invited Talk: Speech & Audio Understanding Based on SSL and LLM, Nvidia, 2024.6
Invited Talk: INTERSPEECH 2023 Pre-presentation, SpeechHome, 2023.07
Invited Talk: Towards More Realistic, Powerful, and Accurate Speech-based Self-Supervised Learning , The Renmin University of China(RUC), 2023.5
PhD Debate Towards AIGC, AI TIME, 2023.1
[Invited Talk]: How to conduct audio-driven talking head? An introduction and solution sharing, Datawhale, 2022.11
Member of Datawhale, 2022.09-Now
Teaching Assistant, Computer Science and Technology, Shandong University, 2021.03-2021.06
Member of Elite Class, Computer Science and Technology, Shandong University, 2020.09-2022.06
|