Ziyang Ma (马子阳)

alt text 

Ph.D. student,
Shanghai Jiao Tong University,
800 Dongchuan RD. Minhang District, Shanghai, China.
zym.22@sjtu.edu.cn
Nanyang Technological University,
50 Nanyang Ave, Singapore 639798.
ziyang012@e.ntu.edu.sg

Biography

Hi👋 nice to meet you!

Currently I am within the Joint Ph.D. Programme of Shanghai Jiao Tong University (SJTU) and Nanyang Technological University (NTU), co-supervised by Prof. Xie Chen from SJTU and Prof. Chng Eng Siong from NTU. I am also a member in Cross Media (X-) Language Intelligence Lab (X-LANCE), working closely with Prof. Kai Yu. As the first Ph.D. supervised by Prof. Chen, I will try my best in the next five exciting years! 💪

My research usually follows the KISS philosophy. I have published 10+ first author papers at top-tier conferences (NeurIPS, ICLR, ACL, AAAI, ICASSP, Interspeech, ASRU, etc.) and nominated Best Student Paper Shortlist at Interspeech 2023.

My research are usually open-source. I develop the open-source emotion2vec series (emotion2vec, emotion2vec+, EmoBox, etc), SLAM-LLM series (SLAM-LLM Framework, SLAM-ASR, SLAM-AAC, SLAM-Omni, etc), and a series of audio reasoning work (Audio-CoT, MMAR, MMAR-Rubrics, Mini-Omni-Reasoner, Qwen-Omni-Captioner, etc). I am also the core contributor for open-source projects like Qwen3-Omni (-Instruct, -Thinking, and -Captioner), FunAudioLLM (CosyVoice, SenseVoice), and hot TTS models (F5-TTS, Spark-TTS, etc).

I am seeking full-time positions. If you are interested, please feel free to contact me!

Education

  • Ph.D., Computer Science and Engineering, Shanghai Jiao Tong University, 2022.09-Now

  • Ph.D., Computer Science and Engineering, Nanyang Technological University, 2022.09-Now

  • B.E., Computer Science and Technology, Shandong University, 2018.09-2022.06

Interests

  • Self-Supervised Learning

  • Speech and Audio Processing

  • Speech Interaction: Dialogue System and Full-Duplex Modeling

  • Omni-Model Post-Training: Perception, Reasoning, Alignment and Evaluation

NEWS

  • [2026.3] alt text We released Omni-Cloze, a new fine-grained audio-visual captioning benchmark.[Omni-Cloze Dataset][Omni-Captioner Paper][GitHub]

  • [2026.2] Final results of the Interspeech 2026 Audio Reasoning Challenge are now available on the leaderboard page. Check out our challenge report and MMAR-Rubrics data & code.

  • [2026.1] 4 papers were accpeted by ICLR 2026.

  • [2026.1] 3 papers were accpeted by ICASSP 2026.

  • [2026.1] SLAM-LLM was accpeted by IEEE JSTSP (IF=13.6).

  • [2025.12] Interspeech 2026 Audio Reasoning Challenge is open for registration now!

  • [2025.11] emotion2vec+ large has hit the 50 million downloads milestone on ModelScope!

  • [2025.10] We released Omni-Captioner Techinical Report, key technique in Qwen3-Omni-Captioner.[GitHub][Techinical Report][HuggingFace][ModelScope]

  • [2025.9] We released Qwen3-Omni series, including -Instruct, -Thinking, and -Captioner.[GitHub][Techinical Report][HuggingFace][ModelScope]

  • [2025.9] 2 papers were accpeted by NeurIPS 2025.

  • [2025.8] 2 papers were accpeted by EMNLP 2025.

  • [2025.8] MuQ was accpeted by IEEE TASLP.

  • [2025.8] Audio-CoT was accpeted by IEEE ASRU 2025.

  • [2025.7] 1 paper was accpeted by INTERSPEECH 2025 MLC-SLM Workshop.

  • [2025.7] EmoVoice was accpeted by ACM Multimedia 2025.

  • [2025.5] Check out our MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs).[arXiv][Demo][GitHub][Benchmark]

  • [2025.5] 1 papers was accpeted by ISCA INTERSPEECH 2025.

  • [2025.5] 5 papers were accpeted by ACL 2025.

  • [2025.4] 1 paper was accpeted by IEEE TASLP.

  • [2025.3] 2 papers were accpeted by ICME 2025.

  • [2025.3] Check out our Spark-TTS (along with BiCodec and VoxBox dataset), a LLM-based controllable TTS with both voice cooing and generation abilities.

  • [2025.1] Check out our Audio-CoT, the first work to explore chain-of-thought reasoning in large audio language model (LALM).

  • [2025.1] Full reproduction (including all data preparation, model training, inference and checkpoints) for SLAM-Omni has been supported!

  • [2025.1] MUPT was accpeted by ICLR 2025.

  • [2025.1] LSLM, SLAM-ASR and ELLA-V have been selected for Oral presentation at AAAI2025.

  • [2024.12] 3 papers were accpeted by ICASSP 2025.

  • [2024.12] 4 papers were accpeted by AAAI 2025.

  • [2024.10] Check out our SLAM-AAC, a new member of SLAM-LLM family with SOTA audio captioning performance.

  • [2024.10] 1 paper was accpeted by IEEE TASLP.

  • [2024.10] Check out our F5-TTS, a bilingual DiT-based TTS model with flow-matching!

  • [2024.8] 1 paper was accpeted by IEEE TMM.

  • [2024.8] 2 papers were accpeted by IEEE SLT 2024.

  • [2024.7] Chinese Tiny LLM was accepted by the 1st Conference on Language Modeling (COLM).

  • [2024.7] MER24 Baseline Paper was accpeted by MRAC24 Workshop@ACM Multimedia.

  • [2024.7] Check out FunAudioLLM family, including a speech understanding model SenseVoice and a speech generation model CosyVoice.

  • [2024.6] We organize Speech Processing in LLM Era @ISCSLP 2024 Special Session which has been open for submission.

  • [2024.6] 4 papers were accpeted by ISCA INTERSPEECH 2024.

  • [2024.5] SLAM-LLM, a toolkit focusing on speech, language, audio, music processing with LLM, has been released!

  • [2024.5] emotion2vec and ChatMusician were accepted by ACL 2024 Findings.

  • [2024.5] BAT was accepted by ICML 2024.

  • [2024.4] MER24 Challenge@IJCAI and MRAC24 Workshop@ACM Multimedia are coming! [Baseline Paper][Baseline Code][Challenge Homepage]

  • [2024.4] EAT was accepted by IJCAI 2024.

  • [2024.3] We won the 1st place in Categorical Emotion Recognition at Odyssey 2024 Emotion Recognition Challenge.[Technical Report]

  • [2024.1] Check out our Repo for EAT, a new audio representation model with both effectiveness and efficiency.

  • [2023.12] Check out our Repo for emotion2vec, the first universal speech emotion representation model.

  • [2023.12] 4 papers were accpeted by IEEE ICASSP 2024.

  • [2023.9] Check out our Repo for Fast-HuBERT. We accelerate HuBERT pre-training in 5.2X speedup without performance drop.

  • [2023.9] 2 papers were accpeted by IEEE ASRU 2023.

  • [2023.8] MT4SSL was nominated in ISCA Interspeech Best Student Paper Shortlist.

  • [2023.5] 4 papers were accpeted by ISCA INTERSPEECH 2023.

  • [2023.2] 2 papers were accpeted by IEEE ICASSP 2023.

  • [2022.11] Check out our Repo for MT4SSL, a multi-task learning framework for self-supervised learning.

  • [2022.09] We won 3rd place in Avatar Track of AIWIN, held by WAIC2022.[Report][Invited Talk]

Research

Selected Publications

Thanks to all the collaborators for their great work!

Check out Google Scholar for more information.

Perception, Reasoning and Alignment for Speech and Audio Processing

SLAM-LLM Series

You can check out the SLAM-LLM GitHub Repository for more details about the SLAM-LLM series.

Interaction, Full-Duplex, Generation, Emotion, and Engagement

Self-Supervised Learning and Representation

Experiences

Research Intern, Qwen Omni Team, Alibaba, 2025.06-now

  • Investigate audio and video detailed perception and deep reasoning.

  • Led by Junyang Lin and supervised by Jin Xu.

Research Intern, SEED Speech Team, ByteDance, 2024.05-2025.05

  • Investigate full-duplex modeling for speech interaction and dialog system.

  • Led by Yuxuan Wang and supervised by Zhuo Chen.

Research Intern, Tongyi Speech Lab, Alibaba DAMO Academy, 2023.06-2024.02

Research Intern, NLC Group, Microsoft Research Asia(MSRA), 2022.02-2022.08

  • Investigate joint pre-training of speech and text to help improve the accuracy of ASR and other downstream tasks.

  • Led by Furu Wei, supervised by Shujie Liu, and working closely with Yu Wu and Long Zhou.

Research Intern, Video Group, MEGVII Research, 2021.04-2021.06

  • Investigate re-identification of vehicle with Transformer architecture.

  • Supervised by Chi Zhang.

Research Assistant, InteLligent media research center (iLearn), Shandong University, 2020.09-2021.09

Academic Service

Organizing Committee / Chair

  • Organizer @Interspeech 2026 Special Session (Post-Training of Speech Foundation Models)

  • Organizer @Interspeech 2026 Audio Reasoning Challenge

  • Session Chair @ICASSP 2026 Special Session (Question Answering and Reasoning on Audio and Time-Series Data)

  • Data Chair @ACM Multimedia MRAC25 Workshop (Multimodal Emotion Recognition Challenge (MER25))

  • Organizer @ISCSLP 2024 Special Session (Speech Processing in LLM Era)

  • Organizer @ACM Multimedia MRAC24 Workshop (Multimodal Emotion Recognition Challenge (MER24))

Conference Reviewer / TPC Member

  • ISCA Interspeech 2025

  • International Conference on Learning Representations (ICLR) 2025, 2026

  • Conference on Neural Information Processing Systems (NeurIPS) 2025

  • International Conference on Machine Learning (ICML) 2026

  • IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP) 2023, 2024, 2025, 2026

  • IEEE Spoken Language Technology Workshop (IEEE SLT) 2024

  • ACL Rolling Review (ACL ARR) 2024, 2025

  • AAAI Conference on Artificial Intelligence 2022

  • ACM International Conference on Multimedia (ACM MM) 2022

Journal Reviewer

  • IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP)

  • IEEE Signal Processing Letters (IEEE SPL)

  • IEEE Transactions on Multimedia (IEEE TMM)

  • IEEE Open Journal of Signal Processing (IEEE OJSP)

  • IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCSVT)

Open-Source Projects

Projects

SLAM-LLM[GitHub][IEEE JSTSP]

  • SLAM-LLM is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on Speech, Language, Audio, Music processing.

emotion2vec series[GitHub][HuggingFace][ModelScope]

  • emotion2vec is the first universal speech emotion representation model.[ACL 2024]

  • emotion2vec+ is a series of foundational models for speech emotion recognition (SER).

  • EmoBox contains speech emotion recognition (SER) data toolkit and benchmark.[Interspeech 2024 Oral]

Qwen3-Omni[GitHub][Techinical Report][HuggingFace][ModelScope]

  • Qwen3-Omni-30B-A3B-Instruct contains both thinker and talker, supporting audio, video, and text input, with audio and text output.

  • Qwen3-Omni-30B-A3B-Thinking contains the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output.

  • Qwen3-Omni-30B-A3B-Captioner produces detailed, low-hallucination captions for arbitrary audio inputs.

FunAudioLLM[GitHub][Techinical Report][HuggingFace][Demo]

  • SenseVoice is a speech foundation model with multiple speech understanding capabilities.[GitHub][ModelScope]

  • CosyVoice is a multi-lingual large voice generation model.[GitHub][ModelScope]

Accomplishments

Awards

  • Best Student Paper Award, Nanyang Speech Technology Forum (NYSF) 2025, Singapore, 2025.10

  • National Scholarship for PhD, Ministry of Education, China, 2025.10

  • SPS Travel Grant, IEEE, 2024.02

  • Best Presentation Award in Student Forum, the 18th National Conference on Man-Machine Speech Communication (NCMMSC), 2023.12

  • Interspeech Best Student Paper Shortlist, ISCA, Ireland, 2023.08

  • Excellent Graduate, Department of Education, Shandong Province, China, 2022.06

  • "Intelligent Pedestal" Scholarship, Huawei, 2021.12

  • SIGMM Student Travel Grant, ACM, 2021.11

  • National Scholarship for Undergraduate, Ministry of Education, China, 2021.10

Competitions

Activities

  • Member of Hongshan AI Fellow, 2025.10-Now

  • Invited Talk: Benchmarking Audio Deep Reasoning Ability, Shanghai AI Lab, 2025.6

  • Teaching Assistant, Practice of Intelligent Perception and Cognitive, Shanghai Jiao Tong University, 2025.03-2025.06

  • Invited Talk: Towards Interactive Speech Language Model, Nvidia, 2024.10

  • Invited Talk: Towards Interactive Speech Language Model, The Hong Kong University of Science and Technology(HKUST), 2024.8

  • Invited Talk: Speech & Audio Understanding Based on SSL and LLM, Nvidia, 2024.6

  • Teaching Assistant, Practice of Intelligent Perception and Cognitive, Shanghai Jiao Tong University, 2024.03-2024.06

  • Invited Talk: INTERSPEECH 2023 Pre-presentation, SpeechHome, 2023.07

  • Invited Talk: Towards More Realistic, Powerful, and Accurate Speech-based Self-Supervised Learning , The Renmin University of China(RUC), 2023.5

  • PhD Debate Towards AIGC, AI TIME, 2023.1

  • [Invited Talk]: How to conduct audio-driven talking head? An introduction and solution sharing, Datawhale, 2022.11

  • Member of Datawhale, 2022.09-Now

  • Teaching Assistant, Computer Science and Technology, Shandong University, 2021.03-2021.06

  • Member of Elite Class, Computer Science and Technology, Shandong University, 2020.09-2022.06