Ziyang Ma (马子阳)

alt text 

Ph.D. student,
Shanghai Jiao Tong University.
800 Dongchuan RD. Minhang District,
Shanghai, China.
E-mail: zym.22@sjtu.edu.cn

Biography

Hi👋 nice to meet you!

Currently I am a Ph.D. student of Shanghai Jiao Tong University (SJTU) and SJTU Artificial Intelligence Institute, and a member in Cross Media (X-) Language Intelligence Lab (X-LANCE) of the Department of Computer Science and Engineering, co-supervised by Prof. Xie Chen, Yanmin Qian and working closely with Prof. Kai Yu. As the first Ph.D. supervised by Prof. Chen, I will try my best in the next five exciting years! 💪

I was a research assistant at InteLligent media research center (iLearn), working closely with Prof. Xuemeng Song and Liqiang Nie during my undergraduate years.

My research usually follows the KISS philosophy. My recent work focuses on speech, language, audio and music processing with Self-Supervised Learning (SSL) and Large Language Model (LLM). If you are also interested, please feel free to contact me.

Education

  • Ph.D., Computer Science and Engineering, Shanghai Jiao Tong University, 2022.09-Now

  • B.E., Computer Science and Technology, Shandong University, 2018.09-2022.06

Interests

  • Self-Supervised Learning

  • Speech and Audio Processing

  • Natural Language Processing

  • Multimedia and Multimodal

NEWS

  • [2024.12] 🎉 4 papers including 2 first-author papers were accpeted by AAAI2025.

  • [2024.10] alt text Check out our SLAM-AAC, a new member of SLAM-LLM family with SOTA audio captioning performance.

  • [2024.10] 🔥 Check out our F5-TTS, a bilingual DiT-based TTS model with flow-matching!

  • [2024.8] 2 papers were accpeted by IEEE SLT2024.

  • [2024.7] Chinese Tiny LLM was accepted by the 1st Conference on Language Modeling (COLM).

  • [2024.7] MER24 Baseline Paper was accpeted by MRAC24 Workshop@ACM Multimedia.

  • [2024.7] Check out FunAudioLLM family, including a speech understanding model SenseVoice and a speech generation model CosyVoice.

  • [2024.6] We organize Speech Processing in LLM Era @ISCSLP 2024 Special Session which has been open for submission.

  • [2024.6] 4 papers were accpeted by ISCA INTERSPEECH2024.

  • [2024.5] SLAM-LLM, a toolkit focusing on speech, language, audio, music processing with LLM, has been released!

  • [2024.5] emotion2vec and ChatMusician were accepted by ACL 2024 Findings.

  • [2024.5] BAT was accepted by ICML 2024.

  • [2024.4] MER24 Challenge@IJCAI and MRAC24 Workshop@ACM Multimedia are coming! [Baseline Paper][Baseline Code][Challenge Homepage]

  • [2024.4] EAT was accepted by IJCAI 2024.

  • [2024.3] We won the 1st place in Categorical Emotion Recognition at Odyssey 2024 Emotion Recognition Challenge.[Technical Report]

  • [2024.1] Check out our Repo for EAT, a new audio representation model with both effectiveness and efficiency.

  • [2023.12] Check out our Repo for emotion2vec, the first universal speech emotion representation model.

  • [2023.12] 4 papers were accpeted by IEEE ICASSP2024.

  • [2023.9] Check out our Repo for Fast-HuBERT. We accelerate HuBERT pre-training in 5.2X speedup without performance drop.

  • [2023.9] 2 papers were accpeted by IEEE ASRU2023.

  • [2023.8] MT4SSL was nominated in ISCA Interspeech Best Student Paper Shortlist. Congrats!

Research

Selected Publications

Thanks to all the collaborators for their great work!

Check out Google Scholar for more information.

Speech, Language, Audio, Music Processing with SSL

Speech, Language, Audio, Music Processing with LLM

Speech Generation and Dialog System

Experiences

Research Intern, Speech Lab, Alibaba DAMO Academy, 2023.06-2024.02

Research Intern, NLC Group, Microsoft Research Asia(MSRA), 2022.02-2022.08

  • Investigate joint pre-training of speech and text to help improve the accuracy of ASR and other downstream tasks.

  • Led by Furu Wei, supervised by Shujie Liu, and working closely with Yu Wu and Long Zhou.

Research Intern, Video Group, MEGVII Research, 2021.04-2021.06

  • Investigate re-identification of vehicle with Transformer architecture.

  • Supervised by Chi Zhang.

Research Assistant, InteLligent media research center (iLearn), Shandong University, 2020.09-2021.09

Academic Service

Organizing Committee

  • Speech Processing in LLM Era @ISCSLP 2024 Special Session

  • Multimodal Emotion Recognition Challenge (MER24) @ACM Multimedia MRAC24 Workshop

Conference Reviewer / TPC Member

  • International Conference on Learning Representations (ICLR) 2025

  • IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP) 2023, 2024, 2025

  • IEEE Spoken Language Technology Workshop (IEEE SLT) 2024

  • ACL Rolling Review (ACL ARR) 2024

  • AAAI Conference on Artificial Intelligence 2022

  • ACM International Conference on Multimedia (ACM MM) 2022

Journal Reviewer

  • IEEE Signal Processing Letters (IEEE SPL)

  • IEEE Transactions on Multimedia (IEEE TMM)

  • IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCAVT)

Open-Source Projects

Projects

SLAM-LLM[GitHub]

  • SLAM-LLM is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on Speech, Language, Audio, Music processing.

FunAudioLLM[GitHub][Techinical Report][HuggingFace][Demo]

  • SenseVoice is a speech foundation model with multiple speech understanding capabilities.[GitHub][ModelScope]

  • CosyVoice is a multi-lingual large voice generation model.[GitHub][ModelScope]

emotion2vec series[GitHub][emotion2vec(ACL2024)][HuggingFace][ModelScope]

  • emotion2vec is the first universal speech emotion representation model.

  • emotion2vec+ is a series of foundational models for speech emotion recognition (SER).

MAP-Neo series[GitHub][Techinical Report][HuggingFace]

  • MAP-Neo is a series of fully open-sourced large language models.

  • Matrix is the pretraining data and data processing pipeline for MAP-Neo.[Dataset]

Dataset & Benchmark

EmoBox[GitHub][Benchmark][EmoBox(INTERSPEECH2024 Oral)]

  • EmoBox is an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings.

GigaSpeech 2[Dataset][GitHub][arXiv]

  • GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.

Accomplishments

Awards

  • SPS Travel Grant, IEEE, 2024.02

  • Best Presentation Award in Student Forum, the 18th National Conference on Man-Machine Speech Communication (NCMMSC), 2023.12

  • Interspeech Best Student Paper Shortlist, ISCA, 2023.08

  • Excellent Graduate, Department of Education, Shandong Province, China, 2022.06

  • "Intelligent Pedestal" Scholarship, Huawei, 2021.12

  • SIGMM Student Travel Grant, ACM, 2021.11

  • National Scholarship, Ministry of Education, China, 2021.10

Competitions

Activities

  • Invited Talk: Towards Interactive Speech Language Model, Nvidia, 2024.10

  • Invited Talk: Towards Interactive Speech Language Model, The Hong Kong University of Science and Technology(HKUST), 2024.8

  • Invited Talk: Speech & Audio Understanding Based on SSL and LLM, Nvidia, 2024.6

  • Invited Talk: INTERSPEECH 2023 Pre-presentation, SpeechHome, 2023.07

  • Invited Talk: Towards More Realistic, Powerful, and Accurate Speech-based Self-Supervised Learning , The Renmin University of China(RUC), 2023.5

  • PhD Debate Towards AIGC, AI TIME, 2023.1

  • [Invited Talk]: How to conduct audio-driven talking head? An introduction and solution sharing, Datawhale, 2022.11

  • Member of Datawhale, 2022.09-Now

  • Teaching Assistant, Computer Science and Technology, Shandong University, 2021.03-2021.06

  • Member of Elite Class, Computer Science and Technology, Shandong University, 2020.09-2022.06