Duzhen Zhang (张笃振)

My name is Duzhen Zhang. I am a PhD student (2019.9 - 2024.6.(expected)) in artificial intelligence at Institute of Automation, Chinese Academy of Sciences (CASIA) advised by Bo Xu and Tielin Zhang. Prior to that, I received my bachelor degree in software engineering at Shandong University (2015.9. - 2019.6.).

Previously, my research interests included Emotion Analysis in Conversational Systems and Brain­-inspired Intelligence. Currently, I'm working on Continual Learning in Information Extraction and Multi-Modal Large Language Models (MM-LLMs). In the future, I plan to engage in research related to Continual Learning in LLMs and LLM4Health (e.g., health centric evaluations and health biases).

I am looking for cooperation. Contact me if you are interested in the above topics.

I am actively seeking a postdoctoral research position as well. If you are interested in my experience, please feel free to contact me and let me know.

Email  /  Google Scholar  /  DBLP  /  GitHub  /  Resumé

profile photo
News
  • [2024.02] 📣 Stay updated on our latest review articles regarding MM-LLMs.
  • [2023.10] 🎉 One paper: SPN-GA is accepted by Machine Intelligence Research (MIR).
  • [2023.10] 🎉 One paper: CPFD is accepted by EMNLP2023 main conference. See you in Singapore!
  • [2023.09] 🎉 One paper: ODE-RNN4RL is accepted by NeurIPS2023.
  • [2023.08] 🎉 One paper: RDP is accepted by CIKM2023. (Oral)
  • [2023.05] 🎉 One paper: DualGATs is accepted by ACL2023 main conference.
  • [2023.04] 🎉 One paper: DLD is accepted by SIGIR2023.
  • [2023.02] 🎉 One paper: FISS is accepted by CVPR2023.
  • [2023.01] 🎉 One paper: SAMGN is accepted by IEEE TMM.
Research Overview
Publications

* denotes equal contribution.

MM-LLMs: Recent Advances in MultiModal Large Language Models

Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu

Preprint
Paper /  Website
Details In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 122 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

Tuning Synaptic Connections instead of Weights by Genetic Algorithm in Spiking Policy Network

Duzhen Zhang, Tielin Zhang, Shuncheng Jia, Qingyu Wang, Bo Xu

Machine Intelligence Research (MIR)
Paper /  Code
Details Learning from the interaction is the primary way biological agents know about the environment and themselves. Modern deep reinforcement learning (DRL) explores a computational approach to learning from interaction and has significantly progressed in solving various tasks. However, the powerful DRL is still far from biological agents in energy efficiency. Although the underlying mechanisms are not fully understood, we believe that the integration of spiking communication between neurons and biologically-plausible synaptic plasticity plays a prominent role. Following this biological intuition, we optimize a spiking policy network (SPN) by a genetic algorithm as an energy-efficient alternative to DRL. Our SPN mimics the sensorimotor neuron pathway of insects and communicates through event-based spikes. Inspired by biological research that the brain forms memories by forming new synaptic connections and rewires these connections based on new experiences, we tune the synaptic connections instead of weights in SPN to solve given tasks. Experimental results on several robotic control tasks show that our method can achieve the performance level of mainstream DRL methods and exhibit significantly higher energy efficiency.

Continual Named Entity Recognition without Catastrophic Forgetting

Duzhen Zhang, Wei Cong, Jiahua Dong, Yahan Yu, Xiuyi Chen,
Yonggang Zhang, Zhen Fang

EMNLP2023 (Main Conference)
Paper /  Code /  Poster /  Slide
Details Continual Named Entity Recognition (CNER) is a burgeoning area, which involves updating an existing model by incorporating new entity types sequentially. Nevertheless, continual learning approaches are often severely afflicted by catastrophic forgetting. This issue is intensified in CNER due to the consolidation of old entity types from previous steps into the non-entity type at each step, leading to what is known as the semantic shift problem of the non-entity type. In this paper, we introduce a pooled feature distillation loss that skillfully navigates the trade-off between retaining knowledge of old entity types and acquiring new ones, thereby more effectively mitigating the problem of catastrophic forgetting. Additionally, we develop a confidence-based pseudo-labeling for the non-entity type, \emph{i.e.,} predicting entity types using the old model to handle the semantic shift of the non-entity type. Following the pseudo-labeling process, we suggest an adaptive re-weighting type-balanced learning strategy to handle the issue of biased type distribution. We carried out comprehensive experiments on ten CNER settings using three different datasets. The results illustrate that our method significantly outperforms prior state-of-the-art approaches, registering an average improvement of 6.3% and 8.0% in Micro and Macro F1 scores, respectively.
ODE-based Recurrent Model-free Reinforcement Learning for POMDPs

Xuanle Zhao, Duzhen Zhang, Liyuan Han, Tielin Zhang, Bo Xu

NeurIPS2023
Paper
Details Neural ordinary differential equations (ODEs) are widely recognized as the standard for modeling physical mechanisms, which help to perform approximate inference in unknown physical or biological environments. In partially observable (PO) environments, how to infer unseen information from raw observations puzzled the agents. By using a recurrent policy with a compact context, context-based reinforcement learning provides a flexible way to extract unobservable information from historical transitions. To help the agent extract more dynamics-related information, we present a novel ODE-based recurrent model combines with model-free reinforcement learning (RL) framework to solve partially observable Markov decision processes (POMDPs). We experimentally demonstrate the efficacy of our methods across various PO continuous control and meta-RL tasks. Furthermore, our experiments illustrate that our method is robust against irregular observations, owing to the ability of ODEs to model irregularly-sampled time series.
Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms

Qingyu Wang*, Duzhen Zhang*, Tielin Zhang, Bo Xu

Preprint
Paper
Details By integrating the self-attention capability and the biological properties of Spiking Neural Networks (SNNs), Spikformer applies the flourishing Transformer architecture to SNNs design. It introduces a Spiking Self-Attention (SSA) module to mix sparse visual features using spike-form Query, Key, and Value, resulting in the State-Of-The-Art (SOTA) performance on numerous datasets compared to previous SNN-like frameworks. In this paper, we demonstrate that the Spikformer architecture can be accelerated by replacing the SSA with an unparameterized Linear Transform (LT) such as Fourier and Wavelet transforms. These transforms are utilized to mix spike sequences, reducing the quadratic time complexity to log-linear time complexity. They alternate between the frequency and time domains to extract sparse visual features, showcasing powerful performance and efficiency. We conduct extensive experiments on image classification using both neuromorphic and static datasets. The results indicate that compared to the SOTA Spikformer with SSA, Spikformer with LT achieves higher Top-1 accuracy on neuromorphic datasets (i.e., CIFAR10-DVS and DVS128 Gesture) and comparable Top-1 accuracy on static datasets (i.e., CIFAR-10 and CIFAR-100). Furthermore, Spikformer with LT achieves approximately 29-51% improvement in training speed, 61-70% improvement in inference speed, and reduces memory usage by 4-26% due to not requiring learnable parameters.
Task Relation Distillation and Prototypical Pseudo Label for Continual Named Entity Recognition

Duzhen Zhang, Hongliu Li, Wei Cong, Rongtao Xu,
Jiahua Dong, Xiuyi Chen

CIKM2023 (Oral)
Paper /  Code /  Slide
Details Incremental Named Entity Recognition (INER) involves the sequential learning of new entity types without accessing the training data of previously learned types. However, INER faces the challenge of catastrophic forgetting specific for incremental learning, further aggravated by background shift (i.e., old and future entity types are labeled as the non-entity type in the current task). To address these challenges, we propose a method called task Relation Distillation and Prototypical pseudo label (RDP) for INER. Specifically, to tackle catastrophic forgetting, we introduce a task relation distillation scheme that serves two purposes: 1) ensuring inter-task semantic consistency across different incremental learning tasks by minimizing inter-task relation distillation loss, and 2) enhancing the model's prediction confidence by minimizing intra-task self-entropy loss. Simultaneously, to mitigate background shift, we develop a prototypical pseudo label strategy that distinguishes old entity types from the current non-entity type using the old model. This strategy generates high-quality pseudo labels by measuring the distances between token embeddings and type-wise prototypes. We conducted extensive experiments on ten INER settings of three benchmark datasets (i.e., CoNLL2003, I2B2, and OntoNotes5). The results demonstrate that our method achieves significant improvements over the previous state-of-the-art methods, with an average increase of 6.08% in Micro F1 score and 7.71% in Macro F1 score.
DualGATs:Dual Graph Attention Networks for Emotion Recognition in Conversations

Duzhen Zhang, Feilong Chen, Xiuyi Chen

ACL2023 (Main Conference)
Paper /  Code /  Poster
Details Capturing complex contextual dependencies plays a vital role in Emotion Recognition in Conversations (ERC). Previous studies have predominantly focused on speaker-aware context modeling, overlooking the discourse structure of the conversation. In this paper, we introduce Dual Graph ATtention networks (DualGATs) to concurrently consider the complementary aspects of discourse structure and speaker-aware context, aiming for more precise ERC. Specifically, we devise a Discourse-aware GAT (DisGAT) module to incorporate discourse structural information by analyzing the discourse dependencies between utterances. Additionally, we develop a Speaker-aware GAT (SpkGAT) module to incorporate speaker-aware contextual information by considering the speaker dependencies between utterances. Furthermore, we design an interaction module that facilitates the integration of the DisGAT and SpkGAT modules, enabling the effective interchange of relevant information between the two modules. We extensively evaluate our method on four datasets, and experimental results demonstrate that our proposed DualGATs surpass state-of-the-art baselines on the majority of the datasets.
Decomposing Logits Distillation for Incremental Named Entity Recognition

Duzhen Zhang, Yahan Yu, Feilong Chen, Xiuyi Chen

SIGIR2023
Paper /  Poster
Details Incremental Named Entity Recognition (INER) aims to continually train a model with new data, recognizing emerging entity types without forgetting previously learned ones. Prior INER methods have shown that Logits Distillation (LD), which involves preserving predicted logits via knowledge distillation, effectively alleviates this challenging issue. In this paper, we discover that a predicted logit can be decomposed into two terms that measure the likelihood of an input token belonging to a specific entity type or not. However, the traditional LD only preserves the sum of these two terms without considering the change in each component. To explicitly constrain each term, we propose a novel Decomposing Logits Distillation (DLD) method, enhancing the model's ability to retain old knowledge and mitigate catastrophic forgetting. Moreover, DLD is model-agnostic and easy to implement. Extensive experiments show that DLD consistently improves the performance of state-of-the-art INER methods across ten INER settings in three datasets.
Federated Incremental Semantic Segmentation

Jiahua Dong*, Duzhen Zhang*, Yang Cong, Wei Cong,
Henghui Ding, Dengxin Dai

CVPR2023
Paper /  Code
Details Federated learning-based semantic segmentation (FSS) has drawn widespread attention via decentralized training on local clients. However, most FSS models assume categories are fxed in advance, thus heavily undergoing forgetting on old categories in practical applications where local clients receive new categories incrementally while have no memory storage to access old classes. Moreover, new clients collecting novel classes may join in the global training of FSS, which further exacerbates catastrophic forgetting. To surmount the above challenges, we propose a Forgetting-Balanced Learning (FBL) model to address heterogeneous forgetting on old classes from both intra-client and inter-client aspects. Specifically, under the guidance of pseudo labels generated via adaptive class-balanced pseudo labeling, we develop a forgetting-balanced semantic compensation loss and a forgetting-balanced relation consistency loss to rectify intra-client heterogeneous forgetting of old categories with background shift. It performs balanced gradient propagation and relation consistency distillation within local clients. Moreover, to tackle heterogeneous forgetting from inter-client aspect, we propose a task transition monitor. It can identify new classes under privacy protection and store the latest old global model for relation distillation. Qualitative experiments reveal large improvement of our model against comparison methods. The code is available at https://github.com/JiahuaDong/FISS.
Structure Aware Multi-Graph Network for Multi-Modal Emotion Recognition in Conversations

Duzhen Zhang, Feilong Chen, Jianlong Chang, Xiuyi Chen, Qi Tian

IEEE TMM
Paper
Details Multi-Modal Emotion Recognition in Conversations (MMERC) is an increasingly active research field that leverages multi-modal signals to understand the feelings behind each utterance. Modeling contextual interactions and multi-modal fusion lie at the heart of this field, with graph-based models recently being widely used for MMERC to capture global multi-modal contextual information. However, these models generally mix all modality representations in a single graph, and utterances in each modality are fully connected, potentially ignoring three problems:(1) the heterogeneity of the multi-modal context, (2) the redundancy of contextual information, and (3) over-smoothing of the graph networks. To address these problems, we propose a Structure Aware Multi-Graph Network (SAMGN) for MMERC. Specifically, we construct multiple modality-specific graphs to model the heterogeneity of the multi-modal context. Instead of fully connecting the utterances in each modality, we design a structure learning module that determines whether edges exist between the utterances. This module reduces redundancy by forcing each utterance to focus on the contextual ones that contribute to its emotion recognition, acting like a message propagating reducer to alleviate over-smoothing. Then, we develop the SAMGN via Dual-Stream Propagation (DSP), which contains two propagation streams, i.e., intra- and inter-modal, performed in parallel to aggregate the heterogeneous modality information from multi-graphs. DSP also contains a gating unit that adaptively integrates the co-occurrence information from the above two propagations for emotion recognition. Experiments on two popular MMERC datasets demonstrate that SAMGN achieves new State-Of-The-Art (SOTA) results.
生物结构启发基本网络算子助力类脑智能研究

张笃振, 程翔, 王岩松, 张新贺, 张铁林, 杜久林, 徐波

人工智能
Paper
Details 类脑智能研究深入交叉脑科学和人工智能, 旨在从脑科学中汲取结构、功能、机制等方面的灵感, 用以启发人工智能软硬件研究。本文聚焦生物结构, 重点总结神经侧向交互、生物彩票网络假设、Mot件架构的结构设计中。未来, 随着多尺度和多类型生物网络组图谱的绘制,越来越多生物结构启发的网络基本算子可以被抽提出来并持续推动类脑智能的创新发展。
Complex Dynamic Neurons Improved Spiking Transformer Network for Efficient Automatic Speech Recognition

Qingyu Wang, Tielin Zhang, Minglun Han, Yi Wang, Duzhen Zhang, Bo Xu

AAAI2023
Paper
Details The spiking neural network (SNN) using leaky-integrated-and-fire (LIF) neurons has been commonly used in automatic speech recognition (ASR) tasks. However, the LIF neuron is still relatively simple compared to that in the biological brain. Further research on more types of neurons with different scales of neuronal dynamics is necessary. Here we introduce four types of neuronal dynamics to post-process the sequential patterns generated from the spiking transformer to get the complex dynamic neuron improved spiking transformer neural network (DyTr-SNN). We found that the DyTr-SNN could handle the non-toy automatic speech recognition task well, representing a lower phoneme error rate, lower computational cost, and higher robustness. These results indicate that the further cooperation of SNNs and neural dynamics at the neuron and network scales might have much in store for the future, especially on the ASR tasks.
VLP: A Survey on Vision-­Language Pre­-training

Feilong Chen*, Duzhen Zhang*, Minglun Han, Xiuyi Chen,
Jing Shi, Shuang Xu, Bo Xu

Machine Intelligence Research (MIR)
Paper
Details In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.
TSAM: A Two-Stream Attention Model for Causal Emotion Entailment

Duzhen Zhang, Zhen Yang, Fandong Meng, Xiuyi Chen, Jie Zhou

COLING2022 (Oral)
Paper /  Code
Details Causal Emotion Entailment (CEE) aims to discover the potential causes behind an emotion in a conversational utterance. Previous works formalize CEE as independent utterance pair classification problems, with emotion and speaker information neglected. From a new perspective, this paper considers CEE in a joint framework. We classify multiple utterances synchronously to capture the correlations between utterances in a global view and propose a Two-Stream Attention Model (TSAM) to effectively model the speaker’s emotional influences in the conversational history. Specifically, the TSAM comprises three modules: Emotion Attention Network (EAN), Speaker Attention Network (SAN), and interaction module. The EAN and SAN incorporate emotion and speaker information in parallel, and the subsequent interaction module effectively interchanges relevant information between the EAN and SAN via a mutual BiAffine transformation. Extensive experimental results demonstrate that our model achieves new State-Of-The-Art (SOTA) performance and outperforms baselines remarkably.
Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

Feilong Chen, Duzhen Zhang, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu

ACM MM2022
Paper
Details Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.
Recent Advances and New Frontiers in Spiking Neural Networks

Duzhen Zhang, Shuncheng Jia, Qingyu Wang

IJCAI2022
Paper
Details In recent years, spiking neural networks (SNNs) have received extensive attention in brain-inspired intelligence due to their rich spatially-temporal dynamics, various encoding methods, and event-driven characteristics that naturally fit the neuromorphic hardware. With the development of SNNs, brain-inspired intelligence, an emerging research field inspired by brain science achievements and aiming at artificial general intelligence, is becoming hot. This paper reviews recent advances and discusses new frontiers in SNNs from five major research topics, including essential elements (i.e., spiking neuron models, encoding methods, and topology structures), neuromorphic datasets, optimization algorithms, software, and hardware frameworks. We hope our survey can help researchers understand SNNs better and inspire new works to advance this field.
Multi­scale Dynamic Coding improved Spiking Actor Network for Reinforcement Learning

Duzhen Zhang, Tielin Zhang, Shuncheng Jia, Bo Xu

AAAI2022 (Oral)
Paper
Details With the help of deep neural networks (DNNs), deep reinforcement learning (DRL) has achieved great success on many complex tasks, from games to robotic control. Compared to DNNs with partial brain-inspired structures and functions, spiking neural networks (SNNs) consider more biological features, including spiking neurons with complex dynamics and learning paradigms with biologically plausible plasticity principles. Inspired by the efficient computation of cell assembly in the biological brain, whereby memory-based coding is much more complex than readout, we propose a multiscale dynamic coding improved spiking actor network (MDC-SAN) for reinforcement learning to achieve effective decision-making. The population coding at the network scale is integrated with the dynamic neurons coding (containing 2nd-order neuronal dynamics) at the neuron scale towards a powerful spatial-temporal state representation. Extensive experimental results show that our MDC-SAN performs better than its counterpart deep actor network (based on DNNs) on four continuous control tasks from OpenAI gym. We think this is a significant attempt to improve SNNs from the perspective of efficient coding towards effective decision-making, just like that in biological networks.
Knowledge Aware Emotion Recognition in Textual Conversations via Multi-Task Incremental Transformer

Duzhen Zhang, Xiuyi Chen, Shuang Xu, Bo Xu

COLING2020 (Oral)
Paper
Details Emotion recognition in textual conversations (ERTC) plays an important role in a wide range of applications, such as opinion mining, recommender systems, and so on. ERTC, however, is a challenging task. For one thing, speakers often rely on the context and commonsense knowledge to express emotions; for another, most utterances contain neutral emotion in conversations, as a result, the confusion between a few non-neutral utterances and much more neutral ones restrains the emotion recognition performance. In this paper, we propose a novel Knowledge Aware Incremental Transformer with Multi-task Learning (KAITML) to address these challenges. Firstly, we devise a dual-level graph attention mechanism to leverage commonsense knowledge, which augments the semantic information of the utterance. Then we apply the Incremental Transformer to encode multi-turn contextual utterances. Moreover, we are the first to introduce multi-task learning to alleviate the aforementioned confusion and thus further improve the emotion recognition performance. Extensive experimental results show that our KAITML model outperforms the state-of-the-art models across five benchmark datasets.
Services

Conference Reviewers: AAAI2023, IJCAI2023, ACL2023, EMNLP2023, AAAI2024, CVPR2024, IJCAI2024, ACL2024

Journal Reviewers: Computer Speech & Language



website template