Towards Multimodal Empathetic Response Generation:A Rich Text-Speech-Vision Avatar-based Benchmark

Abstract

Empathetic Response Generation (ERG) is one of the key tasks of the affective computing area, which aims to produce emotionally nuanced and compassionate responses to user’s queries. However, existing ERG research is predominantly confined to the singleton text modality, limiting its effectiveness since human emotions are inherently conveyed through multiple modalities. To combat this, we introduce an avatar-based Multimodal ERG (MERG) task, entailing rich text, speech, and facial vision information. We first present a large-scale high-quality benchmark dataset, AvaMERG, which extends traditional text ERG by incorporating authentic human speech audio and dynamic talking-face avatar videos, encompassing a diverse range of avatar profiles and broadly covering various topics of real-world scenarios. Further, we deliberately tailor a system, named Empatheia, for MERG. Built upon a Multimodal Large Language Model (MLLM) with multimodal encoder, speech and avatar generators, Empatheia performs end-to-end MERG, with Chain-of-Empathetic reasoning mechanism integrated for enhanced empathy understanding and reasoning. Finally, we devise a list of empathetic-enhanced tuning strategies, strengthening the capabilities of emotional accuracy and content, avatar-profile consistency across modalities. Experimental results on AvaMERG data demonstrate that Empatheia consistently shows superior performance than baseline methods on both textual ERG and MERG.

1. Task Definition

Given a multimodal dialogue Ď = (Q_i | D_<i), where Q_i denotes the current i-th round multimodal user query input, and D_<i represents the dialogue history, MERG task is to produce a contextually appropriate and empathetic multimodal response R_i for Q_i, with each utterance (i.e., Q_i and R_i) consisting of three content-synchronized modalities: text t_i, speech audio s_i, and talking-face video v_i, i.e., Q_i/R_i = (t^q/r_i, s^q/r_i, v^q/r_i). This results in D_i = {(Q₁, R₁), ..., (Q_i, R_i)}, a total of i round of a multimodal dialogue, includes the user query Q_i and model response R_i. The task requires maintaining coherence and emotional congruence across these modalities to ensure that the generated response R_i well aligns with the emotional cues in user input and also context.

2. AvaMERG: A Avatar-based Multimodal Empathetic Response Generation Dataset

We introduce AvaMERG, a large-scale high-quality benchmark dataset for MERG, which extends traditional text-based ERG by integrating authentic human speech audio and dynamic talking-face avatar videos.

3. Empatheia: MLLM for MERG

we present Empatheia, a benchmark system tailored for MERG. Based on a backbone LLM as the core reasoner, Empatheia leverages a multimodal encoder, speech generator, and talking-face avatar generator, forming an end-to-end system.

3.1 Chain-of-Empathy Reasoning

Inspired by Chain-of-Thought, we design a Chain-of-Empathy (CoE) reasoning mechanism. Specifically, we guide the LLM to think through the following progressive steps to gradually derive the final empathetic responses more accurately and more interpretably.

▶ Step 1: Event scenario. Reflect on the event scenarios that arise from the ongoing dialogue.

▶ Step 2: User‘s emotion. Analyze both the implicit and explicit emotions conveyed by the user.

▶ Step-3. Emotion cause. Infer the underlying reasons for the user’s emotions.

▶ Step-4. Goal to response. Determine the goal of your response in this particular instance, such as alleviating anxiety, offering reassurance, or expressing understanding.

▶ Step-5. Generating empathetic response. Formulate a response that addresses the user’s emotions and situation, ensuring it reflects the reasoning from the previous steps. The output should be purely focused on providing a thoughtful and empathetic reply.

3.2 Content Synchronizer and Style Disentangle modules

To ensure high-quality multimodal generation, we integrate the state-of-the-art StyleTTS2 and DreamTalk generators, addressing content synchronization and stylistic coherence through two modules—content synchronizer and style disentangler—before the generators, to maintain consistency in both content and style across modalities.dually derive the final empathetic responses more accurately and more interpretably.

3.3 Empathetic-enhanced Training Strategy

With the above Empatheia model architecture, we now empower it with effective MERG capability via a series of training strategies.

4. Experiment

▶ Main results of Automatic Evaluation.

▶ Main results of Human Evaluation Results.

BibTeX

@article{zhang2025towards,
      title={Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark},
      author={Zhang, Han and Meng, Zixiang and Luo, Meng and Han, Hong and Liao, Lizi and Cambria, Erik and Fei, Hao},
      journal={arXiv preprint arXiv:2502.04976},
      year={2025}
    }