Research Paper Demo

Abstract

Text-to-speech (TTS) has shown great potential in industrial applications in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels, making fine-grained and continuous emotion manipulation either unstable or inaccessible to users. To address this limitation, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control, e.g., conversion, interpolation, and erasure, by utilizing activation steering. Our key observation is that modifying a small subset of the internal activations within a TTS model can effectively alter the emotional tone of the generated speech. Building on this insight, we develop a lightweight and efficient algorithm for activation extraction and steering, which can be seamlessly integrated into a wide range of pretrained TTS models. To support emotion steering, we also construct a carefully curated emotional speech dataset for deriving steering vectors. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and robust control over speech emotion. To the best of our knowledge, this is the first method that achieves training-free and fine-grained emotion control in TTS.

Note: The content on this page is for research purposes only.

Method Framework

Our EmoSteer-TTS approach enables training-free and fine-grained emotion control through activation steering:

Figure: Overview of the proposed EmoSteer-TTS. Emotion steering vectors and steering weights are derived from pairs of neutral and emotional reference speech samples. During inference, these vectors are used to modulate the activations of the TTS model, effectively guiding it to synthesize speech that reflects the desired emotion.

Emotion Conversion

Emotion Erasure

Composite Control