Abstract
Text-to-speech (TTS) has shown great potential in industrial applications in recent years.
However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete
emotion labels, making fine-grained and continuous emotion manipulation either unstable or inaccessible to
users.
To address this limitation, we propose EmoSteer-TTS, a novel training-free
approach, to
achieve fine-grained speech emotion control, e.g., conversion, interpolation, and erasure,
by utilizing
activation steering.
Our key observation is that modifying a small subset of the internal activations within a TTS model can
effectively alter the emotional tone of the generated speech.
Building on this insight, we develop a lightweight and efficient algorithm for activation extraction and
steering, which can be seamlessly integrated into a wide range of pretrained TTS models.
To support emotion steering, we also construct a carefully curated emotional speech dataset for deriving
steering vectors.
Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and robust control
over speech emotion.
To the best of our knowledge, this is the first method that achieves training-free and fine-grained emotion
control in TTS.
Note: The content on this page is for research purposes only.
Method Framework
Our EmoSteer-TTS approach enables training-free and fine-grained emotion control through activation steering:
Figure: Overview of the proposed EmoSteer-TTS. Emotion steering vectors and steering weights are
derived from pairs of
neutral and emotional reference speech samples. During inference, these vectors are used to modulate
the activations of the
TTS model, effectively guiding it to synthesize speech that reflects the desired emotion.